Collective Matrix Factorization

This is the documentation page for the Python package cmfrec. For more details, see the project’s GitHub page:

https://www.github.com/david-cortes/cmfrec/

For the R version, see the CRAN page:

https://cran.r-project.org/web/packages/cmfrec/index.html

Installation

Package is available on PyPI, can be installed with

pip install cmfrec

This library will run faster when compiled from source with non-default compiler arguments, particularly -march=native (replace with -mcpu=native for ARM/PPC), which gets added automatically when installing from pip, and when using an optimized BLAS library for SciPy.

See this guide for details:

https://github.com/david-cortes/installing-optimized-libraries

Sample Usage

For an introduction to the package and methods, see the example IPython notebook building recommendation models on the MovieLens10M dataset with and without side information:

http://nbviewer.jupyter.org/github/david-cortes/cmfrec/blob/master/example/cmfrec_movielens_sideinfo.ipynb

Model evaluation

Metrics for implicit-feedback models or recommendation quality can be calculated using the recometrics library:

https://www.github.com/david-cortes/recometrics/

Naming conventions

This package uses the following general naming conventions:

About data:

  • ‘X’ -> data about interactions between users/rows and items/columns (e.g. ratings given by users to items).

  • ‘U’ -> data about user/row attributes (e.g. user’s age).

  • ‘I’ -> data about item/column attributes (e.g. a movie’s genre).

About function naming:

  • ‘warm’ -> predictions based on new, unseen ‘X’ data, and potentially including new ‘U’ data along.

  • ‘cold’ -> predictions based on new user attributes data ‘U’, without ‘X’.

  • ‘new’ -> predictions about new items based on attributes data ‘I’.

About function descriptions:

  • ‘existing’ -> the user/item was present in the training data that was passed to ‘fit’.

  • ‘new’ -> the user/items was not present in the training data that was passed to ‘fit’.

Be aware that the package’s functions are user-centric (e.g. it will recommend items for users, but not users for items). If predictions about new items are desired, it’s recommended to use the method ‘swap_users_and_items’, as the item-based functions which are provided for convenience might run a lot slower than their user equivalents.

Implicit and explicit feedback

In recommender systems, data might come in the form of explicit user judgements about items (e.g. movie ratings) or in the form of logged user activity (e.g. number of times that a user listened to each song in a catalogue). The former is typically referred to as “explicit feedback”, while the latter is referred to as “implicit feedback”.

Historically, driven by the Netflix competition, formulations of the recommendation problem have geared towards predicting the rating that users would give to items under explicit-feedback datasets, determining the components in the low-rank factorization in a way that minimizes the deviation between predicted and observed numbers on the observed data only (i.e. predictions about items that a user did not rate do not play any role in the optimization problem for determining the low-rank factorization as they are simply ignored), but this approach has turned out to oftentimes result in very low-quality recommendations, particularly for users with few data, and is usually not suitable for implicit feedback as the data in such case does not contain any examples of dislikes and might even come in just binary (yes/no) form.

As such, research has mostly shifted towards the implicit-feedback setting, in which items that are not consumed by users do play a role in the optimization objective for determining the low-rank components - that is, the goal is more to predict which items would users have consumed than to predict the exact rating that they’d give to them - and the evaluation of recommendation quality has shifted towards looking at how items that were consumed by users would be ranked compared to unconsumed items.

Other problem domains

The documentation and naming conventions in this library are all oriented towards recommender systems, with the assumption that users are rows in a matrix, items are columns, and values denote interactions between them, with the idea that values under different columns are comparable (e.g. the rating scale is the same for all items).

The concept of approximate low-rank matrix factorizations is however still useful for other problem domains, such as general dimensionality reduction for large sparse data (e.g. TF-IDF matrices) or imputation of high-dimensional tabular data, in which assumptions like values being comparable between different columns would not hold.

Be aware that classes like CMF come with some defaults that might not be reasonable in other applications, but which can be changed by passing non-default arguments - for example:

  • Global centering - the “explicit-feedback” models here will by default calculate a global mean for all entries in ‘X’ and center the matrix by substracting this value from all entries. This is a reasonable thing to do when dealing with movie ratings as all ratings follow the same scale, but if columns of the ‘X’ matrix represent different things that might have different ranges or different distributions, global mean centering is probably not going to be desirable or useful.

  • User/row biases: models might also have one bias/intercept parameter per row, which in the approximation, would get added to every column for that user/row. This is again a reasonable thing to do for movie ratings, but if the columns of ‘X’ contain different types of information, it might not be a sensible thing to add.

  • Regularization for item/column biases: since the models perform global mean centering beforehand, the item/column-specific bias/intercept parameters will get a regularization penalty (“shrinkage”) applied to them, which might not be desirable if global mean centering is removed.

Models

CMF

class cmfrec.CMF(k=40, lambda_=10.0, method='als', use_cg=True, user_bias=True, item_bias=True, center=True, add_implicit_features=False, scale_lam=False, scale_lam_sideinfo=False, scale_bias_const=False, k_user=0, k_item=0, k_main=0, w_main=1.0, w_user=1.0, w_item=1.0, w_implicit=0.5, l1_lambda=0.0, center_U=True, center_I=True, maxiter=800, niter=10, parallelize='separate', corr_pairs=4, max_cg_steps=3, precondition_cg=False, finalize_chol=True, NA_as_zero=False, NA_as_zero_user=False, NA_as_zero_item=False, nonneg=False, nonneg_C=False, nonneg_D=False, max_cd_steps=100, precompute_for_predictions=True, include_all_X=True, use_float=True, random_state=1, verbose=False, print_every=10, handle_interrupt=True, produce_dicts=False, nthreads=-1, n_jobs=None)[source]

Collective or multi-view matrix factorization

Tries to approximate the ‘X’ interactions matrix by a formula as follows:

\(\mathbf{X} \sim \mathbf{A} \mathbf{B}^T\)

While at the same time also approximating the user/row side information matrix ‘U’ and the item/column side information matrix ‘I’ as follows:

\(\mathbf{U} \sim \mathbf{A} \mathbf{C}^T\),

\(\mathbf{I} \sim \mathbf{B} \mathbf{D}^T\)

The matrices (“A”, “B”, “C”, “D”) are obtained by minimizing the error with respect to the non-missing entries in the input data (“X”, “U”, “I”). Might apply sigmoid transformations to binary columns in U and I too.

This is the most flexible of the models available in this package, and can also mimic the implicit-feedback version through the option ‘NA_as_zero’ plus an array of weights.

Note

The default arguments are not geared towards speed. For faster fitting, use method="als", use_cg=True, finalize_chol=False, use_float=True, precompute_for_predictions=False, produce_dicts=False, and pass COO matrices or NumPy arrays instead of DataFrames to fit.

Note

By default, the model optimization objective will not scale any of its terms according to number of entries (see parameter scale_lam), so hyperparameters such as lambda_ will require more tuning than in other software and trying out values over a wider range.

Parameters:
  • k (int) – Number of latent factors to use (dimensionality of the low-rank factorization), which will be shared between the factorization of the ‘X’ matrix and the side info matrices. Additional non-shared components can also be specified through k_user, k_item, and k_main. Typical values are 30 to 100.

  • lambda_ (float or array(6,)) – Regularization parameter. Can also use different regularization for each matrix, in which case it should be an array with 6 entries, corresponding, in this order, to: user_bias, item_bias, A, B, C, D. Note that the default value for lambda_ here is much higher than in other software, and that the loss/objective function is not divided by the number of entries anywhere, so this parameter needs good tuning. For example, a good value for the MovieLens10M would be lambda_=35. (or lambda=0.05 with scale_lam=True). Typical values are \(10^{-2}\) to \(10^2\).

  • method (str, one of “lbfgs” or “als”) – Optimization method used to fit the model. If passing 'lbfgs', will fit it through a gradient-based approach using an L-BFGS optimizer. L-BFGS is typically a much slower and a much less memory efficient method compared to 'als', but tends to reach better local optima and allows some variations of the problem which ALS doesn’t, such as applying sigmoid transformations for binary side information.

  • use_cg (bool) – In the ALS method, whether to use a conjugate gradient method to solve the closed-form least squares problems. This is a faster and more memory-efficient alternative than the default Cholesky solver, but less exact, less numerically stable, and will require slightly more ALS iterations (niter) to reach a good optimum. In general, better results are achieved with use_cg=False. Note that, if using this method, calculations after fitting which involve new data such as factors_warm, might produce slightly different results from the factors obtained from calling fit with the same data, due to differences in numerical precision. A workaround for this issue (factors on new data that might differ slightly) is to use finalize_chol=True. Even if passing “True” here, will use the Cholesky method in cases in which it is faster (e.g. dense matrices with no missing values), and will not use the conjugate gradient method on new data. This option is not available when using L1 regularization and/or non-negativity constraints. Ignored when passing method="lbfgs".

  • user_bias (bool) – Whether to add user/row biases (intercepts) to the model. If using it for purposes other than recommender systems, this is is usually not suggested to include.

  • item_bias (bool) – Whether to add item/column biases (intercepts) to the model. Be aware that using item biases with low regularization for them will tend to favor items with high average ratings regardless of the number of ratings the item has received.

  • center (bool) – Whether to center the “X” data by subtracting the mean value. For recommender systems, it’s highly recommended to pass “True” here, the more so if the model has user and/or item biases.

  • add_implicit_features (bool) – Whether to automatically add so-called implicit features from the data, as in reference [5a] and similar. If using this for recommender systems with small amounts of data, it’s recommended to pass ‘True’ here.

  • scale_lam (bool) – Whether to scale (increase) the regularization parameter for each row of the model matrices (A, B, C, D) according to the number of non-missing entries in the data for that particular row, as proposed in reference [7a]. For the A and B matrices, the regularization will only be scaled according to the number of non-missing entries in “X” (see also the scale_lam_sideinfo parameter). Note that, when using the options NA_as_zero_*, all entries are considered to be non-missing. If passing “True” here, the optimal value for lambda_ will be much smaller (and likely below 0.1). This option tends to give better results, but requires more hyperparameter tuning. Only supported for method="als".

    When generating factors based on side information alone, if passing scale_lam_sideinfo, will regularize assuming there was one observation present. Be aware that using this option without scale_lam_sideinfo=True can lead to bad cold-start recommendations as it will set a very small regularization for users who have no ‘X’ data.

    Warning: in smaller datasets, using this option can result in top-N recommendations having mostly items with very few interactions (see parameter scale_bias_const).

  • scale_lam_sideinfo (bool) – Whether to scale (increase) the regularization parameter for each row of the “A” and “B” matrices according to the number of non-missing entries in both “X” and the side info matrices “U” and “I”. If passing “True” here, scale_lam will also be assumed to be “True”.

  • scale_bias_const (bool) – When passing scale_lam=True and user_bias=True or item_bias=True, whether to apply the same scaling to the regularization of the biases to all users and items, according to the average number of non-missing entries rather than to the number of entries for each specific user/item.

    While this tends to result in worse RMSE, it tends to make the top-N recommendations less likely to select items with only a few interactions from only a few users.

    Ignored when passing scale_lam=False or not using user/item biases.

  • k_user (int) – Number of factors in the factorizing A and C matrices which will be used only for the ‘U’ and ‘U_bin’ matrices, while being ignored for the ‘X’ matrix. These will be the first factors of the matrices once the model is fit. Will be counted in addition to those already set by k.

  • k_item (int) – Number of factors in the factorizing B and D matrices which will be used only for the ‘I’ and ‘I_bin’ matrices, while being ignored for the ‘X’ matrix. These will be the first factors of the matrices once the model is fit. Will be counted in addition to those already set by k.

  • k_main (int) – Number of factors in the factorizing A and B matrices which will be used only for the ‘X’ matrix, while being ignored for the ‘U’, ‘U_bin’, ‘I’, and ‘I_bin’ matrices. These will be the last factors of the matrices once the model is fit. Will be counted in addition to those already set by k.

  • w_main (float) – Weight in the optimization objective for the errors in the factorization of the ‘X’ matrix.

  • w_user (float) – Weight in the optimization objective for the errors in the factorization of the ‘U’ and ‘U_bin’ matrices. Ignored when passing neither ‘U’ nor ‘U_bin’ to ‘fit’.

  • w_item (float) – Weight in the optimization objective for the errors in the factorization of the ‘I’ and ‘I_bin’ matrices. Ignored when passing neither ‘I’ nor ‘I_bin’ to ‘fit’.

  • w_implicit (float) – Weight in the optimization objective for the errors in the factorizations of the implicit ‘X’ matrices. Note that, depending on the sparsity of the data, the sum of errors from these factorizations might be much larger than for the original ‘X’ and a smaller value will perform better. It is recommended to tune this parameter carefully. Ignored when passing add_implicit_features=False.

  • l1_lambda (float or array(6,)) – Regularization parameter to apply to the L1 norm of the model matrices. Can also pass different values for each matrix (see lambda_ for details). Note that, when adding L1 regularization, the model will be fit through a coordinate descent procedure, which is significantly slower than the Cholesky method with L2 regularization. Only supported with method="als". Not recommended.

  • center_U (bool) – Whether to center the ‘U’ matrix column-by-column. Be aware that this is a simple mean centering without regularization. One might want to turn this option off when using NA_as_zero_user=True.

  • center_I (bool) – Whether to center the ‘I’ matrix column-by-column. Be aware that this is a simple mean centering without regularization. One might want to turn this option off when using NA_as_zero_item=True.

  • maxiter (int) – Maximum L-BFGS iterations to perform. The procedure will halt if it has not converged after this number of updates. Note that, compared to the ohter models, fewer iterations will be required for converge here. Using higher regularization values might also decrease the number of required iterations. Pass zero for no L-BFGS iterations limit. If the procedure is spending hundreds of iterations without any significant decrease in the loss function or gradient norm, it’s highly likely that the regularization is too low. Ignored when passing method='als'.

  • niter (int) – Number of alternating least-squares iterations to perform. Note that one iteration denotes an update round for all the matrices rather than an update of a single matrix. In general, the more iterations, the better the end result. Ignored when passing method='lbfgs'. Typical values are 6 to 30.

  • parallelize (str, “separate” or “single”) – How to parallelize gradient calculations when using more than one thread with method='lbfgs'. Passing 'separate' will iterate over the data twice - first by rows and then by columns, letting each thread calculate results for each row and column, whereas passing 'single' will iterate over the data only once, and then sum the obtained results from each thread. Passing 'separate' is much more memory-efficient and less prone to irreproducibility of random seeds, but might be slower for typical use-cases. Ignored when passing nthreads=1, or method='als', or when compiling without OpenMP support.

  • corr_pairs (int) – Number of correction pairs to use for the L-BFGS optimization routine. Recommended values are between 3 and 7. Note that higher values translate into higher memory requirements. Ignored when passing method='als'.

  • max_cg_steps (int) – Maximum number of conjugate gradient iterations to perform in an ALS round. Ignored when passing use_cg=False or method="lbfgs".

  • precondition_cg (bool) – Whether to use Jacobi preconditioning for the conjugate gradient procedure. In general, this type of preconditioning is not beneficial (makes the algorithm slower) as the factor variables tend to be in the same scale, but it might help when using non-shared factors. Note that, when using preconditioning, the procedure will not check for convergence, taking instead a fixed number of steps (given by max_cg_steps) at each iteration regardless of whether it has reached the optimum already. Ignored when passing use_cg=False or method="als".

  • finalize_chol (bool) – When passing use_cg=True and method="als", whether to perform the last iteration with the Cholesky solver. This will make it slower, but will avoid the issue of potential mismatches between the result from fit and calls to factors_warm or similar with the same data.

  • NA_as_zero (bool) – Whether to take missing entries in the ‘X’ matrix as zeros (only when the ‘X’ matrix is passed as sparse COO matrix or DataFrame) instead of ignoring them. Note that this is a different model from the implicit-feedback version with weighted entries, and it’s a much faster model to fit. Note that passing “True” will affect the results of the functions named “cold” (as it will assume zeros instead of missing). It is possible to obtain equivalent results to the implicit-feedback model if passing “True” here, and then passing an “X” to fit with all values set to one and weights corresponding to the actual values of “X” multiplied by alpha, plus 1 (W := 1 + alpha*X to imitate the implicit-feedback model). If passing this option, be aware that the defaults are also to perform mean centering and add user/item biases, which might be undesirable to have together with this option.

  • NA_as_zero_user (bool) – Whether to take missing entries in the ‘U’ matrix as zeros (only when the ‘U’ matrix is passed as sparse COO matrix) instead of ignoring them. Note that passing “True” will affect the results of the functions named “warm” if no data is passed there (as it will assume zeros instead of missing).

  • NA_as_zero_item (bool) – Whether to take missing entries in the ‘I’ matrix as zeros (only when the ‘I’ matrix is passed as sparse COO matrix) instead of ignoring them.

  • nonneg (bool) – Whether to constrain the ‘A’ and ‘B’ matrices to be non-negative. In order for this to work correctly, the ‘X’ input data must also be non-negative. This constraint will also be applied to the ‘Ai’ and ‘Bi’ matrices if passing add_implicit_features=True.

    Important: be aware that the default options are to perform mean centering and to add user and item biases, which might be undesirable and hinder performance when having non-negativity constraints (especially mean centering).

    This option is not available when using the L-BFGS method. Note that, when determining non-negative factors, it will always use a coordinate descent method, regardless of the value passed for use_cg and finalize_chol. When used for recommender systems, one usually wants to pass ‘False’ here. For better results, do not use centering alongside this option, and use a higher regularization coupled with more iterations.

  • nonneg_C (bool) – Whether to constrain the ‘C’ matrix to be non-negative. In order for this to work correctly, the ‘U’ input data must also be non-negative.

    Note: by default, the ‘U’ data will be centered by columns, which doesn’t play well with non-negativity constraints. One will likely want to pass center_U=False along with this.

  • nonneg_D (bool) – Whether to constrain the ‘D’ matrix to be non-negative. In order for this to work correctly, the ‘I’ input data must also be non-negative.

    Note: by default, the ‘I’ data will be centered by columns, which doesn’t play well with non-negativity constraints. One will likely want to pass center_I=False along with this.

  • max_cd_steps (int) – Maximum number of coordinate descent updates to perform per iteration. Pass zero for no limit. The procedure will only use coordinate descent updates when having L1 regularization and/or non-negativity constraints. This number should usually be larger than k.

  • precompute_for_predictions (bool) – Whether to precompute some of the matrices that are used when making predictions from the model. If ‘False’, it will take longer to generate predictions or top-N lists, but will use less memory and will be faster to fit the model. If passing ‘False’, can be recomputed later on-demand through method ‘force_precompute_for_predictions’.

  • include_all_X (bool) – When passing an input “X” to fit which has less columns than rows in “I”, whether to still make calculations about the items which are in “I” but not in “X”. This has three effects: (a) the topN functionality may recommend such items, (b) the precomptued matrices will be less usable as they will include all such items, (c) it will be possible to pass “X” data to the new factors or topN functions that include such columns (rows of “I”). This option is ignored when using NA_as_zero.

  • use_float (bool) – Whether to use C float type for the model parameters (typically this is np.float32). If passing False, will use C double (typically this is np.float64). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.

  • random_state (int, RandomState, Generator, or None) – Seed used to initialize parameters at random. If passing a NumPy RandomState or Generator, will use it to draw a random integer. Note however that, if using more than one thread, results might not be 100% reproducible with method='lbfgs' due to round-off errors in parallelized aggregations. If passing None, will draw a non-reproducible random integer to use as seed.

  • verbose (bool) – Whether to print informational messages about the optimization routine used to fit the model. Be aware that, if passing ‘False’ and method='lbfgs', the optimization routine will not respond to interrupt signals.

  • print_every (int) – Print L-BFGS convergence messages every n-iterations. Ignored when passing verbose=False or method='als'.

  • produce_dicts (bool) – Whether to produce Python dicts from the mappings between user/item IDs passed to ‘fit’ and the internal IDs used by the class. Having these dicts might speed up some computations such as ‘predict’, but it will add some extra overhead at the time of fitting the model and extra memory usage. Ignored when passing the data as matrices and arrays instead of data frames.

  • handle_interrupt (bool) – When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing ‘True’), or raise an interrupt exception without producing a fitted model object (when passing ‘False’).

  • nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).

  • n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.

Variables:
  • is_fitted (bool) – Whether the model has been fitted to data.

  • reindex (bool) – Whether the IDs passed to ‘fit’ were reindexed internally (this will only happen when passing data frames to ‘fit’).

  • user_mapping (array(m,) or array(0,)) – Correspondence of internal user (row) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.

  • item_mapping (array(n,) or array(0,)) – Correspondence of internal item (column) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.

  • user_dict (dict) – Python dict version of user_mapping_. Only filled-in when passing produce_dicts=True and when passing data frames to ‘fit’.

  • item_dict (dict) – Python dict version of item_mapping_. Only filled-in when passing produce_dicts=True and when passing data frames to ‘fit’.

  • glob_mean (float) – The global mean of the non-missing entries in ‘X’ passed to ‘fit’.

  • user_bias (array(m,), or array(0,)) – The obtained biases for each user (row in the ‘X’ matrix). If passing user_bias=False, this array will be empty.

  • item_bias (array(n,)) – The obtained biases for each item (column in the ‘X’ matrix). If passing item_bias=False, this array will be empty.

  • A (array(m, k_user+k+k_main)) – The obtained user factors.

  • B (array(n, k_item+k+k_main)) – The obtained item factors.

  • C (array(p, k_user+k)) – The obtained user-attributes factors.

  • D (array(q, k_item+k)) – The obtained item attributes factors.

  • Ai (array(m, k+k_main) or array(0, 0)) – The obtain implicit user factors.

  • Bi (array(n, k+k_main) or array(0, 0)) – The obtained implicit item factors.

  • nfev (int) – Number of function and gradient evaluations performed during the L-BFGS optimization procedure.

  • nupd (int) – Number of L-BFGS updates performed during the optimization procedure.

References

[1a]

Cortes, David. “Cold-start recommendations in Collective Matrix Factorization.” arXiv preprint arXiv:1809.00366 (2018).

[2a]

Singh, Ajit P., and Geoffrey J. Gordon. “Relational learning via collective matrix factorization.” Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008.

[4a]

Takacs, Gabor, Istvan Pilaszy, and Domonkos Tikk. “Applications of the conjugate gradient method for implicit feedback collaborative filtering.” Proceedings of the fifth ACM conference on Recommender systems. 2011.

[5a]

Rendle, Steffen, Li Zhang, and Yehuda Koren. “On the difficulty of evaluating baselines: A study on recommender systems.” arXiv preprint arXiv:1905.01395 (2019).

[6a]

Franc, Vojtěch, Václav Hlaváč, and Mirko Navara. “Sequential coordinate-wise algorithm for the non-negative least squares problem.” International Conference on Computer Analysis of Images and Patterns. Springer, Berlin, Heidelberg, 2005.

[7a]

Zhou, Yunhong, et al. “Large-scale parallel collaborative filtering for the netflix prize.” International conference on algorithmic applications in management. Springer, Berlin, Heidelberg, 2008.

drop_nonessential_matrices(drop_precomputed=True)

Drop matrices that are not used for prediction

Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.

This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.

Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.

After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as predict or swap_users_and_items. The methods which are intended to continue working afterwards are:

  • factors_warm

  • factors_cold

  • factors_multiple

  • topN_warm

  • topN_cold

Parameters:

drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).

Returns:

self – This object with the non-essential matrices dropped.

Return type:

obj

factors_cold(U=None, U_bin=None, U_col=None, U_val=None)[source]

Determine user-factors from new data, given U

Note

If using NA_as_zero, this function will assume that all the ‘X’ values are zeros rather than being missing.

Parameters:
  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value np.nan. Only supported with method='lbfgs'.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

Returns:

factors – The user-factors as determined by the model.

Return type:

array(k_user+k+k_main,)

factors_multiple(X=None, U=None, U_bin=None, W=None, return_bias=False)[source]

Determine user latent factors based on new data (warm and cold)

Determines latent factors for multiple rows/users at once given new data for them.

Note

See the documentation of “fit” for details about handling of missing values.

Note

If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes self.user_mapping_ and self.item_mapping_.

Parameters:
  • X (array(m_x, n), CSR matrix(m_x, n), COO matrix(m_x, n), or None) – New ‘X’ data.

  • U (array(m_u, p), CSR matrix(m_u, p), COO matrix(m_u, p), or None) – User attributes information for rows in ‘X’.

  • U_bin (array(m_ub, p_bin) or None) – User binary attributes for each row in ‘X’. Only supported with method='lbfgs'.

  • W (array(m_x, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.

  • return_bias (bool) – Whether to return also the user bias determined by the model given the data in ‘X’. If passing ‘False’, will return an array with the factors. If passing ‘True’, will return a tuple in which the first entry will be an array with the factors, and the second entry will be the estimated bias.

Returns:

  • A (array(max(m_x,m_u,m_ub), k_user+k+k_main)) – The new factors determined for all the rows given the new data.

  • bias (array(max(m_x,m_u,m_ub)) or None) – The user bias given the new ‘X’ data. Only returned if passing return_bias=True.

factors_warm(X=None, X_col=None, X_val=None, W=None, U=None, U_bin=None, U_col=None, U_val=None, return_bias=False)[source]

Determine user latent factors based on new ratings data

Parameters:
  • X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value np.nan. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value np.nan. Only supported with method='lbfgs'. User side info is not strictly required and can be skipped.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • return_bias (bool) – Whether to return also the user bias determined by the model given the data in ‘X’. If passing ‘False’, will return an array with the factors. If passing ‘True’, will return a tuple in which the first entry will be an array with the factors, and the second entry will be the estimated bias.

  • return_raw_A (bool) – Whether to return the raw A factors (the free offset), or the factors used in the factorization, to which the attributes component has been added.

Returns:

  • factors (array(k_user+k+k_main,) or array(k+k_main,)) – User factors as determined from the data in ‘X’.

  • bias (float) – User bias as determined from the data in ‘X’. Only returned if passing return_bias=True.

fit(X, U=None, I=None, U_bin=None, I_bin=None, W=None)[source]

Fit model to explicit-feedback data and user/item attributes

Note

It’s possible to pass partially disjoints sets of users/items between the different matrices (e.g. it’s possible for both the ‘X’ and ‘U’ matrices to have rows that the other doesn’t have). The procedure supports missing values for all inputs (except for “W”). If any of the inputs has less rows/columns than the other(s) (e.g. “U” has more rows than “X”, or “I” has more rows than there are columns in “X”), will assume that the rest of the rows/columns have only missing values. Note however that when having partially disjoint inputs, the order of the rows/columns matters for speed, as it might run faster when the “U”/”I” inputs that do not have matching rows/columns in “X” have those unmatched rows/columns at the end (last rows/columns) and the “X” input is shorter. See also the parameter include_all_X for info about predicting with mismatched “X”.

Note

When passing NumPy arrays, missing (unobserved) entries should have value np.nan. When passing sparse inputs, the zero-valued entries will be considered as missing (unless using “NA_as_zero”), and it should not contain “NaN” values among the non-zero entries.

Note

In order to avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it’s necessary to sort it beforehand by columns and also pass the data data with indices sorted (by column) to the prediction functions.

Parameters:
  • X (DataFrame(nnz, 3), DataFrame(nnz, 4), array(m, n), or sparse COO(m, n)) – Matrix to factorize (e.g. ratings). Can be passed as a SciPy sparse COO matrix (recommended), as a dense NumPy array, or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and ‘Rating’. Might additionally have a column ‘Weight’. If passing a DataFrame, the IDs will be internally remapped. If passing sparse ‘U’ or sparse ‘I’, ‘X’ cannot be passed as a DataFrame.

  • U (array(m, p), COO(m, p), DataFrame(m, p+1), or None) – User attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. If ‘U’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array.

  • U_bin (array(m, p_bin), DataFrame(m, p_bin+1), or None) – User binary attributes information (all values should be zero, one, or missing). If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. Cannot be passed as a sparse matrix. Note that ‘U’ and ‘U_bin’ are not mutually exclusive. Only supported with method='lbfgs'.

  • I (array(n, q), COO(n, q), DataFrame(n, q+1), or None) – Item attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. If ‘I’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array.

  • I_bin (array(n, q_bin), DataFrame(n, q_bin+1), or None) – Item binary attributes information (all values should be zero, one, or missing). If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. Cannot be passed as a sparse matrix. Note that ‘I’ and ‘I_bin’ are not mutually exclusive. Only supported with method='lbfgs'.

  • W (None, array(nnz,), or array(m, n)) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array. Cannot have missing values.

Return type:

self

force_precompute_for_predictions()[source]

Precompute internal matrices that are used for predictions

Note

It’s not necessary to call this method if passing precompute_for_predictions=True.

Return type:

self

static from_model_matrices(A, B, glob_mean=0.0, precompute=True, user_bias=None, item_bias=None, lambda_=10.0, scale_lam=False, l1_lambda=0.0, nonneg=False, NA_as_zero=False, scaling_biasA=None, scaling_biasB=None, use_float=False, nthreads=-1, n_jobs=None)[source]

Create a CMF model object from fitted matrices

Creates a CMF model object based on fitted latent factor matrices, which might have been obtained from a different software. For example, the package python-libmf has functionality for obtaining these matrices, but not for producing recommendations or latent factors for new users, for which this function can come in handy as it will turn such model into a CMF model which provides all such functionality.

This is only available for models without side information, and does not support user/item mappings.

Note

This is a static class method, should be called like this:

CMF.from_model_matrices(...)

(i.e. no parentheses after ‘CMF’)

Parameters:
  • A (array(n_users, k)) – The obtained user factors.

  • B (array(n_items, k)) – The obtained item factors.

  • glob_mean (float) – The obtained global mean, if the model underwent centering. If passing zero, will assume that the values are not to be centered.

  • precompute (bool) – Whether to generate pre-computed matrices which can help to speed up computations on new data.

  • user_bias (None or array(n_users,)) – The obtained user biases. If passing None, will assume that the model did not include user biases.

  • item_bias (None or array(n_items,)) – The obtained item biases. If passing None, will assume that the model did not include item biases.

  • lambda_ (float or array(6,)) – Regularization parameter. See the documentation for __init__ for details.

  • scale_lam (bool) – Whether to scale (increase) the regularization parameter for each row of the model matrices according to the number of non-missing entries in the data for that particular row.

  • l1_lambda (float or array(6,)) – Regularization parameter to apply to the L1 norm of the model matrices. See the documentation for __init__ for details.

  • nonneg (bool) – Whether to constrain the ‘A’ and ‘B’ matrices to be non-negative.

  • NA_as_zero (bool) – Whether to take missing entries in the ‘X’ matrix as zeros (only when the ‘X’ matrix is passed as sparse COO matrix) instead of ignoring them. See the documentation for __init__ for details.

  • scaling_biasA (None or float) – If passing it, will assume that the model uses the option scale_bias_const=True, and will use this number as scaling for the regularization of the user biases.

  • scaling_biasB (None or float) – If passing it, will assume that the model uses the option scale_bias_const=True, and will use this number as scaling for the regularization of the item biases.

  • use_float (bool) – Whether to use C float type for the model parameters (typically this is np.float32). If passing False, will use C double (typically this is np.float64). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.

  • nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).

  • n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.

Returns:

model – A CMF model object without side information, for which the usual prediction methods such as topN and topN_warm can be used as if it had been fitted through this software.

Return type:

CMF

get_params(deep=True)[source]

Get parameters for this estimator.

Kept for compatibility with scikit-learn.

Parameters:

deep (bool) – Ignored.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

item_factors_cold(I=None, I_bin=None, I_col=None, I_val=None)[source]

Determine item-factors from new data, given I

Note

Calculating item factors might be a lot slower than user factors, as the model does not keep precomputed matrices that might speed up these factor calculations. If this function is goint to be used frequently, it’s advised to build the model swapping the users and items instead.

Parameters:
  • I (array(q,), or None) – Attributes for the new item, in dense format. Should only pass one of ‘I’ or ‘I_col’+’I_val’.

  • I_bin (array(q_bin,), or None) – Binary attributes for the new item, in dense format. Only supported with method='lbfgs'.

  • I_col (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘I’ or ‘I_col’+’I_val’.

  • I_val (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_val’ should contain the values in the columns given by ‘I_col’. Should only pass one of ‘I’ or ‘I_col’+’I_val’.

Returns:

factors – The item-factors as determined by the model.

Return type:

array(k_item+k+k_main,)

predict(user, item)

Predict ratings/values given by existing users to existing items

Note

For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.

Parameters:
  • user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in item will be matched with the corresponding entry of user at the same position in the array/list.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

predict_cold(items, U=None, U_bin=None, U_col=None, U_val=None)[source]

Predict rating given by a new user to existing items, given U

Note

If using NA_as_zero, this function will assume that all the ‘X’ values are zeros rather than being missing.

Parameters:
  • items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value np.nan. Only supported with method='lbfgs'.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

Returns:

scores – Predicted ratings for the requested items, for this user.

Return type:

array(n,)

predict_cold_multiple(item, U=None, U_bin=None)[source]

Predict rating given by new users to existing items, given U

Note

If using NA_as_zero, this function will assume that all the ‘X’ values are zeros rather than being missing.

Parameters:
  • item (array-like(m,)) – Items for which ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • U (array(m, p), CSR matrix(m, q), COO matrix(m, q), or None) – Attributes for the users for which to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in item.

  • U_bin (array(m, p_bin), or None) – Binary attributes for the users to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in user. Only supported with method='lbfgs'.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(m,)

predict_new(user, I=None, I_bin=None)[source]

Predict rating given by existing users to new items, given I

Note

Calculating item factors might be a lot slower than user factors, as the model does not keep precomputed matrices that might speed up these factor calculations. If this function is goint to be used frequently, it’s advised to build the model swapping the users and items instead.

Parameters:
  • user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • I (array(n, q), CSR matrix(n, q), COO matrix(n, q), or None) – Attributes for the items for which to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in user. Might contain missing values.

  • I_bin (array(n, q_bin), or None) – Binary attributes for the items to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in user. Might contain missing values. Only supported with method='lbfgs'.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

predict_warm(items, X=None, X_col=None, X_val=None, W=None, U=None, U_bin=None, U_col=None, U_val=None)[source]

Predict ratings for existing items, for a new user, given ‘X’

Parameters:
  • items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value np.nan. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value np.nan. Only supported with method='lbfgs'. User side info is not strictly required and can be skipped.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

Returns:

scores – Predicted values for the requested items for a user defined by the given values of ‘X’ in ‘X_col’ and ‘X_val’.

Return type:

array(n,)

predict_warm_multiple(X, item, U=None, U_bin=None, W=None)[source]

Predict ratings for existing items, for new users, given ‘X’

Note

See the documentation of “fit” for details about handling of missing values.

Parameters:
  • X (array(m, n), CSR matrix(m, n) , or COO matrix(m, n)) – New ‘X’ data with potentially missing entries. Must have one row per entry of item.

  • item (array-like(m,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in item will be matched with the corresponding row of X.

  • U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’.

  • U_bin (array(m, p_bin)) – User binary attributes for each row in ‘X’. Only supported with method='lbfgs'.

  • W (array(m, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(m,)

set_params(**params)

Set the parameters of this estimator.

Kept for compatibility with scikit-learn.

Note

Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

swap_users_and_items(precompute=True)

Swap the users and items in a factorization model

This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as topN, in which any mention of users will now mean items and vice-versa.

Note

The resulting object will not generate any deep copies of the original model’s objects.

Parameters:

precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.

Returns:

model – An object of the same class as this one, but with the user and items swapped.

Return type:

obj

topN(user, n=10, include=None, exclude=None, output_score=False)

Rank top-N highest-predicted items for an existing user

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • n (int) – Number of top-N highest-predicted results to output.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_cold(n=10, U=None, U_bin=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)[source]

Compute top-N highest-predicted items for a new user, given ‘U’

Note

If using NA_as_zero, this function will assume that all the ‘X’ values are zeros rather than being missing.

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • n (int) – Number of top-N highest-predicted results to output.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value np.nan. Only supported with method='lbfgs'.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_new(user, I=None, I_bin=None, n=10, output_score=False)[source]

Rank top-N highest-predicted items for an existing user, given ‘I’

Note

If the model was fit to both ‘I’ and ‘I_bin’, can pass a partially- disjoint set to both - that is, both can have rows that the other doesn’t. In such case, the rows that they have in common should come first, and then one of them appended missing values so that one of the matrices ends up containing all the rows of the other.

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a data frame, must match with entries in its ‘UserId’ column, otherwise should match with the rows on ‘X’.

  • I (array(m, q), CSR matrix(m, q), COO matrix(m, q), or None) – Attributes for the items to rank. Data frames with ‘ItemId’ column are not supported.

  • I_bin (array(m, q_bin), or None) – Binary attributes for the items to rank. Data frames with ‘ItemId’ column are not supported. Only supported with method='lbfgs'.

  • n (int) – Number of top-N highest-predicted results to output. Must be less or equal than the number of rows in ‘I’/’I_bin’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user, as integers matching to the rows of ‘I’/’I_bin’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_warm(n=10, X=None, X_col=None, X_val=None, W=None, U=None, U_bin=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)[source]

Compute top-N highest-predicted items for a new user, given ‘X’

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • n (int) – Number of top-N highest-predicted results to output.

  • X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value np.nan. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value np.nan. Only supported with method='lbfgs'. User side info is not strictly required and can be skipped.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

transform(X=None, y=None, U=None, U_bin=None, W=None, replace_existing=False)[source]

Reconstruct missing entries of the ‘X’ matrix

Will reconstruct/impute all the missing entries in the ‘X’ matrix as determined by the model. This method is intended to be used for imputing tabular data, and can be used as part of SciKit-Learn pipelines.

Note

It’s possible to use this method with ‘X’ alone, with ‘U’/’U_bin’ alone, or with both ‘X’ and ‘U’/’U_bin’ together, in which case both matrices must have the same rows.

Note

If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes self.user_mapping_ and self.item_mapping_.

Parameters:
  • X (array(m, n), or None) – New ‘X’ data with potentially missing entries which are to be imputed. Missing entries should have value np.nan when passing a dense array.

  • y (None) – Not used. Kept as a placeholder for compatibility with SciKit-Learn pipelines.

  • U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’.

  • U_bin (array(m, p_bin) or None) – User binary attributes for each row in ‘X’. Only supported with method='lbfgs'.

  • W (array(m, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.

Returns:

X – The ‘X’ matrix as a dense array with all missing entries imputed according to the model.

Return type:

array(m, n)

CMF_implicit

class cmfrec.CMF_implicit(k=50, lambda_=1.0, alpha=1.0, use_cg=True, k_user=0, k_item=0, k_main=0, w_main=1.0, w_user=10.0, w_item=10.0, l1_lambda=0.0, center_U=True, center_I=True, niter=10, NA_as_zero_user=False, NA_as_zero_item=False, nonneg=False, nonneg_C=False, nonneg_D=False, max_cd_steps=100, apply_log_transf=False, precompute_for_predictions=True, use_float=True, max_cg_steps=3, precondition_cg=False, finalize_chol=False, random_state=1, verbose=False, produce_dicts=False, handle_interrupt=True, nthreads=-1, n_jobs=None)[source]

Collective model for implicit-feedback data

Tries to approximate the ‘X’ interactions matrix by a formula as follows:

\(\mathbf{X} \sim \mathbf{A} \mathbf{B}^T\)

While at the same time also approximating the user side information matrix ‘U’ and the item side information matrix ‘I’ as follows:

\(\mathbf{U} \sim \mathbf{A} \mathbf{C}^T\),

\(\mathbf{I} \sim \mathbf{B} \mathbf{D}^T\)

Compared to the CMF class, here the interactions matrix ‘X’ treats missing entries as zeros and non-missing entries as ones, while the values supplied for interactions are applied as weights over this binarized matrix ‘X’ (see references for more details). Roughly speaking, it is a more efficient version of CMF with hard-coded arguments NA_as_zero=True, center=False, user_bias=False, item_bias=False, scale_lam=False, plus a different initialization of factor matrices, and ‘X’ converted to a weighted binary matrix as explained earlier.

Note

The default hyperparameters in this software are very different from others. For example, to match those of the package implicit, the corresponding hyperparameters here would be use_cg=True, finalize_chol=False, k=100, lambda_=0.01, niter=15, use_float=True, alpha=1.`, (see the individual documentation of each hyperarameter for details).

Note

The default arguments are not geared towards speed. For faster fitting, use use_cg=True, finalize_chol=False, use_float=True, precompute_for_predictions=False, produce_dicts=False, and pass COO matrices or NumPy arrays instead of DataFrames to fit.

Note

The model optimization objective will not scale any of its terms according to number of entries, so hyperparameters such as lambda_ will require more tuning than in other software and trying out values over a wider range.

Note

This model is fit through the alternating least-squares method only, it does not offer a gradient-based approach like the explicit-feedback version.

Note

This model will not perform mean centering and will not fit user/item biases. If desired, an equivalent problem formulation can be made through CMF which can accommodate mean centering and biases.

Note

Recommendation quality metrics for this model can be calculated with the recometrics library.

Parameters:
  • k (int) – Number of latent factors to use (dimensionality of the low-rank factorization), which will be shared between the factorization of the ‘X’ matrix and the side info matrices. Additional non-shared components can also be specified through k_user, k_item, and k_main. Typical values are 30 to 100.

  • lambda_ (float or array(6,)) – Regularization parameter. Can also use different regularization for each matrix, in which case it should be an array with 6 entries, corresponding, in this order, to: <ignored>, <ignored>, A, B, C, D. Note that the default value for lambda_ here is much higher than in other software, and that the loss/objective function is not divided by the number of entries. For example, a good number for the LastFM-360K could be lambda_=5. Typical values are \(10^{-2}\) to \(10^2\).

  • alpha (float) – Weighting parameter for the non-zero entries in the implicit-feedback model. See [3b] for details. Note that, while the author’s suggestion for this value is 40, other software such as implicit use a value of 1, whereas Spark uses a value of 0.01 by default, and values higher than 10 are unlikely to improve results. If the data has very high values, might even be beneficial to put a very low value here - for example, for the LastFM-360K, values below 1 might give better results.

  • use_cg (bool) – In the ALS method, whether to use a conjugate gradient method to solve the closed-form least squares problems. This is a faster and more memory-efficient alternative than the default Cholesky solver, but less exact, less numerically stable, and will require slightly more ALS iterations (niter) to reach a good optimum. In general, better results are achieved with use_cg=False. Note that, if using this method, calculations after fitting which involve new data such as factors_warm, might produce slightly different results from the factors obtained from calling fit with the same data, due to differences in numerical precision. A workaround for this issue (factors on new data that might differ slightly) is to use finalize_chol=True. Even if passing “True” here, will use the Cholesky method in cases in which it is faster (e.g. dense matrices with no missing values), and will not use the conjugate gradient method on new data. This option is not available when using L1 regularization and/or non-negativity constraints.

  • k_user (int) – Number of factors in the factorizing A and C matrices which will be used only for the ‘U’ matrix, while being ignored for the ‘X’ matrix. These will be the first factors of the matrices once the model is fit. Will be counted in addition to those already set by k.

  • k_item (int) – Number of factors in the factorizing B and D matrices which will be used only for the ‘I’ matrix, while being ignored for the ‘X’ matrix. These will be the first factors of the matrices once the model is fit. Will be counted in addition to those already set by k.

  • k_main (int) – Number of factors in the factorizing A and B matrices which will be used only for the ‘X’ matrix, while being ignored for the ‘U’ and ‘I’ matrices. These will be the last factors of the matrices once the model is fit. Will be counted in addition to those already set by k.

  • w_main (float) – Weight in the optimization objective for the errors in the factorization of the ‘X’ matrix. Note that, since the “X” matrix is considered to be full with mostly zero values, the overall sum of errors for “X” will be much larger than for the side info matrices (especially if using large alpha), thus it’s recommended to give higher weights to the side info matrices than to the main matrix.

  • w_user (float) – Weight in the optimization objective for the errors in the factorization of the ‘U’ matrix. Ignored when not passing ‘U’ to ‘fit’. Note that, since the “X” matrix is considered to be full with mostly zero values, the overall sum of errors for “X” will be much larger than for the side info matrices (especially if using large alpha), thus it’s recommended to give higher weights to the side info matrices than to the main matrix.

  • w_item (float) – Weight in the optimization objective for the errors in the factorization of the ‘I’ matrix. Ignored when not passing ‘I’ to ‘fit’. Note that, since the “X” matrix is considered to be full with mostly zero values, the overall sum of errors for “X” will be much larger than for the side info matrices (especially if using large alpha), thus it’s recommended to give higher weights to the side info matrices than to the main matrix.

  • l1_lambda (float or array(6,)) – Regularization parameter to apply to the L1 norm of the model matrices. Can also pass different values for each matrix (see lambda_ for details). Note that, when adding L1 regularization, the model will be git through a coordinate descent procedure, which is significantly slower than the Cholesky method with L2 regularization. Not recommended.

  • center_U (bool) – Whether to center the ‘U’ matrix column-by-column. Be aware that this is a simple mean centering without regularization. One might want to turn this option off when using NA_as_zero_user=True.

  • center_I (bool) – Whether to center the ‘I’ matrix column-by-column. Be aware that this is a simple mean centering without regularization. One might want to turn this option off when using NA_as_zero_item=True.

  • niter (int) – Number of alternating least-squares iterations to perform. Note that one iteration denotes an update round for all the matrices rather than an update of a single matrix. In general, the more iterations, the better the end result. Typical values are 6 to 30.

  • NA_as_zero_user (bool) – Whether to take missing entries in the ‘U’ matrix as zeros (only when the ‘U’ matrix is passed as sparse COO matrix) instead of ignoring them. Note that passing “True” will affect the results of the functions named “warm” if no data is passed there (as it will assume zeros instead of missing).

  • NA_as_zero_item (bool) – Whether to take missing entries in the ‘I’ matrix as zeros (only when the ‘I’ matrix is passed as sparse COO matrix) instead of ignoring them.

  • nonneg (bool) – Whether to constrain the ‘A’ and ‘B’ matrices to be non-negative. In order for this to work correctly, the ‘X’ input data must also be non-negative. This constraint will also be applied to the ‘Ai’ and ‘Bi’ matrices if passing add_implicit_features=True. This option is not available when using the L-BFGS method. Note that, when determining non-negative factors, it will always use a coordinate descent method, regardless of the value passed for use_cg and finalize_chol. When used for recommender systems, one usually wants to pass ‘False’ here. For better results, use a higher regularization and more iterations.

  • nonneg_C (bool) – Whether to constrain the ‘C’ matrix to be non-negative. In order for this to work correctly, the ‘U’ input data must also be non-negative.

  • nonneg_D (bool) – Whether to constrain the ‘D’ matrix to be non-negative. In order for this to work correctly, the ‘I’ input data must also be non-negative.

  • max_cd_steps (int) – Maximum number of coordinate descent updates to perform per iteration. Pass zero for no limit. The procedure will only use coordinate descent updates when having L1 regularization and/or non-negativity constraints. This number should usually be larger than k.

  • precondition_cg (bool) – Whether to use Jacobi preconditioning for the conjugate gradient procedure. In general, this type of preconditioning is not beneficial (makes the algorithm slower) as the factor variables tend to be in the same scale, but it might help when using non-shared factors. Ignored when passing use_cg=False or method="als".

  • apply_log_transf (bool) – Whether to apply a logarithm transformation on the values of ‘X’ (i.e. ‘X := log(X)’)

  • precompute_for_predictions (bool) – Whether to precompute some of the matrices that are used when making predictions from the model. If ‘False’, it will take longer to generate predictions or top-N lists, but will use less memory and will be faster to fit the model. If passing ‘False’, can be recomputed later on-demand through method ‘force_precompute_for_predictions’.

  • use_float (bool) – Whether to use C float type for the model parameters (typically this is np.float32). If passing False, will use C double (typically this is np.float64). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.

  • max_cg_steps (int) – Maximum number of conjugate gradient iterations to perform in an ALS round. Ignored when passing use_cg=False.

  • precondition_cg (bool) – Whether to use Jacobi preconditioning for the conjugate gradient procedure. In general, this type of preconditioning is not beneficial (makes the algorithm slower) as the factor variables tend to be in the same scale, but it might help when using non-shared factors. Note that, when using preconditioning, the procedure will not check for convergence, taking instead a fixed number of steps (given by max_cg_steps) at each iteration regardless of whether it has reached the optimum already. Ignored when passing use_cg=False or method="als".

  • finalize_chol (bool) – When passing use_cg=True, whether to perform the last iteration with the Cholesky solver. This will make it slower, but will avoid the issue of potential mismatches between the result from fit and calls to factors_warm or similar with the same data.

  • random_state (int, RandomState, Generator, or None) – Seed used to initialize parameters at random. If passing a NumPy RandomState or Generator, will use it to draw a random integer. If passing None, will draw a non-reproducible random integer to use as seed.

  • verbose (bool) – Whether to print informational messages about the optimization routine used to fit the model.

  • produce_dicts (bool) – Whether to produce Python dicts from the mappings between user/item IDs passed to ‘fit’ and the internal IDs used by the class. Having these dicts might speed up some computations such as ‘predict’, but it will add some extra overhead at the time of fitting the model and extra memory usage. Ignored when passing the data as matrices and arrays instead of data frames.

  • handle_interrupt (bool) – When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing ‘True’), or raise an interrupt exception without producing a fitted model object (when passing ‘False’).

  • nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).

  • n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.

Variables:
  • is_fitted (bool) – Whether the model has been fitted to data.

  • reindex (bool) – Whether the IDs passed to ‘fit’ were reindexed internally (this will only happen when passing data frames to ‘fit’).

  • user_mapping (array(m,) or array(0,)) – Correspondence of internal user (row) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.

  • item_mapping (array(n,) or array(0,)) – Correspondence of internal item (column) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.

  • user_dict (dict) – Python dict version of user_mapping_. Only filled-in when passing produce_dicts=True and when passing data frames to ‘fit’.

  • item_dict (dict) – Python dict version of item_mapping_. Only filled-in when passing produce_dicts=True and when passing data frames to ‘fit’.

  • A (array(m, k_user+k+k_main)) – The obtained user factors.

  • B (array(n, k_item+k+k_main)) – The obtained item factors.

  • C (array(p, k_user+k)) – The obtained user-attributes factors.

  • D (array(q, k_item+k)) – The obtained item attributes factors.

References

[1b]

Cortes, David. “Cold-start recommendations in Collective Matrix Factorization.” arXiv preprint arXiv:1809.00366 (2018).

[2b]

Singh, Ajit P., and Geoffrey J. Gordon. “Relational learning via collective matrix factorization.” Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008.

[3b]

Hu, Yifan, Yehuda Koren, and Chris Volinsky. “Collaborative filtering for implicit feedback datasets.” 2008 Eighth IEEE International Conference on Data Mining. Ieee, 2008.

[4b]

Takacs, Gabor, Istvan Pilaszy, and Domonkos Tikk. “Applications of the conjugate gradient method for implicit feedback collaborative filtering.” Proceedings of the fifth ACM conference on Recommender systems. 2011.

[5b]

Franc, Vojtěch, Václav Hlaváč, and Mirko Navara. “Sequential coordinate-wise algorithm for the non-negative least squares problem.” International Conference on Computer Analysis of Images and Patterns. Springer, Berlin, Heidelberg, 2005.

drop_nonessential_matrices(drop_precomputed=True)

Drop matrices that are not used for prediction

Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.

This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.

Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.

After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as predict or swap_users_and_items. The methods which are intended to continue working afterwards are:

  • factors_warm

  • factors_cold

  • factors_multiple

  • topN_warm

  • topN_cold

Parameters:

drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).

Returns:

self – This object with the non-essential matrices dropped.

Return type:

obj

factors_cold(U=None, U_col=None, U_val=None)[source]

Determine user-factors from new data, given U

Parameters:
  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

Returns:

factors – The user-factors as determined by the model.

Return type:

array(k_user+k+k_main,)

factors_multiple(X=None, U=None)[source]

Determine user latent factors based on new data (warm and cold)

Determines latent factors for multiple rows/users at once given new data for them.

Note

See the documentation of “fit” for details about handling of missing values.

Note

If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes self.user_mapping_ and self.item_mapping_.

Parameters:
  • X (CSR matrix(m_x, n), COO matrix(m_x, n), or None) – New ‘X’ data.

  • U (array(m_u, p), CSR matrix(m_u, p), COO matrix(m_u, p), or None) – User attributes information for rows in ‘X’.

Returns:

A – The new factors determined for all the rows given the new data.

Return type:

array(max(m_x,m_u), k_user+k+k_main)

factors_warm(X_col, X_val, U=None, U_col=None, U_val=None)[source]

Determine user latent factors based on new interactions data

Parameters:
  • X_col (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).

  • X_val (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

Returns:

factors – User factors as determined from the data in ‘X_col’ and ‘X_val’.

Return type:

array(k_user+k+k_main,)

fit(X, U=None, I=None)[source]

Fit model to implicit-feedback data and user/item attributes

Note

It’s possible to pass partially disjoints sets of users/items between the different matrices (e.g. it’s possible for both the ‘X’ and ‘U’ matrices to have rows that the other doesn’t have), but note that missing values in ‘X’ are treated as zeros. The procedure supports missing values for “U” and “I”. If any of the inputs has less rows/columns than the other(s) (e.g. “U” has more rows than “X”, or “I” has more rows than there are columns in “X”), will assume that the rest of the rows/columns have only missing values (zero values for “X”). Note however that when having partially disjoint inputs, the order of the rows/columns matters for speed, as it might run faster when the “U”/”I” inputs that do not have matching rows/columns in “X” have those unmatched rows/columns at the end (last rows/columns) and the “X” input is shorter.

Note

When passing NumPy arrays, missing (unobserved) entries should have value np.nan. When passing sparse inputs, the zero-valued entries will be considered as missing (unless using “NA_as_zero”, and except for “X” for which missing will always be treated as zero), and it should not contain “NaN” values among the non-zero entries.

Note

In order to avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it’s necessary to sort it beforehand by columns and also pass the data data with indices sorted (by column) to the prediction functions.

Parameters:
  • X (DataFrame(nnz, 3), or sparse COO(m, n)) – Matrix to factorize. Can be passed as a SciPy sparse COO matrix (recommended), or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and ‘Value’. If passing a DataFrame, the IDs will be internally remapped.

  • U (array(m, p), COO(m, p), DataFrame(m, p+1), or None) – User attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. If ‘U’ is sparse, ‘X’ should be passed as a sparse COO matrix too.

  • I (array(n, q), COO(n, q), DataFrame(n, q+1), or None) – Item attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. If ‘I’ is sparse, ‘X’ should be passed as a sparse COO matrix too.

Return type:

self

force_precompute_for_predictions()[source]

Precompute internal matrices that are used for predictions

Note

It’s not necessary to call this method if passing precompute_for_predictions=True.

Return type:

self

static from_model_matrices(A, B, precompute=True, lambda_=1.0, l1_lambda=0.0, nonneg=False, apply_log_transf=False, alpha=1.0, use_float=False, nthreads=-1, n_jobs=None)[source]

Create a CMF_implicit model object from fitted matrices

Creates a CMF_implicit model object based on fitted latent factor matrices, which might have been obtained from a different software. For example, the package python-libmf has functionality for obtaining these matrices, but not for producing recommendations or latent factors for new users, for which this function can come in handy as it will turn such model into a CMF_implicit model which provides all such functionality.

This is only available for models without side information, and does not support user/item mappings.

Note

This is a static class method, should be called like this:

CMF_implicit.from_model_matrices(...)

(i.e. no parentheses after ‘CMF_implicit’)

Parameters:
  • A (array(n_users, k)) – The obtained user factors.

  • B (array(n_items, k)) – The obtained item factors.

  • precompute (bool) – Whether to generate pre-computed matrices which can help to speed up computations on new data.

  • lambda_ (float or array(6,)) – Regularization parameter. See the documentation for __init__ for details.

  • l1_lambda (float or array(6,)) – Regularization parameter to apply to the L1 norm of the model matrices. See the documentation for __init__ for details.

  • nonneg (bool) – Whether to constrain the ‘A’ and ‘B’ matrices to be non-negative.

  • apply_log_transf (bool) – Whether to apply a logarithm transformation on the values of ‘X.

  • alpha (float) – Multiplier to apply to the confidence scores given by ‘X’.

  • use_float (bool) – Whether to use C float type for the model parameters (typically this is np.float32). If passing False, will use C double (typically this is np.float64). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.

  • nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).

  • n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.

Returns:

model – A CMF_implicit model object without side information, for which the usual prediction methods such as topN and topN_warm can be used as if it had been fitted through this software.

Return type:

CMF_implicit

get_params(deep=True)[source]

Get parameters for this estimator.

Kept for compatibility with scikit-learn.

Parameters:

deep (bool) – Ignored.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

item_factors_cold(I=None, I_col=None, I_val=None)[source]

Determine item-factors from new data, given I

Note

Calculating item factors might be a lot slower than user factors, as the model does not keep precomputed matrices that might speed up these factor calculations. If this function is goint to be used frequently, it’s advised to build the model swapping the users and items instead.

Parameters:
  • I (array(q,), or None) – Attributes for the new item, in dense format. Should only pass one of ‘I’ or ‘I_col’+’I_val’.

  • I_col (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘I’ or ‘I_col’+’I_val’.

  • I_val (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_val’ should contain the values in the columns given by ‘I_col’. Should only pass one of ‘I’ or ‘I_col’+’I_val’.

Returns:

factors – The item-factors as determined by the model.

Return type:

array(k_item+k+k_main,)

predict(user, item)

Predict ratings/values given by existing users to existing items

Note

For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.

Parameters:
  • user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in item will be matched with the corresponding entry of user at the same position in the array/list.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

predict_cold(items, U=None, U_col=None, U_val=None)[source]

Predict value/confidence given by a new user to existing items, given U

Parameters:
  • items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

Returns:

scores – Predicted ratings for the requested items, for this user.

Return type:

array(n,)

predict_cold_multiple(item, U)[source]

Predict value/confidence given by new users to existing items, given U

Note

See the documentation of “fit” for details about handling of missing values.

Parameters:
  • item (array-like(m,)) – Items for which ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • U (array(m, p), CSR matrix(m, q), or COO matrix(m, q)) – Attributes for the users for which to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in item.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(m,)

predict_new(user, I)[source]

Predict rating given by existing users to new items, given I

Note

Calculating item factors might be a lot slower than user factors, as the model does not keep precomputed matrices that might speed up these factor calculations. If this function is goint to be used frequently, it’s advised to build the model swapping the users and items instead.

Parameters:
  • user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • I (array(n, q), CSR matrix(n, q), or COO matrix(n, q)) – Attributes for the items for which to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in user.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

predict_warm(items, X_col, X_val, U=None, U_col=None, U_val=None)[source]

Predict scores for existing items, for a new user, given ‘X’

Parameters:
  • items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • X_col (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).

  • X_val (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

Returns:

scores – Predicted values for the requested items for a user defined by the given values of ‘X’ in ‘X_col’ and ‘X_val’, plus ‘U’ if passed.

Return type:

array(n,)

predict_warm_multiple(X, item, U=None)[source]

Predict scores for existing items, for new users, given ‘X’

Note

See the documentation of “fit” for details about handling of missing values.

Note

If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes self.user_mapping_ and self.item_mapping_.

Parameters:
  • X (CSR matrix(m, n) , or COO matrix(m, n)) – New ‘X’ data with potentially missing entries. Must have one row per entry of item.

  • item (array-like(m,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in item will be matched with the corresponding row of X.

  • U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(m,)

set_params(**params)

Set the parameters of this estimator.

Kept for compatibility with scikit-learn.

Note

Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

swap_users_and_items(precompute=True)

Swap the users and items in a factorization model

This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as topN, in which any mention of users will now mean items and vice-versa.

Note

The resulting object will not generate any deep copies of the original model’s objects.

Parameters:

precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.

Returns:

model – An object of the same class as this one, but with the user and items swapped.

Return type:

obj

topN(user, n=10, include=None, exclude=None, output_score=False)

Rank top-N highest-predicted items for an existing user

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • n (int) – Number of top-N highest-predicted results to output.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_cold(n=10, U=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)[source]

Compute top-N highest-predicted items for a new user, given ‘U’

Note

For better cold-start recommendations, one can also add item biases by using the CMF class with parameters that would mimic CMF_implicit plus the biases.

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • n (int) – Number of top-N highest-predicted results to output.

  • U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_new(user, I=None, n=10, output_score=False)[source]

Rank top-N highest-predicted items for an existing user, given ‘I’

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a data frame, must match with entries in its ‘UserId’ column, otherwise should match with the rows on ‘X’.

  • I (array(m, q), CSR matrix(m, q), or COO matrix(m, q)) – Attributes for the items to rank. Data frames with ‘ItemId’ column are not supported.

  • n (int) – Number of top-N highest-predicted results to output. Must be less or equal than the number of rows in I.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user, as integers matching to the rows of ‘I’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_warm(n=10, X_col=None, X_val=None, U=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)[source]

Compute top-N highest-predicted items for a new user, given ‘X’

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • n (int) – Number of top-N highest-predicted results to output.

  • X_col (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).

  • X_val (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

OMF_explicit

class cmfrec.OMF_explicit(k=50, lambda_=10.0, method='lbfgs', use_cg=True, user_bias=True, item_bias=True, center=True, k_sec=0, k_main=0, add_intercepts=True, w_user=1.0, w_item=1.0, maxiter=10000, niter=10, parallelize='separate', corr_pairs=7, max_cg_steps=3, precondition_cg=False, finalize_chol=True, NA_as_zero=False, use_float=False, random_state=1, verbose=False, print_every=100, produce_dicts=False, handle_interrupt=True, nthreads=-1, n_jobs=None)[source]

Offsets model for explicit-feedback data

Tries to approximate the ‘X’ ratings matrix using the user side information ‘U’ and item side information ‘I’ by a formula as follows:

\(\mathbf{X} \sim (\mathbf{A} + \mathbf{U} \mathbf{C}) * (\mathbf{B} + \mathbf{I} \mathbf{D})^T\)

Note

This model is meant to be fit to ratings data with side info about either users or items. If there is side info about both, it’s better to use the content-based model instead.

Note

This model is meant for cold-start predictions (that is, based on side information alone). It is extremely unlikely to bring improvements compared to situations in which the classical model is able to make predictions.

Note

The ALS method works by first fitting a model with no side info and then reconstructing the parameters by least squares approximations, so when making warm-start predictions, the results will be exactly the same as if not using any side information (user/item attributes). The ALS procedure for this model was implemented for experimentation purposes only, and it’s recommended to use L-BFGS instead.

Note

It’s advised to experiment with tuning the maximum number of L-BFGS iterations and stopping earlier. Be aware that this model requires a lot more iterations to reach convergence compared to the classic and the collective models.

Note

The model optimization objective will not scale any of its terms according to number of entries, so hyperparameters such as lambda_ will require more tuning than in other software and trying out values over a wider range.

Parameters:
  • k (int) – Number of latent factors to use (dimensionality of the low-rank factorization), which will have a free component and an attribute-dependent component. Other additional separate factors can be specified through k_sec and k_main. Optionally, this parameter might be set to zero while setting k_sec and k_main for a different type of model. Typical values are 30 to 100.

  • lambda_ (float or array(6,)) – Regularization parameter. Can also use different regularization for each matrix, in which case it should be an array with 6 entries, corresponding, in this order, to: user_bias, item_bias, A, B, C, D. The attribute biases will have the same regularization as the matrices to which they apply (C and D). Note that the default value for lambda_ here is much higher than in other software, and that the loss/objective function is not divided by the number of entries. For example, a good value for the MovieLens10M would be lambda_=35.. Typical values are \(10^{-2}\) to \(10^2\). Passing different regularization for each matrix is not supported with method='als'.

  • method (str, one of “lbfgs” or “als”) – Optimization method used to fit the model. If passing 'lbfgs', will fit it through a gradient-based approach using an L-BFGS optimizer. If passing 'als', will first obtain the solution ignoring the side information using an alternating least-squares procedure (the classical model described in other papers), then reconstruct the model matrices by a least-squares approximation. The ALS approach was implemented for experimentation purposes only and is not recommended.

  • use_cg (bool) – In the ALS method, whether to use a conjugate gradient method to solve the closed-form least squares problems. This is a faster and more memory-efficient alternative than the default Cholesky solver, but less exact, less numerically stable, and will require slightly more ALS iterations (niter) to reach a good optimum. In general, better results are achieved with use_cg=False. Note that, if using this method, calculations after fitting which involve new data such as factors_warm, might produce slightly different results from the factors obtained from calling fit with the same data, due to differences in numerical precision. A workaround for this issue (factors on new data that might differ slightly) is to use finalize_chol=True. Even if passing “True” here, will use the Cholesky method in cases in which it is faster (e.g. dense matrices with no missing values), and will not use the conjugate gradient method on new data. Ignored when passing method="lbfgs".

  • user_bias (bool) – Whether to add user biases (intercepts) to the model.

  • item_bias (bool) – Whether to add item biases (intercepts) to the model. Be aware that using item biases with low regularization for them will tend to favor items with high average ratings regardless of the number of ratings the item has received.

  • center (bool) – Whether to center the “X” data by subtracting the mean value. For recommender systems, it’s highly recommended to pass “True” here, the more so if the model has user and/or item biases.

  • k_sec (int) – Number of factors in the factorizing matrices which are determined exclusively from user/item attributes. These will be at the beginning of the C and D matrices once the model is fit. If there are no attributes for a given matrix (user/item), then that matrix will have an extra k_sec factors (e.g. if passing user side info but not item side info, then the B matrix will have an extra k_sec factors). Will be counted in addition to those already set by k. Not supported when using method='als'.

    For a different model having only k_sec with k=0 and k_main=0, see the ContentBased class.

  • k_main (int) – Number of factors in the factorizing matrices which are determined without any user/item attributes. These will be at the end of the A and B matrices once the model is fit. Will be counted in addition to those already set by k. Not supported when using method='als'.

  • add_intercepts (bool) – Whether to add intercepts/biases to the user/item attribute matrices.

  • w_user (float) – Multiplier for the effect of the attributes contribution to the factorizing matrix A (that is, Am = A + w_user*U*C). Passing values larger than 1 has the effect of giving less freedoom to the free offset term.

  • w_item (float) – Multiplier for the effect of the attributes contribution to the factorizing matrix B (that is, Bm = B + w_item*I*D). Passing values larger than 1 has the effect of giving less freedoom to the free offset term.

  • maxiter (int) – Maximum L-BFGS iterations to perform. The procedure will halt if it has not converged after this number of updates. Note that, compared to the collective model, more iterations will be required for converge here. Using higher regularization values might also decrease the number of required iterations. Pass zero for no L-BFGS iterations limit. If the procedure is spending thousands of iterations without any significant decrease in the loss function or gradient norm, it’s highly likely that the regularization is too low. Ignored when passing method='als'.

  • niter (int) – Number of alternating least-squares iterations to perform. Note that one iteration denotes an update round for all the matrices rather than an update of a single matrix. In general, the more iterations, the better the end result. Ignored when passing method='lbfgs'. Typical values are 6 to 30.

  • parallelize (str, “separate” or “single”) – How to parallelize gradient calculations when using more than one thread with method='lbfgs'. Passing 'separate' will iterate over the data twice - first by rows and then by columns, letting each thread calculate results for each row and column, whereas passing 'single' will iterate over the data only once, and then sum the obtained results from each thread. Passing 'separate' is much more memory-efficient and less prone to irreproducibility of random seeds, but might be slower for typical use-cases. Ignored when passing nthreads=1, or method='als', or when compiling without OpenMP support.parallelize : str, “separate” or “single” How to parallelize gradient calculations when using more than one thread. Passing 'separate' will iterate over the data twice - first by rows and then by columns, letting each thread calculate results for each row and column, whereas passing 'single' will iterate over the data only once, and then sum the obtained results from each thread. Passing 'separate' is much more memory-efficient and less prone to irreproducibility of random seeds, but might be slower for typical use-cases. Ignored when passing nthreads=1 or compiling without OpenMP support.

  • corr_pairs (int) – Number of correction pairs to use for the L-BFGS optimization routine. Recommended values are between 3 and 7. Note that higher values translate into higher memory requirements. Ignored when passing method='als'.

  • max_cg_steps (int) – Maximum number of conjugate gradient iterations to perform in an ALS round. Ignored when passing use_cg=False or method="lbfgs".

  • precondition_cg (bool) – Whether to use Jacobi preconditioning for the conjugate gradient procedure. In general, this type of preconditioning is not beneficial (makes the algorithm slower) as the factor variables tend to be in the same scale, but it might help when using non-shared factors. Note that, when using preconditioning, the procedure will not check for convergence, taking instead a fixed number of steps (given by max_cg_steps) at each iteration regardless of whether it has reached the optimum already. Ignored when passing use_cg=False or method="als".

  • finalize_chol (bool) – When passing use_cg=True and method="als", whether to perform the last iteration with the Cholesky solver. This will make it slower, but will avoid the issue of potential mismatches between the result from fit and calls to factors_warm or similar with the same data.

  • NA_as_zero (bool) – Whether to take missing entries in the ‘X’ matrix as zeros (only when the ‘X’ matrix is passed as sparse COO matrix or DataFrame) instead of ignoring them. Note that this is a different model from the implicit-feedback version with weighted entries, and it’s a much faster model to fit. Be aware that this option will be ignored later when predicting on new data - that is, non-present values will be treated as missing. If passing this option, be aware that the defaults are also to perform mean centering and add user/item biases, which might be undesirable to have together with this option.

  • use_float (bool) – Whether to use C float type for the model parameters (typically this is np.float32). If passing False, will use C double (typically this is np.float64). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.

  • random_state (int, RandomState, Generator, or None) – Seed used to initialize parameters at random. If passing a NumPy RandomState or Generator, will use it to draw a random integer. Note however that, if using more than one thread, results might not be 100% reproducible with method='lbfgs' due to round-off errors in parallelized aggregations. If passing None, will draw a non-reproducible random integer to use as seed.

  • verbose (bool) – Whether to print informational messages about the optimization routine used to fit the model. Be aware that, if passing ‘False’ and method='lbfgs', the optimization routine will not respond to interrupt signals.

  • print_every (int) – Print L-BFGS convergence messages every n-iterations. Ignored when passing verbose=False or method='als'.

  • handle_interrupt (bool) – When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing ‘True’), or raise an interrupt exception without producing a fitted model object (when passing ‘False’).

  • produce_dicts (bool) – Whether to produce Python dicts from the mappings between user/item IDs passed to ‘fit’ and the internal IDs used by the class. Having these dicts might speed up some computations such as ‘predict’, but it will add some extra overhead at the time of fitting the model and extra memory usage. Ignored when passing the data as matrices and arrays instead of data frames.

  • nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).

  • n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.

Variables:
  • is_fitted (bool) – Whether the model has been fitted to data.

  • reindex (bool) – Whether the IDs passed to ‘fit’ were reindexed internally (this will only happen when passing data frames to ‘fit’).

  • user_mapping (array(m,) or array(0,)) – Correspondence of internal user (row) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.

  • item_mapping (array(n,) or array(0,)) – Correspondence of internal item (column) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.

  • user_dict (dict) – Python dict version of user_mapping_. Only filled-in when passing produce_dicts=True and when passing data frames to ‘fit’.

  • item_dict (dict) – Python dict version of item_mapping_. Only filled-in when passing produce_dicts=True and when passing data frames to ‘fit’.

  • glob_mean (float) – The global mean of the non-missing entries in ‘X’ passed to ‘fit’.

  • user_bias (array(m,), or array(0,)) – The obtained biases for each user (row in the ‘X’ matrix). If passing user_bias=False, this array will be empty.

  • item_bias (array(n,)) – The obtained biases for each item (column in the ‘X’ matrix). If passing item_bias=False, this array will be empty.

  • A (array(m, k+k_main) or array(m, k_sec+k+k_main)) – The free offset for the user-factors obtained from user attributes and matrix C_. If passing k_sec>0 and no user side information, this matrix will have an extra k_sec columns at the beginning.

  • B (array(n, k+k_main) or array(m, k_sec+k+k_main)) – The free offset for the item-factors obtained from item attributes and matrix D_. If passing k_sec>0 and no item side information, this matrix will have an extra k_sec columns at the beginning.

  • C (array(p, k_sec+k)) – The obtained coefficients for the user attributes.

  • D (array(q, k_sec+k)) – The obtained coefficients for the item attributes.

  • C_bias (array(k_sec+k)) – The intercepts/biases for the C matrix.

  • D_bias (array(k_sec+k)) – The intercepts/biases for the D matrix.

  • nfev (int) – Number of function and gradient evaluations performed during the L-BFGS optimization procedure.

  • nupd (int) – Number of L-BFGS updates performed during the optimization procedure.

References

[1c]

Cortes, David. “Cold-start recommendations in Collective Matrix Factorization.” arXiv preprint arXiv:1809.00366 (2018).

drop_nonessential_matrices(drop_precomputed=True)

Drop matrices that are not used for prediction

Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.

This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.

Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.

After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as predict or swap_users_and_items. The methods which are intended to continue working afterwards are:

  • factors_warm

  • factors_cold

  • factors_multiple

  • topN_warm

  • topN_cold

Parameters:

drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).

Returns:

self – This object with the non-essential matrices dropped.

Return type:

obj

factors_cold(U=None, U_col=None, U_val=None)

Determine user-factors from new data, given U

Note

For large-scale usage, these factors can be obtained by a matrix multiplication of the attributes matrix and the attribute (model parameter) C_, plus the intercept if present (C_bias_).

Note

The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.

Parameters:
  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

Returns:

factors – The user-factors as determined by the model.

Return type:

array(k_sec+k+k_main,)

factors_warm(X=None, X_col=None, X_val=None, W=None, U=None, U_col=None, U_val=None, return_bias=False, return_raw_A=False, exact=False)[source]

Determine user latent factors based on new ratings data

Note

The argument ‘NA_as_zero’ is ignored here.

Parameters:
  • X (array(n,) or None) – Observed new ‘X’ data for a given user, in dense format. Non-observed entries should have value np.nan.

  • X_col (array(nnz,) or None) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).

  • X_val (array(nnz,) or None) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.

  • W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • return_bias (bool) – Whether to return also the user bias determined by the model given the data in ‘X’. If passing ‘False’, will return an array with the factors. If passing ‘True’, will return a tuple in which the first entry will be an array with the factors, and the second entry will be the estimated bias.

  • return_raw_A (bool) – Whether to return the raw A factors (the free offset), or the factors used in the factorization, to which the attributes component has been added.

  • exact (bool) – Whether to calculate “A” and “Am” with the regularization applied to “A” instead of to “Am”. This is usually a slower procedure. Only relevant when passing “X” data.

Returns:

  • factors (array(k_sec+k+k_main,) or array(k+k_main,)) – User factors as determined from the data in ‘X’.

  • bias (float) – User bias as determined from the data in ‘X’. Only returned if passing return_bias=True.

fit(X, U=None, I=None, W=None)[source]

Fit model to explicit-feedback data and user/item attributes

Note

None of the side info inputs should have missing values. If passing side information ‘U’ and/or ‘I’, all entries (users/items) must be present in both the main matrix and the side info matrix.

Note

In order to avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it’s necessary to sort it beforehand by columns and also pass the data data with indices sorted (by column) to the prediction functions.

Parameters:
  • X (DataFrame(nnz, 3), DataFrame(nnz, 4), array(m, n), or sparse COO(m, n)) – Matrix to factorize (e.g. ratings). Can be passed as a SciPy sparse COO matrix (recommended), as a dense NumPy array, or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and ‘Rating’. If passing a NumPy array, missing (unobserved) entries should have value np.nan. Might additionally have a column ‘Weight’. If passing a DataFrame, the IDs will be internally remapped. If passing sparse ‘U’ or sparse ‘I’, ‘X’ cannot be passed as a DataFrame.

  • U (array(m, p), COO(m, p), DataFrame(m, p+1), or None) – User attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. If ‘U’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array. Should not contain any missing values.

  • I (array(n, q), COO(n, q), DataFrame(n, q+1), or None) – Item attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. If ‘I’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array. Should not contain any missing values.

  • W (None, array(nnz,), or array(m, n)) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.

Return type:

self

get_params(deep=True)[source]

Get parameters for this estimator.

Kept for compatibility with scikit-learn.

Parameters:

deep (bool) – Ignored.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

item_factors_cold(I=None, I_col=None, I_val=None)

Determine item-factors from new data, given I

Parameters:
  • I (array(q,), or None) – Attributes for the new item, in dense format. Should only pass one of ‘I’ or ‘I_col’+’I_val’.

  • I_col (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘I’ or ‘I_col’+’I_val’.

  • I_val (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_val’ should contain the values in the columns given by ‘I_col’. Should only pass one of ‘I’ or ‘I_col’+’I_val’.

Returns:

factors – The item-factors as determined by the model.

Return type:

array(k_sec+k+k_main,)

predict(user, item)

Predict ratings/values given by existing users to existing items

Note

For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.

Parameters:
  • user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in item will be matched with the corresponding entry of user at the same position in the array/list.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

predict_cold(items, U=None, U_col=None, U_val=None)

Predict rating/confidence given by a new user to existing items, given U

Note

The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.

Parameters:
  • items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

Returns:

scores – Predicted ratings for the requested items, for this user.

Return type:

array(n,)

predict_cold_multiple(item, U)

Predict rating/confidence given by new users to existing items, given U

Parameters:
  • item (array-like(m,)) – Items for which ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • U (array(m, p), CSR matrix(m, q), or COO matrix(m, q)) – Attributes for the users for which to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in item.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(m,)

predict_new(user, I)

Predict rating given by existing users to new items, given I

Parameters:
  • user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • I (array(n, q), or COO matrix(n, q)) – Attributes for the items for which to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in user.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

predict_warm(items, X=None, X_col=None, X_val=None, W=None, U=None, U_col=None, U_val=None)[source]

Predict ratings for existing items, for a new user, given ‘X’

Note

The argument ‘NA_as_zero’ is ignored here.

Parameters:
  • items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value np.nan.

  • X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).

  • X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.

  • W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.

  • U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’. Not used when using k_sec=0.

  • U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. Not used when using k_sec=0.

  • U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. Not used when using k_sec=0.

Returns:

scores – Predicted values for the requested items for a user defined by the given values of ‘X’ in ‘X_col’ and ‘X_val’.

Return type:

array(n,)

predict_warm_multiple(X, item, U=None, W=None)[source]

Predict ratings for existing items, for new users, given ‘X’

Note

The argument ‘NA_as_zero’ is ignored here.

Parameters:
  • X (array(m, n), CSR matrix(m, n) , or COO matrix(m, n)) – New ‘X’ data with potentially missing entries. Missing entries should have value np.nan when passing a dense array. Must have one row per entry of item.

  • item (array-like(m,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in item will be matched with the corresponding row of X.

  • U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’. Should not contain any missing values.

  • W (array(m, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(m,)

set_params(**params)

Set the parameters of this estimator.

Kept for compatibility with scikit-learn.

Note

Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

swap_users_and_items(precompute=True)

Swap the users and items in a factorization model

This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as topN, in which any mention of users will now mean items and vice-versa.

Note

The resulting object will not generate any deep copies of the original model’s objects.

Parameters:

precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.

Returns:

model – An object of the same class as this one, but with the user and items swapped.

Return type:

obj

topN(user, n=10, include=None, exclude=None, output_score=False)

Rank top-N highest-predicted items for an existing user

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • n (int) – Number of top-N highest-predicted results to output.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_cold(n=10, U=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)

Compute top-N highest-predicted items for a new user, given ‘U’

Note

The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.

Parameters:
  • n (int) – Number of top-N highest-predicted results to output.

  • U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_new(user, I, n=10, output_score=False)

Rank top-N highest-predicted items for an existing user, given ‘I’

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a data frame, must match with entries in its ‘UserId’ column, otherwise should match with the rows on ‘X’.

  • I (array(m, q), or COO matrix(m, q)) – Attributes for the items to rank. Data frames with ‘ItemId’ column are not supported.

  • n (int) – Number of top-N highest-predicted results to output. Must be less or equal than the number of rows in I.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user, as integers matching to the rows of ‘I’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_warm(n=10, X=None, X_col=None, X_val=None, W=None, U=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)[source]

Compute top-N highest-predicted items for a new user, given ‘X’

Note

The argument ‘NA_as_zero’ is ignored here.

Parameters:
  • n (int) – Number of top-N highest-predicted results to output.

  • X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value np.nan. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.

  • U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’. Not used when using k_sec=0.

  • U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. Not used when using k_sec=0.

  • U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. Not used when using k_sec=0.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

transform(X, y=None, U=None, W=None, replace_existing=False)[source]

Reconstruct entries of the ‘X’ matrix

Will reconstruct all the entries in the ‘X’ matrix as determined by the model. This method is intended to be used for imputing tabular data, and can be used as part of SciKit-Learn pipelines.

Note

The argument ‘NA_as_zero’ is ignored here.

Note

If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes self.user_mapping_ and self.item_mapping_.

Parameters:
  • X (array(m, n)) – New ‘X’ data with potentially missing entries which are to be imputed. Missing entries should have value np.nan.

  • y (None) – Not used. Kept as a placeholder for compatibility with SciKit-Learn pipelines.

  • U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’. Should not contain any missing values.

  • W (array(m, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.

Returns:

X – The ‘X’ matrix as a dense array with all entries as determined by the model. Note that this will be returned as a dense NumPy array.

Return type:

array(m, n)

OMF_implicit

class cmfrec.OMF_implicit(k=50, lambda_=1.0, alpha=1.0, use_cg=True, add_intercepts=True, niter=10, apply_log_transf=False, use_float=False, max_cg_steps=3, precondition_cg=False, finalize_chol=False, random_state=1, verbose=False, produce_dicts=False, handle_interrupt=True, nthreads=-1, n_jobs=None)[source]

Offsets model for implicit-feedback data

Tries to approximate the ‘X’ interactions matrix using the user side information ‘U’ and item side information ‘I’ by a formula as follows:

\(\mathbf{X} \sim (\mathbf{A} + \mathbf{U} \mathbf{C}) * (\mathbf{B} + \mathbf{I} \mathbf{D})^T\)

Note

This model was implemented for experimentation purposes only. Performance is likely to be bad. Be warned.

Note

This works by first fitting a model with no side info and then reconstructing the parameters by least squares approximations, so when making warm-start predictions, the results will be exactly the same as if not using any side information (user/item attributes).

Note

The model optimization objective will not scale any of its terms according to number of entries, so hyperparameters such as lambda_ will require more tuning than in other software and trying out values over a wider range.

Note

Recommendation quality metrics for this model can be calculated with the recometrics library.

Parameters:
  • k (int) – Number of latent factors to use (dimensionality of the low-rank approximation). Typical values are 30 to 100.

  • lambda_ (float) – Regularization parameter. Note that the default value for lambda_ here is much higher than in other software, and that the loss/objective function is not divided by the number of entries. For example, a good number for the LastFM-360K could be lambda_=5. Typical values are \(10^{-2}\) to \(10^2\).

  • alpha (float) – Weighting parameter for the non-zero entries in the implicit-feedback model. See [2d] for details. Note that, while the author’s suggestion for this value is 40, other software such as implicit use a value of 1, whereas Spark uses a value of 0.01 by default If the data has very high values, might even be beneficial to put a very low value here - for example, for the LastFM-360K, values below 1 might give better results.

  • use_cg (bool) – In the ALS method, whether to use a conjugate gradient method to solve the closed-form least squares problems. This is a faster and more memory-efficient alternative than the default Cholesky solver, but less exact, less numerically stable, and will require slightly more ALS iterations (niter) to reach a good optimum. In general, better results are achieved with use_cg=False. Note that, if using this method, calculations after fitting which involve new data such as factors_warm, might produce slightly different results from the factors obtained from calling fit with the same data, due to differences in numerical precision. A workaround for this issue (factors on new data that might differ slightly) is to use finalize_chol=True. Even if passing “True” here, will use the Cholesky method in cases in which it is faster (e.g. dense matrices with no missing values), and will not use the conjugate gradient method on new data.

  • add_intercepts (bool) – Whether to add intercepts/biases to the user/item attribute matrices.

  • niter (int) – Number of alternating least-squares iterations to perform. Note that one iteration denotes an update round for all the matrices rather than an update of a single matrix. In general, the more iterations, the better the end result. Typical values are 6 to 30.

  • apply_log_transf (bool) – Whether to apply a logarithm transformation on the values of ‘X’ (i.e. ‘X := log(X)’)

  • use_float (bool) – Whether to use C float type for the model parameters (typically this is np.float32). If passing False, will use C double (typically this is np.float64). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.

  • max_cg_steps (int) – Maximum number of conjugate gradient iterations to perform in an ALS round. Ignored when passing use_cg=False.

  • precondition_cg (bool) – Whether to use Jacobi preconditioning for the conjugate gradient procedure. In general, this type of preconditioning is not beneficial (makes the algorithm slower) as the factor variables tend to be in the same scale, but it might help when using non-shared factors. Note that, when using preconditioning, the procedure will not check for convergence, taking instead a fixed number of steps (given by max_cg_steps) at each iteration regardless of whether it has reached the optimum already. Ignored when passing use_cg=False or method="als".

  • finalize_chol (bool) – When passing use_cg=True, whether to perform the last iteration with the Cholesky solver. This will make it slower, but will avoid the issue of potential mismatches between the result from fit and calls to factors_warm or similar with the same data.

  • random_state (int, RandomState, Generator, None) – Seed used to initialize parameters at random. If passing a NumPy RandomState or Generator, will use it to draw a random integer. If passing None, will draw a non-reproducible random integer to use as seed.

  • verbose (bool) – Whether to print informational messages about the optimization routine used to fit the model.

  • handle_interrupt (bool) – When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing ‘True’), or raise an interrupt exception without producing a fitted model object (when passing ‘False’).

  • produce_dicts (bool) – Whether to produce Python dicts from the mappings between user/item IDs passed to ‘fit’ and the internal IDs used by the class. Having these dicts might speed up some computations such as ‘predict’, but it will add some extra overhead at the time of fitting the model and extra memory usage. Ignored when passing the data as matrices and arrays instead of data frames.

  • nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).

  • n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.

Variables:
  • is_fitted (bool) – Whether the model has been fitted to data.

  • reindex (bool) – Whether the IDs passed to ‘fit’ were reindexed internally (this will only happen when passing data frames to ‘fit’).

  • user_mapping (array(m,) or array(0,)) – Correspondence of internal user (row) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.

  • item_mapping (array(n,) or array(0,)) – Correspondence of internal item (column) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.

  • user_dict (dict) – Python dict version of user_mapping_. Only filled-in when passing produce_dicts=True and when passing data frames to ‘fit’.

  • item_dict (dict) – Python dict version of item_mapping_. Only filled-in when passing produce_dicts=True and when passing data frames to ‘fit’.

  • A (array(m, k)) – The free offset for the user-factors obtained from user attributes and matrix C_.

  • B (array(n, k)) – The free offset for the item-factors obtained from item attributes and matrix D_.

  • C (array(p, k)) – The obtained coefficients for the user attributes.

  • D (array(q, k)) – The obtained coefficients for the item attributes.

  • C_bias (array(k)) – The intercepts/biases for the C matrix.

  • D_bias (array(k)) – The intercepts/biases for the D matrix.

References

[1d]

Cortes, David. “Cold-start recommendations in Collective Matrix Factorization.” arXiv preprint arXiv:1809.00366 (2018).

[2d]

Hu, Yifan, Yehuda Koren, and Chris Volinsky. “Collaborative filtering for implicit feedback datasets.” 2008 Eighth IEEE International Conference on Data Mining. Ieee, 2008.

[3d]

Takacs, Gabor, Istvan Pilaszy, and Domonkos Tikk. “Applications of the conjugate gradient method for implicit feedback collaborative filtering.” Proceedings of the fifth ACM conference on Recommender systems. 2011.

drop_nonessential_matrices(drop_precomputed=True)

Drop matrices that are not used for prediction

Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.

This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.

Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.

After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as predict or swap_users_and_items. The methods which are intended to continue working afterwards are:

  • factors_warm

  • factors_cold

  • factors_multiple

  • topN_warm

  • topN_cold

Parameters:

drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).

Returns:

self – This object with the non-essential matrices dropped.

Return type:

obj

factors_cold(U=None, U_col=None, U_val=None)

Determine user-factors from new data, given U

Note

For large-scale usage, these factors can be obtained by a matrix multiplication of the attributes matrix and the attribute (model parameter) C_, plus the intercept if present (C_bias_).

Note

The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.

Parameters:
  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

Returns:

factors – The user-factors as determined by the model.

Return type:

array(k_sec+k+k_main,)

factors_warm(X_col, X_val, return_raw_A=False)[source]

Determine user latent factors based on new interactions data

Parameters:
  • X_col (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).

  • X_val (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.

  • return_raw_A (bool) – Whether to return the raw A factors (the free offset), or the factors used in the factorization, to which the attributes component has been added.

Returns:

factors – User factors as determined from the data in ‘X_col’ and ‘X_val’.

Return type:

array(k,)

fit(X, U=None, I=None)[source]

Fit model to implicit-feedback data and user/item attributes

Note

None of the side info inputs should have missing values. If passing side information ‘U’ and/or ‘I’, all entries (users/items) must be present in both the main matrix and the side info matrix.

Note

In order to avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it’s necessary to sort it beforehand by columns and also pass the data data with indices sorted (by column) to the prediction functions.

Parameters:
  • X (DataFrame(nnz, 3), or sparse COO(m, n)) – Matrix to factorize. Can be passed as a SciPy sparse COO matrix (recommended), or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and ‘Value’. If passing a NumPy array, missing (unobserved) entries should have value np.nan. If passing a DataFrame, the IDs will be internally remapped.

  • U (array(m, p), COO(m, p), DataFrame(m, p+1), or None) – User attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. If ‘U’ is sparse, ‘X’ should be passed as a sparse COO matrix too. Should not contain any missing values.

  • I (array(n, q), COO(n, q), DataFrame(n, q+1), or None) – Item attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. If ‘I’ is sparse, ‘X’ should be passed as a sparse COO matrix too. Should not contain any missing values.

Return type:

self

get_params(deep=True)[source]

Get parameters for this estimator.

Kept for compatibility with scikit-learn.

Parameters:

deep (bool) – Ignored.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

item_factors_cold(I=None, I_col=None, I_val=None)

Determine item-factors from new data, given I

Parameters:
  • I (array(q,), or None) – Attributes for the new item, in dense format. Should only pass one of ‘I’ or ‘I_col’+’I_val’.

  • I_col (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘I’ or ‘I_col’+’I_val’.

  • I_val (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_val’ should contain the values in the columns given by ‘I_col’. Should only pass one of ‘I’ or ‘I_col’+’I_val’.

Returns:

factors – The item-factors as determined by the model.

Return type:

array(k_sec+k+k_main,)

predict(user, item)

Predict ratings/values given by existing users to existing items

Note

For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.

Parameters:
  • user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in item will be matched with the corresponding entry of user at the same position in the array/list.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

predict_cold(items, U=None, U_col=None, U_val=None)

Predict rating/confidence given by a new user to existing items, given U

Note

The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.

Parameters:
  • items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

Returns:

scores – Predicted ratings for the requested items, for this user.

Return type:

array(n,)

predict_cold_multiple(item, U)

Predict rating/confidence given by new users to existing items, given U

Parameters:
  • item (array-like(m,)) – Items for which ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • U (array(m, p), CSR matrix(m, q), or COO matrix(m, q)) – Attributes for the users for which to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in item.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(m,)

predict_new(user, I)

Predict rating given by existing users to new items, given I

Parameters:
  • user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • I (array(n, q), or COO matrix(n, q)) – Attributes for the items for which to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in user.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

predict_warm(items, X_col, X_val)[source]

Predict scores for existing items, for a new user, given ‘X’

Parameters:
  • items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • X_col (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).

  • X_val (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.

Returns:

scores – Predicted values for the requested items for a user defined by the given values of ‘X’ in ‘X_col’ and ‘X_val’.

Return type:

array(n,)

predict_warm_multiple(X, item, U=None)[source]

Predict scores for existing items, for new users, given ‘X’

Parameters:
  • X (array(m, n), CSR matrix(m, n) , or COO matrix(m, n)) – New ‘X’ data with potentially missing entries. Missing entries should have value np.nan when passing a dense array. Must have one row per entry of item.

  • item (array-like(m,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in item will be matched with the corresponding row of X.

  • U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’. Should not contain any missing values.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(m,)

set_params(**params)

Set the parameters of this estimator.

Kept for compatibility with scikit-learn.

Note

Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

swap_users_and_items(precompute=True)

Swap the users and items in a factorization model

This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as topN, in which any mention of users will now mean items and vice-versa.

Note

The resulting object will not generate any deep copies of the original model’s objects.

Parameters:

precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.

Returns:

model – An object of the same class as this one, but with the user and items swapped.

Return type:

obj

topN(user, n=10, include=None, exclude=None, output_score=False)

Rank top-N highest-predicted items for an existing user

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • n (int) – Number of top-N highest-predicted results to output.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_cold(n=10, U=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)

Compute top-N highest-predicted items for a new user, given ‘U’

Note

The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.

Parameters:
  • n (int) – Number of top-N highest-predicted results to output.

  • U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_new(user, I, n=10, output_score=False)

Rank top-N highest-predicted items for an existing user, given ‘I’

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a data frame, must match with entries in its ‘UserId’ column, otherwise should match with the rows on ‘X’.

  • I (array(m, q), or COO matrix(m, q)) – Attributes for the items to rank. Data frames with ‘ItemId’ column are not supported.

  • n (int) – Number of top-N highest-predicted results to output. Must be less or equal than the number of rows in I.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user, as integers matching to the rows of ‘I’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_warm(n=10, X_col=None, X_val=None, include=None, exclude=None, output_score=False)[source]

Compute top-N highest-predicted items for a new user, given ‘X’

Parameters:
  • n (int) – Number of top-N highest-predicted results to output.

  • X_col (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).

  • X_val (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

ContentBased

class cmfrec.ContentBased(k=20, lambda_=100.0, user_bias=False, item_bias=False, add_intercepts=True, maxiter=3000, corr_pairs=3, parallelize='separate', verbose=False, print_every=100, random_state=1, use_float=True, produce_dicts=False, handle_interrupt=True, start_with_ALS=True, nthreads=-1, n_jobs=None)[source]

Content-based recommendation model

Fits a recommendation model to explicit-feedback data based on user and item attributes only, making it a more ideal approach for cold-start recommendations and with faster prediction times. Follows the same factorization approach as the classical model, but with the latent-factor matrices being determined as linear combinations of the user and item attributes - this is similar to a two-layer neural network with separate layers for each input.

The ‘X’ is approximated using the user side information ‘U’ and item side information ‘I’ by a formula as follows:

\(\mathbf{X} \sim (\mathbf{U} \mathbf{C}) * (\mathbf{I} \mathbf{D})^T\)

Note

This is a highly non-linear model that will take many more L-BFGS iterations to converge compared to the other models. It’s advised to experiment with tuning the maximum number of iterations.

Note

The input data for attributes does not undergo any transformations when fitting this model, which is to some extent sensible to the scales of the variables and their means in the same way as regularized linear regression.

Note

In order to obtain the final user-factors and item-factors matrices that are used to factorize ‘X’ from a fitted-model object, you’ll need to perform a matrix multiplication between the side info (‘U’ and ‘I’) and the fitted parameters (’C_’ and ‘D_’) - e.g. ‘A = U*model.C_ + model.C_bias_’.

Parameters:
  • k (int) – Number of latent factors to use (dimensionality of the low-rank approximation). Recommended values are 30 to 100.

  • lambda_ (float or array(6,)) – Regularization parameter. Can also use different regularization for each matrix, in which case it should be an array with 6 entries, corresponding, in this order, to: user_bias, item_bias, [ignored], [ignored], C, D. Note that the default value for lambda_ here is much higher than in other software, and that the loss/objective function is not divided by the number of entries. Recommended values are \(10^{-2}\) to \(10^2\).

  • user_bias (bool) – Whether to add user biases (intercepts) to the model.

  • item_bias (bool) – Whether to add item biases (intercepts) to the model. Be aware that using item biases with low regularization for them will tend to favor items with high average ratings regardless of the number of ratings the item has received.

  • add_intercepts (bool) – Whether to add intercepts/biases to the user/item attribute matrices.

  • maxiter (int) – Maximum L-BFGS iterations to perform. The procedure will halt if it has not converged after this number of updates. Note that, compared to the collective model, more iterations will be required for converge here. Using higher regularization values might also decrease the number of required iterations. Pass zero for no L-BFGS iterations limit. If the procedure is spending thousands of iterations without any significant decrease in the loss function or gradient norm, it’s highly likely that the regularization is too low.

  • corr_pairs (int) – Number of correction pairs to use for the L-BFGS optimization routine. Recommended values are between 3 and 7. Note that higher values translate into higher memory requirements.

  • parallelize (str, “separate” or “single”) – How to parallelize gradient calculations when using more than one thread. Passing 'separate' will iterate over the data twice - first by rows and then by columns, letting each thread calculate results for each row and column, whereas passing 'single' will iterate over the data only once, and then sum the obtained results from each thread. Passing 'separate' is much more memory-efficient and less prone to irreproducibility of random seeds, but might be slower for typical use-cases. Ignored when passing nthreads=1 or compiling without OpenMP support.

  • verbose (bool) – Whether to print informational messages about the optimization routine used to fit the model. Be aware that, if passing ‘False’, the optimization routine will not respond to interrupt signals.

  • print_every (int) – Print L-BFGS convergence messages every n-iterations. Ignored when passing verbose=False.

  • random_state (int, RandomState, Generator, or None) – Seed used to initialize parameters at random. If passing a NumPy RandomState or Generator, will use it to draw a random integer. Note however that, if using more than one thread, results might not be 100% reproducible due to round-off errors in parallelized aggregations. If passing None, will draw a non-reproducible random integer to use as seed.

  • use_float (bool) – Whether to use C float type for the model parameters (typically this is np.float32). If passing False, will use C double (typically this is np.float64). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.

  • produce_dicts (bool) – Whether to produce Python dicts from the mappings between user/item IDs passed to ‘fit’ and the internal IDs used by the class. Having these dicts might speed up some computations such as ‘predict’, but it will add some extra overhead at the time of fitting the model and extra memory usage. Ignored when passing the data as matrices and arrays instead of data frames.

  • handle_interrupt (bool) – When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing ‘True’), or raise an interrupt exception without producing a fitted model object (when passing ‘False’).

  • start_with_ALS (bool) – Whether to determine the initial coefficients through an ALS procedure. This might help to speed up the procedure by starting closer to an optimum. This option is not available when the side information is passed as sparse matrices.

    Note that this option will not work (will throw an error) if there are users or items without side information, or if the input data is otherwise problematic (e.g. users/items which are duplicates of each other).

  • nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).

  • n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.

Variables:
  • is_fitted (bool) – Whether the model has been fitted to data.

  • reindex (bool) – Whether the IDs passed to ‘fit’ were reindexed internally (this will only happen when passing data frames to ‘fit’).

  • user_mapping (array(m,) or array(0,)) – Correspondence of internal user (row) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.

  • item_mapping (array(n,) or array(0,)) – Correspondence of internal item (column) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.

  • user_dict (dict) – Python dict version of user_mapping_. Only filled-in when passing produce_dicts=True and when passing data frames to ‘fit’.

  • item_dict (dict) – Python dict version of item_mapping_. Only filled-in when passing produce_dicts=True and when passing data frames to ‘fit’.

  • glob_mean (float) – The global mean of the non-missing entries in ‘X’ passed to ‘fit’.

  • user_bias (array(m,), or array(0,)) – The obtained biases for each user (row in the ‘X’ matrix). If passing user_bias=False (the default), this array will be empty.

  • item_bias (array(n,)) – The obtained biases for each item (column in the ‘X’ matrix). If passing item_bias=False (the default), this array will be empty.

  • C (array(p, k)) – The obtained coefficients for the user attributes.

  • D (array(q, k)) – The obtained coefficients for the item attributes.

  • C_bias (array(k)) – The intercepts/biases for the C matrix.

  • D_bias (array(k)) – The intercepts/biases for the D matrix.

  • nfev (int) – Number of function and gradient evaluations performed during the L-BFGS optimization procedure.

  • nupd (int) – Number of L-BFGS updates performed during the optimization procedure.

References

[1e]

Cortes, David. “Cold-start recommendations in Collective Matrix Factorization.” arXiv preprint arXiv:1809.00366 (2018).

drop_nonessential_matrices(drop_precomputed=True)

Drop matrices that are not used for prediction

Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.

This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.

Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.

After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as predict or swap_users_and_items. The methods which are intended to continue working afterwards are:

  • factors_warm

  • factors_cold

  • factors_multiple

  • topN_warm

  • topN_cold

Parameters:

drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).

Returns:

self – This object with the non-essential matrices dropped.

Return type:

obj

factors_cold(U=None, U_col=None, U_val=None)[source]

Determine user-factors from new data, given U

Note

For large-scale usage, these factors can be obtained by a matrix multiplication of the attributes matrix and the attribute (model parameter) C_, plus the intercept if present (C_bias_).

Parameters:
  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

Returns:

factors – The user-factors as determined by the model.

Return type:

array(k,)

factors_multiple(U=None)[source]

Determine user-factors from new data for multiple rows, given U

Parameters:

U (array-like(m, p)) – User attributes in the new data.

Returns:

factors – The user-factors as determined by the model.

Return type:

array(m, k)

fit(X, U, I, W=None)[source]

Fit model to explicit-feedback data based on user-item attributes

Note

None of the side info inputs should have missing values. All entries (users/items) must be present in both the main matrix and the side info matrix.

Note

In order to avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it’s necessary to sort it beforehand by columns and also pass the data data with indices sorted (by column) to the prediction functions.

Parameters:
  • X (DataFrame(nnz, 3), DataFrame(nnz, 4), array(m, n), or sparse COO(m, n)) – Matrix to factorize (e.g. ratings). Can be passed as a SciPy sparse COO matrix (recommended), as a dense NumPy array, or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and ‘Rating’. If passing a NumPy array, missing (unobserved) entries should have value np.nan. Might additionally have a column ‘Weight’. If passing a DataFrame, the IDs will be internally remapped. If passing sparse ‘U’ or sparse ‘I’, ‘X’ cannot be passed as a DataFrame.

  • U (array(m, p), COO(m, p), DataFrame(m, p+1)) – User attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. If ‘U’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array. Should not contain any missing values.

  • I (array(n, q), COO(n, q), DataFrame(n, q+1)) – Item attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. If ‘I’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array. Should not contain any missing values.

  • W (None, array(nnz,), or array(m, n)) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.

Return type:

self

get_params(deep=True)[source]

Get parameters for this estimator.

Kept for compatibility with scikit-learn.

Parameters:

deep (bool) – Ignored.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

predict(user, item)

Predict ratings/values given by existing users to existing items

Note

For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.

Parameters:
  • user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in item will be matched with the corresponding entry of user at the same position in the array/list.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

predict_cold(U, items)[source]

Predict rating given by new users to existing items, given U

Parameters:
  • U (array(n, p), CSR(n, p), or COO(n, p)) – Attributes for the users whose ratings are to be predicted. Each row will be matched to the corresponding row of ‘items’.

  • items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

predict_new(U, I)[source]

Predict rating given by new users to new items, given U and I

Parameters:
  • U (array(n, p), CSR(n, p), or COO(n, p)) – Attributes for the users whose ratings are to be predicted. Each row will be matched to the corresponding row of ‘I’.

  • I (array(n, q), CSR(n, q), or COO(n, q)) – Attributes for the items whose ratings are to be predicted. Each row will be matched to the corresponding row of ‘U’.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

set_params(**params)

Set the parameters of this estimator.

Kept for compatibility with scikit-learn.

Note

Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

swap_users_and_items(precompute=True)

Swap the users and items in a factorization model

This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as topN, in which any mention of users will now mean items and vice-versa.

Note

The resulting object will not generate any deep copies of the original model’s objects.

Parameters:

precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.

Returns:

model – An object of the same class as this one, but with the user and items swapped.

Return type:

obj

topN(user, n=10, include=None, exclude=None, output_score=False)

Rank top-N highest-predicted items for an existing user

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • n (int) – Number of top-N highest-predicted results to output.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_cold(n=10, U=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)

Compute top-N highest-predicted items for a new user, given ‘U’

Note

The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.

Parameters:
  • n (int) – Number of top-N highest-predicted results to output.

  • U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_new(n=10, U=None, U_col=None, U_val=None, I=None, output_score=False)[source]

Compute top-N highest-predicted items for a given user, given U

Parameters:
  • n (int) – Number of top-N highest-predicted results to output.

  • U (array(p,), or None) – User attributes for the user for whom to rank items. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_col (None or array(nnz)) – User attributes for the user for whom to rank items, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – User attributes for the user for whom to rank items, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • I (array(n2, q), CSR(n2, q), or COO(n2, q)) – Attributes for the items to rank (each row corresponding to an item). Must have at least ‘n’ rows.

  • output_score (bool) – Whether to output the scores in addition to the row numbers. If passing ‘False’, will return a single array with the item numbers, otherwise will return a tuple with the item numbers and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items among ‘I’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

MostPopular

class cmfrec.MostPopular(implicit=False, center=True, user_bias=False, lambda_=10.0, alpha=1.0, NA_as_zero=False, scale_lam=False, scale_bias_const=False, apply_log_transf=False, use_float=False, produce_dicts=False, nthreads=-1, n_jobs=None)[source]

Non-Personalized recommender model

Fits a model with only the intercept terms (biases), in order to provide non-personalized recommendations.

This class is provided as a benchmark - if your personalized-recommendations model does not manage to beat this under the evaluation metrics of interest, chances are, that model needs to be reworked.

It minimizes the same objective functions as the other classes and offers the same options (e.g. centering, scaling regulatization, etc.), but fitting only the biases.

Note

Implicit-feedback recommendation quality metrics for this model can be calculated with the recometrics library.

Parameters:
  • implicit (bool) – Whether to use the implicit-feedback model, in which the ‘X’ matrix is assumed to have only binary entries and each of them having a weight in the loss function given by the observer user-item interactions and other parameters.

  • center (bool) – Whether to center the “X” data by subtracting the mean value. Ignored (assumed “False”) when passing implicit=True.

  • user_bias (bool) – Whether to add user biases to the model. Not supported for implicit feedback (implicit=True).

  • lambda_ (float) – Regularization parameter. For the explicit-feedback case (default), lower values will tend to favor the highest-rated items regardless of the number of observations. Note that the default value for lambda_ here is much higher than in other software, and that the loss/objective function is not divided by the number of entries.

  • alpha (float) – Weighting parameter for the non-zero entries in the implicit-feedback model. See [2f] for details. Note that, while the author’s suggestion for this value is 40, other software such as implicit use a value of 1, whereas Spark uses a value of 0.01 by default See the documentation of CMF_implicit for more details.

  • NA_as_zero (bool) – Whether to take missing entries in the ‘X’ matrix as zeros (only when the ‘X’ matrix is passed as sparse COO matrix or DataFrame) instead of ignoring them.

  • scale_lam (bool) – Whether to scale (increase) the regularization parameter for each estimated bias according to the number of non-missing entries in the data. This is only available when passing implicit=False.

    It is not recommended to use this option, as when passing True, it tends to recommend items which have a single user interaction with the maximum possible value (e.g. 5-star movies from only 1 user). By default, scale_bias_const is also set to True, so in order to have the regularization scale for each user/item, that option also needs to be turned off.

  • scale_bias_const (bool) – When passing scale_lam=True, whether to apply the same scaling to the regularization for all users and items, according to the average number of non-missing entries rather than to the number of entries for each specific user/item.

    While this tends to result in worse RMSE, it tends to make the top-N recommendations less likely to select items with only a few interactions from only a few users.

    Ignored when passing scale_lam=False.

  • apply_log_transf (bool) – Whether to apply a logarithm transformation on the values of ‘X’ (i.e. ‘X := log(X)’). This is only available with implicit=True.

  • use_float (bool) – Whether to use C float type for the model parameters (typically this is np.float32). If passing False, will use C double (typically this is np.float64). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.

  • produce_dicts (bool) – Whether to produce Python dicts from the mappings between user/item IDs passed to ‘fit’ and the internal IDs used by the class. Having these dicts might speed up some computations such as ‘predict’, but it will add some extra overhead at the time of fitting the model and extra memory usage. Ignored when passing the data as matrices and arrays instead of data frames.

  • nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads). Most of the work is done single-threaded however.

  • n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.

Variables:
  • is_fitted (bool) – Whether the model has been fitted to data.

  • reindex (bool) – Whether the IDs passed to ‘fit’ were reindexed internally (this will only happen when passing data frames to ‘fit’).

  • user_mapping (array(m,) or array(0,)) – Correspondence of internal user (row) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.

  • item_mapping (array(n,) or array(0,)) – Correspondence of internal item (column) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.

  • user_dict (dict) – Python dict version of user_mapping_. Only filled-in when passing produce_dicts=True and when passing data frames to ‘fit’.

  • item_dict (dict) – Python dict version of item_mapping_. Only filled-in when passing produce_dicts=True and when passing data frames to ‘fit’.

  • glob_mean (float) – The global mean of the non-missing entries in ‘X’ passed to ‘fit’ (only for explicit-feedback case).

  • user_bias (array(m,), or array(0,)) – The obtained biases for each user (row in the ‘X’ matrix). If passing user_bias=False (the default), this array will be empty.

  • item_bias (array(n,)) – The obtained biases for each item (column in the ‘X’ matrix). Items are ranked according to these values.

References

[1f]

Koren, Yehuda, Robert Bell, and Chris Volinsky. “Matrix factorization techniques for recommender systems.” Computer 42.8 (2009): 30-37.

[2f]

Hu, Yifan, Yehuda Koren, and Chris Volinsky. “Collaborative filtering for implicit feedback datasets.” 2008 Eighth IEEE International Conference on Data Mining. Ieee, 2008.

drop_nonessential_matrices(drop_precomputed=True)

Drop matrices that are not used for prediction

Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.

This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.

Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.

After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as predict or swap_users_and_items. The methods which are intended to continue working afterwards are:

  • factors_warm

  • factors_cold

  • factors_multiple

  • topN_warm

  • topN_cold

Parameters:

drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).

Returns:

self – This object with the non-essential matrices dropped.

Return type:

obj

fit(X, W=None)[source]

Fit intercepts-only model to data.

Parameters:
  • X (DataFrame(nnz, 3), DataFrame(nnz, 4), array(m, n), or sparse COO(m, n)) – Matrix to factorize (e.g. ratings). Can be passed as a SciPy sparse COO matrix (recommended), as a dense NumPy array, or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and either ‘Rating’ (explicit-feedback, default) or ‘Value’ (implicit feedback). If passing a NumPy array, missing (unobserved) entries should have value np.nan under both explicit and implicit feedback. Might additionally have a column ‘Weight’ for the explicit-feedback case. If passing a DataFrame, the IDs will be internally remapped.

  • W (None, array(nnz,), or array(m, n)) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.

Return type:

self

get_params(deep=True)[source]

Get parameters for this estimator.

Kept for compatibility with scikit-learn.

Parameters:

deep (bool) – Ignored.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

predict(user, item)

Predict ratings/values given by existing users to existing items

Note

For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.

Parameters:
  • user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in item will be matched with the corresponding entry of user at the same position in the array/list.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

set_params(**params)

Set the parameters of this estimator.

Kept for compatibility with scikit-learn.

Note

Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

swap_users_and_items(precompute=True)

Swap the users and items in a factorization model

This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as topN, in which any mention of users will now mean items and vice-versa.

Note

The resulting object will not generate any deep copies of the original model’s objects.

Parameters:

precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.

Returns:

model – An object of the same class as this one, but with the user and items swapped.

Return type:

obj

topN(user=None, n=10, include=None, exclude=None, output_score=False)[source]

Compute top-N highest-predicted items

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’. Only relevant if using user biases and outputting score.

  • n (int) – Number of top-N highest-predicted results to output.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

CMF_imputer

class cmfrec.CMF_imputer(k=40, lambda_=10.0, method='als', use_cg=True, user_bias=True, item_bias=True, center=True, add_implicit_features=False, scale_lam=False, scale_lam_sideinfo=False, scale_bias_const=False, k_user=0, k_item=0, k_main=0, w_main=1.0, w_user=1.0, w_item=1.0, w_implicit=0.5, l1_lambda=0.0, center_U=True, center_I=True, maxiter=800, niter=10, parallelize='separate', corr_pairs=4, max_cg_steps=3, precondition_cg=False, finalize_chol=True, NA_as_zero=False, NA_as_zero_user=False, NA_as_zero_item=False, nonneg=False, nonneg_C=False, nonneg_D=False, max_cd_steps=100, precompute_for_predictions=True, include_all_X=True, use_float=True, random_state=1, verbose=False, print_every=10, handle_interrupt=True, produce_dicts=False, nthreads=-1, n_jobs=None)[source]

A wrapper for CMF allowing argument ‘y’ in ‘fit’ and ‘transform’ (left as a placeholder only, not used for anything), which can be used as part of SciKit-Learn pipelines due to having this extra parameter.

Everything else is exactly the same as for ‘CMF’ - see its documentation for details.

drop_nonessential_matrices(drop_precomputed=True)

Drop matrices that are not used for prediction

Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.

This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.

Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.

After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as predict or swap_users_and_items. The methods which are intended to continue working afterwards are:

  • factors_warm

  • factors_cold

  • factors_multiple

  • topN_warm

  • topN_cold

Parameters:

drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).

Returns:

self – This object with the non-essential matrices dropped.

Return type:

obj

factors_cold(U=None, U_bin=None, U_col=None, U_val=None)

Determine user-factors from new data, given U

Note

If using NA_as_zero, this function will assume that all the ‘X’ values are zeros rather than being missing.

Parameters:
  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value np.nan. Only supported with method='lbfgs'.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

Returns:

factors – The user-factors as determined by the model.

Return type:

array(k_user+k+k_main,)

factors_multiple(X=None, U=None, U_bin=None, W=None, return_bias=False)

Determine user latent factors based on new data (warm and cold)

Determines latent factors for multiple rows/users at once given new data for them.

Note

See the documentation of “fit” for details about handling of missing values.

Note

If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes self.user_mapping_ and self.item_mapping_.

Parameters:
  • X (array(m_x, n), CSR matrix(m_x, n), COO matrix(m_x, n), or None) – New ‘X’ data.

  • U (array(m_u, p), CSR matrix(m_u, p), COO matrix(m_u, p), or None) – User attributes information for rows in ‘X’.

  • U_bin (array(m_ub, p_bin) or None) – User binary attributes for each row in ‘X’. Only supported with method='lbfgs'.

  • W (array(m_x, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.

  • return_bias (bool) – Whether to return also the user bias determined by the model given the data in ‘X’. If passing ‘False’, will return an array with the factors. If passing ‘True’, will return a tuple in which the first entry will be an array with the factors, and the second entry will be the estimated bias.

Returns:

  • A (array(max(m_x,m_u,m_ub), k_user+k+k_main)) – The new factors determined for all the rows given the new data.

  • bias (array(max(m_x,m_u,m_ub)) or None) – The user bias given the new ‘X’ data. Only returned if passing return_bias=True.

factors_warm(X=None, X_col=None, X_val=None, W=None, U=None, U_bin=None, U_col=None, U_val=None, return_bias=False)

Determine user latent factors based on new ratings data

Parameters:
  • X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value np.nan. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value np.nan. Only supported with method='lbfgs'. User side info is not strictly required and can be skipped.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • return_bias (bool) – Whether to return also the user bias determined by the model given the data in ‘X’. If passing ‘False’, will return an array with the factors. If passing ‘True’, will return a tuple in which the first entry will be an array with the factors, and the second entry will be the estimated bias.

  • return_raw_A (bool) – Whether to return the raw A factors (the free offset), or the factors used in the factorization, to which the attributes component has been added.

Returns:

  • factors (array(k_user+k+k_main,) or array(k+k_main,)) – User factors as determined from the data in ‘X’.

  • bias (float) – User bias as determined from the data in ‘X’. Only returned if passing return_bias=True.

fit(X, y=None, U=None, I=None, U_bin=None, I_bin=None, W=None)[source]

Fit model to explicit-feedback data and user/item attributes

Note

It’s possible to pass partially disjoints sets of users/items between the different matrices (e.g. it’s possible for both the ‘X’ and ‘U’ matrices to have rows that the other doesn’t have). The procedure supports missing values for all inputs (except for “W”). If any of the inputs has less rows/columns than the other(s) (e.g. “U” has more rows than “X”, or “I” has more rows than there are columns in “X”), will assume that the rest of the rows/columns have only missing values. Note however that when having partially disjoint inputs, the order of the rows/columns matters for speed, as it might run faster when the “U”/”I” inputs that do not have matching rows/columns in “X” have those unmatched rows/columns at the end (last rows/columns) and the “X” input is shorter. See also the parameter include_all_X for info about predicting with mismatched “X”.

Note

When passing NumPy arrays, missing (unobserved) entries should have value np.nan. When passing sparse inputs, the zero-valued entries will be considered as missing (unless using “NA_as_zero”), and it should not contain “NaN” values among the non-zero entries.

Note

In order to avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it’s necessary to sort it beforehand by columns and also pass the data data with indices sorted (by column) to the prediction functions.

Parameters:
  • X (DataFrame(nnz, 3), DataFrame(nnz, 4), array(m, n), or sparse COO(m, n)) – Matrix to factorize (e.g. ratings). Can be passed as a SciPy sparse COO matrix (recommended), as a dense NumPy array, or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and ‘Rating’. Might additionally have a column ‘Weight’. If passing a DataFrame, the IDs will be internally remapped. If passing sparse ‘U’ or sparse ‘I’, ‘X’ cannot be passed as a DataFrame.

  • U (array(m, p), COO(m, p), DataFrame(m, p+1), or None) – User attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. If ‘U’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array.

  • U_bin (array(m, p_bin), DataFrame(m, p_bin+1), or None) – User binary attributes information (all values should be zero, one, or missing). If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. Cannot be passed as a sparse matrix. Note that ‘U’ and ‘U_bin’ are not mutually exclusive. Only supported with method='lbfgs'.

  • I (array(n, q), COO(n, q), DataFrame(n, q+1), or None) – Item attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. If ‘I’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array.

  • I_bin (array(n, q_bin), DataFrame(n, q_bin+1), or None) – Item binary attributes information (all values should be zero, one, or missing). If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. Cannot be passed as a sparse matrix. Note that ‘I’ and ‘I_bin’ are not mutually exclusive. Only supported with method='lbfgs'.

  • W (None, array(nnz,), or array(m, n)) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array. Cannot have missing values.

Return type:

self

force_precompute_for_predictions()

Precompute internal matrices that are used for predictions

Note

It’s not necessary to call this method if passing precompute_for_predictions=True.

Return type:

self

static from_model_matrices(A, B, glob_mean=0.0, precompute=True, user_bias=None, item_bias=None, lambda_=10.0, scale_lam=False, l1_lambda=0.0, nonneg=False, NA_as_zero=False, scaling_biasA=None, scaling_biasB=None, use_float=False, nthreads=-1, n_jobs=None)

Create a CMF model object from fitted matrices

Creates a CMF model object based on fitted latent factor matrices, which might have been obtained from a different software. For example, the package python-libmf has functionality for obtaining these matrices, but not for producing recommendations or latent factors for new users, for which this function can come in handy as it will turn such model into a CMF model which provides all such functionality.

This is only available for models without side information, and does not support user/item mappings.

Note

This is a static class method, should be called like this:

CMF.from_model_matrices(...)

(i.e. no parentheses after ‘CMF’)

Parameters:
  • A (array(n_users, k)) – The obtained user factors.

  • B (array(n_items, k)) – The obtained item factors.

  • glob_mean (float) – The obtained global mean, if the model underwent centering. If passing zero, will assume that the values are not to be centered.

  • precompute (bool) – Whether to generate pre-computed matrices which can help to speed up computations on new data.

  • user_bias (None or array(n_users,)) – The obtained user biases. If passing None, will assume that the model did not include user biases.

  • item_bias (None or array(n_items,)) – The obtained item biases. If passing None, will assume that the model did not include item biases.

  • lambda_ (float or array(6,)) – Regularization parameter. See the documentation for __init__ for details.

  • scale_lam (bool) – Whether to scale (increase) the regularization parameter for each row of the model matrices according to the number of non-missing entries in the data for that particular row.

  • l1_lambda (float or array(6,)) – Regularization parameter to apply to the L1 norm of the model matrices. See the documentation for __init__ for details.

  • nonneg (bool) – Whether to constrain the ‘A’ and ‘B’ matrices to be non-negative.

  • NA_as_zero (bool) – Whether to take missing entries in the ‘X’ matrix as zeros (only when the ‘X’ matrix is passed as sparse COO matrix) instead of ignoring them. See the documentation for __init__ for details.

  • scaling_biasA (None or float) – If passing it, will assume that the model uses the option scale_bias_const=True, and will use this number as scaling for the regularization of the user biases.

  • scaling_biasB (None or float) – If passing it, will assume that the model uses the option scale_bias_const=True, and will use this number as scaling for the regularization of the item biases.

  • use_float (bool) – Whether to use C float type for the model parameters (typically this is np.float32). If passing False, will use C double (typically this is np.float64). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.

  • nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).

  • n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.

Returns:

model – A CMF model object without side information, for which the usual prediction methods such as topN and topN_warm can be used as if it had been fitted through this software.

Return type:

CMF

get_params(deep=True)

Get parameters for this estimator.

Kept for compatibility with scikit-learn.

Parameters:

deep (bool) – Ignored.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

item_factors_cold(I=None, I_bin=None, I_col=None, I_val=None)

Determine item-factors from new data, given I

Note

Calculating item factors might be a lot slower than user factors, as the model does not keep precomputed matrices that might speed up these factor calculations. If this function is goint to be used frequently, it’s advised to build the model swapping the users and items instead.

Parameters:
  • I (array(q,), or None) – Attributes for the new item, in dense format. Should only pass one of ‘I’ or ‘I_col’+’I_val’.

  • I_bin (array(q_bin,), or None) – Binary attributes for the new item, in dense format. Only supported with method='lbfgs'.

  • I_col (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘I’ or ‘I_col’+’I_val’.

  • I_val (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_val’ should contain the values in the columns given by ‘I_col’. Should only pass one of ‘I’ or ‘I_col’+’I_val’.

Returns:

factors – The item-factors as determined by the model.

Return type:

array(k_item+k+k_main,)

predict(user, item)

Predict ratings/values given by existing users to existing items

Note

For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.

Parameters:
  • user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in item will be matched with the corresponding entry of user at the same position in the array/list.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

predict_cold(items, U=None, U_bin=None, U_col=None, U_val=None)

Predict rating given by a new user to existing items, given U

Note

If using NA_as_zero, this function will assume that all the ‘X’ values are zeros rather than being missing.

Parameters:
  • items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value np.nan. Only supported with method='lbfgs'.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

Returns:

scores – Predicted ratings for the requested items, for this user.

Return type:

array(n,)

predict_cold_multiple(item, U=None, U_bin=None)

Predict rating given by new users to existing items, given U

Note

If using NA_as_zero, this function will assume that all the ‘X’ values are zeros rather than being missing.

Parameters:
  • item (array-like(m,)) – Items for which ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • U (array(m, p), CSR matrix(m, q), COO matrix(m, q), or None) – Attributes for the users for which to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in item.

  • U_bin (array(m, p_bin), or None) – Binary attributes for the users to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in user. Only supported with method='lbfgs'.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(m,)

predict_new(user, I=None, I_bin=None)

Predict rating given by existing users to new items, given I

Note

Calculating item factors might be a lot slower than user factors, as the model does not keep precomputed matrices that might speed up these factor calculations. If this function is goint to be used frequently, it’s advised to build the model swapping the users and items instead.

Parameters:
  • user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • I (array(n, q), CSR matrix(n, q), COO matrix(n, q), or None) – Attributes for the items for which to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in user. Might contain missing values.

  • I_bin (array(n, q_bin), or None) – Binary attributes for the items to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in user. Might contain missing values. Only supported with method='lbfgs'.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(n,)

predict_warm(items, X=None, X_col=None, X_val=None, W=None, U=None, U_bin=None, U_col=None, U_val=None)

Predict ratings for existing items, for a new user, given ‘X’

Parameters:
  • items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.

  • X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value np.nan. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value np.nan. Only supported with method='lbfgs'. User side info is not strictly required and can be skipped.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

Returns:

scores – Predicted values for the requested items for a user defined by the given values of ‘X’ in ‘X_col’ and ‘X_val’.

Return type:

array(n,)

predict_warm_multiple(X, item, U=None, U_bin=None, W=None)

Predict ratings for existing items, for new users, given ‘X’

Note

See the documentation of “fit” for details about handling of missing values.

Parameters:
  • X (array(m, n), CSR matrix(m, n) , or COO matrix(m, n)) – New ‘X’ data with potentially missing entries. Must have one row per entry of item.

  • item (array-like(m,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in item will be matched with the corresponding row of X.

  • U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’.

  • U_bin (array(m, p_bin)) – User binary attributes for each row in ‘X’. Only supported with method='lbfgs'.

  • W (array(m, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.

Returns:

scores – Predicted ratings for the requested user-item combinations.

Return type:

array(m,)

set_params(**params)

Set the parameters of this estimator.

Kept for compatibility with scikit-learn.

Note

Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

swap_users_and_items(precompute=True)

Swap the users and items in a factorization model

This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as topN, in which any mention of users will now mean items and vice-versa.

Note

The resulting object will not generate any deep copies of the original model’s objects.

Parameters:

precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.

Returns:

model – An object of the same class as this one, but with the user and items swapped.

Return type:

obj

topN(user, n=10, include=None, exclude=None, output_score=False)

Rank top-N highest-predicted items for an existing user

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.

  • n (int) – Number of top-N highest-predicted results to output.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_cold(n=10, U=None, U_bin=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)

Compute top-N highest-predicted items for a new user, given ‘U’

Note

If using NA_as_zero, this function will assume that all the ‘X’ values are zeros rather than being missing.

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • n (int) – Number of top-N highest-predicted results to output.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value np.nan. Only supported with method='lbfgs'.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_new(user, I=None, I_bin=None, n=10, output_score=False)

Rank top-N highest-predicted items for an existing user, given ‘I’

Note

If the model was fit to both ‘I’ and ‘I_bin’, can pass a partially- disjoint set to both - that is, both can have rows that the other doesn’t. In such case, the rows that they have in common should come first, and then one of them appended missing values so that one of the matrices ends up containing all the rows of the other.

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a data frame, must match with entries in its ‘UserId’ column, otherwise should match with the rows on ‘X’.

  • I (array(m, q), CSR matrix(m, q), COO matrix(m, q), or None) – Attributes for the items to rank. Data frames with ‘ItemId’ column are not supported.

  • I_bin (array(m, q_bin), or None) – Binary attributes for the items to rank. Data frames with ‘ItemId’ column are not supported. Only supported with method='lbfgs'.

  • n (int) – Number of top-N highest-predicted results to output. Must be less or equal than the number of rows in ‘I’/’I_bin’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user, as integers matching to the rows of ‘I’/’I_bin’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

topN_warm(n=10, X=None, X_col=None, X_val=None, W=None, U=None, U_bin=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)

Compute top-N highest-predicted items for a new user, given ‘X’

Note

This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.

Parameters:
  • n (int) – Number of top-N highest-predicted results to output.

  • X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value np.nan. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.

  • W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.

  • U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value np.nan. Only supported with method='lbfgs'. User side info is not strictly required and can be skipped.

  • U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.

  • include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.

  • output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.

Returns:

  • items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.

  • scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing output_score=True, in which case the result will be a tuple with these two entries.

transform(X=None, y=None, U=None, U_bin=None, W=None, replace_existing=False)

Reconstruct missing entries of the ‘X’ matrix

Will reconstruct/impute all the missing entries in the ‘X’ matrix as determined by the model. This method is intended to be used for imputing tabular data, and can be used as part of SciKit-Learn pipelines.

Note

It’s possible to use this method with ‘X’ alone, with ‘U’/’U_bin’ alone, or with both ‘X’ and ‘U’/’U_bin’ together, in which case both matrices must have the same rows.

Note

If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes self.user_mapping_ and self.item_mapping_.

Parameters:
  • X (array(m, n), or None) – New ‘X’ data with potentially missing entries which are to be imputed. Missing entries should have value np.nan when passing a dense array.

  • y (None) – Not used. Kept as a placeholder for compatibility with SciKit-Learn pipelines.

  • U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’.

  • U_bin (array(m, p_bin) or None) – User binary attributes for each row in ‘X’. Only supported with method='lbfgs'.

  • W (array(m, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.

Returns:

X – The ‘X’ matrix as a dense array with all missing entries imputed according to the model.

Return type:

array(m, n)

Indices and tables