Collective Matrix Factorization¶
This is the documentation page for the Python package cmfrec. For more details, see the project’s GitHub page:
https://www.github.com/david-cortes/cmfrec/
For the R version, see the CRAN page:
Sample Usage¶
For an introduction to the package and methods, see the example IPython notebook building recommendation models on the MovieLens10M dataset with and without side information:
Model evaluation¶
Metrics for implicit-feedback models or recommendation quality can be calculated using the recometrics library:
Naming conventions¶
This package uses the following general naming conventions:
- About data:
‘X’ -> data about interactions between users/rows and items/columns (e.g. ratings given by users to items).
‘U’ -> data about user/row attributes (e.g. user’s age).
‘I’ -> data about item/column attributes (e.g. a movie’s genre).
- About function naming:
‘warm’ -> predictions based on new, unseen ‘X’ data, and potentially including new ‘U’ data along.
‘cold’ -> predictions based on new user attributes data ‘U’, without ‘X’.
‘new’ -> predictions about new items based on attributes data ‘I’.
- About function descriptions:
‘existing’ -> the user/item was present in the training data that was passed to ‘fit’.
‘new’ -> the user/items was not present in the training data that was passed to ‘fit’.
Be aware that the package’s functions are user-centric (e.g. it will recommend items for users, but not users for items). If predictions about new items are desired, it’s recommended to use the method ‘swap_users_and_items’, as the item-based functions which are provided for convenience might run a lot slower than their user equivalents.
Models¶
CMF¶
- class cmfrec.CMF(k=40, lambda_=10.0, method='als', use_cg=True, user_bias=True, item_bias=True, center=True, add_implicit_features=False, scale_lam=False, scale_lam_sideinfo=False, scale_bias_const=False, k_user=0, k_item=0, k_main=0, w_main=1.0, w_user=1.0, w_item=1.0, w_implicit=0.5, l1_lambda=0.0, center_U=True, center_I=True, maxiter=800, niter=10, parallelize='separate', corr_pairs=4, max_cg_steps=3, precondition_cg=False, finalize_chol=True, NA_as_zero=False, NA_as_zero_user=False, NA_as_zero_item=False, nonneg=False, nonneg_C=False, nonneg_D=False, max_cd_steps=100, precompute_for_predictions=True, include_all_X=True, use_float=True, random_state=1, verbose=True, print_every=10, handle_interrupt=True, produce_dicts=False, nthreads=-1, n_jobs=None)¶
Collective or multi-view matrix factorization
Tries to approximate the ‘X’ interactions matrix by a formula as follows:
\(\mathbf{X} \sim \mathbf{A} \mathbf{B}^T\)
While at the same time also approximating the user/row side information matrix ‘U’ and the item/column side information matrix ‘I’ as follows:
\(\mathbf{U} \sim \mathbf{A} \mathbf{C}^T\),
\(\mathbf{I} \sim \mathbf{B} \mathbf{D}^T\)
The matrices (“A”, “B”, “C”, “D”) are obtained by minimizing the error with respect to the non-missing entries in the input data (“X”, “U”, “I”). Might apply sigmoid transformations to binary columns in U and I too.
This is the most flexible of the models available in this package, and can also mimic the implicit-feedback version through the option ‘NA_as_zero’ plus an array of weights.
Note
The default arguments are not geared towards speed. For faster fitting, use
method="als"
,use_cg=True
,finalize_chol=False
,use_float=True
,precompute_for_predictions=False
,produce_dicts=False
, and pass COO matrices or NumPy arrays instead of DataFrames tofit
.Note
By default, the model optimization objective will not scale any of its terms according to number of entries (see parameter
scale_lam
), so hyperparameters such aslambda_
will require more tuning than in other software and trying out values over a wider range.- Parameters
k (int) – Number of latent factors to use (dimensionality of the low-rank factorization), which will be shared between the factorization of the ‘X’ matrix and the side info matrices. Additional non-shared components can also be specified through
k_user
,k_item
, andk_main
. Typical values are 30 to 100.lambda_ (float or array(6,)) – Regularization parameter. Can also use different regularization for each matrix, in which case it should be an array with 6 entries, corresponding, in this order, to: user_bias, item_bias, A, B, C, D. Note that the default value for
lambda_
here is much higher than in other software, and that the loss/objective function is not divided by the number of entries anywhere, so this parameter needs good tuning. For example, a good value for the MovieLens10M would belambda_=35.
(orlambda=0.05
withscale_lam=True
). Typical values are \(10^{-2}\) to \(10^2\).method (str, one of “lbfgs” or “als”) – Optimization method used to fit the model. If passing
'lbfgs'
, will fit it through a gradient-based approach using an L-BFGS optimizer. L-BFGS is typically a much slower and a much less memory efficient method compared to'als'
, but tends to reach better local optima and allows some variations of the problem which ALS doesn’t, such as applying sigmoid transformations for binary side information.use_cg (bool) – In the ALS method, whether to use a conjugate gradient method to solve the closed-form least squares problems. This is a faster and more memory-efficient alternative than the default Cholesky solver, but less exact, less numerically stable, and will require slightly more ALS iterations (
niter
) to reach a good optimum. In general, better results are achieved withuse_cg=False
. Note that, if using this method, calculations after fitting which involve new data such asfactors_warm
, might produce slightly different results from the factors obtained from callingfit
with the same data, due to differences in numerical precision. A workaround for this issue (factors on new data that might differ slightly) is to usefinalize_chol=True
. Even if passing “True” here, will use the Cholesky method in cases in which it is faster (e.g. dense matrices with no missing values), and will not use the conjugate gradient method on new data. This option is not available when using L1 regularization and/or non-negativity constraints. Ignored when passingmethod="lbfgs"
.user_bias (bool) – Whether to add user/row biases (intercepts) to the model. If using it for purposes other than recommender systems, this is is usually not suggested to include.
item_bias (bool) – Whether to add item/column biases (intercepts) to the model. Be aware that using item biases with low regularization for them will tend to favor items with high average ratings regardless of the number of ratings the item has received.
center (bool) – Whether to center the “X” data by subtracting the mean value. For recommender systems, it’s highly recommended to pass “True” here, the more so if the model has user and/or item biases.
add_implicit_features (bool) – Whether to automatically add so-called implicit features from the data, as in reference [5a] and similar. If using this for recommender systems with small amounts of data, it’s recommended to pass ‘True’ here.
scale_lam (bool) – Whether to scale (increase) the regularization parameter for each row of the model matrices (A, B, C, D) according to the number of non-missing entries in the data for that particular row, as proposed in reference [7a]. For the A and B matrices, the regularization will only be scaled according to the number of non-missing entries in “X” (see also the
scale_lam_sideinfo
parameter). Note that, when using the optionsNA_as_zero_*
, all entries are considered to be non-missing. If passing “True” here, the optimal value forlambda_
will be much smaller (and likely below 0.1). This option tends to give better results, but requires more hyperparameter tuning. Only supported formethod="als"
.When generating factors based on side information alone, if passing
scale_lam_sideinfo
, will regularize assuming there was one observation present. Be aware that using this option withoutscale_lam_sideinfo=True
can lead to bad cold-start recommendations as it will set a very small regularization for users who have no ‘X’ data.Warning: in smaller datasets, using this option can result in top-N recommendations having mostly items with very few interactions (see parameter
scale_bias_const
).scale_lam_sideinfo (bool) – Whether to scale (increase) the regularization parameter for each row of the “A” and “B” matrices according to the number of non-missing entries in both “X” and the side info matrices “U” and “I”. If passing “True” here,
scale_lam
will also be assumed to be “True”.scale_bias_const (bool) – When passing
scale_lam=True
anduser_bias=True
oritem_bias=True
, whether to apply the same scaling to the regularization of the biases to all users and items, according to the average number of non-missing entries rather than to the number of entries for each specific user/item.While this tends to result in worse RMSE, it tends to make the top-N recommendations less likely to select items with only a few interactions from only a few users.
Ignored when passing
scale_lam=False
or not using user/item biases.k_user (int) – Number of factors in the factorizing A and C matrices which will be used only for the ‘U’ and ‘U_bin’ matrices, while being ignored for the ‘X’ matrix. These will be the first factors of the matrices once the model is fit. Will be counted in addition to those already set by
k
.k_item (int) – Number of factors in the factorizing B and D matrices which will be used only for the ‘I’ and ‘I_bin’ matrices, while being ignored for the ‘X’ matrix. These will be the first factors of the matrices once the model is fit. Will be counted in addition to those already set by
k
.k_main (int) – Number of factors in the factorizing A and B matrices which will be used only for the ‘X’ matrix, while being ignored for the ‘U’, ‘U_bin’, ‘I’, and ‘I_bin’ matrices. These will be the last factors of the matrices once the model is fit. Will be counted in addition to those already set by
k
.w_main (float) – Weight in the optimization objective for the errors in the factorization of the ‘X’ matrix.
w_user (float) – Weight in the optimization objective for the errors in the factorization of the ‘U’ and ‘U_bin’ matrices. Ignored when passing neither ‘U’ nor ‘U_bin’ to ‘fit’.
w_item (float) – Weight in the optimization objective for the errors in the factorization of the ‘I’ and ‘I_bin’ matrices. Ignored when passing neither ‘I’ nor ‘I_bin’ to ‘fit’.
w_implicit (float) – Weight in the optimization objective for the errors in the factorizations of the implicit ‘X’ matrices. Note that, depending on the sparsity of the data, the sum of errors from these factorizations might be much larger than for the original ‘X’ and a smaller value will perform better. It is recommended to tune this parameter carefully. Ignored when passing
add_implicit_features=False
.l1_lambda (float or array(6,)) – Regularization parameter to apply to the L1 norm of the model matrices. Can also pass different values for each matrix (see
lambda_
for details). Note that, when adding L1 regularization, the model will be fit through a coordinate descent procedure, which is significantly slower than the Cholesky method with L2 regularization. Only supported withmethod="als"
. Not recommended.center_U (bool) – Whether to center the ‘U’ matrix column-by-column. Be aware that this is a simple mean centering without regularization. One might want to turn this option off when using
NA_as_zero_user=True
.center_I (bool) – Whether to center the ‘I’ matrix column-by-column. Be aware that this is a simple mean centering without regularization. One might want to turn this option off when using
NA_as_zero_item=True
.maxiter (int) – Maximum L-BFGS iterations to perform. The procedure will halt if it has not converged after this number of updates. Note that, compared to the ohter models, fewer iterations will be required for converge here. Using higher regularization values might also decrease the number of required iterations. Pass zero for no L-BFGS iterations limit. If the procedure is spending hundreds of iterations without any significant decrease in the loss function or gradient norm, it’s highly likely that the regularization is too low. Ignored when passing
method='als'
.niter (int) – Number of alternating least-squares iterations to perform. Note that one iteration denotes an update round for all the matrices rather than an update of a single matrix. In general, the more iterations, the better the end result. Ignored when passing
method='lbfgs'
. Typical values are 6 to 30.parallelize (str, “separate” or “single”) – How to parallelize gradient calculations when using more than one thread with
method='lbfgs'
. Passing'separate'
will iterate over the data twice - first by rows and then by columns, letting each thread calculate results for each row and column, whereas passing'single'
will iterate over the data only once, and then sum the obtained results from each thread. Passing'separate'
is much more memory-efficient and less prone to irreproducibility of random seeds, but might be slower for typical use-cases. Ignored when passingnthreads=1
, ormethod='als'
, or when compiling without OpenMP support.corr_pairs (int) – Number of correction pairs to use for the L-BFGS optimization routine. Recommended values are between 3 and 7. Note that higher values translate into higher memory requirements. Ignored when passing
method='als'
.max_cg_steps (int) – Maximum number of conjugate gradient iterations to perform in an ALS round. Ignored when passing
use_cg=False
ormethod="lbfgs"
.precondition_cg (bool) – Whether to use Jacobi preconditioning for the conjugate gradient procedure. In general, this type of preconditioning is not beneficial (makes the algorithm slower) as the factor variables tend to be in the same scale, but it might help when using non-shared factors. Note that, when using preconditioning, the procedure will not check for convergence, taking instead a fixed number of steps (given by
max_cg_steps
) at each iteration regardless of whether it has reached the optimum already. Ignored when passinguse_cg=False
ormethod="als"
.finalize_chol (bool) – When passing
use_cg=True
andmethod="als"
, whether to perform the last iteration with the Cholesky solver. This will make it slower, but will avoid the issue of potential mismatches between the result fromfit
and calls tofactors_warm
or similar with the same data.NA_as_zero (bool) – Whether to take missing entries in the ‘X’ matrix as zeros (only when the ‘X’ matrix is passed as sparse COO matrix or DataFrame) instead of ignoring them. Note that this is a different model from the implicit-feedback version with weighted entries, and it’s a much faster model to fit. Note that passing “True” will affect the results of the functions named “cold” (as it will assume zeros instead of missing). It is possible to obtain equivalent results to the implicit-feedback model if passing “True” here, and then passing an “X” to fit with all values set to one and weights corresponding to the actual values of “X” multiplied by alpha, plus 1 (W := 1 + alpha*X to imitate the implicit-feedback model). If passing this option, be aware that the defaults are also to perform mean centering and add user/item biases, which might be undesirable to have together with this option.
NA_as_zero_user (bool) – Whether to take missing entries in the ‘U’ matrix as zeros (only when the ‘U’ matrix is passed as sparse COO matrix) instead of ignoring them. Note that passing “True” will affect the results of the functions named “warm” if no data is passed there (as it will assume zeros instead of missing).
NA_as_zero_item (bool) – Whether to take missing entries in the ‘I’ matrix as zeros (only when the ‘I’ matrix is passed as sparse COO matrix) instead of ignoring them.
nonneg (bool) – Whether to constrain the ‘A’ and ‘B’ matrices to be non-negative. In order for this to work correctly, the ‘X’ input data must also be non-negative. This constraint will also be applied to the ‘Ai’ and ‘Bi’ matrices if passing
add_implicit_features=True
.Important: be aware that the default options are to perform mean centering and to add user and item biases, which might be undesirable and hinder performance when having non-negativity constraints (especially mean centering).
This option is not available when using the L-BFGS method. Note that, when determining non-negative factors, it will always use a coordinate descent method, regardless of the value passed for
use_cg
andfinalize_chol
. When used for recommender systems, one usually wants to pass ‘False’ here. For better results, do not use centering alongside this option, and use a higher regularization coupled with more iterations.nonneg_C (bool) – Whether to constrain the ‘C’ matrix to be non-negative. In order for this to work correctly, the ‘U’ input data must also be non-negative.
Note: by default, the ‘U’ data will be centered by columns, which doesn’t play well with non-negativity constraints. One will likely want to pass
center_U=False
along with this.nonneg_D (bool) – Whether to constrain the ‘D’ matrix to be non-negative. In order for this to work correctly, the ‘I’ input data must also be non-negative.
Note: by default, the ‘I’ data will be centered by columns, which doesn’t play well with non-negativity constraints. One will likely want to pass
center_I=False
along with this.max_cd_steps (int) – Maximum number of coordinate descent updates to perform per iteration. Pass zero for no limit. The procedure will only use coordinate descent updates when having L1 regularization and/or non-negativity constraints. This number should usually be larger than
k
.precompute_for_predictions (bool) – Whether to precompute some of the matrices that are used when making predictions from the model. If ‘False’, it will take longer to generate predictions or top-N lists, but will use less memory and will be faster to fit the model. If passing ‘False’, can be recomputed later on-demand through method ‘force_precompute_for_predictions’.
include_all_X (bool) – When passing an input “X” to
fit
which has less columns than rows in “I”, whether to still make calculations about the items which are in “I” but not in “X”. This has three effects: (a) thetopN
functionality may recommend such items, (b) the precomptued matrices will be less usable as they will include all such items, (c) it will be possible to pass “X” data to the new factors or topN functions that include such columns (rows of “I”). This option is ignored when usingNA_as_zero
.use_float (bool) – Whether to use C float type for the model parameters (typically this is
np.float32
). If passingFalse
, will use C double (typically this isnp.float64
). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.random_state (int, RandomState, Generator, or None) – Seed used to initialize parameters at random. If passing a NumPy RandomState or Generator, will use it to draw a random integer. Note however that, if using more than one thread, results might not be 100% reproducible with
method='lbfgs'
due to round-off errors in parallelized aggregations. If passingNone
, will draw a non-reproducible random integer to use as seed.verbose (bool) – Whether to print informational messages about the optimization routine used to fit the model. Be aware that, if passing ‘False’ and
method='lbfgs'
, the optimization routine will not respond to interrupt signals.print_every (int) – Print L-BFGS convergence messages every n-iterations. Ignored when passing
verbose=False
ormethod='als'
.produce_dicts (bool) – Whether to produce Python dicts from the mappings between user/item IDs passed to ‘fit’ and the internal IDs used by the class. Having these dicts might speed up some computations such as ‘predict’, but it will add some extra overhead at the time of fitting the model and extra memory usage. Ignored when passing the data as matrices and arrays instead of data frames.
handle_interrupt (bool) – When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing ‘True’), or raise an interrupt exception without producing a fitted model object (when passing ‘False’).
nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).
n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.
- Variables
is_fitted (bool) – Whether the model has been fitted to data.
reindex (bool) – Whether the IDs passed to ‘fit’ were reindexed internally (this will only happen when passing data frames to ‘fit’).
user_mapping (array(m,) or array(0,)) – Correspondence of internal user (row) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.
item_mapping (array(n,) or array(0,)) – Correspondence of internal item (column) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.
user_dict (dict) – Python dict version of
user_mapping_
. Only filled-in when passingproduce_dicts=True
and when passing data frames to ‘fit’.item_dict (dict) – Python dict version of
item_mapping_
. Only filled-in when passingproduce_dicts=True
and when passing data frames to ‘fit’.glob_mean (float) – The global mean of the non-missing entries in ‘X’ passed to ‘fit’.
user_bias (array(m,), or array(0,)) – The obtained biases for each user (row in the ‘X’ matrix). If passing
user_bias=False
, this array will be empty.item_bias (array(n,)) – The obtained biases for each item (column in the ‘X’ matrix). If passing
item_bias=False
, this array will be empty.A (array(m, k_user+k+k_main)) – The obtained user factors.
B (array(n, k_item+k+k_main)) – The obtained item factors.
C (array(p, k_user+k)) – The obtained user-attributes factors.
D (array(q, k_item+k)) – The obtained item attributes factors.
Ai (array(m, k+k_main) or array(0, 0)) – The obtain implicit user factors.
Bi (array(n, k+k_main) or array(0, 0)) – The obtained implicit item factors.
nfev (int) – Number of function and gradient evaluations performed during the L-BFGS optimization procedure.
nupd (int) – Number of L-BFGS updates performed during the optimization procedure.
References
- 1a
Cortes, David. “Cold-start recommendations in Collective Matrix Factorization.” arXiv preprint arXiv:1809.00366 (2018).
- 2a
Singh, Ajit P., and Geoffrey J. Gordon. “Relational learning via collective matrix factorization.” Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008.
- 4a
Takacs, Gabor, Istvan Pilaszy, and Domonkos Tikk. “Applications of the conjugate gradient method for implicit feedback collaborative filtering.” Proceedings of the fifth ACM conference on Recommender systems. 2011.
- 5a
Rendle, Steffen, Li Zhang, and Yehuda Koren. “On the difficulty of evaluating baselines: A study on recommender systems.” arXiv preprint arXiv:1905.01395 (2019).
- 6a
Franc, Vojtěch, Václav Hlaváč, and Mirko Navara. “Sequential coordinate-wise algorithm for the non-negative least squares problem.” International Conference on Computer Analysis of Images and Patterns. Springer, Berlin, Heidelberg, 2005.
- 7a
Zhou, Yunhong, et al. “Large-scale parallel collaborative filtering for the netflix prize.” International conference on algorithmic applications in management. Springer, Berlin, Heidelberg, 2008.
- drop_nonessential_matrices(drop_precomputed=True)¶
Drop matrices that are not used for prediction
Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.
This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.
Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.
After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as
predict
orswap_users_and_items
. The methods which are intended to continue working afterwards are:factors_warm
factors_cold
factors_multiple
topN_warm
topN_cold
- Parameters
drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).
- Returns
self – This object with the non-essential matrices dropped.
- Return type
obj
- factors_cold(U=None, U_bin=None, U_col=None, U_val=None)¶
Determine user-factors from new data, given U
Note
If using
NA_as_zero
, this function will assume that all the ‘X’ values are zeros rather than being missing.- Parameters
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value
np.nan
. Only supported withmethod='lbfgs'
.U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
- Returns
factors – The user-factors as determined by the model.
- Return type
array(k_user+k+k_main,)
- factors_multiple(X=None, U=None, U_bin=None, W=None, return_bias=False)¶
Determine user latent factors based on new data (warm and cold)
Determines latent factors for multiple rows/users at once given new data for them.
Note
See the documentation of “fit” for details about handling of missing values.
Note
If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes
self.user_mapping_
andself.item_mapping_
.- Parameters
X (array(m_x, n), CSR matrix(m_x, n), COO matrix(m_x, n), or None) – New ‘X’ data.
U (array(m_u, p), CSR matrix(m_u, p), COO matrix(m_u, p), or None) – User attributes information for rows in ‘X’.
U_bin (array(m_ub, p_bin) or None) – User binary attributes for each row in ‘X’. Only supported with
method='lbfgs'
.W (array(m_x, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.
return_bias (bool) – Whether to return also the user bias determined by the model given the data in ‘X’. If passing ‘False’, will return an array with the factors. If passing ‘True’, will return a tuple in which the first entry will be an array with the factors, and the second entry will be the estimated bias.
- Returns
A (array(max(m_x,m_u,m_ub), k_user+k+k_main)) – The new factors determined for all the rows given the new data.
bias (array(max(m_x,m_u,m_ub)) or None) – The user bias given the new ‘X’ data. Only returned if passing
return_bias=True
.
- factors_warm(X=None, X_col=None, X_val=None, W=None, U=None, U_bin=None, U_col=None, U_val=None, return_bias=False)¶
Determine user latent factors based on new ratings data
- Parameters
X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value
np.nan
. Should only pass one of ‘X’ or ‘X_col’+’X_val’.X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.
X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.
W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value
np.nan
. Only supported withmethod='lbfgs'
. User side info is not strictly required and can be skipped.U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
return_bias (bool) – Whether to return also the user bias determined by the model given the data in ‘X’. If passing ‘False’, will return an array with the factors. If passing ‘True’, will return a tuple in which the first entry will be an array with the factors, and the second entry will be the estimated bias.
return_raw_A (bool) – Whether to return the raw A factors (the free offset), or the factors used in the factorization, to which the attributes component has been added.
- Returns
factors (array(k_user+k+k_main,) or array(k+k_main,)) – User factors as determined from the data in ‘X’.
bias (float) – User bias as determined from the data in ‘X’. Only returned if passing
return_bias=True
.
- fit(X, U=None, I=None, U_bin=None, I_bin=None, W=None)¶
Fit model to explicit-feedback data and user/item attributes
Note
It’s possible to pass partially disjoints sets of users/items between the different matrices (e.g. it’s possible for both the ‘X’ and ‘U’ matrices to have rows that the other doesn’t have). The procedure supports missing values for all inputs (except for “W”). If any of the inputs has less rows/columns than the other(s) (e.g. “U” has more rows than “X”, or “I” has more rows than there are columns in “X”), will assume that the rest of the rows/columns have only missing values. Note however that when having partially disjoint inputs, the order of the rows/columns matters for speed, as it might run faster when the “U”/”I” inputs that do not have matching rows/columns in “X” have those unmatched rows/columns at the end (last rows/columns) and the “X” input is shorter. See also the parameter
include_all_X
for info about predicting with mismatched “X”.Note
When passing NumPy arrays, missing (unobserved) entries should have value
np.nan
. When passing sparse inputs, the zero-valued entries will be considered as missing (unless using “NA_as_zero”), and it should not contain “NaN” values among the non-zero entries.Note
In order to avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it’s necessary to sort it beforehand by columns and also pass the data data with indices sorted (by column) to the prediction functions.
- Parameters
X (DataFrame(nnz, 3), DataFrame(nnz, 4), array(m, n), or sparse COO(m, n)) – Matrix to factorize (e.g. ratings). Can be passed as a SciPy sparse COO matrix (recommended), as a dense NumPy array, or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and ‘Rating’. Might additionally have a column ‘Weight’. If passing a DataFrame, the IDs will be internally remapped. If passing sparse ‘U’ or sparse ‘I’, ‘X’ cannot be passed as a DataFrame.
U (array(m, p), COO(m, p), DataFrame(m, p+1), or None) – User attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. If ‘U’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array.
U_bin (array(m, p_bin), DataFrame(m, p_bin+1), or None) – User binary attributes information (all values should be zero, one, or missing). If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. Cannot be passed as a sparse matrix. Note that ‘U’ and ‘U_bin’ are not mutually exclusive. Only supported with
method='lbfgs'
.I (array(n, q), COO(n, q), DataFrame(n, q+1), or None) – Item attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. If ‘I’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array.
I_bin (array(n, q_bin), DataFrame(n, q_bin+1), or None) – Item binary attributes information (all values should be zero, one, or missing). If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. Cannot be passed as a sparse matrix. Note that ‘I’ and ‘I_bin’ are not mutually exclusive. Only supported with
method='lbfgs'
.W (None, array(nnz,), or array(m, n)) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array. Cannot have missing values.
- Return type
self
- force_precompute_for_predictions()¶
Precompute internal matrices that are used for predictions
Note
It’s not necessary to call this method if passing
precompute_for_predictions=True
.- Return type
self
- static from_model_matrices(A, B, glob_mean=0.0, precompute=True, user_bias=None, item_bias=None, lambda_=10.0, scale_lam=False, l1_lambda=0.0, nonneg=False, NA_as_zero=False, scaling_biasA=None, scaling_biasB=None, use_float=False, nthreads=-1, n_jobs=None)¶
Create a CMF model object from fitted matrices
Creates a CMF model object based on fitted latent factor matrices, which might have been obtained from a different software. For example, the package
python-libmf
has functionality for obtaining these matrices, but not for producing recommendations or latent factors for new users, for which this function can come in handy as it will turn such model into a CMF model which provides all such functionality.This is only available for models without side information, and does not support user/item mappings.
Note
- This is a static class method, should be called like this:
CMF.from_model_matrices(...)
(i.e. no parentheses after ‘CMF’)
- Parameters
A (array(n_users, k)) – The obtained user factors.
B (array(n_items, k)) – The obtained item factors.
glob_mean (float) – The obtained global mean, if the model underwent centering. If passing zero, will assume that the values are not to be centered.
precompute (bool) – Whether to generate pre-computed matrices which can help to speed up computations on new data.
user_bias (None or array(n_users,)) – The obtained user biases. If passing
None
, will assume that the model did not include user biases.item_bias (None or array(n_items,)) – The obtained item biases. If passing
None
, will assume that the model did not include item biases.lambda_ (float or array(6,)) – Regularization parameter. See the documentation for
__init__
for details.scale_lam (bool) – Whether to scale (increase) the regularization parameter for each row of the model matrices according to the number of non-missing entries in the data for that particular row.
l1_lambda (float or array(6,)) – Regularization parameter to apply to the L1 norm of the model matrices. See the documentation for
__init__
for details.nonneg (bool) – Whether to constrain the ‘A’ and ‘B’ matrices to be non-negative.
NA_as_zero (bool) – Whether to take missing entries in the ‘X’ matrix as zeros (only when the ‘X’ matrix is passed as sparse COO matrix) instead of ignoring them. See the documentation for
__init__
for details.scaling_biasA (None or float) – If passing it, will assume that the model uses the option
scale_bias_const=True
, and will use this number as scaling for the regularization of the user biases.scaling_biasB (None or float) – If passing it, will assume that the model uses the option
scale_bias_const=True
, and will use this number as scaling for the regularization of the item biases.use_float (bool) – Whether to use C float type for the model parameters (typically this is
np.float32
). If passingFalse
, will use C double (typically this isnp.float64
). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).
n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.
- Returns
model – A
CMF
model object without side information, for which the usual prediction methods such astopN
andtopN_warm
can be used as if it had been fitted through this software.- Return type
- get_params(deep=True)¶
Get parameters for this estimator.
Kept for compatibility with scikit-learn.
- Parameters
deep (bool) – Ignored.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
- item_factors_cold(I=None, I_bin=None, I_col=None, I_val=None)¶
Determine item-factors from new data, given I
Note
Calculating item factors might be a lot slower than user factors, as the model does not keep precomputed matrices that might speed up these factor calculations. If this function is goint to be used frequently, it’s advised to build the model swapping the users and items instead.
- Parameters
I (array(q,), or None) – Attributes for the new item, in dense format. Should only pass one of ‘I’ or ‘I_col’+’I_val’.
I_bin (array(q_bin,), or None) – Binary attributes for the new item, in dense format. Only supported with
method='lbfgs'
.I_col (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘I’ or ‘I_col’+’I_val’.
I_val (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_val’ should contain the values in the columns given by ‘I_col’. Should only pass one of ‘I’ or ‘I_col’+’I_val’.
- Returns
factors – The item-factors as determined by the model.
- Return type
array(k_item+k+k_main,)
- predict(user, item)¶
Predict ratings/values given by existing users to existing items
Note
For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.
- Parameters
user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in
item
will be matched with the corresponding entry ofuser
at the same position in the array/list.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- predict_cold(items, U=None, U_bin=None, U_col=None, U_val=None)¶
Predict rating given by a new user to existing items, given U
Note
If using
NA_as_zero
, this function will assume that all the ‘X’ values are zeros rather than being missing.- Parameters
items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value
np.nan
. Only supported withmethod='lbfgs'
.U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
- Returns
scores – Predicted ratings for the requested items, for this user.
- Return type
array(n,)
- predict_cold_multiple(item, U=None, U_bin=None)¶
Predict rating given by new users to existing items, given U
Note
If using
NA_as_zero
, this function will assume that all the ‘X’ values are zeros rather than being missing.- Parameters
item (array-like(m,)) – Items for which ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
U (array(m, p), CSR matrix(m, q), COO matrix(m, q), or None) – Attributes for the users for which to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in
item
.U_bin (array(m, p_bin), or None) – Binary attributes for the users to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in
user
. Only supported withmethod='lbfgs'
.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(m,)
- predict_new(user, I=None, I_bin=None)¶
Predict rating given by existing users to new items, given I
Note
Calculating item factors might be a lot slower than user factors, as the model does not keep precomputed matrices that might speed up these factor calculations. If this function is goint to be used frequently, it’s advised to build the model swapping the users and items instead.
- Parameters
user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
I (array(n, q), CSR matrix(n, q), COO matrix(n, q), or None) – Attributes for the items for which to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in
user
. Might contain missing values.I_bin (array(n, q_bin), or None) – Binary attributes for the items to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in
user
. Might contain missing values. Only supported withmethod='lbfgs'
.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- predict_warm(items, X=None, X_col=None, X_val=None, W=None, U=None, U_bin=None, U_col=None, U_val=None)¶
Predict ratings for existing items, for a new user, given ‘X’
- Parameters
items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value
np.nan
. Should only pass one of ‘X’ or ‘X_col’+’X_val’.X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.
X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.
W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value
np.nan
. Only supported withmethod='lbfgs'
. User side info is not strictly required and can be skipped.U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
- Returns
scores – Predicted values for the requested items for a user defined by the given values of ‘X’ in ‘X_col’ and ‘X_val’.
- Return type
array(n,)
- predict_warm_multiple(X, item, U=None, U_bin=None, W=None)¶
Predict ratings for existing items, for new users, given ‘X’
Note
See the documentation of “fit” for details about handling of missing values.
- Parameters
X (array(m, n), CSR matrix(m, n) , or COO matrix(m, n)) – New ‘X’ data with potentially missing entries. Must have one row per entry of
item
.item (array-like(m,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in
item
will be matched with the corresponding row ofX
.U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’.
U_bin (array(m, p_bin)) – User binary attributes for each row in ‘X’. Only supported with
method='lbfgs'
.W (array(m, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(m,)
- set_params(**params)¶
Set the parameters of this estimator.
Kept for compatibility with scikit-learn.
Note
Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.
- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- swap_users_and_items(precompute=True)¶
Swap the users and items in a factorization model
This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as
topN
, in which any mention of users will now mean items and vice-versa.Note
The resulting object will not generate any deep copies of the original model’s objects.
- Parameters
precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.
- Returns
model – An object of the same class as this one, but with the user and items swapped.
- Return type
obj
- topN(user, n=10, include=None, exclude=None, output_score=False)¶
Rank top-N highest-predicted items for an existing user
Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
n (int) – Number of top-N highest-predicted results to output.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_cold(n=10, U=None, U_bin=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)¶
Compute top-N highest-predicted items for a new user, given ‘U’
Note
If using
NA_as_zero
, this function will assume that all the ‘X’ values are zeros rather than being missing.Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
n (int) – Number of top-N highest-predicted results to output.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value
np.nan
. Only supported withmethod='lbfgs'
.U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_new(user, I=None, I_bin=None, n=10, output_score=False)¶
Rank top-N highest-predicted items for an existing user, given ‘I’
Note
If the model was fit to both ‘I’ and ‘I_bin’, can pass a partially- disjoint set to both - that is, both can have rows that the other doesn’t. In such case, the rows that they have in common should come first, and then one of them appended missing values so that one of the matrices ends up containing all the rows of the other.
Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a data frame, must match with entries in its ‘UserId’ column, otherwise should match with the rows on ‘X’.
I (array(m, q), CSR matrix(m, q), COO matrix(m, q), or None) – Attributes for the items to rank. Data frames with ‘ItemId’ column are not supported.
I_bin (array(m, q_bin), or None) – Binary attributes for the items to rank. Data frames with ‘ItemId’ column are not supported. Only supported with
method='lbfgs'
.n (int) – Number of top-N highest-predicted results to output. Must be less or equal than the number of rows in ‘I’/’I_bin’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user, as integers matching to the rows of ‘I’/’I_bin’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_warm(n=10, X=None, X_col=None, X_val=None, W=None, U=None, U_bin=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)¶
Compute top-N highest-predicted items for a new user, given ‘X’
Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
n (int) – Number of top-N highest-predicted results to output.
X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value
np.nan
. Should only pass one of ‘X’ or ‘X_col’+’X_val’.X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.
X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.
W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value
np.nan
. Only supported withmethod='lbfgs'
. User side info is not strictly required and can be skipped.U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- transform(X=None, y=None, U=None, U_bin=None, W=None, replace_existing=False)¶
Reconstruct missing entries of the ‘X’ matrix
Will reconstruct/impute all the missing entries in the ‘X’ matrix as determined by the model. This method is intended to be used for imputing tabular data, and can be used as part of SciKit-Learn pipelines.
Note
It’s possible to use this method with ‘X’ alone, with ‘U’/’U_bin’ alone, or with both ‘X’ and ‘U’/’U_bin’ together, in which case both matrices must have the same rows.
Note
If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes
self.user_mapping_
andself.item_mapping_
.- Parameters
X (array(m, n), or None) – New ‘X’ data with potentially missing entries which are to be imputed. Missing entries should have value
np.nan
when passing a dense array.y (None) – Not used. Kept as a placeholder for compatibility with SciKit-Learn pipelines.
U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’.
U_bin (array(m, p_bin) or None) – User binary attributes for each row in ‘X’. Only supported with
method='lbfgs'
.W (array(m, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.
- Returns
X – The ‘X’ matrix as a dense array with all missing entries imputed according to the model.
- Return type
array(m, n)
CMF_implicit¶
- class cmfrec.CMF_implicit(k=50, lambda_=1.0, alpha=1.0, use_cg=True, k_user=0, k_item=0, k_main=0, w_main=1.0, w_user=10.0, w_item=10.0, l1_lambda=0.0, center_U=True, center_I=True, niter=10, NA_as_zero_user=False, NA_as_zero_item=False, nonneg=False, nonneg_C=False, nonneg_D=False, max_cd_steps=100, apply_log_transf=False, precompute_for_predictions=True, use_float=True, max_cg_steps=3, precondition_cg=False, finalize_chol=False, random_state=1, verbose=False, produce_dicts=False, handle_interrupt=True, nthreads=-1, n_jobs=None)¶
Collective model for implicit-feedback data
Tries to approximate the ‘X’ interactions matrix by a formula as follows:
\(\mathbf{X} \sim \mathbf{A} \mathbf{B}^T\)
While at the same time also approximating the user side information matrix ‘U’ and the item side information matrix ‘I’ as follows:
\(\mathbf{U} \sim \mathbf{A} \mathbf{C}^T\),
\(\mathbf{I} \sim \mathbf{B} \mathbf{D}^T\)
Note
The default hyperparameters in this software are very different from others. For example, to match those of the package
implicit
, the corresponding hyperparameters here would beuse_cg=True
,finalize_chol=False
,k=100
,lambda_=0.01
,niter=15
,use_float=True
, alpha=1.`, (see the individual documentation of each hyperarameter for details).Note
The default arguments are not geared towards speed. For faster fitting, use
use_cg=True
,finalize_chol=False
,use_float=True
,precompute_for_predictions=False
,produce_dicts=False
, and pass COO matrices or NumPy arrays instead of DataFrames tofit
.Note
The model optimization objective will not scale any of its terms according to number of entries, so hyperparameters such as
lambda_
will require more tuning than in other software and trying out values over a wider range.Note
This model is fit through the alternating least-squares method only, it does not offer a gradient-based approach like the explicit-feedback version.
Note
This model will not perform mean centering and will not fit user/item biases. If desired, an equivalent problem formulation can be made through
CMF
which can accommodate mean centering and biases.Note
Recommendation quality metrics for this model can be calculated with the recometrics library.
- Parameters
k (int) – Number of latent factors to use (dimensionality of the low-rank factorization), which will be shared between the factorization of the ‘X’ matrix and the side info matrices. Additional non-shared components can also be specified through
k_user
,k_item
, andk_main
. Typical values are 30 to 100.lambda_ (float or array(6,)) – Regularization parameter. Can also use different regularization for each matrix, in which case it should be an array with 6 entries, corresponding, in this order, to: <ignored>, <ignored>, A, B, C, D. Note that the default value for
lambda_
here is much higher than in other software, and that the loss/objective function is not divided by the number of entries. For example, a good number for the LastFM-360K could belambda_=5
. Typical values are \(10^{-2}\) to \(10^2\).alpha (float) – Weighting parameter for the non-zero entries in the implicit-feedback model. See [3b] for details. Note that, while the author’s suggestion for this value is 40, other software such as
implicit
use a value of 1, whereas Spark uses a value of 0.01 by default, and values higher than 10 are unlikely to improve results. If the data has very high values, might even be beneficial to put a very low value here - for example, for the LastFM-360K, values below 1 might give better results.use_cg (bool) – In the ALS method, whether to use a conjugate gradient method to solve the closed-form least squares problems. This is a faster and more memory-efficient alternative than the default Cholesky solver, but less exact, less numerically stable, and will require slightly more ALS iterations (
niter
) to reach a good optimum. In general, better results are achieved withuse_cg=False
. Note that, if using this method, calculations after fitting which involve new data such asfactors_warm
, might produce slightly different results from the factors obtained from callingfit
with the same data, due to differences in numerical precision. A workaround for this issue (factors on new data that might differ slightly) is to usefinalize_chol=True
. Even if passing “True” here, will use the Cholesky method in cases in which it is faster (e.g. dense matrices with no missing values), and will not use the conjugate gradient method on new data. This option is not available when using L1 regularization and/or non-negativity constraints.k_user (int) – Number of factors in the factorizing A and C matrices which will be used only for the ‘U’ matrix, while being ignored for the ‘X’ matrix. These will be the first factors of the matrices once the model is fit. Will be counted in addition to those already set by
k
.k_item (int) – Number of factors in the factorizing B and D matrices which will be used only for the ‘I’ matrix, while being ignored for the ‘X’ matrix. These will be the first factors of the matrices once the model is fit. Will be counted in addition to those already set by
k
.k_main (int) – Number of factors in the factorizing A and B matrices which will be used only for the ‘X’ matrix, while being ignored for the ‘U’ and ‘I’ matrices. These will be the last factors of the matrices once the model is fit. Will be counted in addition to those already set by
k
.w_main (float) – Weight in the optimization objective for the errors in the factorization of the ‘X’ matrix. Note that, since the “X” matrix is considered to be full with mostly zero values, the overall sum of errors for “X” will be much larger than for the side info matrices (especially if using large
alpha
), thus it’s recommended to give higher weights to the side info matrices than to the main matrix.w_user (float) – Weight in the optimization objective for the errors in the factorization of the ‘U’ matrix. Ignored when not passing ‘U’ to ‘fit’. Note that, since the “X” matrix is considered to be full with mostly zero values, the overall sum of errors for “X” will be much larger than for the side info matrices (especially if using large
alpha
), thus it’s recommended to give higher weights to the side info matrices than to the main matrix.w_item (float) – Weight in the optimization objective for the errors in the factorization of the ‘I’ matrix. Ignored when not passing ‘I’ to ‘fit’. Note that, since the “X” matrix is considered to be full with mostly zero values, the overall sum of errors for “X” will be much larger than for the side info matrices (especially if using large
alpha
), thus it’s recommended to give higher weights to the side info matrices than to the main matrix.l1_lambda (float or array(6,)) – Regularization parameter to apply to the L1 norm of the model matrices. Can also pass different values for each matrix (see
lambda_
for details). Note that, when adding L1 regularization, the model will be git through a coordinate descent procedure, which is significantly slower than the Cholesky method with L2 regularization. Not recommended.center_U (bool) – Whether to center the ‘U’ matrix column-by-column. Be aware that this is a simple mean centering without regularization. One might want to turn this option off when using
NA_as_zero_user=True
.center_I (bool) – Whether to center the ‘I’ matrix column-by-column. Be aware that this is a simple mean centering without regularization. One might want to turn this option off when using
NA_as_zero_item=True
.niter (int) – Number of alternating least-squares iterations to perform. Note that one iteration denotes an update round for all the matrices rather than an update of a single matrix. In general, the more iterations, the better the end result. Typical values are 6 to 30.
NA_as_zero_user (bool) – Whether to take missing entries in the ‘U’ matrix as zeros (only when the ‘U’ matrix is passed as sparse COO matrix) instead of ignoring them. Note that passing “True” will affect the results of the functions named “warm” if no data is passed there (as it will assume zeros instead of missing).
NA_as_zero_item (bool) – Whether to take missing entries in the ‘I’ matrix as zeros (only when the ‘I’ matrix is passed as sparse COO matrix) instead of ignoring them.
nonneg (bool) – Whether to constrain the ‘A’ and ‘B’ matrices to be non-negative. In order for this to work correctly, the ‘X’ input data must also be non-negative. This constraint will also be applied to the ‘Ai’ and ‘Bi’ matrices if passing
add_implicit_features=True
. This option is not available when using the L-BFGS method. Note that, when determining non-negative factors, it will always use a coordinate descent method, regardless of the value passed foruse_cg
andfinalize_chol
. When used for recommender systems, one usually wants to pass ‘False’ here. For better results, use a higher regularization and more iterations.nonneg_C (bool) – Whether to constrain the ‘C’ matrix to be non-negative. In order for this to work correctly, the ‘U’ input data must also be non-negative.
nonneg_D (bool) – Whether to constrain the ‘D’ matrix to be non-negative. In order for this to work correctly, the ‘I’ input data must also be non-negative.
max_cd_steps (int) – Maximum number of coordinate descent updates to perform per iteration. Pass zero for no limit. The procedure will only use coordinate descent updates when having L1 regularization and/or non-negativity constraints. This number should usually be larger than
k
.precondition_cg (bool) – Whether to use Jacobi preconditioning for the conjugate gradient procedure. In general, this type of preconditioning is not beneficial (makes the algorithm slower) as the factor variables tend to be in the same scale, but it might help when using non-shared factors. Ignored when passing
use_cg=False
ormethod="als"
.apply_log_transf (bool) – Whether to apply a logarithm transformation on the values of ‘X’ (i.e. ‘X := log(X)’)
precompute_for_predictions (bool) – Whether to precompute some of the matrices that are used when making predictions from the model. If ‘False’, it will take longer to generate predictions or top-N lists, but will use less memory and will be faster to fit the model. If passing ‘False’, can be recomputed later on-demand through method ‘force_precompute_for_predictions’.
use_float (bool) – Whether to use C float type for the model parameters (typically this is
np.float32
). If passingFalse
, will use C double (typically this isnp.float64
). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.max_cg_steps (int) – Maximum number of conjugate gradient iterations to perform in an ALS round. Ignored when passing
use_cg=False
.precondition_cg (bool) – Whether to use Jacobi preconditioning for the conjugate gradient procedure. In general, this type of preconditioning is not beneficial (makes the algorithm slower) as the factor variables tend to be in the same scale, but it might help when using non-shared factors. Note that, when using preconditioning, the procedure will not check for convergence, taking instead a fixed number of steps (given by
max_cg_steps
) at each iteration regardless of whether it has reached the optimum already. Ignored when passinguse_cg=False
ormethod="als"
.finalize_chol (bool) – When passing
use_cg=True
, whether to perform the last iteration with the Cholesky solver. This will make it slower, but will avoid the issue of potential mismatches between the result fromfit
and calls tofactors_warm
or similar with the same data.random_state (int, RandomState, Generator, or None) – Seed used to initialize parameters at random. If passing a NumPy RandomState or Generator, will use it to draw a random integer. If passing
None
, will draw a non-reproducible random integer to use as seed.verbose (bool) – Whether to print informational messages about the optimization routine used to fit the model.
produce_dicts (bool) – Whether to produce Python dicts from the mappings between user/item IDs passed to ‘fit’ and the internal IDs used by the class. Having these dicts might speed up some computations such as ‘predict’, but it will add some extra overhead at the time of fitting the model and extra memory usage. Ignored when passing the data as matrices and arrays instead of data frames.
handle_interrupt (bool) – When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing ‘True’), or raise an interrupt exception without producing a fitted model object (when passing ‘False’).
nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).
n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.
- Variables
is_fitted (bool) – Whether the model has been fitted to data.
reindex (bool) – Whether the IDs passed to ‘fit’ were reindexed internally (this will only happen when passing data frames to ‘fit’).
user_mapping (array(m,) or array(0,)) – Correspondence of internal user (row) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.
item_mapping (array(n,) or array(0,)) – Correspondence of internal item (column) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.
user_dict (dict) – Python dict version of
user_mapping_
. Only filled-in when passingproduce_dicts=True
and when passing data frames to ‘fit’.item_dict (dict) – Python dict version of
item_mapping_
. Only filled-in when passingproduce_dicts=True
and when passing data frames to ‘fit’.A (array(m, k_user+k+k_main)) – The obtained user factors.
B (array(n, k_item+k+k_main)) – The obtained item factors.
C (array(p, k_user+k)) – The obtained user-attributes factors.
D (array(q, k_item+k)) – The obtained item attributes factors.
References
- 1b
Cortes, David. “Cold-start recommendations in Collective Matrix Factorization.” arXiv preprint arXiv:1809.00366 (2018).
- 2b
Singh, Ajit P., and Geoffrey J. Gordon. “Relational learning via collective matrix factorization.” Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008.
- 3b
Hu, Yifan, Yehuda Koren, and Chris Volinsky. “Collaborative filtering for implicit feedback datasets.” 2008 Eighth IEEE International Conference on Data Mining. Ieee, 2008.
- 4b
Takacs, Gabor, Istvan Pilaszy, and Domonkos Tikk. “Applications of the conjugate gradient method for implicit feedback collaborative filtering.” Proceedings of the fifth ACM conference on Recommender systems. 2011.
- 5b
Franc, Vojtěch, Václav Hlaváč, and Mirko Navara. “Sequential coordinate-wise algorithm for the non-negative least squares problem.” International Conference on Computer Analysis of Images and Patterns. Springer, Berlin, Heidelberg, 2005.
- drop_nonessential_matrices(drop_precomputed=True)¶
Drop matrices that are not used for prediction
Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.
This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.
Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.
After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as
predict
orswap_users_and_items
. The methods which are intended to continue working afterwards are:factors_warm
factors_cold
factors_multiple
topN_warm
topN_cold
- Parameters
drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).
- Returns
self – This object with the non-essential matrices dropped.
- Return type
obj
- factors_cold(U=None, U_col=None, U_val=None)¶
Determine user-factors from new data, given U
- Parameters
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
- Returns
factors – The user-factors as determined by the model.
- Return type
array(k_user+k+k_main,)
- factors_multiple(X=None, U=None)¶
Determine user latent factors based on new data (warm and cold)
Determines latent factors for multiple rows/users at once given new data for them.
Note
See the documentation of “fit” for details about handling of missing values.
Note
If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes
self.user_mapping_
andself.item_mapping_
.- Parameters
X (CSR matrix(m_x, n), COO matrix(m_x, n), or None) – New ‘X’ data.
U (array(m_u, p), CSR matrix(m_u, p), COO matrix(m_u, p), or None) – User attributes information for rows in ‘X’.
- Returns
A – The new factors determined for all the rows given the new data.
- Return type
array(max(m_x,m_u), k_user+k+k_main)
- factors_warm(X_col, X_val, U=None, U_col=None, U_val=None)¶
Determine user latent factors based on new interactions data
- Parameters
X_col (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).
X_val (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
- Returns
factors – User factors as determined from the data in ‘X_col’ and ‘X_val’.
- Return type
array(k_user+k+k_main,)
- fit(X, U=None, I=None)¶
Fit model to implicit-feedback data and user/item attributes
Note
It’s possible to pass partially disjoints sets of users/items between the different matrices (e.g. it’s possible for both the ‘X’ and ‘U’ matrices to have rows that the other doesn’t have), but note that missing values in ‘X’ are treated as zeros. The procedure supports missing values for “U” and “I”. If any of the inputs has less rows/columns than the other(s) (e.g. “U” has more rows than “X”, or “I” has more rows than there are columns in “X”), will assume that the rest of the rows/columns have only missing values (zero values for “X”). Note however that when having partially disjoint inputs, the order of the rows/columns matters for speed, as it might run faster when the “U”/”I” inputs that do not have matching rows/columns in “X” have those unmatched rows/columns at the end (last rows/columns) and the “X” input is shorter.
Note
When passing NumPy arrays, missing (unobserved) entries should have value
np.nan
. When passing sparse inputs, the zero-valued entries will be considered as missing (unless using “NA_as_zero”, and except for “X” for which missing will always be treated as zero), and it should not contain “NaN” values among the non-zero entries.Note
In order to avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it’s necessary to sort it beforehand by columns and also pass the data data with indices sorted (by column) to the prediction functions.
- Parameters
X (DataFrame(nnz, 3), or sparse COO(m, n)) – Matrix to factorize. Can be passed as a SciPy sparse COO matrix (recommended), or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and ‘Value’. If passing a DataFrame, the IDs will be internally remapped.
U (array(m, p), COO(m, p), DataFrame(m, p+1), or None) – User attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. If ‘U’ is sparse, ‘X’ should be passed as a sparse COO matrix too.
I (array(n, q), COO(n, q), DataFrame(n, q+1), or None) – Item attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. If ‘I’ is sparse, ‘X’ should be passed as a sparse COO matrix too.
- Return type
self
- force_precompute_for_predictions()¶
Precompute internal matrices that are used for predictions
Note
It’s not necessary to call this method if passing
precompute_for_predictions=True
.- Return type
self
- static from_model_matrices(A, B, precompute=True, lambda_=1.0, l1_lambda=0.0, nonneg=False, apply_log_transf=False, alpha=1.0, use_float=False, nthreads=-1, n_jobs=None)¶
Create a CMF_implicit model object from fitted matrices
Creates a CMF_implicit model object based on fitted latent factor matrices, which might have been obtained from a different software. For example, the package
python-libmf
has functionality for obtaining these matrices, but not for producing recommendations or latent factors for new users, for which this function can come in handy as it will turn such model into a CMF_implicit model which provides all such functionality.This is only available for models without side information, and does not support user/item mappings.
Note
- This is a static class method, should be called like this:
CMF_implicit.from_model_matrices(...)
(i.e. no parentheses after ‘CMF_implicit’)
- Parameters
A (array(n_users, k)) – The obtained user factors.
B (array(n_items, k)) – The obtained item factors.
precompute (bool) – Whether to generate pre-computed matrices which can help to speed up computations on new data.
lambda_ (float or array(6,)) – Regularization parameter. See the documentation for
__init__
for details.l1_lambda (float or array(6,)) – Regularization parameter to apply to the L1 norm of the model matrices. See the documentation for
__init__
for details.nonneg (bool) – Whether to constrain the ‘A’ and ‘B’ matrices to be non-negative.
apply_log_transf (bool) – Whether to apply a logarithm transformation on the values of ‘X.
alpha (float) – Multiplier to apply to the confidence scores given by ‘X’.
use_float (bool) – Whether to use C float type for the model parameters (typically this is
np.float32
). If passingFalse
, will use C double (typically this isnp.float64
). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).
n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.
- Returns
model – A
CMF_implicit
model object without side information, for which the usual prediction methods such astopN
andtopN_warm
can be used as if it had been fitted through this software.- Return type
- get_params(deep=True)¶
Get parameters for this estimator.
Kept for compatibility with scikit-learn.
- Parameters
deep (bool) – Ignored.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
- item_factors_cold(I=None, I_col=None, I_val=None)¶
Determine item-factors from new data, given I
Note
Calculating item factors might be a lot slower than user factors, as the model does not keep precomputed matrices that might speed up these factor calculations. If this function is goint to be used frequently, it’s advised to build the model swapping the users and items instead.
- Parameters
I (array(q,), or None) – Attributes for the new item, in dense format. Should only pass one of ‘I’ or ‘I_col’+’I_val’.
I_col (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘I’ or ‘I_col’+’I_val’.
I_val (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_val’ should contain the values in the columns given by ‘I_col’. Should only pass one of ‘I’ or ‘I_col’+’I_val’.
- Returns
factors – The item-factors as determined by the model.
- Return type
array(k_item+k+k_main,)
- predict(user, item)¶
Predict ratings/values given by existing users to existing items
Note
For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.
- Parameters
user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in
item
will be matched with the corresponding entry ofuser
at the same position in the array/list.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- predict_cold(items, U=None, U_col=None, U_val=None)¶
Predict value/confidence given by a new user to existing items, given U
- Parameters
items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
- Returns
scores – Predicted ratings for the requested items, for this user.
- Return type
array(n,)
- predict_cold_multiple(item, U)¶
Predict value/confidence given by new users to existing items, given U
Note
See the documentation of “fit” for details about handling of missing values.
- Parameters
item (array-like(m,)) – Items for which ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
U (array(m, p), CSR matrix(m, q), or COO matrix(m, q)) – Attributes for the users for which to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in
item
.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(m,)
- predict_new(user, I)¶
Predict rating given by existing users to new items, given I
Note
Calculating item factors might be a lot slower than user factors, as the model does not keep precomputed matrices that might speed up these factor calculations. If this function is goint to be used frequently, it’s advised to build the model swapping the users and items instead.
- Parameters
user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
I (array(n, q), CSR matrix(n, q), or COO matrix(n, q)) – Attributes for the items for which to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in
user
.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- predict_warm(items, X_col, X_val, U=None, U_col=None, U_val=None)¶
Predict scores for existing items, for a new user, given ‘X’
- Parameters
items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
X_col (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).
X_val (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
- Returns
scores – Predicted values for the requested items for a user defined by the given values of ‘X’ in ‘X_col’ and ‘X_val’, plus ‘U’ if passed.
- Return type
array(n,)
- predict_warm_multiple(X, item, U=None)¶
Predict scores for existing items, for new users, given ‘X’
Note
See the documentation of “fit” for details about handling of missing values.
Note
If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes
self.user_mapping_
andself.item_mapping_
.- Parameters
X (CSR matrix(m, n) , or COO matrix(m, n)) – New ‘X’ data with potentially missing entries. Must have one row per entry of
item
.item (array-like(m,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in
item
will be matched with the corresponding row ofX
.U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(m,)
- set_params(**params)¶
Set the parameters of this estimator.
Kept for compatibility with scikit-learn.
Note
Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.
- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- swap_users_and_items(precompute=True)¶
Swap the users and items in a factorization model
This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as
topN
, in which any mention of users will now mean items and vice-versa.Note
The resulting object will not generate any deep copies of the original model’s objects.
- Parameters
precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.
- Returns
model – An object of the same class as this one, but with the user and items swapped.
- Return type
obj
- topN(user, n=10, include=None, exclude=None, output_score=False)¶
Rank top-N highest-predicted items for an existing user
Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
n (int) – Number of top-N highest-predicted results to output.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_cold(n=10, U=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)¶
Compute top-N highest-predicted items for a new user, given ‘U’
Note
For better cold-start recommendations, one can also add item biases by using the
CMF
class with parameters that would mimicCMF_implicit
plus the biases.Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
n (int) – Number of top-N highest-predicted results to output.
U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_new(user, I=None, n=10, output_score=False)¶
Rank top-N highest-predicted items for an existing user, given ‘I’
Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a data frame, must match with entries in its ‘UserId’ column, otherwise should match with the rows on ‘X’.
I (array(m, q), CSR matrix(m, q), or COO matrix(m, q)) – Attributes for the items to rank. Data frames with ‘ItemId’ column are not supported.
n (int) – Number of top-N highest-predicted results to output. Must be less or equal than the number of rows in I.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user, as integers matching to the rows of ‘I’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_warm(n=10, X_col=None, X_val=None, U=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)¶
Compute top-N highest-predicted items for a new user, given ‘X’
Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
n (int) – Number of top-N highest-predicted results to output.
X_col (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).
X_val (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
OMF_explicit¶
- class cmfrec.OMF_explicit(k=50, lambda_=10.0, method='lbfgs', use_cg=True, user_bias=True, item_bias=True, center=True, k_sec=0, k_main=0, add_intercepts=True, w_user=1.0, w_item=1.0, maxiter=10000, niter=10, parallelize='separate', corr_pairs=7, max_cg_steps=3, precondition_cg=False, finalize_chol=True, NA_as_zero=False, use_float=False, random_state=1, verbose=True, print_every=100, produce_dicts=False, handle_interrupt=True, nthreads=-1, n_jobs=None)¶
Offsets model for explicit-feedback data
Tries to approximate the ‘X’ ratings matrix using the user side information ‘U’ and item side information ‘I’ by a formula as follows:
\(\mathbf{X} \sim (\mathbf{A} + \mathbf{U} \mathbf{C}) * (\mathbf{B} + \mathbf{I} \mathbf{D})^T\)
Note
This model is meant to be fit to ratings data with side info about either users or items. If there is side info about both, it’s better to use the content-based model instead.
Note
This model is meant for cold-start predictions (that is, based on side information alone). It is extremely unlikely to bring improvements compared to situations in which the classical model is able to make predictions.
Note
The ALS method works by first fitting a model with no side info and then reconstructing the parameters by least squares approximations, so when making warm-start predictions, the results will be exactly the same as if not using any side information (user/item attributes). The ALS procedure for this model was implemented for experimentation purposes only, and it’s recommended to use L-BFGS instead.
Note
It’s advised to experiment with tuning the maximum number of L-BFGS iterations and stopping earlier. Be aware that this model requires a lot more iterations to reach convergence compared to the classic and the collective models.
Note
The model optimization objective will not scale any of its terms according to number of entries, so hyperparameters such as
lambda_
will require more tuning than in other software and trying out values over a wider range.- Parameters
k (int) – Number of latent factors to use (dimensionality of the low-rank factorization), which will have a free component and an attribute-dependent component. Other additional separate factors can be specified through
k_sec
andk_main
. Optionally, this parameter might be set to zero while settingk_sec
andk_main
for a different type of model. Typical values are 30 to 100.lambda_ (float or array(6,)) – Regularization parameter. Can also use different regularization for each matrix, in which case it should be an array with 6 entries, corresponding, in this order, to: user_bias, item_bias, A, B, C, D. The attribute biases will have the same regularization as the matrices to which they apply (C and D). Note that the default value for
lambda_
here is much higher than in other software, and that the loss/objective function is not divided by the number of entries. For example, a good value for the MovieLens10M would belambda_=35.
. Typical values are \(10^{-2}\) to \(10^2\). Passing different regularization for each matrix is not supported withmethod='als'
.method (str, one of “lbfgs” or “als”) – Optimization method used to fit the model. If passing
'lbfgs'
, will fit it through a gradient-based approach using an L-BFGS optimizer. If passing'als'
, will first obtain the solution ignoring the side information using an alternating least-squares procedure (the classical model described in other papers), then reconstruct the model matrices by a least-squares approximation. The ALS approach was implemented for experimentation purposes only and is not recommended.use_cg (bool) – In the ALS method, whether to use a conjugate gradient method to solve the closed-form least squares problems. This is a faster and more memory-efficient alternative than the default Cholesky solver, but less exact, less numerically stable, and will require slightly more ALS iterations (
niter
) to reach a good optimum. In general, better results are achieved withuse_cg=False
. Note that, if using this method, calculations after fitting which involve new data such asfactors_warm
, might produce slightly different results from the factors obtained from callingfit
with the same data, due to differences in numerical precision. A workaround for this issue (factors on new data that might differ slightly) is to usefinalize_chol=True
. Even if passing “True” here, will use the Cholesky method in cases in which it is faster (e.g. dense matrices with no missing values), and will not use the conjugate gradient method on new data. Ignored when passingmethod="lbfgs"
.user_bias (bool) – Whether to add user biases (intercepts) to the model.
item_bias (bool) – Whether to add item biases (intercepts) to the model. Be aware that using item biases with low regularization for them will tend to favor items with high average ratings regardless of the number of ratings the item has received.
center (bool) – Whether to center the “X” data by subtracting the mean value. For recommender systems, it’s highly recommended to pass “True” here, the more so if the model has user and/or item biases.
k_sec (int) – Number of factors in the factorizing matrices which are determined exclusively from user/item attributes. These will be at the beginning of the C and D matrices once the model is fit. If there are no attributes for a given matrix (user/item), then that matrix will have an extra
k_sec
factors (e.g. if passing user side info but not item side info, then the B matrix will have an extrak_sec
factors). Will be counted in addition to those already set byk
. Not supported when usingmethod='als'
.For a different model having only
k_sec
withk=0
andk_main=0
, see theContentBased
class.k_main (int) – Number of factors in the factorizing matrices which are determined without any user/item attributes. These will be at the end of the A and B matrices once the model is fit. Will be counted in addition to those already set by
k
. Not supported when usingmethod='als'
.add_intercepts (bool) – Whether to add intercepts/biases to the user/item attribute matrices.
w_user (float) – Multiplier for the effect of the attributes contribution to the factorizing matrix A (that is, Am = A + w_user*U*C). Passing values larger than 1 has the effect of giving less freedoom to the free offset term.
w_item (float) – Multiplier for the effect of the attributes contribution to the factorizing matrix B (that is, Bm = B + w_item*I*D). Passing values larger than 1 has the effect of giving less freedoom to the free offset term.
maxiter (int) – Maximum L-BFGS iterations to perform. The procedure will halt if it has not converged after this number of updates. Note that, compared to the collective model, more iterations will be required for converge here. Using higher regularization values might also decrease the number of required iterations. Pass zero for no L-BFGS iterations limit. If the procedure is spending thousands of iterations without any significant decrease in the loss function or gradient norm, it’s highly likely that the regularization is too low. Ignored when passing
method='als'
.niter (int) – Number of alternating least-squares iterations to perform. Note that one iteration denotes an update round for all the matrices rather than an update of a single matrix. In general, the more iterations, the better the end result. Ignored when passing
method='lbfgs'
. Typical values are 6 to 30.parallelize (str, “separate” or “single”) – How to parallelize gradient calculations when using more than one thread with
method='lbfgs'
. Passing'separate'
will iterate over the data twice - first by rows and then by columns, letting each thread calculate results for each row and column, whereas passing'single'
will iterate over the data only once, and then sum the obtained results from each thread. Passing'separate'
is much more memory-efficient and less prone to irreproducibility of random seeds, but might be slower for typical use-cases. Ignored when passingnthreads=1
, ormethod='als'
, or when compiling without OpenMP support.parallelize : str, “separate” or “single” How to parallelize gradient calculations when using more than one thread. Passing'separate'
will iterate over the data twice - first by rows and then by columns, letting each thread calculate results for each row and column, whereas passing'single'
will iterate over the data only once, and then sum the obtained results from each thread. Passing'separate'
is much more memory-efficient and less prone to irreproducibility of random seeds, but might be slower for typical use-cases. Ignored when passingnthreads=1
or compiling without OpenMP support.corr_pairs (int) – Number of correction pairs to use for the L-BFGS optimization routine. Recommended values are between 3 and 7. Note that higher values translate into higher memory requirements. Ignored when passing
method='als'
.max_cg_steps (int) – Maximum number of conjugate gradient iterations to perform in an ALS round. Ignored when passing
use_cg=False
ormethod="lbfgs"
.precondition_cg (bool) – Whether to use Jacobi preconditioning for the conjugate gradient procedure. In general, this type of preconditioning is not beneficial (makes the algorithm slower) as the factor variables tend to be in the same scale, but it might help when using non-shared factors. Note that, when using preconditioning, the procedure will not check for convergence, taking instead a fixed number of steps (given by
max_cg_steps
) at each iteration regardless of whether it has reached the optimum already. Ignored when passinguse_cg=False
ormethod="als"
.finalize_chol (bool) – When passing
use_cg=True
andmethod="als"
, whether to perform the last iteration with the Cholesky solver. This will make it slower, but will avoid the issue of potential mismatches between the result fromfit
and calls tofactors_warm
or similar with the same data.NA_as_zero (bool) – Whether to take missing entries in the ‘X’ matrix as zeros (only when the ‘X’ matrix is passed as sparse COO matrix or DataFrame) instead of ignoring them. Note that this is a different model from the implicit-feedback version with weighted entries, and it’s a much faster model to fit. Be aware that this option will be ignored later when predicting on new data - that is, non-present values will be treated as missing. If passing this option, be aware that the defaults are also to perform mean centering and add user/item biases, which might be undesirable to have together with this option.
use_float (bool) – Whether to use C float type for the model parameters (typically this is
np.float32
). If passingFalse
, will use C double (typically this isnp.float64
). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.random_state (int, RandomState, Generator, or None) – Seed used to initialize parameters at random. If passing a NumPy RandomState or Generator, will use it to draw a random integer. Note however that, if using more than one thread, results might not be 100% reproducible with
method='lbfgs'
due to round-off errors in parallelized aggregations. If passingNone
, will draw a non-reproducible random integer to use as seed.verbose (bool) – Whether to print informational messages about the optimization routine used to fit the model. Be aware that, if passing ‘False’ and
method='lbfgs'
, the optimization routine will not respond to interrupt signals.print_every (int) – Print L-BFGS convergence messages every n-iterations. Ignored when passing
verbose=False
ormethod='als'
.handle_interrupt (bool) – When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing ‘True’), or raise an interrupt exception without producing a fitted model object (when passing ‘False’).
produce_dicts (bool) – Whether to produce Python dicts from the mappings between user/item IDs passed to ‘fit’ and the internal IDs used by the class. Having these dicts might speed up some computations such as ‘predict’, but it will add some extra overhead at the time of fitting the model and extra memory usage. Ignored when passing the data as matrices and arrays instead of data frames.
nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).
n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.
- Variables
is_fitted (bool) – Whether the model has been fitted to data.
reindex (bool) – Whether the IDs passed to ‘fit’ were reindexed internally (this will only happen when passing data frames to ‘fit’).
user_mapping (array(m,) or array(0,)) – Correspondence of internal user (row) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.
item_mapping (array(n,) or array(0,)) – Correspondence of internal item (column) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.
user_dict (dict) – Python dict version of
user_mapping_
. Only filled-in when passingproduce_dicts=True
and when passing data frames to ‘fit’.item_dict (dict) – Python dict version of
item_mapping_
. Only filled-in when passingproduce_dicts=True
and when passing data frames to ‘fit’.glob_mean (float) – The global mean of the non-missing entries in ‘X’ passed to ‘fit’.
user_bias (array(m,), or array(0,)) – The obtained biases for each user (row in the ‘X’ matrix). If passing
user_bias=False
, this array will be empty.item_bias (array(n,)) – The obtained biases for each item (column in the ‘X’ matrix). If passing
item_bias=False
, this array will be empty.A (array(m, k+k_main) or array(m, k_sec+k+k_main)) – The free offset for the user-factors obtained from user attributes and matrix C_. If passing
k_sec>0
and no user side information, this matrix will have an extrak_sec
columns at the beginning.B (array(n, k+k_main) or array(m, k_sec+k+k_main)) – The free offset for the item-factors obtained from item attributes and matrix D_. If passing
k_sec>0
and no item side information, this matrix will have an extrak_sec
columns at the beginning.C (array(p, k_sec+k)) – The obtained coefficients for the user attributes.
D (array(q, k_sec+k)) – The obtained coefficients for the item attributes.
C_bias (array(k_sec+k)) – The intercepts/biases for the C matrix.
D_bias (array(k_sec+k)) – The intercepts/biases for the D matrix.
nfev (int) – Number of function and gradient evaluations performed during the L-BFGS optimization procedure.
nupd (int) – Number of L-BFGS updates performed during the optimization procedure.
References
- 1c
Cortes, David. “Cold-start recommendations in Collective Matrix Factorization.” arXiv preprint arXiv:1809.00366 (2018).
- drop_nonessential_matrices(drop_precomputed=True)¶
Drop matrices that are not used for prediction
Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.
This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.
Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.
After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as
predict
orswap_users_and_items
. The methods which are intended to continue working afterwards are:factors_warm
factors_cold
factors_multiple
topN_warm
topN_cold
- Parameters
drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).
- Returns
self – This object with the non-essential matrices dropped.
- Return type
obj
- factors_cold(U=None, U_col=None, U_val=None)¶
Determine user-factors from new data, given U
Note
For large-scale usage, these factors can be obtained by a matrix multiplication of the attributes matrix and the attribute (model parameter)
C_
, plus the intercept if present (C_bias_
).Note
The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.
- Parameters
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
- Returns
factors – The user-factors as determined by the model.
- Return type
array(k_sec+k+k_main,)
- factors_warm(X=None, X_col=None, X_val=None, W=None, U=None, U_col=None, U_val=None, return_bias=False, return_raw_A=False, exact=False)¶
Determine user latent factors based on new ratings data
Note
The argument ‘NA_as_zero’ is ignored here.
- Parameters
X (array(n,) or None) – Observed new ‘X’ data for a given user, in dense format. Non-observed entries should have value
np.nan
.X_col (array(nnz,) or None) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).
X_val (array(nnz,) or None) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.
W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
return_bias (bool) – Whether to return also the user bias determined by the model given the data in ‘X’. If passing ‘False’, will return an array with the factors. If passing ‘True’, will return a tuple in which the first entry will be an array with the factors, and the second entry will be the estimated bias.
return_raw_A (bool) – Whether to return the raw A factors (the free offset), or the factors used in the factorization, to which the attributes component has been added.
exact (bool) – Whether to calculate “A” and “Am” with the regularization applied to “A” instead of to “Am”. This is usually a slower procedure. Only relevant when passing “X” data.
- Returns
factors (array(k_sec+k+k_main,) or array(k+k_main,)) – User factors as determined from the data in ‘X’.
bias (float) – User bias as determined from the data in ‘X’. Only returned if passing
return_bias=True
.
- fit(X, U=None, I=None, W=None)¶
Fit model to explicit-feedback data and user/item attributes
Note
None of the side info inputs should have missing values. If passing side information ‘U’ and/or ‘I’, all entries (users/items) must be present in both the main matrix and the side info matrix.
Note
In order to avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it’s necessary to sort it beforehand by columns and also pass the data data with indices sorted (by column) to the prediction functions.
- Parameters
X (DataFrame(nnz, 3), DataFrame(nnz, 4), array(m, n), or sparse COO(m, n)) – Matrix to factorize (e.g. ratings). Can be passed as a SciPy sparse COO matrix (recommended), as a dense NumPy array, or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and ‘Rating’. If passing a NumPy array, missing (unobserved) entries should have value
np.nan
. Might additionally have a column ‘Weight’. If passing a DataFrame, the IDs will be internally remapped. If passing sparse ‘U’ or sparse ‘I’, ‘X’ cannot be passed as a DataFrame.U (array(m, p), COO(m, p), DataFrame(m, p+1), or None) – User attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. If ‘U’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array. Should not contain any missing values.
I (array(n, q), COO(n, q), DataFrame(n, q+1), or None) – Item attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. If ‘I’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array. Should not contain any missing values.
W (None, array(nnz,), or array(m, n)) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.
- Return type
self
- get_params(deep=True)¶
Get parameters for this estimator.
Kept for compatibility with scikit-learn.
- Parameters
deep (bool) – Ignored.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
- item_factors_cold(I=None, I_col=None, I_val=None)¶
Determine item-factors from new data, given I
- Parameters
I (array(q,), or None) – Attributes for the new item, in dense format. Should only pass one of ‘I’ or ‘I_col’+’I_val’.
I_col (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘I’ or ‘I_col’+’I_val’.
I_val (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_val’ should contain the values in the columns given by ‘I_col’. Should only pass one of ‘I’ or ‘I_col’+’I_val’.
- Returns
factors – The item-factors as determined by the model.
- Return type
array(k_sec+k+k_main,)
- predict(user, item)¶
Predict ratings/values given by existing users to existing items
Note
For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.
- Parameters
user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in
item
will be matched with the corresponding entry ofuser
at the same position in the array/list.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- predict_cold(items, U=None, U_col=None, U_val=None)¶
Predict rating/confidence given by a new user to existing items, given U
Note
The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.
- Parameters
items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
- Returns
scores – Predicted ratings for the requested items, for this user.
- Return type
array(n,)
- predict_cold_multiple(item, U)¶
Predict rating/confidence given by new users to existing items, given U
- Parameters
item (array-like(m,)) – Items for which ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
U (array(m, p), CSR matrix(m, q), or COO matrix(m, q)) – Attributes for the users for which to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in
item
.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(m,)
- predict_new(user, I)¶
Predict rating given by existing users to new items, given I
- Parameters
user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
I (array(n, q), or COO matrix(n, q)) – Attributes for the items for which to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in
user
.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- predict_warm(items, X=None, X_col=None, X_val=None, W=None, U=None, U_col=None, U_val=None)¶
Predict ratings for existing items, for a new user, given ‘X’
Note
The argument ‘NA_as_zero’ is ignored here.
- Parameters
items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value
np.nan
.X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).
X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.
W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.
U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’. Not used when using
k_sec=0
.U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. Not used when using
k_sec=0
.U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. Not used when using
k_sec=0
.
- Returns
scores – Predicted values for the requested items for a user defined by the given values of ‘X’ in ‘X_col’ and ‘X_val’.
- Return type
array(n,)
- predict_warm_multiple(X, item, U=None, W=None)¶
Predict ratings for existing items, for new users, given ‘X’
Note
The argument ‘NA_as_zero’ is ignored here.
- Parameters
X (array(m, n), CSR matrix(m, n) , or COO matrix(m, n)) – New ‘X’ data with potentially missing entries. Missing entries should have value
np.nan
when passing a dense array. Must have one row per entry ofitem
.item (array-like(m,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in
item
will be matched with the corresponding row ofX
.U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’. Should not contain any missing values.
W (array(m, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(m,)
- set_params(**params)¶
Set the parameters of this estimator.
Kept for compatibility with scikit-learn.
Note
Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.
- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- swap_users_and_items(precompute=True)¶
Swap the users and items in a factorization model
This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as
topN
, in which any mention of users will now mean items and vice-versa.Note
The resulting object will not generate any deep copies of the original model’s objects.
- Parameters
precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.
- Returns
model – An object of the same class as this one, but with the user and items swapped.
- Return type
obj
- topN(user, n=10, include=None, exclude=None, output_score=False)¶
Rank top-N highest-predicted items for an existing user
Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
n (int) – Number of top-N highest-predicted results to output.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_cold(n=10, U=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)¶
Compute top-N highest-predicted items for a new user, given ‘U’
Note
The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.
- Parameters
n (int) – Number of top-N highest-predicted results to output.
U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_new(user, I, n=10, output_score=False)¶
Rank top-N highest-predicted items for an existing user, given ‘I’
- Parameters
user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a data frame, must match with entries in its ‘UserId’ column, otherwise should match with the rows on ‘X’.
I (array(m, q), or COO matrix(m, q)) – Attributes for the items to rank. Data frames with ‘ItemId’ column are not supported.
n (int) – Number of top-N highest-predicted results to output. Must be less or equal than the number of rows in I.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user, as integers matching to the rows of ‘I’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_warm(n=10, X=None, X_col=None, X_val=None, W=None, U=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)¶
Compute top-N highest-predicted items for a new user, given ‘X’
Note
The argument ‘NA_as_zero’ is ignored here.
- Parameters
n (int) – Number of top-N highest-predicted results to output.
X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value
np.nan
. Should only pass one of ‘X’ or ‘X_col’+’X_val’.X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.
X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.
W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.
U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’. Not used when using
k_sec=0
.U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. Not used when using
k_sec=0
.U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. Not used when using
k_sec=0
.include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- transform(X, y=None, U=None, W=None, replace_existing=False)¶
Reconstruct entries of the ‘X’ matrix
Will reconstruct all the entries in the ‘X’ matrix as determined by the model. This method is intended to be used for imputing tabular data, and can be used as part of SciKit-Learn pipelines.
Note
The argument ‘NA_as_zero’ is ignored here.
Note
If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes
self.user_mapping_
andself.item_mapping_
.- Parameters
X (array(m, n)) – New ‘X’ data with potentially missing entries which are to be imputed. Missing entries should have value
np.nan
.y (None) – Not used. Kept as a placeholder for compatibility with SciKit-Learn pipelines.
U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’. Should not contain any missing values.
W (array(m, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.
- Returns
X – The ‘X’ matrix as a dense array with all entries as determined by the model. Note that this will be returned as a dense NumPy array.
- Return type
array(m, n)
OMF_implicit¶
- class cmfrec.OMF_implicit(k=50, lambda_=1.0, alpha=1.0, use_cg=True, add_intercepts=True, niter=10, apply_log_transf=False, use_float=False, max_cg_steps=3, precondition_cg=False, finalize_chol=False, random_state=1, verbose=False, produce_dicts=False, handle_interrupt=True, nthreads=-1, n_jobs=None)¶
Offsets model for implicit-feedback data
Tries to approximate the ‘X’ interactions matrix using the user side information ‘U’ and item side information ‘I’ by a formula as follows:
\(\mathbf{X} \sim (\mathbf{A} + \mathbf{U} \mathbf{C}) * (\mathbf{B} + \mathbf{I} \mathbf{D})^T\)
Note
This model was implemented for experimentation purposes only. Performance is likely to be bad. Be warned.
Note
This works by first fitting a model with no side info and then reconstructing the parameters by least squares approximations, so when making warm-start predictions, the results will be exactly the same as if not using any side information (user/item attributes).
Note
The model optimization objective will not scale any of its terms according to number of entries, so hyperparameters such as
lambda_
will require more tuning than in other software and trying out values over a wider range.Note
Recommendation quality metrics for this model can be calculated with the recometrics library.
- Parameters
k (int) – Number of latent factors to use (dimensionality of the low-rank approximation). Typical values are 30 to 100.
lambda_ (float) – Regularization parameter. Note that the default value for
lambda_
here is much higher than in other software, and that the loss/objective function is not divided by the number of entries. For example, a good number for the LastFM-360K could belambda_=5
. Typical values are \(10^{-2}\) to \(10^2\).alpha (float) – Weighting parameter for the non-zero entries in the implicit-feedback model. See [2d] for details. Note that, while the author’s suggestion for this value is 40, other software such as
implicit
use a value of 1, whereas Spark uses a value of 0.01 by default If the data has very high values, might even be beneficial to put a very low value here - for example, for the LastFM-360K, values below 1 might give better results.use_cg (bool) – In the ALS method, whether to use a conjugate gradient method to solve the closed-form least squares problems. This is a faster and more memory-efficient alternative than the default Cholesky solver, but less exact, less numerically stable, and will require slightly more ALS iterations (
niter
) to reach a good optimum. In general, better results are achieved withuse_cg=False
. Note that, if using this method, calculations after fitting which involve new data such asfactors_warm
, might produce slightly different results from the factors obtained from callingfit
with the same data, due to differences in numerical precision. A workaround for this issue (factors on new data that might differ slightly) is to usefinalize_chol=True
. Even if passing “True” here, will use the Cholesky method in cases in which it is faster (e.g. dense matrices with no missing values), and will not use the conjugate gradient method on new data.add_intercepts (bool) – Whether to add intercepts/biases to the user/item attribute matrices.
niter (int) – Number of alternating least-squares iterations to perform. Note that one iteration denotes an update round for all the matrices rather than an update of a single matrix. In general, the more iterations, the better the end result. Typical values are 6 to 30.
apply_log_transf (bool) – Whether to apply a logarithm transformation on the values of ‘X’ (i.e. ‘X := log(X)’)
use_float (bool) – Whether to use C float type for the model parameters (typically this is
np.float32
). If passingFalse
, will use C double (typically this isnp.float64
). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.max_cg_steps (int) – Maximum number of conjugate gradient iterations to perform in an ALS round. Ignored when passing
use_cg=False
.precondition_cg (bool) – Whether to use Jacobi preconditioning for the conjugate gradient procedure. In general, this type of preconditioning is not beneficial (makes the algorithm slower) as the factor variables tend to be in the same scale, but it might help when using non-shared factors. Note that, when using preconditioning, the procedure will not check for convergence, taking instead a fixed number of steps (given by
max_cg_steps
) at each iteration regardless of whether it has reached the optimum already. Ignored when passinguse_cg=False
ormethod="als"
.finalize_chol (bool) – When passing
use_cg=True
, whether to perform the last iteration with the Cholesky solver. This will make it slower, but will avoid the issue of potential mismatches between the result fromfit
and calls tofactors_warm
or similar with the same data.random_state (int, RandomState, Generator, None) – Seed used to initialize parameters at random. If passing a NumPy RandomState or Generator, will use it to draw a random integer. If passing
None
, will draw a non-reproducible random integer to use as seed.verbose (bool) – Whether to print informational messages about the optimization routine used to fit the model.
handle_interrupt (bool) – When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing ‘True’), or raise an interrupt exception without producing a fitted model object (when passing ‘False’).
produce_dicts (bool) – Whether to produce Python dicts from the mappings between user/item IDs passed to ‘fit’ and the internal IDs used by the class. Having these dicts might speed up some computations such as ‘predict’, but it will add some extra overhead at the time of fitting the model and extra memory usage. Ignored when passing the data as matrices and arrays instead of data frames.
nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).
n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.
- Variables
is_fitted (bool) – Whether the model has been fitted to data.
reindex (bool) – Whether the IDs passed to ‘fit’ were reindexed internally (this will only happen when passing data frames to ‘fit’).
user_mapping (array(m,) or array(0,)) – Correspondence of internal user (row) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.
item_mapping (array(n,) or array(0,)) – Correspondence of internal item (column) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.
user_dict (dict) – Python dict version of
user_mapping_
. Only filled-in when passingproduce_dicts=True
and when passing data frames to ‘fit’.item_dict (dict) – Python dict version of
item_mapping_
. Only filled-in when passingproduce_dicts=True
and when passing data frames to ‘fit’.A (array(m, k)) – The free offset for the user-factors obtained from user attributes and matrix C_.
B (array(n, k)) – The free offset for the item-factors obtained from item attributes and matrix D_.
C (array(p, k)) – The obtained coefficients for the user attributes.
D (array(q, k)) – The obtained coefficients for the item attributes.
C_bias (array(k)) – The intercepts/biases for the C matrix.
D_bias (array(k)) – The intercepts/biases for the D matrix.
References
- 1d
Cortes, David. “Cold-start recommendations in Collective Matrix Factorization.” arXiv preprint arXiv:1809.00366 (2018).
- 2d
Hu, Yifan, Yehuda Koren, and Chris Volinsky. “Collaborative filtering for implicit feedback datasets.” 2008 Eighth IEEE International Conference on Data Mining. Ieee, 2008.
- 3d
Takacs, Gabor, Istvan Pilaszy, and Domonkos Tikk. “Applications of the conjugate gradient method for implicit feedback collaborative filtering.” Proceedings of the fifth ACM conference on Recommender systems. 2011.
- drop_nonessential_matrices(drop_precomputed=True)¶
Drop matrices that are not used for prediction
Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.
This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.
Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.
After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as
predict
orswap_users_and_items
. The methods which are intended to continue working afterwards are:factors_warm
factors_cold
factors_multiple
topN_warm
topN_cold
- Parameters
drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).
- Returns
self – This object with the non-essential matrices dropped.
- Return type
obj
- factors_cold(U=None, U_col=None, U_val=None)¶
Determine user-factors from new data, given U
Note
For large-scale usage, these factors can be obtained by a matrix multiplication of the attributes matrix and the attribute (model parameter)
C_
, plus the intercept if present (C_bias_
).Note
The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.
- Parameters
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
- Returns
factors – The user-factors as determined by the model.
- Return type
array(k_sec+k+k_main,)
- factors_warm(X_col, X_val, return_raw_A=False)¶
Determine user latent factors based on new interactions data
- Parameters
X_col (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).
X_val (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.
return_raw_A (bool) – Whether to return the raw A factors (the free offset), or the factors used in the factorization, to which the attributes component has been added.
- Returns
factors – User factors as determined from the data in ‘X_col’ and ‘X_val’.
- Return type
array(k,)
- fit(X, U=None, I=None)¶
Fit model to implicit-feedback data and user/item attributes
Note
None of the side info inputs should have missing values. If passing side information ‘U’ and/or ‘I’, all entries (users/items) must be present in both the main matrix and the side info matrix.
Note
In order to avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it’s necessary to sort it beforehand by columns and also pass the data data with indices sorted (by column) to the prediction functions.
- Parameters
X (DataFrame(nnz, 3), or sparse COO(m, n)) – Matrix to factorize. Can be passed as a SciPy sparse COO matrix (recommended), or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and ‘Value’. If passing a NumPy array, missing (unobserved) entries should have value
np.nan
. If passing a DataFrame, the IDs will be internally remapped.U (array(m, p), COO(m, p), DataFrame(m, p+1), or None) – User attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. If ‘U’ is sparse, ‘X’ should be passed as a sparse COO matrix too. Should not contain any missing values.
I (array(n, q), COO(n, q), DataFrame(n, q+1), or None) – Item attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. If ‘I’ is sparse, ‘X’ should be passed as a sparse COO matrix too. Should not contain any missing values.
- Return type
self
- get_params(deep=True)¶
Get parameters for this estimator.
Kept for compatibility with scikit-learn.
- Parameters
deep (bool) – Ignored.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
- item_factors_cold(I=None, I_col=None, I_val=None)¶
Determine item-factors from new data, given I
- Parameters
I (array(q,), or None) – Attributes for the new item, in dense format. Should only pass one of ‘I’ or ‘I_col’+’I_val’.
I_col (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘I’ or ‘I_col’+’I_val’.
I_val (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_val’ should contain the values in the columns given by ‘I_col’. Should only pass one of ‘I’ or ‘I_col’+’I_val’.
- Returns
factors – The item-factors as determined by the model.
- Return type
array(k_sec+k+k_main,)
- predict(user, item)¶
Predict ratings/values given by existing users to existing items
Note
For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.
- Parameters
user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in
item
will be matched with the corresponding entry ofuser
at the same position in the array/list.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- predict_cold(items, U=None, U_col=None, U_val=None)¶
Predict rating/confidence given by a new user to existing items, given U
Note
The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.
- Parameters
items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
- Returns
scores – Predicted ratings for the requested items, for this user.
- Return type
array(n,)
- predict_cold_multiple(item, U)¶
Predict rating/confidence given by new users to existing items, given U
- Parameters
item (array-like(m,)) – Items for which ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
U (array(m, p), CSR matrix(m, q), or COO matrix(m, q)) – Attributes for the users for which to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in
item
.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(m,)
- predict_new(user, I)¶
Predict rating given by existing users to new items, given I
- Parameters
user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
I (array(n, q), or COO matrix(n, q)) – Attributes for the items for which to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in
user
.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- predict_warm(items, X_col, X_val)¶
Predict scores for existing items, for a new user, given ‘X’
- Parameters
items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
X_col (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).
X_val (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.
- Returns
scores – Predicted values for the requested items for a user defined by the given values of ‘X’ in ‘X_col’ and ‘X_val’.
- Return type
array(n,)
- predict_warm_multiple(X, item, U=None)¶
Predict scores for existing items, for new users, given ‘X’
- Parameters
X (array(m, n), CSR matrix(m, n) , or COO matrix(m, n)) – New ‘X’ data with potentially missing entries. Missing entries should have value
np.nan
when passing a dense array. Must have one row per entry ofitem
.item (array-like(m,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in
item
will be matched with the corresponding row ofX
.U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’. Should not contain any missing values.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(m,)
- set_params(**params)¶
Set the parameters of this estimator.
Kept for compatibility with scikit-learn.
Note
Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.
- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- swap_users_and_items(precompute=True)¶
Swap the users and items in a factorization model
This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as
topN
, in which any mention of users will now mean items and vice-versa.Note
The resulting object will not generate any deep copies of the original model’s objects.
- Parameters
precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.
- Returns
model – An object of the same class as this one, but with the user and items swapped.
- Return type
obj
- topN(user, n=10, include=None, exclude=None, output_score=False)¶
Rank top-N highest-predicted items for an existing user
Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
n (int) – Number of top-N highest-predicted results to output.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_cold(n=10, U=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)¶
Compute top-N highest-predicted items for a new user, given ‘U’
Note
The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.
- Parameters
n (int) – Number of top-N highest-predicted results to output.
U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_new(user, I, n=10, output_score=False)¶
Rank top-N highest-predicted items for an existing user, given ‘I’
- Parameters
user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a data frame, must match with entries in its ‘UserId’ column, otherwise should match with the rows on ‘X’.
I (array(m, q), or COO matrix(m, q)) – Attributes for the items to rank. Data frames with ‘ItemId’ column are not supported.
n (int) – Number of top-N highest-predicted results to output. Must be less or equal than the number of rows in I.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user, as integers matching to the rows of ‘I’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_warm(n=10, X_col=None, X_val=None, include=None, exclude=None, output_score=False)¶
Compute top-N highest-predicted items for a new user, given ‘X’
- Parameters
n (int) – Number of top-N highest-predicted results to output.
X_col (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero).
X_val (array(nnz,)) – Observed new ‘X’ data for a given user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
ContentBased¶
- class cmfrec.ContentBased(k=20, lambda_=100.0, user_bias=False, item_bias=False, add_intercepts=True, maxiter=3000, corr_pairs=3, parallelize='separate', verbose=True, print_every=100, random_state=1, use_float=True, produce_dicts=False, handle_interrupt=True, start_with_ALS=True, nthreads=-1, n_jobs=None)¶
Content-based recommendation model
Fits a recommendation model to explicit-feedback data based on user and item attributes only, making it a more ideal approach for cold-start recommendations and with faster prediction times. Follows the same factorization approach as the classical model, but with the latent-factor matrices being determined as linear combinations of the user and item attributes - this is similar to a two-layer neural network with separate layers for each input.
The ‘X’ is approximated using the user side information ‘U’ and item side information ‘I’ by a formula as follows:
\(\mathbf{X} \sim (\mathbf{U} \mathbf{C}) * (\mathbf{I} \mathbf{D})^T\)
Note
This is a highly non-linear model that will take many more L-BFGS iterations to converge compared to the other models. It’s advised to experiment with tuning the maximum number of iterations.
Note
The input data for attributes does not undergo any transformations when fitting this model, which is to some extent sensible to the scales of the variables and their means in the same way as regularized linear regression.
Note
In order to obtain the final user-factors and item-factors matrices that are used to factorize ‘X’ from a fitted-model object, you’ll need to perform a matrix multiplication between the side info (‘U’ and ‘I’) and the fitted parameters (‘C_’ and ‘D_’) - e.g. ‘A = U*model.C_ + model.C_bias_’.
- Parameters
k (int) – Number of latent factors to use (dimensionality of the low-rank approximation). Recommended values are 30 to 100.
lambda_ (float or array(6,)) – Regularization parameter. Can also use different regularization for each matrix, in which case it should be an array with 6 entries, corresponding, in this order, to: user_bias, item_bias, [ignored], [ignored], C, D. Note that the default value for
lambda_
here is much higher than in other software, and that the loss/objective function is not divided by the number of entries. Recommended values are \(10^{-2}\) to \(10^2\).user_bias (bool) – Whether to add user biases (intercepts) to the model.
item_bias (bool) – Whether to add item biases (intercepts) to the model. Be aware that using item biases with low regularization for them will tend to favor items with high average ratings regardless of the number of ratings the item has received.
add_intercepts (bool) – Whether to add intercepts/biases to the user/item attribute matrices.
maxiter (int) – Maximum L-BFGS iterations to perform. The procedure will halt if it has not converged after this number of updates. Note that, compared to the collective model, more iterations will be required for converge here. Using higher regularization values might also decrease the number of required iterations. Pass zero for no L-BFGS iterations limit. If the procedure is spending thousands of iterations without any significant decrease in the loss function or gradient norm, it’s highly likely that the regularization is too low.
corr_pairs (int) – Number of correction pairs to use for the L-BFGS optimization routine. Recommended values are between 3 and 7. Note that higher values translate into higher memory requirements.
parallelize (str, “separate” or “single”) – How to parallelize gradient calculations when using more than one thread. Passing
'separate'
will iterate over the data twice - first by rows and then by columns, letting each thread calculate results for each row and column, whereas passing'single'
will iterate over the data only once, and then sum the obtained results from each thread. Passing'separate'
is much more memory-efficient and less prone to irreproducibility of random seeds, but might be slower for typical use-cases. Ignored when passingnthreads=1
or compiling without OpenMP support.verbose (bool) – Whether to print informational messages about the optimization routine used to fit the model. Be aware that, if passing ‘False’, the optimization routine will not respond to interrupt signals.
print_every (int) – Print L-BFGS convergence messages every n-iterations. Ignored when passing
verbose=False
.random_state (int, RandomState, Generator, or None) – Seed used to initialize parameters at random. If passing a NumPy RandomState or Generator, will use it to draw a random integer. Note however that, if using more than one thread, results might not be 100% reproducible due to round-off errors in parallelized aggregations. If passing
None
, will draw a non-reproducible random integer to use as seed.use_float (bool) – Whether to use C float type for the model parameters (typically this is
np.float32
). If passingFalse
, will use C double (typically this isnp.float64
). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.produce_dicts (bool) – Whether to produce Python dicts from the mappings between user/item IDs passed to ‘fit’ and the internal IDs used by the class. Having these dicts might speed up some computations such as ‘predict’, but it will add some extra overhead at the time of fitting the model and extra memory usage. Ignored when passing the data as matrices and arrays instead of data frames.
handle_interrupt (bool) – When receiving an interrupt signal, whether the model should stop early and leave a usable object with the parameters obtained up to the point when it was interrupted (when passing ‘True’), or raise an interrupt exception without producing a fitted model object (when passing ‘False’).
start_with_ALS (bool) – Whether to determine the initial coefficients through an ALS procedure. This might help to speed up the procedure by starting closer to an optimum. This option is not available when the side information is passed as sparse matrices.
nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).
n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.
- Variables
is_fitted (bool) – Whether the model has been fitted to data.
reindex (bool) – Whether the IDs passed to ‘fit’ were reindexed internally (this will only happen when passing data frames to ‘fit’).
user_mapping (array(m,) or array(0,)) – Correspondence of internal user (row) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.
item_mapping (array(n,) or array(0,)) – Correspondence of internal item (column) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.
user_dict (dict) – Python dict version of
user_mapping_
. Only filled-in when passingproduce_dicts=True
and when passing data frames to ‘fit’.item_dict (dict) – Python dict version of
item_mapping_
. Only filled-in when passingproduce_dicts=True
and when passing data frames to ‘fit’.glob_mean (float) – The global mean of the non-missing entries in ‘X’ passed to ‘fit’.
user_bias (array(m,), or array(0,)) – The obtained biases for each user (row in the ‘X’ matrix). If passing
user_bias=False
(the default), this array will be empty.item_bias (array(n,)) – The obtained biases for each item (column in the ‘X’ matrix). If passing
item_bias=False
(the default), this array will be empty.C (array(p, k)) – The obtained coefficients for the user attributes.
D (array(q, k)) – The obtained coefficients for the item attributes.
C_bias (array(k)) – The intercepts/biases for the C matrix.
D_bias (array(k)) – The intercepts/biases for the D matrix.
nfev (int) – Number of function and gradient evaluations performed during the L-BFGS optimization procedure.
nupd (int) – Number of L-BFGS updates performed during the optimization procedure.
References
- 1e
Cortes, David. “Cold-start recommendations in Collective Matrix Factorization.” arXiv preprint arXiv:1809.00366 (2018).
- drop_nonessential_matrices(drop_precomputed=True)¶
Drop matrices that are not used for prediction
Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.
This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.
Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.
After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as
predict
orswap_users_and_items
. The methods which are intended to continue working afterwards are:factors_warm
factors_cold
factors_multiple
topN_warm
topN_cold
- Parameters
drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).
- Returns
self – This object with the non-essential matrices dropped.
- Return type
obj
- factors_cold(U=None, U_col=None, U_val=None)¶
Determine user-factors from new data, given U
Note
For large-scale usage, these factors can be obtained by a matrix multiplication of the attributes matrix and the attribute (model parameter)
C_
, plus the intercept if present (C_bias_
).- Parameters
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
- Returns
factors – The user-factors as determined by the model.
- Return type
array(k,)
- factors_multiple(U=None)¶
Determine user-factors from new data for multiple rows, given U
- Parameters
U (array-like(m, p)) – User attributes in the new data.
- Returns
factors – The user-factors as determined by the model.
- Return type
array(m, k)
- fit(X, U, I, W=None)¶
Fit model to explicit-feedback data based on user-item attributes
Note
None of the side info inputs should have missing values. All entries (users/items) must be present in both the main matrix and the side info matrix.
Note
In order to avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it’s necessary to sort it beforehand by columns and also pass the data data with indices sorted (by column) to the prediction functions.
- Parameters
X (DataFrame(nnz, 3), DataFrame(nnz, 4), array(m, n), or sparse COO(m, n)) – Matrix to factorize (e.g. ratings). Can be passed as a SciPy sparse COO matrix (recommended), as a dense NumPy array, or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and ‘Rating’. If passing a NumPy array, missing (unobserved) entries should have value
np.nan
. Might additionally have a column ‘Weight’. If passing a DataFrame, the IDs will be internally remapped. If passing sparse ‘U’ or sparse ‘I’, ‘X’ cannot be passed as a DataFrame.U (array(m, p), COO(m, p), DataFrame(m, p+1)) – User attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. If ‘U’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array. Should not contain any missing values.
I (array(n, q), COO(n, q), DataFrame(n, q+1)) – Item attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. If ‘I’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array. Should not contain any missing values.
W (None, array(nnz,), or array(m, n)) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.
- Return type
self
- get_params(deep=True)¶
Get parameters for this estimator.
Kept for compatibility with scikit-learn.
- Parameters
deep (bool) – Ignored.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
- predict(user, item)¶
Predict ratings/values given by existing users to existing items
Note
For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.
- Parameters
user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in
item
will be matched with the corresponding entry ofuser
at the same position in the array/list.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- predict_cold(U, items)¶
Predict rating given by new users to existing items, given U
- Parameters
U (array(n, p), CSR(n, p), or COO(n, p)) – Attributes for the users whose ratings are to be predicted. Each row will be matched to the corresponding row of ‘items’.
items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- predict_new(U, I)¶
Predict rating given by new users to new items, given U and I
- Parameters
U (array(n, p), CSR(n, p), or COO(n, p)) – Attributes for the users whose ratings are to be predicted. Each row will be matched to the corresponding row of ‘I’.
I (array(n, q), CSR(n, q), or COO(n, q)) – Attributes for the items whose ratings are to be predicted. Each row will be matched to the corresponding row of ‘U’.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- set_params(**params)¶
Set the parameters of this estimator.
Kept for compatibility with scikit-learn.
Note
Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.
- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- swap_users_and_items(precompute=True)¶
Swap the users and items in a factorization model
This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as
topN
, in which any mention of users will now mean items and vice-versa.Note
The resulting object will not generate any deep copies of the original model’s objects.
- Parameters
precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.
- Returns
model – An object of the same class as this one, but with the user and items swapped.
- Return type
obj
- topN(user, n=10, include=None, exclude=None, output_score=False)¶
Rank top-N highest-predicted items for an existing user
Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
n (int) – Number of top-N highest-predicted results to output.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_cold(n=10, U=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)¶
Compute top-N highest-predicted items for a new user, given ‘U’
Note
The argument ‘NA_as_zero’ (if available) is ignored here - thus, it assumes all the ‘X’ values are missing.
- Parameters
n (int) – Number of top-N highest-predicted results to output.
U (array(p,), or None) – Attributes for the new user, in dense format. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – Attributes for the new user, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_new(n=10, U=None, U_col=None, U_val=None, I=None, output_score=False)¶
Compute top-N highest-predicted items for a given user, given U
- Parameters
n (int) – Number of top-N highest-predicted results to output.
U (array(p,), or None) – User attributes for the user for whom to rank items. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_col (None or array(nnz)) – User attributes for the user for whom to rank items, in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – User attributes for the user for whom to rank items, in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
I (array(n2, q), CSR(n2, q), or COO(n2, q)) – Attributes for the items to rank (each row corresponding to an item). Must have at least ‘n’ rows.
output_score (bool) – Whether to output the scores in addition to the row numbers. If passing ‘False’, will return a single array with the item numbers, otherwise will return a tuple with the item numbers and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items among ‘I’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
MostPopular¶
- class cmfrec.MostPopular(implicit=False, center=True, user_bias=False, lambda_=10.0, alpha=1.0, NA_as_zero=False, scale_lam=False, scale_bias_const=False, apply_log_transf=False, use_float=False, produce_dicts=False, nthreads=-1, n_jobs=None)¶
Non-Personalized recommender model
Fits a model with only the intercept terms (biases), in order to provide non-personalized recommendations.
This class is provided as a benchmark - if your personalized-recommendations model does not manage to beat this under the evaluation metrics of interest, chances are, that model needs to be reworked.
It minimizes the same objective functions as the other classes and offers the same options (e.g. centering, scaling regulatization, etc.), but fitting only the biases.
Note
Implicit-feedback recommendation quality metrics for this model can be calculated with the recometrics library.
- Parameters
implicit (bool) – Whether to use the implicit-feedback model, in which the ‘X’ matrix is assumed to have only binary entries and each of them having a weight in the loss function given by the observer user-item interactions and other parameters.
center (bool) – Whether to center the “X” data by subtracting the mean value. Ignored (assumed “False”) when passing
implicit=True
.user_bias (bool) – Whether to add user biases to the model. Not supported for implicit feedback (
implicit=True
).lambda_ (float) – Regularization parameter. For the explicit-feedback case (default), lower values will tend to favor the highest-rated items regardless of the number of observations. Note that the default value for
lambda_
here is much higher than in other software, and that the loss/objective function is not divided by the number of entries.alpha (float) – Weighting parameter for the non-zero entries in the implicit-feedback model. See [2f] for details. Note that, while the author’s suggestion for this value is 40, other software such as
implicit
use a value of 1, whereas Spark uses a value of 0.01 by default See the documentation ofCMF_implicit
for more details.NA_as_zero (bool) – Whether to take missing entries in the ‘X’ matrix as zeros (only when the ‘X’ matrix is passed as sparse COO matrix or DataFrame) instead of ignoring them.
scale_lam (bool) – Whether to scale (increase) the regularization parameter for each estimated bias according to the number of non-missing entries in the data. This is only available when passing
implicit=False
.It is not recommended to use this option, as when passing
True
, it tends to recommend items which have a single user interaction with the maximum possible value (e.g. 5-star movies from only 1 user). By default,scale_bias_const
is also set toTrue
, so in order to have the regularization scale for each user/item, that option also needs to be turned off.scale_bias_const (bool) – When passing
scale_lam=True
, whether to apply the same scaling to the regularization for all users and items, according to the average number of non-missing entries rather than to the number of entries for each specific user/item.While this tends to result in worse RMSE, it tends to make the top-N recommendations less likely to select items with only a few interactions from only a few users.
Ignored when passing
scale_lam=False
.apply_log_transf (bool) – Whether to apply a logarithm transformation on the values of ‘X’ (i.e. ‘X := log(X)’). This is only available with
implicit=True
.use_float (bool) – Whether to use C float type for the model parameters (typically this is
np.float32
). If passingFalse
, will use C double (typically this isnp.float64
). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.produce_dicts (bool) – Whether to produce Python dicts from the mappings between user/item IDs passed to ‘fit’ and the internal IDs used by the class. Having these dicts might speed up some computations such as ‘predict’, but it will add some extra overhead at the time of fitting the model and extra memory usage. Ignored when passing the data as matrices and arrays instead of data frames.
nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads). Most of the work is done single-threaded however.
n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.
- Variables
is_fitted (bool) – Whether the model has been fitted to data.
reindex (bool) – Whether the IDs passed to ‘fit’ were reindexed internally (this will only happen when passing data frames to ‘fit’).
user_mapping (array(m,) or array(0,)) – Correspondence of internal user (row) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.
item_mapping (array(n,) or array(0,)) – Correspondence of internal item (column) IDs to IDs in the data passed to ‘fit’. Will only be non-empty when passing a data frame as input to ‘X’.
user_dict (dict) – Python dict version of
user_mapping_
. Only filled-in when passingproduce_dicts=True
and when passing data frames to ‘fit’.item_dict (dict) – Python dict version of
item_mapping_
. Only filled-in when passingproduce_dicts=True
and when passing data frames to ‘fit’.glob_mean (float) – The global mean of the non-missing entries in ‘X’ passed to ‘fit’ (only for explicit-feedback case).
user_bias (array(m,), or array(0,)) – The obtained biases for each user (row in the ‘X’ matrix). If passing
user_bias=False
(the default), this array will be empty.item_bias (array(n,)) – The obtained biases for each item (column in the ‘X’ matrix). Items are ranked according to these values.
References
- 1f
Koren, Yehuda, Robert Bell, and Chris Volinsky. “Matrix factorization techniques for recommender systems.” Computer 42.8 (2009): 30-37.
- 2f
Hu, Yifan, Yehuda Koren, and Chris Volinsky. “Collaborative filtering for implicit feedback datasets.” 2008 Eighth IEEE International Conference on Data Mining. Ieee, 2008.
- drop_nonessential_matrices(drop_precomputed=True)¶
Drop matrices that are not used for prediction
Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.
This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.
Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.
After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as
predict
orswap_users_and_items
. The methods which are intended to continue working afterwards are:factors_warm
factors_cold
factors_multiple
topN_warm
topN_cold
- Parameters
drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).
- Returns
self – This object with the non-essential matrices dropped.
- Return type
obj
- fit(X, W=None)¶
Fit intercepts-only model to data.
- Parameters
X (DataFrame(nnz, 3), DataFrame(nnz, 4), array(m, n), or sparse COO(m, n)) – Matrix to factorize (e.g. ratings). Can be passed as a SciPy sparse COO matrix (recommended), as a dense NumPy array, or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and either ‘Rating’ (explicit-feedback, default) or ‘Value’ (implicit feedback). If passing a NumPy array, missing (unobserved) entries should have value
np.nan
under both explicit and implicit feedback. Might additionally have a column ‘Weight’ for the explicit-feedback case. If passing a DataFrame, the IDs will be internally remapped.W (None, array(nnz,), or array(m, n)) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.
- Return type
self
- get_params(deep=True)¶
Get parameters for this estimator.
Kept for compatibility with scikit-learn.
- Parameters
deep (bool) – Ignored.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
- predict(user, item)¶
Predict ratings/values given by existing users to existing items
Note
For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.
- Parameters
user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in
item
will be matched with the corresponding entry ofuser
at the same position in the array/list.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- set_params(**params)¶
Set the parameters of this estimator.
Kept for compatibility with scikit-learn.
Note
Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.
- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- swap_users_and_items(precompute=True)¶
Swap the users and items in a factorization model
This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as
topN
, in which any mention of users will now mean items and vice-versa.Note
The resulting object will not generate any deep copies of the original model’s objects.
- Parameters
precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.
- Returns
model – An object of the same class as this one, but with the user and items swapped.
- Return type
obj
- topN(user=None, n=10, include=None, exclude=None, output_score=False)¶
Compute top-N highest-predicted items
- Parameters
user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’. Only relevant if using user biases and outputting score.
n (int) – Number of top-N highest-predicted results to output.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
CMF_imputer¶
- class cmfrec.CMF_imputer(k=40, lambda_=10.0, method='als', use_cg=True, user_bias=True, item_bias=True, center=True, add_implicit_features=False, scale_lam=False, scale_lam_sideinfo=False, scale_bias_const=False, k_user=0, k_item=0, k_main=0, w_main=1.0, w_user=1.0, w_item=1.0, w_implicit=0.5, l1_lambda=0.0, center_U=True, center_I=True, maxiter=800, niter=10, parallelize='separate', corr_pairs=4, max_cg_steps=3, precondition_cg=False, finalize_chol=True, NA_as_zero=False, NA_as_zero_user=False, NA_as_zero_item=False, nonneg=False, nonneg_C=False, nonneg_D=False, max_cd_steps=100, precompute_for_predictions=True, include_all_X=True, use_float=True, random_state=1, verbose=True, print_every=10, handle_interrupt=True, produce_dicts=False, nthreads=-1, n_jobs=None)¶
A wrapper for CMF allowing argument ‘y’ in ‘fit’ and ‘transform’ (left as a placeholder only, not used for anything), which can be used as part of SciKit-Learn pipelines due to having this extra parameter.
Everything else is exactly the same as for ‘CMF’ - see its documentation for details.
- drop_nonessential_matrices(drop_precomputed=True)¶
Drop matrices that are not used for prediction
Drops all the matrices in the model object which are not used for calculating new user factors (either warm or cold), such as the user biases or the item factors.
This is intended at decreasing memory usage in production systems which use this software for calculation of user factors or top-N recommendations.
Can additionally drop some of the precomputed matrices which are only taken in special circumstances such as when passing dense data with no missing values - however, predictions that would have otherwise used these matrices will become slower afterwards.
After dropping these non-essential matrices, it will not be possible anymore to call certain methods such as
predict
orswap_users_and_items
. The methods which are intended to continue working afterwards are:factors_warm
factors_cold
factors_multiple
topN_warm
topN_cold
- Parameters
drop_precomputed (bool) – Whether to drop the less commonly used prediction matrices (see documentation above for more details).
- Returns
self – This object with the non-essential matrices dropped.
- Return type
obj
- factors_cold(U=None, U_bin=None, U_col=None, U_val=None)¶
Determine user-factors from new data, given U
Note
If using
NA_as_zero
, this function will assume that all the ‘X’ values are zeros rather than being missing.- Parameters
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value
np.nan
. Only supported withmethod='lbfgs'
.U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
- Returns
factors – The user-factors as determined by the model.
- Return type
array(k_user+k+k_main,)
- factors_multiple(X=None, U=None, U_bin=None, W=None, return_bias=False)¶
Determine user latent factors based on new data (warm and cold)
Determines latent factors for multiple rows/users at once given new data for them.
Note
See the documentation of “fit” for details about handling of missing values.
Note
If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes
self.user_mapping_
andself.item_mapping_
.- Parameters
X (array(m_x, n), CSR matrix(m_x, n), COO matrix(m_x, n), or None) – New ‘X’ data.
U (array(m_u, p), CSR matrix(m_u, p), COO matrix(m_u, p), or None) – User attributes information for rows in ‘X’.
U_bin (array(m_ub, p_bin) or None) – User binary attributes for each row in ‘X’. Only supported with
method='lbfgs'
.W (array(m_x, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.
return_bias (bool) – Whether to return also the user bias determined by the model given the data in ‘X’. If passing ‘False’, will return an array with the factors. If passing ‘True’, will return a tuple in which the first entry will be an array with the factors, and the second entry will be the estimated bias.
- Returns
A (array(max(m_x,m_u,m_ub), k_user+k+k_main)) – The new factors determined for all the rows given the new data.
bias (array(max(m_x,m_u,m_ub)) or None) – The user bias given the new ‘X’ data. Only returned if passing
return_bias=True
.
- factors_warm(X=None, X_col=None, X_val=None, W=None, U=None, U_bin=None, U_col=None, U_val=None, return_bias=False)¶
Determine user latent factors based on new ratings data
- Parameters
X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value
np.nan
. Should only pass one of ‘X’ or ‘X_col’+’X_val’.X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.
X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.
W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value
np.nan
. Only supported withmethod='lbfgs'
. User side info is not strictly required and can be skipped.U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
return_bias (bool) – Whether to return also the user bias determined by the model given the data in ‘X’. If passing ‘False’, will return an array with the factors. If passing ‘True’, will return a tuple in which the first entry will be an array with the factors, and the second entry will be the estimated bias.
return_raw_A (bool) – Whether to return the raw A factors (the free offset), or the factors used in the factorization, to which the attributes component has been added.
- Returns
factors (array(k_user+k+k_main,) or array(k+k_main,)) – User factors as determined from the data in ‘X’.
bias (float) – User bias as determined from the data in ‘X’. Only returned if passing
return_bias=True
.
- fit(X, y=None, U=None, I=None, U_bin=None, I_bin=None, W=None)¶
Fit model to explicit-feedback data and user/item attributes
Note
It’s possible to pass partially disjoints sets of users/items between the different matrices (e.g. it’s possible for both the ‘X’ and ‘U’ matrices to have rows that the other doesn’t have). The procedure supports missing values for all inputs (except for “W”). If any of the inputs has less rows/columns than the other(s) (e.g. “U” has more rows than “X”, or “I” has more rows than there are columns in “X”), will assume that the rest of the rows/columns have only missing values. Note however that when having partially disjoint inputs, the order of the rows/columns matters for speed, as it might run faster when the “U”/”I” inputs that do not have matching rows/columns in “X” have those unmatched rows/columns at the end (last rows/columns) and the “X” input is shorter. See also the parameter
include_all_X
for info about predicting with mismatched “X”.Note
When passing NumPy arrays, missing (unobserved) entries should have value
np.nan
. When passing sparse inputs, the zero-valued entries will be considered as missing (unless using “NA_as_zero”), and it should not contain “NaN” values among the non-zero entries.Note
In order to avoid potential decimal differences in the factors obtained when fitting the model and when calling the prediction functions on new data, when the data is sparse, it’s necessary to sort it beforehand by columns and also pass the data data with indices sorted (by column) to the prediction functions.
- Parameters
X (DataFrame(nnz, 3), DataFrame(nnz, 4), array(m, n), or sparse COO(m, n)) – Matrix to factorize (e.g. ratings). Can be passed as a SciPy sparse COO matrix (recommended), as a dense NumPy array, or as a Pandas DataFrame, in which case it should contain the following columns: ‘UserId’, ‘ItemId’, and ‘Rating’. Might additionally have a column ‘Weight’. If passing a DataFrame, the IDs will be internally remapped. If passing sparse ‘U’ or sparse ‘I’, ‘X’ cannot be passed as a DataFrame.
U (array(m, p), COO(m, p), DataFrame(m, p+1), or None) – User attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. If ‘U’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array.
U_bin (array(m, p_bin), DataFrame(m, p_bin+1), or None) – User binary attributes information (all values should be zero, one, or missing). If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘UserId’. Cannot be passed as a sparse matrix. Note that ‘U’ and ‘U_bin’ are not mutually exclusive. Only supported with
method='lbfgs'
.I (array(n, q), COO(n, q), DataFrame(n, q+1), or None) – Item attributes information. If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. If ‘I’ is sparse, ‘X’ should be passed as a sparse COO matrix or as a dense NumPy array.
I_bin (array(n, q_bin), DataFrame(n, q_bin+1), or None) – Item binary attributes information (all values should be zero, one, or missing). If ‘X’ is a DataFrame, should also be a DataFrame, containing column ‘ItemId’. Cannot be passed as a sparse matrix. Note that ‘I’ and ‘I_bin’ are not mutually exclusive. Only supported with
method='lbfgs'
.W (None, array(nnz,), or array(m, n)) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array. Cannot have missing values.
- Return type
self
- force_precompute_for_predictions()¶
Precompute internal matrices that are used for predictions
Note
It’s not necessary to call this method if passing
precompute_for_predictions=True
.- Return type
self
- static from_model_matrices(A, B, glob_mean=0.0, precompute=True, user_bias=None, item_bias=None, lambda_=10.0, scale_lam=False, l1_lambda=0.0, nonneg=False, NA_as_zero=False, scaling_biasA=None, scaling_biasB=None, use_float=False, nthreads=-1, n_jobs=None)¶
Create a CMF model object from fitted matrices
Creates a CMF model object based on fitted latent factor matrices, which might have been obtained from a different software. For example, the package
python-libmf
has functionality for obtaining these matrices, but not for producing recommendations or latent factors for new users, for which this function can come in handy as it will turn such model into a CMF model which provides all such functionality.This is only available for models without side information, and does not support user/item mappings.
Note
- This is a static class method, should be called like this:
CMF.from_model_matrices(...)
(i.e. no parentheses after ‘CMF’)
- Parameters
A (array(n_users, k)) – The obtained user factors.
B (array(n_items, k)) – The obtained item factors.
glob_mean (float) – The obtained global mean, if the model underwent centering. If passing zero, will assume that the values are not to be centered.
precompute (bool) – Whether to generate pre-computed matrices which can help to speed up computations on new data.
user_bias (None or array(n_users,)) – The obtained user biases. If passing
None
, will assume that the model did not include user biases.item_bias (None or array(n_items,)) – The obtained item biases. If passing
None
, will assume that the model did not include item biases.lambda_ (float or array(6,)) – Regularization parameter. See the documentation for
__init__
for details.scale_lam (bool) – Whether to scale (increase) the regularization parameter for each row of the model matrices according to the number of non-missing entries in the data for that particular row.
l1_lambda (float or array(6,)) – Regularization parameter to apply to the L1 norm of the model matrices. See the documentation for
__init__
for details.nonneg (bool) – Whether to constrain the ‘A’ and ‘B’ matrices to be non-negative.
NA_as_zero (bool) – Whether to take missing entries in the ‘X’ matrix as zeros (only when the ‘X’ matrix is passed as sparse COO matrix) instead of ignoring them. See the documentation for
__init__
for details.scaling_biasA (None or float) – If passing it, will assume that the model uses the option
scale_bias_const=True
, and will use this number as scaling for the regularization of the user biases.scaling_biasB (None or float) – If passing it, will assume that the model uses the option
scale_bias_const=True
, and will use this number as scaling for the regularization of the item biases.use_float (bool) – Whether to use C float type for the model parameters (typically this is
np.float32
). If passingFalse
, will use C double (typically this isnp.float64
). Using float types will speed up computations and use less memory, at the expense of reduced numerical precision.nthreads (int) – Number of parallel threads to use. If passing a negative number, will use the same formula as joblib (maximum threads + 1 - nthreads).
n_jobs (None or int) – Synonym for nthreads, kept for better compatibility with scikit-learn.
- Returns
model – A
CMF
model object without side information, for which the usual prediction methods such astopN
andtopN_warm
can be used as if it had been fitted through this software.- Return type
- get_params(deep=True)¶
Get parameters for this estimator.
Kept for compatibility with scikit-learn.
- Parameters
deep (bool) – Ignored.
- Returns
params – Parameter names mapped to their values.
- Return type
dict
- item_factors_cold(I=None, I_bin=None, I_col=None, I_val=None)¶
Determine item-factors from new data, given I
Note
Calculating item factors might be a lot slower than user factors, as the model does not keep precomputed matrices that might speed up these factor calculations. If this function is goint to be used frequently, it’s advised to build the model swapping the users and items instead.
- Parameters
I (array(q,), or None) – Attributes for the new item, in dense format. Should only pass one of ‘I’ or ‘I_col’+’I_val’.
I_bin (array(q_bin,), or None) – Binary attributes for the new item, in dense format. Only supported with
method='lbfgs'
.I_col (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘I’ or ‘I_col’+’I_val’.
I_val (None or array(nnz)) – Attributes for the new item, in sparse format. ‘I_val’ should contain the values in the columns given by ‘I_col’. Should only pass one of ‘I’ or ‘I_col’+’I_val’.
- Returns
factors – The item-factors as determined by the model.
- Return type
array(k_item+k+k_main,)
- predict(user, item)¶
Predict ratings/values given by existing users to existing items
Note
For CMF explicit, invalid combinations of users and items will be set to the global mean plus biases if applicable. For other models, invalid combinations will be set as NaN.
- Parameters
user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
item (array-like(n,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in
item
will be matched with the corresponding entry ofuser
at the same position in the array/list.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- predict_cold(items, U=None, U_bin=None, U_col=None, U_val=None)¶
Predict rating given by a new user to existing items, given U
Note
If using
NA_as_zero
, this function will assume that all the ‘X’ values are zeros rather than being missing.- Parameters
items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value
np.nan
. Only supported withmethod='lbfgs'
.U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
- Returns
scores – Predicted ratings for the requested items, for this user.
- Return type
array(n,)
- predict_cold_multiple(item, U=None, U_bin=None)¶
Predict rating given by new users to existing items, given U
Note
If using
NA_as_zero
, this function will assume that all the ‘X’ values are zeros rather than being missing.- Parameters
item (array-like(m,)) – Items for which ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
U (array(m, p), CSR matrix(m, q), COO matrix(m, q), or None) – Attributes for the users for which to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in
item
.U_bin (array(m, p_bin), or None) – Binary attributes for the users to predict ratings/values. Data frames with ‘UserId’ column are not supported. Must have one row per entry in
user
. Only supported withmethod='lbfgs'
.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(m,)
- predict_new(user, I=None, I_bin=None)¶
Predict rating given by existing users to new items, given I
Note
Calculating item factors might be a lot slower than user factors, as the model does not keep precomputed matrices that might speed up these factor calculations. If this function is goint to be used frequently, it’s advised to build the model swapping the users and items instead.
- Parameters
user (array-like(n,)) – Users for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
I (array(n, q), CSR matrix(n, q), COO matrix(n, q), or None) – Attributes for the items for which to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in
user
. Might contain missing values.I_bin (array(n, q_bin), or None) – Binary attributes for the items to predict ratings/values. Data frames with ‘ItemId’ column are not supported. Must have one row per entry in
user
. Might contain missing values. Only supported withmethod='lbfgs'
.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(n,)
- predict_warm(items, X=None, X_col=None, X_val=None, W=None, U=None, U_bin=None, U_col=None, U_val=None)¶
Predict ratings for existing items, for a new user, given ‘X’
- Parameters
items (array-like(n,)) – Items whose ratings are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’.
X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value
np.nan
. Should only pass one of ‘X’ or ‘X_col’+’X_val’.X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.
X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.
W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value
np.nan
. Only supported withmethod='lbfgs'
. User side info is not strictly required and can be skipped.U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
- Returns
scores – Predicted values for the requested items for a user defined by the given values of ‘X’ in ‘X_col’ and ‘X_val’.
- Return type
array(n,)
- predict_warm_multiple(X, item, U=None, U_bin=None, W=None)¶
Predict ratings for existing items, for new users, given ‘X’
Note
See the documentation of “fit” for details about handling of missing values.
- Parameters
X (array(m, n), CSR matrix(m, n) , or COO matrix(m, n)) – New ‘X’ data with potentially missing entries. Must have one row per entry of
item
.item (array-like(m,)) – Items for whom ratings/values are to be predicted. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Each entry in
item
will be matched with the corresponding row ofX
.U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’.
U_bin (array(m, p_bin)) – User binary attributes for each row in ‘X’. Only supported with
method='lbfgs'
.W (array(m, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.
- Returns
scores – Predicted ratings for the requested user-item combinations.
- Return type
array(m,)
- set_params(**params)¶
Set the parameters of this estimator.
Kept for compatibility with scikit-learn.
Note
Setting any parameter that is related to model hyperparameters (i.e. anything not related to verbosity or number of threads) will reset the model - that is, it will no longer be possible to use it for predictions without a new refit.
- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
estimator instance
- swap_users_and_items(precompute=True)¶
Swap the users and items in a factorization model
This method will generate a new object that will have the users and items of this object swapped, and such result can be used under the same methods such as
topN
, in which any mention of users will now mean items and vice-versa.Note
The resulting object will not generate any deep copies of the original model’s objects.
- Parameters
precompute (bool) – Whether to produce the precomputed matrices which might help to speed up predictions on new data.
- Returns
model – An object of the same class as this one, but with the user and items swapped.
- Return type
obj
- topN(user, n=10, include=None, exclude=None, output_score=False)¶
Rank top-N highest-predicted items for an existing user
Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a DataFrame, must match with the entries in its ‘UserId’ column, otherwise should match with the rows of ‘X’.
n (int) – Number of top-N highest-predicted results to output.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_cold(n=10, U=None, U_bin=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)¶
Compute top-N highest-predicted items for a new user, given ‘U’
Note
If using
NA_as_zero
, this function will assume that all the ‘X’ values are zeros rather than being missing.Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
n (int) – Number of top-N highest-predicted results to output.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value
np.nan
. Only supported withmethod='lbfgs'
.U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_new(user, I=None, I_bin=None, n=10, output_score=False)¶
Rank top-N highest-predicted items for an existing user, given ‘I’
Note
If the model was fit to both ‘I’ and ‘I_bin’, can pass a partially- disjoint set to both - that is, both can have rows that the other doesn’t. In such case, the rows that they have in common should come first, and then one of them appended missing values so that one of the matrices ends up containing all the rows of the other.
Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
user (int or obj) – User for which to rank the items. If ‘X’ passed to ‘fit’ was a data frame, must match with entries in its ‘UserId’ column, otherwise should match with the rows on ‘X’.
I (array(m, q), CSR matrix(m, q), COO matrix(m, q), or None) – Attributes for the items to rank. Data frames with ‘ItemId’ column are not supported.
I_bin (array(m, q_bin), or None) – Binary attributes for the items to rank. Data frames with ‘ItemId’ column are not supported. Only supported with
method='lbfgs'
.n (int) – Number of top-N highest-predicted results to output. Must be less or equal than the number of rows in ‘I’/’I_bin’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user, as integers matching to the rows of ‘I’/’I_bin’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- topN_warm(n=10, X=None, X_col=None, X_val=None, W=None, U=None, U_bin=None, U_col=None, U_val=None, include=None, exclude=None, output_score=False)¶
Compute top-N highest-predicted items for a new user, given ‘X’
Note
This method produces an exact ranking by computing all item predictions for a given user. As the number of items grows, this can become a rather slow operation - for model serving purposes, it’s usually a better idea to obtain an an approximate top-N ranking through software such as “hnsw” or “Milvus” from the calculated user factors and item factors.
- Parameters
n (int) – Number of top-N highest-predicted results to output.
X (array(n,) or None) – Observed ‘X’ data for the new user, in dense format. Non-observed entries should have value
np.nan
. Should only pass one of ‘X’ or ‘X_col’+’X_val’.X_col (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_col’ should contain the column indices (items) of the observed entries. If ‘X’ passed to ‘fit’ was a data frame, should have entries from ‘ItemId’ column, otherwise should have column numbers (starting at zero). Should only pass one of ‘X’ or ‘X_col’+’X_val’.
X_val (array(nnz,) or None) – Observed ‘X’ data for the new user, in sparse format. ‘X_val’ should contain the values in the columns/items given by ‘X_col’. Should only pass one of ‘X’ or ‘X_col’+’X_val’.
W (array(nnz,), array(n,), or None) – Weights for the observed entries in ‘X’. If passed, should have the same shape as ‘X’ - that is, if ‘X’ is passed as a dense array, should have ‘n’ entries, otherwise should have ‘nnz’ entries.
U (array(p,), or None) – User attributes in the new data (1-row only). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_bin (array(p_bin,)) – User binary attributes in the new data (1-row only). Missing entries should have value
np.nan
. Only supported withmethod='lbfgs'
. User side info is not strictly required and can be skipped.U_col (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_col’ should contain the column indices of the non-zero entries (starting at zero). Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
U_val (None or array(nnz)) – User attributes in the new data (1-row only), in sparse format. ‘U_val’ should contain the values in the columns given by ‘U_col’. Should only pass one of ‘U’ or ‘U_col’+’U_val’. User side information is not strictly required, and can skip both.
include (array-like) – List of items which will be ranked. If passing this, will only make a ranking among these items. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
exclude (array-like) – List of items to exclude from the ranking. If passing this, will rank all the items except for these. If ‘X’ passed to fit was a DataFrame, must match with the entries in its ‘ItemId’ column, otherwise should match with the columns of ‘X’. Can only pass one of ‘include or ‘exclude’.
output_score (bool) – Whether to output the scores in addition to the IDs. If passing ‘False’, will return a single array with the item IDs, otherwise will return a tuple with the item IDs and the scores.
- Returns
items (array(n,)) – The top-N highest predicted items for this user. If the ‘X’ data passed to fit was a DataFrame, will contain the item IDs from its column ‘ItemId’, otherwise will be integers matching to the columns of ‘X’.
scores (array(n,)) – The predicted scores for the top-N items. Will only be returned when passing
output_score=True
, in which case the result will be a tuple with these two entries.
- transform(X=None, y=None, U=None, U_bin=None, W=None, replace_existing=False)¶
Reconstruct missing entries of the ‘X’ matrix
Will reconstruct/impute all the missing entries in the ‘X’ matrix as determined by the model. This method is intended to be used for imputing tabular data, and can be used as part of SciKit-Learn pipelines.
Note
It’s possible to use this method with ‘X’ alone, with ‘U’/’U_bin’ alone, or with both ‘X’ and ‘U’/’U_bin’ together, in which case both matrices must have the same rows.
Note
If fitting the model to DataFrame inputs (instead of NumPy arrays and/or SciPy sparse matrices), the IDs are reindexed internally, and the inputs provided here should match with the numeration that was produced by the model. The mappings in such case are available under attributes
self.user_mapping_
andself.item_mapping_
.- Parameters
X (array(m, n), or None) – New ‘X’ data with potentially missing entries which are to be imputed. Missing entries should have value
np.nan
when passing a dense array.y (None) – Not used. Kept as a placeholder for compatibility with SciKit-Learn pipelines.
U (array(m, p), CSR matrix(m, p), COO matrix(m, p), or None) – User attributes information for each row in ‘X’.
U_bin (array(m, p_bin) or None) – User binary attributes for each row in ‘X’. Only supported with
method='lbfgs'
.W (array(m, n), array(nnz,), or None) – Observation weights. Must have the same shape as ‘X’ - that is, if ‘X’ is a sparse COO matrix, must be a 1-d array with the same number of non-zero entries as ‘X.data’, if ‘X’ is a 2-d array, ‘W’ must also be a 2-d array.
- Returns
X – The ‘X’ matrix as a dense array with all missing entries imputed according to the model.
- Return type
array(m, n)