Machine learns as data speak: Markov random fields for recommender systems II: Discovering latent space

In the previous post we talked about how Markov random fields (MRFs) can be used to model local structure in the recommendation data. Local structures are powerful enough to make our MRF work, but they model only second-order interactions. We now explore the option of discovering latent spaces in the data. The idea is to capture weaker but higher-order interactions globally.

Restricted Boltzmann machines for user latent space

Restricted Boltzmann machines (RBMs) offer an interesting framework: observed ratings can be encoded into the input layer, and the latent factors are represented by the hidden layer. More precisely a RBM is a special type of MRFs where only connections across input nodes and hidden nodes are allowed. As each hidden unit (latent factor) is connected to every input units, high-order correlations among input units are nicely modelled.

Now we proceed to build RBMs for users, assuming for the moment that users are independent in their choices. And thus we build one RBM per user. This is because each user rates only a handful number of items, and thus building the full model of all items is not well justified. For now let us assume that we include only seen items into the model for estimation. At test time, we will introduce unseen items into the model assuming that the model won't change.

For simplicity, assume that latent factors are binary. Denote by \( \mathbf{r} = (r_1,r_2,...,r_N) \) the vector of input variables, and \( \mathbf{h} = (h_1,h_2,..,h_K) \) the vector of latent vectors, where \( h_k \in \{0,1\}\). The energy of the RBM reads
\[ E = - \left ( \sum_i \sum_m\alpha_{im} f_m(r_i) + \sum_k \beta_k h_k + \sum_i\sum_m\sum_k w_{ikm} f_m(r_i)h_k \right ) \]
where \( f_m(r_i) \) is feature function and \( \alpha_{im}, \beta_k, w_{ikm} \) are parameters.

Depending on your argument of the nature of the ratings, we may end up different feature functions, and thus different models such as Gaussian, categorical or ordinal. We can also incorporate item contents and user profiles into the model. For example, profile aspects can be treated as some kind of items which we always know and never have to predict. For this, perhaps a conditional RBM would make sense: We may want to model \( P(\mathbf{r} | \mathbf{p}) \) instead of \( P(\mathbf{r}, \mathbf{p}) \) where \( \mathbf{p} \) is the profile vector.

The model can now be estimated, either by deterministic methods such as pseudo-likelihood or stochastic methods such as Contrastive Divergence or persistent MCMC. Note that for the pseudo-likelihood method, we need to integrate over the hidden units, so it is not entirely the same as the original Besag's proposal, but the principle is the same.

At test time, we basically look for
\[ r_{j}^{*} = \arg\max_{r_{j}}P(r_{j}|\mathbf{r}) = \arg\max_{r_{j}}\sum_{\mathbf{h}}P(r_{j},\mathbf{h}|\mathbf{r}) \]
Thus we need to sum over the hidden layer, which is efficient to do because conditioned on the seen ratings, the graphical model is now a tree. Alternatively, we may resort to mean-field approximation: the stochastic hidden unit is replaced by its deterministic conditional expectation given the seen ratings, i.e., \(h_k \leftarrow P(h_k=1|\mathbf{r}) \). The big O complexity is the same, but it is numerically faster since we do not need to deal with the computation of expensive exponential functions.

A more subtle question is that what if we don't care about exact rating prediction, but any numerical approximation would do? This is often the case in evaluation metrics such as MAE or RMSE. The problem with the RMSE is it penalises zero-one errors to much, while it may be the case that the RBM would make arbitrary zero-one errors. A smoother version of the rating prediction would definitely help here:
\[ \bar{r}_{j}=\sum_{S=1}^{L}P(r_{j}=S|\mathbf{r})S \]
for the case of \( L \) rating scale.

Finally, the choice of number of hidden unit \( K \) is hard to estimate directly. The random projection theory (e.g., see this KDD paper) suggests that \( K \) is about the same order as the log of the number of users. In practice, however, we can use cross-validation to choose the best value. My experience suggests that once it reaches some reasonable large (e.g., 50-100), the size doesn't matter much, and we don't have to worry too much about overfitting. Perhaps the stochastic nature of the model plays the role here.

A variant of this RBM with categorical assumption of rating (which largely ignores the ordinal constraints) was introduced in ICML'07. It was demonstrated for the first time that RBMs can be a serious competitor for large-scale collaborative filtering. Since then, RBMs are an integrated part of the so-called "ensemble methods" for high performance systems.

Not-so-restricted Boltzmann machines

Recall that in the RBMs, the direct interactions between items are not considered. This is orthogonal to the standard MRF we consider in the previous post. A natural idea is therefore to combine the two approaches: We model both the direct, local interactions and the indirect, global counterparts. Hopefully this would strengthen each method and lead to better predictions.

With energy-based model such as MRFs, this is fairly easy: We just need to add up the energy function and removing the duplicate biases! The results so far are encouraging -- the combination actually works suggesting that the two approaches capture different type of data aspects.

The joint model of everything

For now we have a semi-restricted Boltzmann machine for each user. However, it is also plausible that each item deserves the same treatment! Besides, the original assumption that the user's choices are independent is not correct: Users only choose those items which are reachable, and it is known that users are influenced by the current choices of their peers.

Thus, we need to integrate all the user-specific models and item-specific models together. Fortunately, this is not so difficult for MRFs. Again, we just need to add energy functions and remove duplicates. The thing we need to worry about is learning for this complicated model. But a closer look that the model structure would suggest a way out: Conditioned on user-specific units, item-specific models are independent. The same holds if we flip the roles of users and items. Thus, an efficient method would be structured pseudo-likelihood (and mean-fields): We look for a block of units that the same time while conditioning on the rest.

Until next time, I wish you enjoy the giant model with a lot of connections and hidden units! See our UAI'09 paper for more technical details.

Previous post: Part 1: learning local dependency | Next post: Part 3: embedding ordinal matrix factorisation

Machine learns as data speak

Thursday, 21 June 2012

Markov random fields for recommender systems II: Discovering latent space

No comments:

Post a Comment