In the previous posts I have described how Markov random fields (MRFs) can be used to represent the local structures and latent spaces in recommendation. The advantage of MRFs is in its disciplined interpretation and inference. However, the employment of log-linear modelling in MRFs does not enable a simple treatment of ordinal ratings, which we have claimed to be of great importance.
The goal of this post is to develop a MRF-based model that achieves both the representational power of the probabilistic graphical models and and the ease of modelling ordinal rating by matrix factorisation. In short, our new model combines a MRF and an ordinal matrix factorisation (OMF) method. For simplicity, we shall focus only on the MRF with local structures. This is because the latent spaces have been captured by the OMF. Of course, nothing will prevent us from modelling the latent spaces by the MRF itself because the two approaches (OMF and MRF) are very different, one is linear and another is non-linear.
The OMF
Recall that the OMF aims to model an ordinal distribution \( Q(r|u,i) \). The standard way is to assume that there exists a latent utility function \( f(u,i) \) that captures how much value the user \( u \) judges the item \( i \) . For simplicity we assume that the utility has the linear, parametric form \( f(u,i) = a_u + b_i + W_u' H_i \), although nonlinear and nonparametric options can be possible. The assumption by McCullagh is that the rating is chosen based on the interval to which the utility belongs:
\[ r_{ui} = l \mbox{ if } f(u,i) \in (\theta_{l-1},\theta_{l}] \]
for \( l < L \) and
\[ r_{ui} = L \mbox{ if } f(u,i) > \theta_{L-1} \]
where \( L \) is the number of ordinal levels. Here usually we fix \( \theta_{0} = -\infty \). Other assumptions are also possible but this is by far the most popular. The probability of receiving a rating is therefore
\[ Q(r=l|u,i) = \int_{\theta_{l-1}}^{\theta_l}P(f(u,i)| \theta) = F(\theta_l) - F(\theta_{l-1}) \]
where \( F(\theta_l) \) is the cumulative distribution evaluated at \( \theta_l \). The thresholds can be prameterised to depend on user and item but care must be taken to ensure that they are monotonic in \( l \), e.g., \( \theta_{ui,l} = \theta_{ui,l-1} + e^{\eta_{u,l} + \gamma_{i,l}} \).
Depending on the choice of the distribution over \( f(u,i) \) we may end up with the logistic form of \( F(\theta_l) \) or its probit alternatives. See our recent AAAI'12 paper for a comparison.
The hybrid model of MRF and OMF
Now we wish to build a MRF model by multiplying the point-wise OMF and the item-item interactions:
\[ P(\mathbf{r}|u) \propto \Psi_u(\mathbf{r}) \prod_{i \in I(u)} Q(r_{ui}|u,i) \]
where \( \Psi_u(\mathbf{r}) > 0 \) is the potential function capturing the interaction among items and \( \mathbf{r} \) is the set of items rated by user \( u \). In essence this model promotes the agreement between the latent aspects discovered by the OMF and the local structures in the data. The usual form of the potential function is factorised into pairwise potentials
\[ \Psi_u(\mathbf{r}) = \exp \left(\sum_i\sum_{j>i}\sum_m \beta_{ijm} g_m(r_i,r_j)\right) \]
where \( \beta_{ijm} \) is parameter and \( g_m(r_i,r_j) \) is feature function. For example, the feature function can be just indicators
\[ g_{lk}(r_i,r_j) = \delta(r_i,l) \delta(r_j,k) \]
as in standard treatment of categorical variables; or bilinear association
\[ g(r_i,r_j) = (r_i - \bar{r}_i)( r_j - \bar{r}_j)\]
as in usual Gaussian assumption where \( \bar{r}_i \) is the mean rating for item \( i \); or the metric cost
\[ g(r_i,r_j) = |r_i - r_j| ^ p \]
for \( p > 0\).
Model estimation
Like most MRFs we cannot estimate the parameters easily using the standard maximum likelihood approach. Currently, there are two most effective alternatives: The pseudo-likelihood of Besag and the stochastic gradient using truncated MCMC. The pseudo-likelihood leads to exact computation of the loss function and its gradient with respect to parameters, and thus may be preferred by some practitioners. The MCMC-based methods may, on the other hand, lead to better estimation given enough time.
Let us present here the case for pseudo-likelihood. We need to estimate the local conditional distribution
\[ P(r_i | \mathbf{r}_{\neg i}) \propto Q(r_{ui}|u,i) \exp \left(\sum_{j\in I(u), j \ne i}\sum_m \beta_{ijm} g_m(r_i,r_j)\right) \]
Maximising the log of product of all local conditional distributions with respect to parameters will give us the result.
Rating prediction
Predicting the new rating is easy, we just need to compute \( P(r_j | \mathbf{r}) \) in the same way as we have done in the pseudo-likelihood and search for the most probable rating (for metrics such as MAE), or its expectation (for metrics such as RMSE).
Joining user-based models and item-based models
As we have always argued, each item should has its own life, much in the same way as the users. Thus we can build item-based model with user-user interactions. Since user and item are complementary entities, the two model types can be fused into one. The idea is quite simple: We just need to multiply all models together and remove duplicates (because the OMF components will appear twice):
\[ P(\mathbf{R}) \propto \left[\prod_u\Psi_u(\mathbf{r}_u)\right] \left[\prod_i\Phi_i(\mathbf{r}_i)\right] \prod_{u,i} Q(r_{ui}|u,i) \]
where \( \mathbf{R} \) is the collection of all seen ratings.
Connection to our AAAI'12 paper
In our AAAI'12 paper we suggest the following form of the utility function
\[ f(u,i) = a_u + b_i + W_u' H_i + \sum_{j \in I(u)} w_{ij} g(r_j) \]
This appears to be quite similar to our local (log) probability in the pseudo-likelihood method. Of course the fine details will be different but in essence the information captured by both the approaches may be similar. Another subtle difference is that in the MRF treatment, the pairwise interactions are symmetric (i.e., \( w_{ij} = w_{ji} \)), while in this model, it is asymmetric \( w_{ij} \ne w_{ji} \).
Previous posts: Part 1: Modelling local dependency | Part 2: Discovering latent spaces.
AI, machine learning, deep learning, data science and all those topics!
Sunday, 24 June 2012
Thursday, 21 June 2012
Markov random fields for recommender systems II: Discovering latent space
In the previous post we talked about how Markov random fields (MRFs) can be used to model local structure in the recommendation data. Local
structures are powerful enough to make our MRF work, but they model
only second-order interactions. We now explore the option of discovering latent spaces in the data. The idea is to capture weaker but higher-order
interactions globally.
Restricted Boltzmann machines for user latent space
Restricted Boltzmann machines (RBMs) offer an interesting framework: observed ratings can be encoded into the input layer, and the latent factors are represented by the hidden layer. More precisely a RBM is a special type of MRFs where only connections across input nodes and hidden nodes are allowed. As each hidden unit (latent factor) is connected to every input units, high-order correlations among input units are nicely modelled.
Now we proceed to build RBMs for users, assuming for the moment that users are independent in their choices. And thus we build one RBM per user. This is because each user rates only a handful number of items, and thus building the full model of all items is not well justified. For now let us assume that we include only seen items into the model for estimation. At test time, we will introduce unseen items into the model assuming that the model won't change.
For simplicity, assume that latent factors are binary. Denote by \( \mathbf{r} = (r_1,r_2,...,r_N) \) the vector of input variables, and \( \mathbf{h} = (h_1,h_2,..,h_K) \) the vector of latent vectors, where \( h_k \in \{0,1\}\). The energy of the RBM reads
\[ E = - \left ( \sum_i \sum_m\alpha_{im} f_m(r_i) + \sum_k \beta_k h_k + \sum_i\sum_m\sum_k w_{ikm} f_m(r_i)h_k \right ) \]
where \( f_m(r_i) \) is feature function and \( \alpha_{im}, \beta_k, w_{ikm} \) are parameters.
Depending on your argument of the nature of the ratings, we may end up different feature functions, and thus different models such as Gaussian, categorical or ordinal. We can also incorporate item contents and user profiles into the model. For example, profile aspects can be treated as some kind of items which we always know and never have to predict. For this, perhaps a conditional RBM would make sense: We may want to model \( P(\mathbf{r} | \mathbf{p}) \) instead of \( P(\mathbf{r}, \mathbf{p}) \) where \( \mathbf{p} \) is the profile vector.
The model can now be estimated, either by deterministic methods such as pseudo-likelihood or stochastic methods such as Contrastive Divergence or persistent MCMC. Note that for the pseudo-likelihood method, we need to integrate over the hidden units, so it is not entirely the same as the original Besag's proposal, but the principle is the same.
At test time, we basically look for
\[ r_{j}^{*} = \arg\max_{r_{j}}P(r_{j}|\mathbf{r}) = \arg\max_{r_{j}}\sum_{\mathbf{h}}P(r_{j},\mathbf{h}|\mathbf{r}) \]
Thus we need to sum over the hidden layer, which is efficient to do because conditioned on the seen ratings, the graphical model is now a tree. Alternatively, we may resort to mean-field approximation: the stochastic hidden unit is replaced by its deterministic conditional expectation given the seen ratings, i.e., \(h_k \leftarrow P(h_k=1|\mathbf{r}) \). The big O complexity is the same, but it is numerically faster since we do not need to deal with the computation of expensive exponential functions.
A more subtle question is that what if we don't care about exact rating prediction, but any numerical approximation would do? This is often the case in evaluation metrics such as MAE or RMSE. The problem with the RMSE is it penalises zero-one errors to much, while it may be the case that the RBM would make arbitrary zero-one errors. A smoother version of the rating prediction would definitely help here:
\[ \bar{r}_{j}=\sum_{S=1}^{L}P(r_{j}=S|\mathbf{r})S \]
for the case of \( L \) rating scale.
Finally, the choice of number of hidden unit \( K \) is hard to estimate directly. The random projection theory (e.g., see this KDD paper) suggests that \( K \) is about the same order as the log of the number of users. In practice, however, we can use cross-validation to choose the best value. My experience suggests that once it reaches some reasonable large (e.g., 50-100), the size doesn't matter much, and we don't have to worry too much about overfitting. Perhaps the stochastic nature of the model plays the role here.
A variant of this RBM with categorical assumption of rating (which largely ignores the ordinal constraints) was introduced in ICML'07. It was demonstrated for the first time that RBMs can be a serious competitor for large-scale collaborative filtering. Since then, RBMs are an integrated part of the so-called "ensemble methods" for high performance systems.
Not-so-restricted Boltzmann machines
Recall that in the RBMs, the direct interactions between items are not considered. This is orthogonal to the standard MRF we consider in the previous post. A natural idea is therefore to combine the two approaches: We model both the direct, local interactions and the indirect, global counterparts. Hopefully this would strengthen each method and lead to better predictions.
With energy-based model such as MRFs, this is fairly easy: We just need to add up the energy function and removing the duplicate biases! The results so far are encouraging -- the combination actually works suggesting that the two approaches capture different type of data aspects.
The joint model of everything
For now we have a semi-restricted Boltzmann machine for each user. However, it is also plausible that each item deserves the same treatment! Besides, the original assumption that the user's choices are independent is not correct: Users only choose those items which are reachable, and it is known that users are influenced by the current choices of their peers.
Thus, we need to integrate all the user-specific models and item-specific models together. Fortunately, this is not so difficult for MRFs. Again, we just need to add energy functions and remove duplicates. The thing we need to worry about is learning for this complicated model. But a closer look that the model structure would suggest a way out: Conditioned on user-specific units, item-specific models are independent. The same holds if we flip the roles of users and items. Thus, an efficient method would be structured pseudo-likelihood (and mean-fields): We look for a block of units that the same time while conditioning on the rest.
Until next time, I wish you enjoy the giant model with a lot of connections and hidden units! See our UAI'09 paper for more technical details.
Previous post: Part 1: learning local dependency | Next post: Part 3: embedding ordinal matrix factorisation
Restricted Boltzmann machines for user latent space
Restricted Boltzmann machines (RBMs) offer an interesting framework: observed ratings can be encoded into the input layer, and the latent factors are represented by the hidden layer. More precisely a RBM is a special type of MRFs where only connections across input nodes and hidden nodes are allowed. As each hidden unit (latent factor) is connected to every input units, high-order correlations among input units are nicely modelled.
Now we proceed to build RBMs for users, assuming for the moment that users are independent in their choices. And thus we build one RBM per user. This is because each user rates only a handful number of items, and thus building the full model of all items is not well justified. For now let us assume that we include only seen items into the model for estimation. At test time, we will introduce unseen items into the model assuming that the model won't change.
For simplicity, assume that latent factors are binary. Denote by \( \mathbf{r} = (r_1,r_2,...,r_N) \) the vector of input variables, and \( \mathbf{h} = (h_1,h_2,..,h_K) \) the vector of latent vectors, where \( h_k \in \{0,1\}\). The energy of the RBM reads
\[ E = - \left ( \sum_i \sum_m\alpha_{im} f_m(r_i) + \sum_k \beta_k h_k + \sum_i\sum_m\sum_k w_{ikm} f_m(r_i)h_k \right ) \]
where \( f_m(r_i) \) is feature function and \( \alpha_{im}, \beta_k, w_{ikm} \) are parameters.
Depending on your argument of the nature of the ratings, we may end up different feature functions, and thus different models such as Gaussian, categorical or ordinal. We can also incorporate item contents and user profiles into the model. For example, profile aspects can be treated as some kind of items which we always know and never have to predict. For this, perhaps a conditional RBM would make sense: We may want to model \( P(\mathbf{r} | \mathbf{p}) \) instead of \( P(\mathbf{r}, \mathbf{p}) \) where \( \mathbf{p} \) is the profile vector.
The model can now be estimated, either by deterministic methods such as pseudo-likelihood or stochastic methods such as Contrastive Divergence or persistent MCMC. Note that for the pseudo-likelihood method, we need to integrate over the hidden units, so it is not entirely the same as the original Besag's proposal, but the principle is the same.
At test time, we basically look for
\[ r_{j}^{*} = \arg\max_{r_{j}}P(r_{j}|\mathbf{r}) = \arg\max_{r_{j}}\sum_{\mathbf{h}}P(r_{j},\mathbf{h}|\mathbf{r}) \]
Thus we need to sum over the hidden layer, which is efficient to do because conditioned on the seen ratings, the graphical model is now a tree. Alternatively, we may resort to mean-field approximation: the stochastic hidden unit is replaced by its deterministic conditional expectation given the seen ratings, i.e., \(h_k \leftarrow P(h_k=1|\mathbf{r}) \). The big O complexity is the same, but it is numerically faster since we do not need to deal with the computation of expensive exponential functions.
A more subtle question is that what if we don't care about exact rating prediction, but any numerical approximation would do? This is often the case in evaluation metrics such as MAE or RMSE. The problem with the RMSE is it penalises zero-one errors to much, while it may be the case that the RBM would make arbitrary zero-one errors. A smoother version of the rating prediction would definitely help here:
\[ \bar{r}_{j}=\sum_{S=1}^{L}P(r_{j}=S|\mathbf{r})S \]
for the case of \( L \) rating scale.
Finally, the choice of number of hidden unit \( K \) is hard to estimate directly. The random projection theory (e.g., see this KDD paper) suggests that \( K \) is about the same order as the log of the number of users. In practice, however, we can use cross-validation to choose the best value. My experience suggests that once it reaches some reasonable large (e.g., 50-100), the size doesn't matter much, and we don't have to worry too much about overfitting. Perhaps the stochastic nature of the model plays the role here.
A variant of this RBM with categorical assumption of rating (which largely ignores the ordinal constraints) was introduced in ICML'07. It was demonstrated for the first time that RBMs can be a serious competitor for large-scale collaborative filtering. Since then, RBMs are an integrated part of the so-called "ensemble methods" for high performance systems.
Not-so-restricted Boltzmann machines
Recall that in the RBMs, the direct interactions between items are not considered. This is orthogonal to the standard MRF we consider in the previous post. A natural idea is therefore to combine the two approaches: We model both the direct, local interactions and the indirect, global counterparts. Hopefully this would strengthen each method and lead to better predictions.
With energy-based model such as MRFs, this is fairly easy: We just need to add up the energy function and removing the duplicate biases! The results so far are encouraging -- the combination actually works suggesting that the two approaches capture different type of data aspects.
The joint model of everything
For now we have a semi-restricted Boltzmann machine for each user. However, it is also plausible that each item deserves the same treatment! Besides, the original assumption that the user's choices are independent is not correct: Users only choose those items which are reachable, and it is known that users are influenced by the current choices of their peers.
Thus, we need to integrate all the user-specific models and item-specific models together. Fortunately, this is not so difficult for MRFs. Again, we just need to add energy functions and remove duplicates. The thing we need to worry about is learning for this complicated model. But a closer look that the model structure would suggest a way out: Conditioned on user-specific units, item-specific models are independent. The same holds if we flip the roles of users and items. Thus, an efficient method would be structured pseudo-likelihood (and mean-fields): We look for a block of units that the same time while conditioning on the rest.
Until next time, I wish you enjoy the giant model with a lot of connections and hidden units! See our UAI'09 paper for more technical details.
Previous post: Part 1: learning local dependency | Next post: Part 3: embedding ordinal matrix factorisation
Saturday, 16 June 2012
Markov random fields for recommender systems I: Modelling local dependency
As I have mentioned, there are two general approaches to collaborative filtering: the latent space (e.g., using matrix factorisation) and the local structures (e.g., using graph to estimate user-user similarity or item-item similarity). A natural question is then can the two approaches be combined? Numerous evidences have suggested that the answer is yes: The latent space assumption and the local structures assumption seem to be somewhat complementary, or at least do not overlap entirely. I guess they do partly overlap because by using spectral decomposition of the similarity matrices, we can also discover a latent space, but it does not guarantee to be the same as those obtained by factorising the rating matrix.
The only issue is that the combination of the two methods so far appears a bit heuristic, or at least the two approaches do not come from the same unified representation. The quest for such a unification may be more of theoretical interest rather than practical ones: After all with ensemble methods, we can combine any different methods as long as they are fairly diverse.
It turns out, probabilistic graphical models (PGMs) nicely offer such a representation so that the ratings/preferences data can be seen as a probabilistic database from which many queries can be answered probabilistically. For example, rating prediction can be cast as an inference problem where we need to infer the probability that an user will assign a rating value to a particular item. More importantly, it separates out the issue of model building (e.g., structure specification and parameter estimation) from inference (e.g., rating prediction or item ranking).
There are two general ways to specify a PGM: using directed models and using undirected models. Existing directed models for collaborative filtering appeared earlier, and they include Bayesian networks and those variants of the so-called topic models: the PLSA and LDA. The main problem with directed models is that we need to be really careful in our model design to make sure that the structure is a DAG (there will be no loops), and every link from a parent to a child makes some sense. Incorporating features and domain knowledge can be hard and generally requires great design skills to make the model correct.
Undirected models, also known as Markov random fields, Markov networks or their variants such as Botlzmann machines, restricted Boltzmann machines, relational Markov networks, and Markov logic networks are more flexible: They allow loops and can incorporate arbitrary features extracted from data or side constraints. This post will solely focus on this undirected class.
Markov random fields for user-specific local structures
Let us start with the local structures (or the neighbourhood). Similarity scores such as Pearson's correlation or cosine are perfectly OK on their own (parameter estimation stage) but their combination to make prediction does not seem to be theoretically guided (inference stage). Markov random fields (MRFs), on the other hand, allow both the stages to be guided in a principled way.
First we need to specify the model structure. For clarity, let us first start with the assumption that users are largely independent in their choice. We will argue later that this assumption is usually wrong in practice, but let the assumption hold for a while.
Our goal is to build graphs, one per user, so that ratings are modelled as node variables, and interaction between items is modelled as edges. Here we depart from the standard use of MRFs by building multiple graphs instead of just one. This is because users only rate only a handful number of items, and building the full graph for all items is not very attractive: First, the graph will be too big for efficient handling. Second, most of the variables will be missing, making learning and inference cumbersome. And finally, assuming most missing variables to be 'just' missing does not seem right: Most items shouldn't be there, or at best, are irrelevant to the taste of the user.
The key here is that although graphs are different, they all share the same set of parameters. Otherwise, learning will not happen robustly, and prediction will be impossible with unidentified parameters.
So we end up including only those seen items into the user-specific graph. Note that it is certainly not optimal, because they will be surely other unseen items which will be highly relevant to the user. For example, at test time we need to include unseen items to make inference. But it doesn't seem to be an issue: At test time, we just need to plug one test item at a time, assuming that the user-specific MRF is unchanged in their parameters, and then we do inference \[ r^* = \arg\max_r P(r | \mathbf{r}(u)) \] where\( r \) denotes the rating, \( \mathbf{r}(u) \) denotes the set of seen ratings by user \( u \) .
The next question is how dense the graph should be? Ideally, the graph should be fully connected because all the items will interact with each other under the same user. However, this is not efficient: The number of edges is quadratic in number of items. Second, it makes learning overfit to weakly interacting pairs.
A better way should be making the graph sparse, and it we would eliminate weakly interacting items. If we do the graph structure estimation automatically, this would be the so-called structure learning in probabilistic graphical models. A more heuristic way is to use some correlation measures: If the two items are strongly positively or negatively correlated, their interaction should be kept. However, it is not easy to determine the right amount of interaction strength. An easy way is to keep only K most strongly interacting neigbours for each item.
A hidden question is how should we compute the correlation measure. There are number of information sources we can exploit here: From the ratings alone, we can look for the common users between two items. From the content, we can match two items if their descriptions are similar (e.g., two movies directed by the same directors or played by the same lead actor/actress). Thus the use of content can be implicit.
Are we done with the structure yet? Not quite. MRFs are undirected, and thus the connections are symmetric: if node j is connected to i then i must also be connected to j. However, the edges discovered from the K-nearest neighours are asymmetric: if j is a neighbour of i, it does not necessarily mean i is a neighbour of j. Thus from the K-nearest neighbours per node we need to add more edges to make the neighbourhood consistent.
The next step is parameterisation. We would use the following energy function
\[ E (\mathbf{r}_u) = -\Big ( \sum_i\sum_m \alpha_{im} g_m(r_i) + \sum_i\sum_{j>i} \sum_m\beta_{ijm} f_m(r_i,r_j) \Big ) \] where \( g_m(r_i) \) and \( f_m(r_i,r_j) \) are feature functions capturing the properties of the ratings and the interaction between two items; \( \alpha_{im} \) and \( \beta_{ijm} \) are parameters.
We will not go into the details of feature design as it is more an art than science. We wish to comment that we can come up with Gaussian random fields, categorical random fields or ordinal random fields, depending on the choice of these functions. See our UAI'09 paper for the comparison of these three model types.
Note that we can easily incorporate item-content and user-profile into these feature functions, making the model an attractive hybrid network which will help combating against the so-called cold-start problem. This is a clear advantage of undirected models where we can just integrate diverse sources of information with little extra cost.
Once the model parameters have been estimated, the prediction for unseen item j is quite easy:
\[ r_j^* = \arg\min_{r_j} E (\mathbf{r}_u, r_j) \]
From user-model to item-model to integrated model
Let us return to the assumption we made earlier: Users are independent in making their choice. This is not always the case: Users are generally influenced by their peers and the media. From the data representation point of view, the rating matrix does not distinguish the role of columns and rows -- by a transpose we still have the same matrix!
Thus it is plausible that we can build a model for each item in the same way that we did earlier with the users. The only difference is that now we need to model user-user interactions.
Given the fact that we now have either user-specific models or item-specific models, the natural question is can these two approaches be unified?
The answer is yes, and in a simple way: We just need to add the two energies functions together. Now we have a giant MRF connecting every user and every item.
From rating-prediction to item-ranking
Recall that the real goal in recommender systems is less about rating prediction but more about producing a rank list of unseen items for a particular user. It turns out, there is a way: If an new item causes more expected reduction in model energy, the item is likely to be more influential, and it should be ranked higher.
To sum up, we now have a principled way to integrate users, items, contents and profiles into the same model which supports structure specification, parameter estimation, rating prediction and item ranking. For more technical details, we refer to our AusDM'07 paper.
In the next part of the post, we will focus on how MRFs can be used to discover latent spaces and how to integrate with the local structures.
Next posts Part 2: discovering latent spaces | Part 3: embedding ordinal matrix factorisation
The only issue is that the combination of the two methods so far appears a bit heuristic, or at least the two approaches do not come from the same unified representation. The quest for such a unification may be more of theoretical interest rather than practical ones: After all with ensemble methods, we can combine any different methods as long as they are fairly diverse.
It turns out, probabilistic graphical models (PGMs) nicely offer such a representation so that the ratings/preferences data can be seen as a probabilistic database from which many queries can be answered probabilistically. For example, rating prediction can be cast as an inference problem where we need to infer the probability that an user will assign a rating value to a particular item. More importantly, it separates out the issue of model building (e.g., structure specification and parameter estimation) from inference (e.g., rating prediction or item ranking).
There are two general ways to specify a PGM: using directed models and using undirected models. Existing directed models for collaborative filtering appeared earlier, and they include Bayesian networks and those variants of the so-called topic models: the PLSA and LDA. The main problem with directed models is that we need to be really careful in our model design to make sure that the structure is a DAG (there will be no loops), and every link from a parent to a child makes some sense. Incorporating features and domain knowledge can be hard and generally requires great design skills to make the model correct.
Undirected models, also known as Markov random fields, Markov networks or their variants such as Botlzmann machines, restricted Boltzmann machines, relational Markov networks, and Markov logic networks are more flexible: They allow loops and can incorporate arbitrary features extracted from data or side constraints. This post will solely focus on this undirected class.
Markov random fields for user-specific local structures
Let us start with the local structures (or the neighbourhood). Similarity scores such as Pearson's correlation or cosine are perfectly OK on their own (parameter estimation stage) but their combination to make prediction does not seem to be theoretically guided (inference stage). Markov random fields (MRFs), on the other hand, allow both the stages to be guided in a principled way.
First we need to specify the model structure. For clarity, let us first start with the assumption that users are largely independent in their choice. We will argue later that this assumption is usually wrong in practice, but let the assumption hold for a while.
Our goal is to build graphs, one per user, so that ratings are modelled as node variables, and interaction between items is modelled as edges. Here we depart from the standard use of MRFs by building multiple graphs instead of just one. This is because users only rate only a handful number of items, and building the full graph for all items is not very attractive: First, the graph will be too big for efficient handling. Second, most of the variables will be missing, making learning and inference cumbersome. And finally, assuming most missing variables to be 'just' missing does not seem right: Most items shouldn't be there, or at best, are irrelevant to the taste of the user.
The key here is that although graphs are different, they all share the same set of parameters. Otherwise, learning will not happen robustly, and prediction will be impossible with unidentified parameters.
So we end up including only those seen items into the user-specific graph. Note that it is certainly not optimal, because they will be surely other unseen items which will be highly relevant to the user. For example, at test time we need to include unseen items to make inference. But it doesn't seem to be an issue: At test time, we just need to plug one test item at a time, assuming that the user-specific MRF is unchanged in their parameters, and then we do inference \[ r^* = \arg\max_r P(r | \mathbf{r}(u)) \] where\( r \) denotes the rating, \( \mathbf{r}(u) \) denotes the set of seen ratings by user \( u \) .
The next question is how dense the graph should be? Ideally, the graph should be fully connected because all the items will interact with each other under the same user. However, this is not efficient: The number of edges is quadratic in number of items. Second, it makes learning overfit to weakly interacting pairs.
A better way should be making the graph sparse, and it we would eliminate weakly interacting items. If we do the graph structure estimation automatically, this would be the so-called structure learning in probabilistic graphical models. A more heuristic way is to use some correlation measures: If the two items are strongly positively or negatively correlated, their interaction should be kept. However, it is not easy to determine the right amount of interaction strength. An easy way is to keep only K most strongly interacting neigbours for each item.
A hidden question is how should we compute the correlation measure. There are number of information sources we can exploit here: From the ratings alone, we can look for the common users between two items. From the content, we can match two items if their descriptions are similar (e.g., two movies directed by the same directors or played by the same lead actor/actress). Thus the use of content can be implicit.
Are we done with the structure yet? Not quite. MRFs are undirected, and thus the connections are symmetric: if node j is connected to i then i must also be connected to j. However, the edges discovered from the K-nearest neighours are asymmetric: if j is a neighbour of i, it does not necessarily mean i is a neighbour of j. Thus from the K-nearest neighbours per node we need to add more edges to make the neighbourhood consistent.
The next step is parameterisation. We would use the following energy function
\[ E (\mathbf{r}_u) = -\Big ( \sum_i\sum_m \alpha_{im} g_m(r_i) + \sum_i\sum_{j>i} \sum_m\beta_{ijm} f_m(r_i,r_j) \Big ) \] where \( g_m(r_i) \) and \( f_m(r_i,r_j) \) are feature functions capturing the properties of the ratings and the interaction between two items; \( \alpha_{im} \) and \( \beta_{ijm} \) are parameters.
We will not go into the details of feature design as it is more an art than science. We wish to comment that we can come up with Gaussian random fields, categorical random fields or ordinal random fields, depending on the choice of these functions. See our UAI'09 paper for the comparison of these three model types.
Note that we can easily incorporate item-content and user-profile into these feature functions, making the model an attractive hybrid network which will help combating against the so-called cold-start problem. This is a clear advantage of undirected models where we can just integrate diverse sources of information with little extra cost.
Once the model parameters have been estimated, the prediction for unseen item j is quite easy:
\[ r_j^* = \arg\min_{r_j} E (\mathbf{r}_u, r_j) \]
From user-model to item-model to integrated model
Let us return to the assumption we made earlier: Users are independent in making their choice. This is not always the case: Users are generally influenced by their peers and the media. From the data representation point of view, the rating matrix does not distinguish the role of columns and rows -- by a transpose we still have the same matrix!
Thus it is plausible that we can build a model for each item in the same way that we did earlier with the users. The only difference is that now we need to model user-user interactions.
Given the fact that we now have either user-specific models or item-specific models, the natural question is can these two approaches be unified?
The answer is yes, and in a simple way: We just need to add the two energies functions together. Now we have a giant MRF connecting every user and every item.
From rating-prediction to item-ranking
Recall that the real goal in recommender systems is less about rating prediction but more about producing a rank list of unseen items for a particular user. It turns out, there is a way: If an new item causes more expected reduction in model energy, the item is likely to be more influential, and it should be ranked higher.
To sum up, we now have a principled way to integrate users, items, contents and profiles into the same model which supports structure specification, parameter estimation, rating prediction and item ranking. For more technical details, we refer to our AusDM'07 paper.
In the next part of the post, we will focus on how MRFs can be used to discover latent spaces and how to integrate with the local structures.
Next posts Part 2: discovering latent spaces | Part 3: embedding ordinal matrix factorisation
Tuesday, 12 June 2012
Ranking by estimating a cone
Geometry is a fascinating world, and in fact many machine learning algorithms can be expressed in the geometrical language: hyper-planes as in SVM, space partitioning as in k-d trees, decision-trees and other non-parametric Bayesian methods, manifold learning, information geometry, to name a few.
Here we will be concerned about a particular geometrical object known as cone (or to be more cultural, have a look at the so-called conical Asian hat). A cone has a special geometrical property that any non-negative combination of vectors belong to it will also belong to it. Mathematically it says for a cone \( C \), if \( x,y \in C \) then \( \alpha x + \beta y \in C \) for any \( \alpha,\beta > 0 \). But doesn't seem to be interesting.
What interests us more is the cone allows us to define a more general concept of inequality. Now we can freely compare vectors instead of points: A vector is said to be greater than another if the difference belongs to a cone. And thus, we can write
\[ x {\succeq}_C y \]
to denote that vector \( x \) is larger than or equal to vector \( y \) with respect to cone \( C \). By definition, we have
\[ x-y \in C \].
It turns out, this concept of generalized inequality can be exploited in our problem of learning to rank (LTR) and collaborative ranking. Let's see how.
Recall that in LTR, one must order objects according to their preferences or perceived utility and relevance. A standard way is to estimate a ranking functional which takes a pair of (query,object) and returns a real score. As we have already mentioned, this is not the only way. We can take a pair of objects and return their ordering.
Now suppose we are given two objects and their true ordering (as always in training data). For convenience we will assume that each (query,object) pair is represented by a vector. How to come up with this vector is an interesting problem on it own, and this is pretty much domain-specific. For example, in Web search engines, the vector can contain elements of relevance measures from different criteria such as TF-IDF, Okapi BM25, title matching and quality measures such as domain authority, structural designs, number of in-links, PageRank, timeliness, or so.
What we are interested here is the fact that we have an ordering between the two vectors. By ordering, we assume that there exists an inequality between the them. And now, the generalized inequality with a cone will come in. What is missing is the cone itself: We don't know it in advance, and thus it must be estimated. And we shall assume that there will be only one cone for the problem at hand, although for some problems more than one cone may be needed.
First, we need a parametric way to represent the cone. Recall that any non-negative combination of two vectors in a cone will stay in a cone. This can be generalised easily: any non-negative combination of any number of vectors in a cone will also stay in a cone. This suggests a way: a cone can be represented by a several basis vectors and all other vectors can be generated from this basis set. This cone is polyhedral in the sense that it is both a polyhedron and a cone.
Let us denote by \( u_1,u_2,..,u_K \) the \( K \) basis vectors. An ordering can be represented as
\[ x-y = \sum_{k=1}^K w_k^{(xy)} u_k \]
where \( w_k^{(xy)} > 0 \) are the coefficients for the pair \( (x,y) \). Thus we are left with two unknown sets of parameters to estimate: the shared basis vectors and the pairwise coefficients. However, we will not cover the details of the estimation for now but refer to our recent work (in progress) here.
Now let's assume that we have estimated the basis vectors, what can we do for prediction? It seems since we do not know which order is best, we need to try both and compare the error rate of the two directions. But this is quite inconvenient because of the non-negative constraints. Fortunately, we found an easy way: just do unconstrained regression in one direction and check for the sign of the sum of the coefficients. If the sign is positive, then the direction is correct; otherwise, the other direction is more accurate ordering.
Here we will be concerned about a particular geometrical object known as cone (or to be more cultural, have a look at the so-called conical Asian hat). A cone has a special geometrical property that any non-negative combination of vectors belong to it will also belong to it. Mathematically it says for a cone \( C \), if \( x,y \in C \) then \( \alpha x + \beta y \in C \) for any \( \alpha,\beta > 0 \). But doesn't seem to be interesting.
What interests us more is the cone allows us to define a more general concept of inequality. Now we can freely compare vectors instead of points: A vector is said to be greater than another if the difference belongs to a cone. And thus, we can write
\[ x {\succeq}_C y \]
to denote that vector \( x \) is larger than or equal to vector \( y \) with respect to cone \( C \). By definition, we have
\[ x-y \in C \].
It turns out, this concept of generalized inequality can be exploited in our problem of learning to rank (LTR) and collaborative ranking. Let's see how.
Recall that in LTR, one must order objects according to their preferences or perceived utility and relevance. A standard way is to estimate a ranking functional which takes a pair of (query,object) and returns a real score. As we have already mentioned, this is not the only way. We can take a pair of objects and return their ordering.
Now suppose we are given two objects and their true ordering (as always in training data). For convenience we will assume that each (query,object) pair is represented by a vector. How to come up with this vector is an interesting problem on it own, and this is pretty much domain-specific. For example, in Web search engines, the vector can contain elements of relevance measures from different criteria such as TF-IDF, Okapi BM25, title matching and quality measures such as domain authority, structural designs, number of in-links, PageRank, timeliness, or so.
What we are interested here is the fact that we have an ordering between the two vectors. By ordering, we assume that there exists an inequality between the them. And now, the generalized inequality with a cone will come in. What is missing is the cone itself: We don't know it in advance, and thus it must be estimated. And we shall assume that there will be only one cone for the problem at hand, although for some problems more than one cone may be needed.
First, we need a parametric way to represent the cone. Recall that any non-negative combination of two vectors in a cone will stay in a cone. This can be generalised easily: any non-negative combination of any number of vectors in a cone will also stay in a cone. This suggests a way: a cone can be represented by a several basis vectors and all other vectors can be generated from this basis set. This cone is polyhedral in the sense that it is both a polyhedron and a cone.
Let us denote by \( u_1,u_2,..,u_K \) the \( K \) basis vectors. An ordering can be represented as
\[ x-y = \sum_{k=1}^K w_k^{(xy)} u_k \]
where \( w_k^{(xy)} > 0 \) are the coefficients for the pair \( (x,y) \). Thus we are left with two unknown sets of parameters to estimate: the shared basis vectors and the pairwise coefficients. However, we will not cover the details of the estimation for now but refer to our recent work (in progress) here.
Now let's assume that we have estimated the basis vectors, what can we do for prediction? It seems since we do not know which order is best, we need to try both and compare the error rate of the two directions. But this is quite inconvenient because of the non-negative constraints. Fortunately, we found an easy way: just do unconstrained regression in one direction and check for the sign of the sum of the coefficients. If the sign is positive, then the direction is correct; otherwise, the other direction is more accurate ordering.
Saturday, 9 June 2012
Collaborative ranking II: The estimation
Designing the rank functional
Let us for now concentrate on the the point-wise rank functional which takes as input the pair (user,item) and returns a real score. Note that we do not mean point-wise loss functions, which will be discussed later in this post.
As inspired by the SVD approach in standard (real-valued) rating prediction (especially after the Neflix challenge), it is natural to go for a function of the form
\[ f(u,i) = W_u'V_i \]
where \( W_u \) is the feature vector of user \( u \), \( V_i \) is the feature vector of item \( i \). Indeed, this is not the only way because the function \( f \) is a measure of similarity in the feature space. So probably kernels are applicable here (this particular inner product is really a linear kernel).
Note that the features don't have to be discovered if they are available, e.g., user profiles and item content descriptions. Or the features can be mixed: some are latent (a.k.a. random effects), and some are given (a.k.a. fixed effects). Our forthcoming paper in AAAI'12 offers even more interesting fixed features: the popular items (and their associated ratings for this particular user) themselves can also be highly informative features.
What is more interesting is that the user features and item features don't have to stay in the same space. What we need to do is to build a bilinear functional form
\[ f(u,i) = W_u' A V_i \]
where \( A \) where is the association matrix. Since \( A \) may be huge, we may choose to factorise it into two smaller matrices for robust estimation, i.e., \( A = B'C \) where the ranks of \( B \) and \( C \) are small compared to the size of features. That is effectively equivalent to applying linear transforms to the user features and item features so that the projected features belong to the same space: \( W_u \rightarrow BW_u \) and \( V_i \rightarrow CV_i \).
The rank 1 problem
The rank 1 problem
One of the annoying properties of the point-wise rank functional with the linear factorisation as discussed above is that some some datasets, the number of hidden features can be as little as 1 or 2. More precisely, the rank of the two matrices \( W, V \) can be 1 or 2. This is hugely different from standard SVD in rating prediction, where the effective dimensionality is often in the range of 50-500, depending on how much data we have. In fact, we don't even need any fancy learning, just ranking items according to their popularity will do a good job. See this paper for more discussion.
It has been suggested that we should prepare test data
that contains only items which are not popular and are controversial
(i.e., the rating variance/entropy is high among users). With more
diversity, more features will be needed.
The true reasons behind this phenomenon are still unexplored. My guess is that in standard rating prediction, since it is important to get each rating right, we need to map the user's profile into a high dimensional space of item ratings. On the other hand, in ranking with point-wise rank functional, we just need to map into a single real line. Thus the number of bits needed for rating prediction is larger than what needed for ranking. My conjecture is that the number of bits needed grows with the degree of the polynomial which is best describing the rank functional.
The loss functions
The loss functions
In order to estimate the rank functional, we need to minimise some loss function. Ideally the loss function should be the error metric used in evaluation. Typically rank metrics discount the items deep in the list. Well-known examples include: NDCG at top T, MRR, MAP, and more recently the Yahoo! ERR. Unfortunately, these metrics are non-smooth, and thus it is difficult to optimise directly. A reasonable method is to approximate the rank by some smooth functions, for example
\[ \bar{r}_i = \sum_{j \ne i} P(f(i) > f(j)) \]
where \( P(f(i) > f(j)) \) is the probability that item \( i \) is ranked higher than item \( j \). One useful probability form is the sigmoid function
\[ P(f(i) > f(j)) = \frac{1}{1+e^{-\gamma(f(i) - f(j))}} \]
where \( \gamma > 0 \) is some scale parameter. The main problem with this approximate loss is that it can be non-convex and flat when the pairwise probability is close to 0 or 1 making optimisation not a straightforward exercise.
Another strategy is to apply some surrogate loss functions. It is desirable that the loss function takes the whole list into account (the listwise loss). For example, the Luce's model (subsequently rediscovered by Plackett) reads:
\[ P(f(1) > f(2)) > ... > f(M)) = \prod_{i=1}^M \frac{e^{f(i)}}{\sum_{j=i}^M e^{f(j)}} \]
However, there has been work proving that even pointwise losses and pairwise losses are good upper bounds of some popular rank metrics.
Another strategy is to apply some surrogate loss functions. It is desirable that the loss function takes the whole list into account (the listwise loss). For example, the Luce's model (subsequently rediscovered by Plackett) reads:
\[ P(f(1) > f(2)) > ... > f(M)) = \prod_{i=1}^M \frac{e^{f(i)}}{\sum_{j=i}^M e^{f(j)}} \]
However, there has been work proving that even pointwise losses and pairwise losses are good upper bounds of some popular rank metrics.
Taking ties into account
Up to this point, we haven't addressed the mismatch between the ratings as in training data and the ranking as in the test output. In order words, the labels are in the form of ordinal variables, while the the end goal is to produce a rank.
In theory it is easy to convert the ordinal values into ranks. However, since the number of ordinal levels is usually small (e.g., as in 5-star scale), there will be many tied ranks. Standard sorting typically ignores ties and thus gives you a rank where ties are broken arbitrarily depending on how you index your items. Thus much information is lost during this conversion.
A better way is to treat ties directly. Early methods in the statistical literature include the Rao-Kupper method in 1967 and Davidson method in 1970 for extending the standard pairwise Bradley-Terry model. Last year we made a contribution (published in SDM'11) to this area by treating the ties entirely in the list, without breaking them into pairs. The idea is that, ranking with ties can be seen as an instance of the problem class of simultaneous partitioning a set into subsets and ordering the subsets. It turns out the space of such combinatorial operation is extremely large -- it is known in combinatorics that the space size is the Fubini number or ordered Bell number. However, we don't have to evaluate such a combinatorial space when constructing the loss function. A nice result is that there exists at least a case where the loss function can be computed in linear time.
[Part 1] [Part 3]
Sunday, 3 June 2012
Collaborative ranking I: Problem setting
As I have mentioned in the previous post, and there are two important aspects in modelling data of collaborative filtering (CF): one is the ordinal nature of the ratings, and another one is the end goal of CF in producing a ranked list (of new items for a particular user, or of new users for a particular item) rather than a set of ratings. This post is about the latter aspect. Indeed, I plan a series of posts, so stay tuned.
For clarity, we shall focus exclusively on producing a ranked item list for a particular user. We leave to another post the orthogonal problem of producing a ranked user list for a particular item.
There are really two steps involved in producing an item list: (i) identify a candidate list, and (ii) rank them. Let us look at each of them.
Identifying a candidate set
This seems trivial but it is not. One might think that we can use all the remaining items, but it is not very effective. First, each user is really interested in only some items. Second, including all the good items is not necessarily satisfactory to the user: They often want a mix of items of different tastes, even though some of they may not be of high quality. And third, how can you measure the quality of the candidate set? What if all the candidate items are of high quality but none of them will be actually in the future list?
There are a number of ways to produce a candidate list. The most intuitively way is to look for other items which similar users have seen. More specifically, we first identify top K similar users and aggregate all the items seen by them, minus those already seen by this particular user. This poses two questions: (a) what is the similarity measure, and (b) what is the best K? Adequate answers to each question deserves a research paper itself. So we will not pursue here.
There are of course other system criteria -- one might look to maximise the purchase rate, or the total revenue per user, the reduction of inventory, the higher degree of diversity, or so.
Producing the ranked list
Given the candidate set, a straightforward way to produce the ranked list is to first predict the rating for each item, and then sort them. For this, we do not need any specialized ranking algorithm, but it is less exciting: One might argue that since our objective is ranking, we should learn a ranking function in the first place.
What is a rank function? The simplest rank function might take a pair of (user,item) and return a real score. These scores will be used to sort items. For this the ranking will be easy -- the cost is usually Nlog(N) for standard sort algorithms. The drawback of this method is that it implicitly assumes that all items are independent, whereas indeed they are not, because after all, they are liked by the same user (and not just one user). Nevertheless, this method is by far very popular, dating back to the Luce's axiom of choices in the 1950s, to the recent rise of learning-to-rank in information retrieval and label ranking in machine learning.
There is another way: a rank function can take a triple (user,item1,item2) and return the ordering between item1 and item2. Ranking among all items will be based on voting, e.g., number of times a particular item is ranked higher than the others. The cost of this method will be N(N-1)/2. One drawback of this method is that it does not enforce the transitivity, i.e., if item1 is ranked higher than item2, and item2 higher than item3, then item1 should be ranked higher than item3. It also does not take higher-order interaction among items into account. Despite of these shortcomings, this method is widely practised, and one of the earliest citations is Bradley-Terry's work.
This leaves the most informative method: Think of the rank function is a way to produce a permutation whose utility for the user is maximum. In other word, we must solve a set-wise problem rather than the point-wise or pair-wise problems. However, there is a big problem here: the computational complexity is bounded by N!, which is too expensive for even moderate N. Thus, we would not address this method for the moment.
Updated references:
[Part 2]
For clarity, we shall focus exclusively on producing a ranked item list for a particular user. We leave to another post the orthogonal problem of producing a ranked user list for a particular item.
There are really two steps involved in producing an item list: (i) identify a candidate list, and (ii) rank them. Let us look at each of them.
Identifying a candidate set
This seems trivial but it is not. One might think that we can use all the remaining items, but it is not very effective. First, each user is really interested in only some items. Second, including all the good items is not necessarily satisfactory to the user: They often want a mix of items of different tastes, even though some of they may not be of high quality. And third, how can you measure the quality of the candidate set? What if all the candidate items are of high quality but none of them will be actually in the future list?
There are a number of ways to produce a candidate list. The most intuitively way is to look for other items which similar users have seen. More specifically, we first identify top K similar users and aggregate all the items seen by them, minus those already seen by this particular user. This poses two questions: (a) what is the similarity measure, and (b) what is the best K? Adequate answers to each question deserves a research paper itself. So we will not pursue here.
There are of course other system criteria -- one might look to maximise the purchase rate, or the total revenue per user, the reduction of inventory, the higher degree of diversity, or so.
Producing the ranked list
Given the candidate set, a straightforward way to produce the ranked list is to first predict the rating for each item, and then sort them. For this, we do not need any specialized ranking algorithm, but it is less exciting: One might argue that since our objective is ranking, we should learn a ranking function in the first place.
What is a rank function? The simplest rank function might take a pair of (user,item) and return a real score. These scores will be used to sort items. For this the ranking will be easy -- the cost is usually Nlog(N) for standard sort algorithms. The drawback of this method is that it implicitly assumes that all items are independent, whereas indeed they are not, because after all, they are liked by the same user (and not just one user). Nevertheless, this method is by far very popular, dating back to the Luce's axiom of choices in the 1950s, to the recent rise of learning-to-rank in information retrieval and label ranking in machine learning.
There is another way: a rank function can take a triple (user,item1,item2) and return the ordering between item1 and item2. Ranking among all items will be based on voting, e.g., number of times a particular item is ranked higher than the others. The cost of this method will be N(N-1)/2. One drawback of this method is that it does not enforce the transitivity, i.e., if item1 is ranked higher than item2, and item2 higher than item3, then item1 should be ranked higher than item3. It also does not take higher-order interaction among items into account. Despite of these shortcomings, this method is widely practised, and one of the earliest citations is Bradley-Terry's work.
This leaves the most informative method: Think of the rank function is a way to produce a permutation whose utility for the user is maximum. In other word, we must solve a set-wise problem rather than the point-wise or pair-wise problems. However, there is a big problem here: the computational complexity is bounded by N!, which is too expensive for even moderate N. Thus, we would not address this method for the moment.
Updated references:
- Probabilistic Models over Ordered Partitions with Applications in Document Ranking and Collaborative Filtering, T. Truyen, D. Phung, and S. Venkatesh, in Proc. of SIAM Int. Conf. on Data Mining (SDM11), April, Arizona, USA, 2011.
- Learning from Ordered Sets and Applications in Collaborative Ranking, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 4th Asian Conference on Machine Learning (ACML2012), Singapore, Nov 2012.
- Preference Relation-based Markov Random Fields in Recommender Systems, Shaowu Liu, Gang Li,Truyen Tran, Jiang Yuan, ACML'15, November 20-22, 2015, Hong Kong.
[Part 2]
Subscribe to:
Posts (Atom)