Machine learns as data speak: Markov random fields for recommender systems I: Modelling local dependency

As I have mentioned, there are two general approaches to collaborative filtering: the latent space (e.g., using matrix factorisation) and the local structures (e.g., using graph to estimate user-user similarity or item-item similarity). A natural question is then can the two approaches be combined? Numerous evidences have suggested that the answer is yes: The latent space assumption and the local structures assumption seem to be somewhat complementary, or at least do not overlap entirely. I guess they do partly overlap because by using spectral decomposition of the similarity matrices, we can also discover a latent space, but it does not guarantee to be the same as those obtained by factorising the rating matrix.

The only issue is that the combination of the two methods so far appears a bit heuristic, or at least the two approaches do not come from the same unified representation. The quest for such a unification may be more of theoretical interest rather than practical ones: After all with ensemble methods, we can combine any different methods as long as they are fairly diverse.

It turns out, probabilistic graphical models (PGMs) nicely offer such a representation so that the ratings/preferences data can be seen as a probabilistic database from which many queries can be answered probabilistically. For example, rating prediction can be cast as an inference problem where we need to infer the probability that an user will assign a rating value to a particular item. More importantly, it separates out the issue of model building (e.g., structure specification and parameter estimation) from inference (e.g., rating prediction or item ranking).

There are two general ways to specify a PGM: using directed models and using undirected models. Existing directed models for collaborative filtering appeared earlier, and they include Bayesian networks and those variants of the so-called topic models: the PLSA and LDA. The main problem with directed models is that we need to be really careful in our model design to make sure that the structure is a DAG (there will be no loops), and every link from a parent to a child makes some sense. Incorporating features and domain knowledge can be hard and generally requires great design skills to make the model correct.

Undirected models, also known as Markov random fields, Markov networks or their variants such as Botlzmann machines, restricted Boltzmann machines, relational Markov networks, and Markov logic networks are more flexible: They allow loops and can incorporate arbitrary features extracted from data or side constraints. This post will solely focus on this undirected class.

Markov random fields for user-specific local structures

Let us start with the local structures (or the neighbourhood). Similarity scores such as Pearson's correlation or cosine are perfectly OK on their own (parameter estimation stage) but their combination to make prediction does not seem to be theoretically guided (inference stage). Markov random fields (MRFs), on the other hand, allow both the stages to be guided in a principled way.

First we need to specify the model structure. For clarity, let us first start with the assumption that users are largely independent in their choice. We will argue later that this assumption is usually wrong in practice, but let the assumption hold for a while.

Our goal is to build graphs, one per user, so that ratings are modelled as node variables, and interaction between items is modelled as edges. Here we depart from the standard use of MRFs by building multiple graphs instead of just one. This is because users only rate only a handful number of items, and building the full graph for all items is not very attractive: First, the graph will be too big for efficient handling. Second, most of the variables will be missing, making learning and inference cumbersome. And finally, assuming most missing variables to be 'just' missing does not seem right: Most items shouldn't be there, or at best, are irrelevant to the taste of the user.

The key here is that although graphs are different, they all share the same set of parameters. Otherwise, learning will not happen robustly, and prediction will be impossible with unidentified parameters.

So we end up including only those seen items into the user-specific graph. Note that it is certainly not optimal, because they will be surely other unseen items which will be highly relevant to the user. For example, at test time we need to include unseen items to make inference. But it doesn't seem to be an issue: At test time, we just need to plug one test item at a time, assuming that the user-specific MRF is unchanged in their parameters, and then we do inference \[ r^* = \arg\max_r P(r | \mathbf{r}(u)) \] where\( r \) denotes the rating, \( \mathbf{r}(u) \) denotes the set of seen ratings by user \( u \) .

The next question is how dense the graph should be? Ideally, the graph should be fully connected because all the items will interact with each other under the same user. However, this is not efficient: The number of edges is quadratic in number of items. Second, it makes learning overfit to weakly interacting pairs.

A better way should be making the graph sparse, and it we would eliminate weakly interacting items. If we do the graph structure estimation automatically, this would be the so-called structure learning in probabilistic graphical models. A more heuristic way is to use some correlation measures: If the two items are strongly positively or negatively correlated, their interaction should be kept. However, it is not easy to determine the right amount of interaction strength. An easy way is to keep only K most strongly interacting neigbours for each item.

A hidden question is how should we compute the correlation measure. There are number of information sources we can exploit here: From the ratings alone, we can look for the common users between two items. From the content, we can match two items if their descriptions are similar (e.g., two movies directed by the same directors or played by the same lead actor/actress). Thus the use of content can be implicit.

Are we done with the structure yet? Not quite. MRFs are undirected, and thus the connections are symmetric: if node j is connected to i then i must also be connected to j. However, the edges discovered from the K-nearest neighours are asymmetric: if j is a neighbour of i, it does not necessarily mean i is a neighbour of j. Thus from the K-nearest neighbours per node we need to add more edges to make the neighbourhood consistent.

The next step is parameterisation. We would use the following energy function
\[ E (\mathbf{r}_u) = -\Big ( \sum_i\sum_m \alpha_{im} g_m(r_i) + \sum_i\sum_{j>i} \sum_m\beta_{ijm} f_m(r_i,r_j) \Big ) \] where \( g_m(r_i) \) and \( f_m(r_i,r_j) \) are feature functions capturing the properties of the ratings and the interaction between two items; \( \alpha_{im} \) and \( \beta_{ijm} \) are parameters.

We will not go into the details of feature design as it is more an art than science. We wish to comment that we can come up with Gaussian random fields, categorical random fields or ordinal random fields, depending on the choice of these functions. See our UAI'09 paper for the comparison of these three model types.

Note that we can easily incorporate item-content and user-profile into these feature functions, making the model an attractive hybrid network which will help combating against the so-called cold-start problem. This is a clear advantage of undirected models where we can just integrate diverse sources of information with little extra cost.

Once the model parameters have been estimated, the prediction for unseen item j is quite easy:
\[ r_j^* = \arg\min_{r_j} E (\mathbf{r}_u, r_j) \]
From user-model to item-model to integrated model

Let us return to the assumption we made earlier: Users are independent in making their choice. This is not always the case: Users are generally influenced by their peers and the media. From the data representation point of view, the rating matrix does not distinguish the role of columns and rows -- by a transpose we still have the same matrix!

Thus it is plausible that we can build a model for each item in the same way that we did earlier with the users. The only difference is that now we need to model user-user interactions.

Given the fact that we now have either user-specific models or item-specific models, the natural question is can these two approaches be unified?

The answer is yes, and in a simple way: We just need to add the two energies functions together. Now we have a giant MRF connecting every user and every item.

From rating-prediction to item-ranking

Recall that the real goal in recommender systems is less about rating prediction but more about producing a rank list of unseen items for a particular user. It turns out, there is a way: If an new item causes more expected reduction in model energy, the item is likely to be more influential, and it should be ranked higher.

To sum up, we now have a principled way to integrate users, items, contents and profiles into the same model which supports structure specification, parameter estimation, rating prediction and item ranking. For more technical details, we refer to our AusDM'07 paper.

In the next part of the post, we will focus on how MRFs can be used to discover latent spaces and how to integrate with the local structures.

Next posts Part 2: discovering latent spaces | Part 3: embedding ordinal matrix factorisation

Machine learns as data speak

Saturday, 16 June 2012

Markov random fields for recommender systems I: Modelling local dependency

No comments:

Post a Comment