Machine learns as data speak: Ratings in collaborative filtering should be treated ordinally

Collaborative filtering (CF), like its umbrella, recommender systems, is hot right now. Commercial CF systems have been implemented by major players like Amazon, Google News, TiVo, and Netflix. Within the academic world, CF research is now regularly delivered in popular venues such as information retrieval (or information filtering), information science, knowledge management, social media, data mining, machine learning, AI, statistics, human-computer interaction, drug-discovery, social sciences, marketing, and of course, the ACM RecSys annual conferences.

The basic idea behind CF is that, since people are interested in common items, and thus there must exist some shared structures, e.g., some people are really similar in their tastes or some items are really similar in their characteristics. The sharing will eventually help to predict how much an user will like an item he or she hasn't seen before.

Depending on your problem representation, you will end up with different algorithms.

One particularly appealing representation is matrix (e.g., see here). Here we are given an incomplete user-item matrix, where observed entries are often ratings by users on items they have seen. The tasks are either to predict the ratings of remaining entries, or to produce a list of unseen items for each user (or other way around, a list of new users for an item).

Under this representation, the shared structure can be captured by the low-rank decomposition, that is, the user-item matrix is really a product of an user-feature matrix and an item-feature matrix, where features are latent aspect of each user and each item. The number of latent features is often small compared to the original matrix sizes. In effect, we project both users and items onto the same latent space, and the preference is measured by some similarity between an user and an item in that space.

Another representation is bipartite graph, where users and items are nodes in the graph, and a weighted link between an user and an item indicates the level of interest. The problem is now to predict the missing links and their weights.

However, these two representations are somewhat complementary since they do not overlap entirely. These are evidenced in recent successful methods which combine both the approaches: here, here and here.

Surprisingly, both camps did not even consider the fact that (i) the ratings are indeed ordinal, (ii) we more or less interested in ranking items rather than rating them.

The second problem deserves separate posts, and here we shall focus on the first one.

Ordinal variables are those categorical variables whose relative orders between categories is the most important information, such as those in standard survey options: {very bad, bad, OK, good and very good}. You may be familiar with number of stars given to a movie, but these stars shouldn't be treated as a (real or integer) number. This is because, first, people are different in their use of rating values -- for some a 4 star movie means very excellent, but for other, it is may be just OK. Second, you can't be sure that these stars are spaced equally -- people are really bad at making consistent quantitative judgement.

On the other extreme, ordinal variables may be treated as standard categorical variable, where each time we pick only one category, assuming that the utility of that category is the highest among all categories. However, this treatment effectively ignores the fact that ordinal values are ordered. Under the categorical treatment, all mistakes are equally bad, but in a proper ordinal treatment, a mistake that is far from the true category should be considered as worse than those near by in the ordinal scale.

The absence of adequate ordinal treatment in CF might suggest some degree of ignorance in machine learning research: we sometimes do not pay attention to existing research in statistics. As for ordinal data, the statistical community has been addressed it for many decades under the umbrella of ordinal regression, and a hugely influential paper was written in 1980 by McCullagh. One may argue that the natures of the collaborative filtering and ordinal regression are different. In fact, they do not: The setting of collaborative filtering is identical to that of the so-called "multi-raters" problem, and the modelling of the raters itself has been long known as the random effect models.

Only very recently, in CF there has been some work addressing the ordinal nature of the data: here, here, here, here and here. The last two are ours, one was published in UAI'09 and the other one will appear in AAAI this year. The UAI'09 paper exploits the ordinal property in the feature construction but it doesn't really model ordinal ratings. The AAAI'12 paper addresses this shortcoming and the main contribution is the deviation from the standard McCullagh's model by adopting the sequential decision approach of Mare (1980).

To be more concrete, the McCullagh's model assumes that there exists a continuous utility that the user perceives when seeing an item. The ordinal level is determined by examining the interval to which the utility belongs. This property, when combining with the underlying assumption of the utility distribution, gives rise to several names: grouped continuous model, proportional odds models, cumulative logit model or ordered probit models. This model will work well if there exists the true interpretation of utility interval, such as income ranges, severity levels, etc.

On the other hand, the sequential decision approach assumes the decision is made progressively starting from the lowest level. If the utility is beyond the level's threshold, the next level will be considered. Otherwise, the current level is selected. This approach is attractive in the cases where stagewise progresses are made, for example, the education level, the health recovery status, etc.

The question is then, which approach is the best for CF? There are no easy answers because we do not know the true decision making behind the rating processes. For now we could only make some guess and verify on real datasets. That that is what we reveal in our forthcoming paper: The sequential approach is as good as, if not better than, the McCullagh's approach, for a number of large-scale datasets.

Updated References

Ordinal Boltzmann Machines for Collaborative Filtering. Truyen Tran, Dinh Q. Phung and Svetha Venkatesh. In Proc. of 25th Conference on Uncertainty in Artificial Intelligence, June, 2009, Montreal, Canada.
A Sequential Decision Approach to Ordinal Preferences in Recommender Systems, Truyen Tran, Dinh Phung, Svetha Venkatesh, in Proc. of 25-th Conference on Artificial Intelligence (AAAI-12), Toronto, Canada, July 2012.
Cumulative Restricted Boltzmann Machines for Ordinal Matrix Data Analysis, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 4th Asian Conference on Machine Learning (ACML2012), Singapore, Nov 2012.
Ordinal random fields for recommender systems, Shaowu Liu, Truyen Tran, Gang Li, Yuan Jiang, ACML'14, Nha Trang, Vietnam, Nov 2014.
Preference Relation-based Markov Random Fields in Recommender Systems, Shaowu Liu, Gang Li,Truyen Tran, Jiang Yuan, ACML'15, November 20-22, 2015, Hong Kong.
Collaborative filtering via sparse Markov random fields, Truyen Tran, Dinh Phung, Svetha Venkatesh, Information Science, doi:10.1016/j.ins.2016.06.027.

1 comment:

Unknown25 September 2016 at 17:08
Very nice post! I wish I have seen it earlier...

In addition to the models you've described, there's also another family of algorithms, namely, tensor-based algorithms, that allow to model ratings (or any other kind of feedback) as ordinal concept very naturally. Here's the link to our recent paper called "Fifty shades of ratings...", which among other things, shows how to deal with this problem:
http://dl.acm.org/citation.cfm?id=2959170

Thursday, 31 May 2012

Ratings in collaborative filtering should be treated ordinally

1 comment: