Machine learns as data speak: Thurstonian Boltzmann machines

Saturday, 17 December 2016

Mixed-type data analysis V: One size fits many with Thurstonian Boltzmann machines

This is part V of this series. The previous parts are here:

Part I: Overview
Part II: Pairwise methods
Part III: Mixed-variate Restricted Boltzmann Machines
Part IV: Ordinal data

The random variable world is very diverse and fascinating. They are real, count, categorical (e.g., single choice), multi-categorical (e.g., multiple choices), ordered (e.g., rating), rank, preference, boxed and many other forms. Wonder what they have in common?

A fundamental observation made in our recent ICML'13 paper is that, these variables can be expressed using the same form -- a set of inequalities. For example, real variables can receive values as a point, or an interval, which is essentially defined by two inequalities at two sides. A categorical variable can be thought as having the highest "utility" among all choices. A ranking is akin to having an ordered list of "utilities".

These kinds of thinking have a long history. The root can be traced back to the 1920s and 1930s under Thurstone. He posited that if we pick one choice over the other, it means the utility of that choice is higher than the other. A popular way to model utility is to assume a latent Gaussian variable, giving rise to probit functions. Later, in the 1950s, Luce derived a generalized formula for categorical choice among several. He found that if the utility follows the Gumbel distribution, also known as Extreme Value Distribution, then the probability of choosing the right choice is proportional to its (exponential of) utility. This is also known as multinomial distribution.

The Gumbel distribution is interesting itself. It is often used to model extreme values, for example, the highest tide of a year. Little surprise that it leads to categorical distribution, which is about choosing the best option. My AAAI'12 paper studies this distribution in the recommender system context.

Now we need a joint tool to glue these inequalities. Again, let us return to Thurstone. For simplicity, assume that there exist latent Gaussian utilities that give rise to these inequalities. What we need now is a joint distribution of all Gaussian variables.

Usually we can use multivariate Gaussian distributions. However, since we do not observe the Gaussian directly, inference and estimation are very difficult with many latent variables.

We found a particular architecture that is reasonable efficient to sample -- the restricted Boltzmann machine (RBM), a canonical method in the current wave of deep learning. The rest are just tedious details of MCMC inference.

Friday, 16 December 2016

Mixed-type data analysis IV: Representing multivariate ordinal data

Multivariate ordinal data is popular when human judgement is involved. For example, in collaborative filtering, we rate multiple items, each of which with a number of stars. In a typical survey, we provide ordinal assessment of many things, ranging from feeling of the day (happy, OK, sad) to the current situation of worldwide security (safe, OK, dangerous). Since these come from the same person, they are correlated, and thus we need to model multiple ordinal variables simultaneously. This blog will present an overview of the area.

Much of existing work, however, is focusing on single ordinal variable, typically under the umbrella of "ordinal regression". How about multiple ordinal variables?

There are several approaches. One way is to assume that ordinal data are just quantized version of an underlying continuous variable. Thus, each ordinal value corresponds to an interval of the underlying variable. This is intuitive, for example, when we says salary levels A, B and C, and they refer to ranges like A = $[50K,60K], B = $[60K,70K] and C = $70K+.

This thinking is convenient, especially when the underlying variable is assumed to be Gaussian. We can build a multivariate Gaussian distribution. The problem is that we will never observe these Gaussian variables directly but indirectly through the intervals dictated by the ordinal levels. Things get more interesting when the intervals are unknown. The only requirement is that the intervals have to be consecutive (i.e., no gaps). With this, we need to estimate the interval boundaries from data.

This is basically the main idea behind this paper published in ACML'12. However, we go further because the multivariate Gaussian distributions are hard to sample from under interval constraints. We leverage Gaussian-Bernoulli Restricted Boltzmann Machines instead. This makes MCMC sampling quite efficiently. The RBM style can also make it easy to extend to model the matrix with row and column RBMs linked together.

The other way is to use log-linear model, treating the ordinal as categorical but with log-linear constraints among the ordered levels. This is the idea behind this work published in UAI'09.

Updated references:

Ordinal Boltzmann Machines for Collaborative Filtering. Truyen Tran, Dinh Q. Phung and Svetha Venkatesh. In Proc. of 25th Conference on Uncertainty in Artificial Intelligence, June, 2009, Montreal, Canada.
A Sequential Decision Approach to Ordinal Preferences in Recommender Systems, Truyen Tran, Dinh Phung, Svetha Venkatesh, in Proc. of 25-th Conference on Artificial Intelligence (AAAI-12), Toronto, Canada, July 2012
Cumulative Restricted Boltzmann Machines for Ordinal Matrix Data Analysis, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 4th Asian Conference on Machine Learning (ACML2012), Singapore, Nov 2012.
Ordinal random fields for recommender systems, Shaowu Liu, Truyen Tran, Gang Li, Yuan Jiang, ACML'14, Nha Trang, Vietnam, Nov 2014.