This is part V of this series. The previous parts are here:
- Part I: Overview
- Part II: Pairwise methods
- Part III: Mixed-variate Restricted Boltzmann Machines
- Part IV: Ordinal data
The random variable world is very diverse and fascinating. They are real, count, categorical (e.g., single choice), multi-categorical (e.g., multiple choices), ordered (e.g., rating), rank, preference, boxed and many other forms. Wonder what they have in common?
A fundamental observation made in our recent ICML'13 paper is that, these variables can be expressed using the same form -- a set of inequalities. For example, real variables can receive values as a point, or an interval, which is essentially defined by two inequalities at two sides. A categorical variable can be thought as having the highest "utility" among all choices. A ranking is akin to having an ordered list of "utilities".
These kinds of thinking have a long history. The root can be traced back to the 1920s and 1930s under Thurstone. He posited that if we pick one choice over the other, it means the utility of that choice is higher than the other. A popular way to model utility is to assume a latent Gaussian variable, giving rise to probit functions. Later, in the 1950s, Luce derived a generalized formula for categorical choice among several. He found that if the utility follows the Gumbel distribution, also known as Extreme Value Distribution, then the probability of choosing the right choice is proportional to its (exponential of) utility. This is also known as multinomial distribution.
The Gumbel distribution is interesting itself. It is often used to model extreme values, for example, the highest tide of a year. Little surprise that it leads to categorical distribution, which is about choosing the best option. My AAAI'12 paper studies this distribution in the recommender system context.Now we need a joint tool to glue these inequalities. Again, let us return to Thurstone. For simplicity, assume that there exist latent Gaussian utilities that give rise to these inequalities. What we need now is a joint distribution of all Gaussian variables.
Usually we can use multivariate Gaussian distributions. However, since we do not observe the Gaussian directly, inference and estimation are very difficult with many latent variables.
We found a particular architecture that is reasonable efficient to sample -- the restricted Boltzmann machine (RBM), a canonical method in the current wave of deep learning. The rest are just tedious details of MCMC inference.