Machine learns as data speak: mixed-variate restricted Boltzmann machines

Showing posts with label mixed-variate restricted Boltzmann machines. Show all posts

Saturday, 17 December 2016

Mixed-type data analysis V: One size fits many with Thurstonian Boltzmann machines

This is part V of this series. The previous parts are here:

Part I: Overview
Part II: Pairwise methods
Part III: Mixed-variate Restricted Boltzmann Machines
Part IV: Ordinal data

The random variable world is very diverse and fascinating. They are real, count, categorical (e.g., single choice), multi-categorical (e.g., multiple choices), ordered (e.g., rating), rank, preference, boxed and many other forms. Wonder what they have in common?

A fundamental observation made in our recent ICML'13 paper is that, these variables can be expressed using the same form -- a set of inequalities. For example, real variables can receive values as a point, or an interval, which is essentially defined by two inequalities at two sides. A categorical variable can be thought as having the highest "utility" among all choices. A ranking is akin to having an ordered list of "utilities".

These kinds of thinking have a long history. The root can be traced back to the 1920s and 1930s under Thurstone. He posited that if we pick one choice over the other, it means the utility of that choice is higher than the other. A popular way to model utility is to assume a latent Gaussian variable, giving rise to probit functions. Later, in the 1950s, Luce derived a generalized formula for categorical choice among several. He found that if the utility follows the Gumbel distribution, also known as Extreme Value Distribution, then the probability of choosing the right choice is proportional to its (exponential of) utility. This is also known as multinomial distribution.

The Gumbel distribution is interesting itself. It is often used to model extreme values, for example, the highest tide of a year. Little surprise that it leads to categorical distribution, which is about choosing the best option. My AAAI'12 paper studies this distribution in the recommender system context.

Now we need a joint tool to glue these inequalities. Again, let us return to Thurstone. For simplicity, assume that there exist latent Gaussian utilities that give rise to these inequalities. What we need now is a joint distribution of all Gaussian variables.

Usually we can use multivariate Gaussian distributions. However, since we do not observe the Gaussian directly, inference and estimation are very difficult with many latent variables.

We found a particular architecture that is reasonable efficient to sample -- the restricted Boltzmann machine (RBM), a canonical method in the current wave of deep learning. The rest are just tedious details of MCMC inference.

Friday, 16 December 2016

Mixed-type data analysis IV: Representing multivariate ordinal data

Multivariate ordinal data is popular when human judgement is involved. For example, in collaborative filtering, we rate multiple items, each of which with a number of stars. In a typical survey, we provide ordinal assessment of many things, ranging from feeling of the day (happy, OK, sad) to the current situation of worldwide security (safe, OK, dangerous). Since these come from the same person, they are correlated, and thus we need to model multiple ordinal variables simultaneously. This blog will present an overview of the area.

Much of existing work, however, is focusing on single ordinal variable, typically under the umbrella of "ordinal regression". How about multiple ordinal variables?

There are several approaches. One way is to assume that ordinal data are just quantized version of an underlying continuous variable. Thus, each ordinal value corresponds to an interval of the underlying variable. This is intuitive, for example, when we says salary levels A, B and C, and they refer to ranges like A = $[50K,60K], B = $[60K,70K] and C = $70K+.

This thinking is convenient, especially when the underlying variable is assumed to be Gaussian. We can build a multivariate Gaussian distribution. The problem is that we will never observe these Gaussian variables directly but indirectly through the intervals dictated by the ordinal levels. Things get more interesting when the intervals are unknown. The only requirement is that the intervals have to be consecutive (i.e., no gaps). With this, we need to estimate the interval boundaries from data.

This is basically the main idea behind this paper published in ACML'12. However, we go further because the multivariate Gaussian distributions are hard to sample from under interval constraints. We leverage Gaussian-Bernoulli Restricted Boltzmann Machines instead. This makes MCMC sampling quite efficiently. The RBM style can also make it easy to extend to model the matrix with row and column RBMs linked together.

The other way is to use log-linear model, treating the ordinal as categorical but with log-linear constraints among the ordered levels. This is the idea behind this work published in UAI'09.

Updated references:

Ordinal Boltzmann Machines for Collaborative Filtering. Truyen Tran, Dinh Q. Phung and Svetha Venkatesh. In Proc. of 25th Conference on Uncertainty in Artificial Intelligence, June, 2009, Montreal, Canada.
A Sequential Decision Approach to Ordinal Preferences in Recommender Systems, Truyen Tran, Dinh Phung, Svetha Venkatesh, in Proc. of 25-th Conference on Artificial Intelligence (AAAI-12), Toronto, Canada, July 2012
Cumulative Restricted Boltzmann Machines for Ordinal Matrix Data Analysis, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 4th Asian Conference on Machine Learning (ACML2012), Singapore, Nov 2012.
Ordinal random fields for recommender systems, Shaowu Liu, Truyen Tran, Gang Li, Yuan Jiang, ACML'14, Nha Trang, Vietnam, Nov 2014.

Saturday, 10 December 2016

Mixed-type data analysis III: Mixed-variate Restricted Boltzmann machines

This is part III of this series. The previous parts are here:

Part I: Overview
Part II: Pairwise methods

It is clear from the previous post that we need a better way to represent arbitrary number of types without resorting to ad-hoc treatments. Restricted Boltzmann machines (RBMs) offering an unified way, leading to what is called mixed-variate RBM.

A RBM is a Markov random field with two layer of nodes, one for visible variables, the other for hidden variables. For our purpose, we shall assume that hidden variables are binary, and visible variables are typed. Standard RBMs often deal with only a single visible type, mostly binary and Gaussian. Other less popular types are count, categorical, ordinal, intervals. There are several treatments of count: Poisson, constrained Poisson and replicated multinomial by Ruslan Salakhutdinov and others.

Regardless of the types, the principles are the same: The model is a Boltzmann machine (clearly from its name) that admits a Boltzmann distribution (a.k.a. exponential family, Gibbs distribution, log-linear, etc). This form is flexible, virtually most known types can be expressed easily. The RBMs are one of the hot kids these days, thanks to the hype in deep learning driven by big guys such as Google, Facebook, Microsoft and Baidu.

It is natural that we can plug all types together, all share the same hidden layer. Then all the problems with mixed-type suddenly disappear! This is because now the interactions are limited to types and the hidden layer, which is usually binary. No need for all the types to mess up with each other directly.

Although it appears effortless, and we simply expect it to work, as we have shown in the mixed-variate RBM, there are several issues that are worth discussing.

First, we have intended only to work with primitive types, regardless of how the types arise. However, moving up to one more layer, we may be worried about multiple modalities rather than types. For example, an image with tags has two modalities, and they represent different kinds of data at different levels of semantics. Typically an image is considered as a vector of pixels, and thus it is natural that we can use either Gaussian (for continuous pixel intensity) or Beta types (for bounded intensity).

Second, we can efficiently estimate the data density up to a constant, using a quantity known as "free-energy". This offers a natural way for anomaly detection.

Third, since the representation layer is all binary, we can indeed stack multiple RBMs on top of the first layer, making the entire structure a mixed-variate Deep Belief Network.

Fourth, since the estimation of the posterior is straightforward, it paves a way to learn distance metric. The distance are useful in many tasks including information retrieval, k-nearest neighbor classification and clustering. These capabilities are not possible with other mixed-type modeling methods.

Updated references

Multilevel Anomaly Detection for Mixed Data, K Do, T Tran, S Venkatesh, arXiv preprint arXiv: 1610.06249.
Learning deep representation of multityped objects and tasks, Truyen Tran, D. Phung, and S. Venkatesh, arXiv preprint arXiv: 1603.01359.
Outlier Detection on Mixed-Type Data: An Energy-based Approach, K Do, T Tran, D Phung, S Venkatesh, International Conference on Advanced Data Mining and Applications (ADMA 2016).
Latent patient profile modelling and applications with Mixed-Variate Restricted Boltzmann Machine, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’13), Gold Coast, Australia, April 2013.
Learning sparse latent representation and distance metric for image retrieval, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of IEEE International Conference on Multimedia and Expo (ICME), San Jose, California, USA, July 2013.
Mixed-Variate Restricted Boltzmann Machines, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 3rd Asian Conference on Machine Learning (ACML2011), Taoyuan, Taiwan, Nov 2011.

Tuesday, 17 June 2014

Mixed-type data analysis II: Pairwise models

Analysing mixed-type data is hard. A popular way is to consider a pair of types, e.g., continuous and binary. A simple method is to utilise the standard rule $ P(A,B) = P(A)P(B|A) $, were $A$ can be continuous and $B$ can be binary. In this particular case, $P(A)$ can be a Gaussian distribution and $P(B|A)$ a logistic distribution, where A plays some role in the location, intercept or scale parameters of $P(B|A)$.

For example, you wonder the relation between IQ ($A$) and having a top job ($B$). The distribution of IQ $P(A)$ can be easily estimated from the population, but the conditional distribution of having a top job given your IQ (any nothing else) $P(B|A)$ requires a bit of thinking. So you may want to consider IQ as a feature and estimate its weight. You may find some positive weight there, but it may not be as strong as you might wish. You may then question such a linear relationship and start modeling a polynomial equation, or chopping up the IQ into ranges (oops, this creates another issue, really). For example, it could be the case that after a certain IQ threshold, job success does not really depend on IQ anymore (possibly not every job is like that, especially in theoretical physics and maths).

To compute the conditional distribution of IQ given job success $P(A|B)$ we need the Bayes' rule: $Q(A|B) = P(A)P(B|A) / P(B)$, where $P(B)$ is a normalising factor.

Alternatively, you can use the same rule, but differently: $P(A,B) = P(B)P(A|B)$. Again $P(B)$, the proportion of working people with top jobs, can be estimated from some national statistics. The IQ distribution among successful people P$(A|B)$. You may find that $P(A|B)$ may not even be a Gaussian but rather skewed to the right. I don't know for sure, just a guess.

It is interesting to compare $Q(A|B)$ and $P(A|B)$ using the two methods. In a perfect world, they should be the same, but in practice they may not. This is because of the assumptions in the model specification and the unreliable estimation. This is one problem.

Another problem with this approach is that it is limited to pairs and cannot be easily generalised to more than two types. Even more, the number of models will be quadratic in number of variables. And finally, it does not offer an easy way for further analysis using existing tools (such as visualisation in 2D or 3D).

A better way is to imagine there exists some hidden variables that govern the generation of these types. Given these hidden variables, types are conditionally independent, and thus we don't have to really model the interaction among types. We have to, however, just model the interaction between each type and the hidden variables. These are techniques behind our recent attempts: mixed-variate restricted Boltzmann machines and Thurstonian Boltzmann machines. These are the subjects of subsequent posts.