Sunday, 25 December 2016

30 years of a Swiss army knife: Restricted Boltzmann machines


I read somewhere, but cannot recall exactly who said so, that in ancient worlds, 30 years are long enough for the new generation to settle down with a new system, regime or ideology. As there are only a few days away from 2017, I would like to look back the history of a 30-year old model which has captured my research attention for the past 10 years.

To some of you, restricted Boltzmann machine (RBM) may be a familiar name, especially for those who follow the current deep learning literature since the beginning. But RBM has also passed its prime time, so you may have heard about it in passing.

I was attracted to RBM for several reasons. When I was studying conditional random fields in 2004 and was looking for a fast way to train with arbitrary structures, Contrastive Divergence (CD) appears to be an interesting one. While CD is a generic technique, it was derived especially for RBMs. Second, RBM has "Boltzmann" in the name, which is kind of interesting, because physicists are kind of sexy :)

Needless too say, another big reason is that RBM, together with its cousin, Autoencoder are building blocks of unsupervised deep nets, which started the current revolution -- deep learning.

The greatest reason is that I think RBM is one of the most important classes of data models known to date, perhaps comparable to PCA in dimensionality-reduction  and k-means in clustering in terms of usefulness.

First introduced in 1986 by  Paul Smolensky under the name Harmonium in a classic two-volume book known as PDP (Parallel Distributed Processing), co-edited by Rumelhart and McLelland. RBMs were subsequently popularised by Geoff Hinton in the 2000s, especially in 2001 with the introduction of Contrastive Divergence (CD), and  in 2006 with the introduction of a deep version known as Deep Belief Nets (DBN).

Statistically, RBM is a probabilistic model of data, i.e., it assigns a probability (or density) to a multivariate data. Initially, RBMs are limited to binary data (known as Bernoulli-Bernoulli RBM), but subsequently extended to Gaussian data (known as Gaussian-Bernoulli RBM), and mixed-types (known as Mixed-variate RBM, or Thurstonian Boltzmann machine).

Source: http://deeplearning.net/tutorial/_images/rbm.png

RBM is a special case of Boltzmann machine, which is in turn a special case of Markov random field. It has two layers, one for observed data, the other for latent representation. Due to its special bipartite structure, MCMC inference can be implemented in a block-wise fashion. Learning is relatively fast with CD or its Persistent version. Estimating of latent representation is very fast with a single matrix operation. RBM is also a powerful model in the sense that it can represent any distribution given enough hidden units. As a Markov random field, it has log-linear paramerization which makes it easy to incorporate a variety of domain knowledge.

With all of these advantages, RBMs have been used successfully in many applications, ranging from density modelling, feature extraction, dimensional reduction, clustering, topic modeling, imputation, classification, retrieval and anomaly detection.

Some bias selection of developments
  • 1986: first introduced as Harmonium.
  • 2001: fast approximate biased learning introduced as Contrastive Divergence (CD)
  • 2004: generalized Harmonium introduced
  • 2006: used successfully in Deep Belief Networks
  • 2007: demonstrated with great success on a very large-scale task within the Netflix challenge
  • 2007: temporal RBM
  • 2008: recurrent temporal RBM
  • 2008: classification RBM
  • 2008: persistent CD introduced, essentially another variant of Young's.
  • 2008: convolutional RBMs
  • 2008: universality property proved
  • 2009: topic models with Replicated Softmax
  • 2009: matrix modelling with non i.i.d. RBMs, ordinal data, semi-restricted RBM
  • 2009: implicit mixtures of RBMs
  • 2010: factored high-order RBM
  • 2010: mean-covariance RBM
  • 2010: rectifier linear units RBM
  • 2010: deep BM
  • 2011: mixed-variate RBM
  • 2012: a proper modeling of ordinal matrix data
  • 2013: Thurstonian BM for joint modeling of most known data types
  • 2013: nonnegative RBMs for parts-based representation
  • 2015: trained with graph priors, demonstrating better generalization
  • 2015: extended to tensor-objects
  • 2016: infinite RBM
In short, most of the work has been on extending the representational power of RBM to suit problem structures. The rest is about analysing theoretical properties of RBMs, making deep nets out of RBMs, and improving training speed & accuracy. For the past few years, research about RBMs has slowed down significantly, mostly because the superb accuracy of supervised deep nets, and the ease of deployment of deterministic nets on large-scale problems. 

Some of our own work

1 comment: