Machine learns as data speak: mixed types

Showing posts with label mixed types. Show all posts

Saturday, 17 December 2016

Mixed-type data analysis V: One size fits many with Thurstonian Boltzmann machines

This is part V of this series. The previous parts are here:

Part I: Overview
Part II: Pairwise methods
Part III: Mixed-variate Restricted Boltzmann Machines
Part IV: Ordinal data

The random variable world is very diverse and fascinating. They are real, count, categorical (e.g., single choice), multi-categorical (e.g., multiple choices), ordered (e.g., rating), rank, preference, boxed and many other forms. Wonder what they have in common?

A fundamental observation made in our recent ICML'13 paper is that, these variables can be expressed using the same form -- a set of inequalities. For example, real variables can receive values as a point, or an interval, which is essentially defined by two inequalities at two sides. A categorical variable can be thought as having the highest "utility" among all choices. A ranking is akin to having an ordered list of "utilities".

These kinds of thinking have a long history. The root can be traced back to the 1920s and 1930s under Thurstone. He posited that if we pick one choice over the other, it means the utility of that choice is higher than the other. A popular way to model utility is to assume a latent Gaussian variable, giving rise to probit functions. Later, in the 1950s, Luce derived a generalized formula for categorical choice among several. He found that if the utility follows the Gumbel distribution, also known as Extreme Value Distribution, then the probability of choosing the right choice is proportional to its (exponential of) utility. This is also known as multinomial distribution.

The Gumbel distribution is interesting itself. It is often used to model extreme values, for example, the highest tide of a year. Little surprise that it leads to categorical distribution, which is about choosing the best option. My AAAI'12 paper studies this distribution in the recommender system context.

Now we need a joint tool to glue these inequalities. Again, let us return to Thurstone. For simplicity, assume that there exist latent Gaussian utilities that give rise to these inequalities. What we need now is a joint distribution of all Gaussian variables.

Usually we can use multivariate Gaussian distributions. However, since we do not observe the Gaussian directly, inference and estimation are very difficult with many latent variables.

We found a particular architecture that is reasonable efficient to sample -- the restricted Boltzmann machine (RBM), a canonical method in the current wave of deep learning. The rest are just tedious details of MCMC inference.

Friday, 16 December 2016

Mixed-type data analysis IV: Representing multivariate ordinal data

Multivariate ordinal data is popular when human judgement is involved. For example, in collaborative filtering, we rate multiple items, each of which with a number of stars. In a typical survey, we provide ordinal assessment of many things, ranging from feeling of the day (happy, OK, sad) to the current situation of worldwide security (safe, OK, dangerous). Since these come from the same person, they are correlated, and thus we need to model multiple ordinal variables simultaneously. This blog will present an overview of the area.

Much of existing work, however, is focusing on single ordinal variable, typically under the umbrella of "ordinal regression". How about multiple ordinal variables?

There are several approaches. One way is to assume that ordinal data are just quantized version of an underlying continuous variable. Thus, each ordinal value corresponds to an interval of the underlying variable. This is intuitive, for example, when we says salary levels A, B and C, and they refer to ranges like A = $[50K,60K], B = $[60K,70K] and C = $70K+.

This thinking is convenient, especially when the underlying variable is assumed to be Gaussian. We can build a multivariate Gaussian distribution. The problem is that we will never observe these Gaussian variables directly but indirectly through the intervals dictated by the ordinal levels. Things get more interesting when the intervals are unknown. The only requirement is that the intervals have to be consecutive (i.e., no gaps). With this, we need to estimate the interval boundaries from data.

This is basically the main idea behind this paper published in ACML'12. However, we go further because the multivariate Gaussian distributions are hard to sample from under interval constraints. We leverage Gaussian-Bernoulli Restricted Boltzmann Machines instead. This makes MCMC sampling quite efficiently. The RBM style can also make it easy to extend to model the matrix with row and column RBMs linked together.

The other way is to use log-linear model, treating the ordinal as categorical but with log-linear constraints among the ordered levels. This is the idea behind this work published in UAI'09.

Updated references:

Ordinal Boltzmann Machines for Collaborative Filtering. Truyen Tran, Dinh Q. Phung and Svetha Venkatesh. In Proc. of 25th Conference on Uncertainty in Artificial Intelligence, June, 2009, Montreal, Canada.
A Sequential Decision Approach to Ordinal Preferences in Recommender Systems, Truyen Tran, Dinh Phung, Svetha Venkatesh, in Proc. of 25-th Conference on Artificial Intelligence (AAAI-12), Toronto, Canada, July 2012
Cumulative Restricted Boltzmann Machines for Ordinal Matrix Data Analysis, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 4th Asian Conference on Machine Learning (ACML2012), Singapore, Nov 2012.
Ordinal random fields for recommender systems, Shaowu Liu, Truyen Tran, Gang Li, Yuan Jiang, ACML'14, Nha Trang, Vietnam, Nov 2014.

Saturday, 10 December 2016

Mixed-type data analysis III: Mixed-variate Restricted Boltzmann machines

This is part III of this series. The previous parts are here:

Part I: Overview
Part II: Pairwise methods

It is clear from the previous post that we need a better way to represent arbitrary number of types without resorting to ad-hoc treatments. Restricted Boltzmann machines (RBMs) offering an unified way, leading to what is called mixed-variate RBM.

A RBM is a Markov random field with two layer of nodes, one for visible variables, the other for hidden variables. For our purpose, we shall assume that hidden variables are binary, and visible variables are typed. Standard RBMs often deal with only a single visible type, mostly binary and Gaussian. Other less popular types are count, categorical, ordinal, intervals. There are several treatments of count: Poisson, constrained Poisson and replicated multinomial by Ruslan Salakhutdinov and others.

Regardless of the types, the principles are the same: The model is a Boltzmann machine (clearly from its name) that admits a Boltzmann distribution (a.k.a. exponential family, Gibbs distribution, log-linear, etc). This form is flexible, virtually most known types can be expressed easily. The RBMs are one of the hot kids these days, thanks to the hype in deep learning driven by big guys such as Google, Facebook, Microsoft and Baidu.

It is natural that we can plug all types together, all share the same hidden layer. Then all the problems with mixed-type suddenly disappear! This is because now the interactions are limited to types and the hidden layer, which is usually binary. No need for all the types to mess up with each other directly.

Although it appears effortless, and we simply expect it to work, as we have shown in the mixed-variate RBM, there are several issues that are worth discussing.

First, we have intended only to work with primitive types, regardless of how the types arise. However, moving up to one more layer, we may be worried about multiple modalities rather than types. For example, an image with tags has two modalities, and they represent different kinds of data at different levels of semantics. Typically an image is considered as a vector of pixels, and thus it is natural that we can use either Gaussian (for continuous pixel intensity) or Beta types (for bounded intensity).

Second, we can efficiently estimate the data density up to a constant, using a quantity known as "free-energy". This offers a natural way for anomaly detection.

Third, since the representation layer is all binary, we can indeed stack multiple RBMs on top of the first layer, making the entire structure a mixed-variate Deep Belief Network.

Fourth, since the estimation of the posterior is straightforward, it paves a way to learn distance metric. The distance are useful in many tasks including information retrieval, k-nearest neighbor classification and clustering. These capabilities are not possible with other mixed-type modeling methods.

Updated references

Multilevel Anomaly Detection for Mixed Data, K Do, T Tran, S Venkatesh, arXiv preprint arXiv: 1610.06249.
Learning deep representation of multityped objects and tasks, Truyen Tran, D. Phung, and S. Venkatesh, arXiv preprint arXiv: 1603.01359.
Outlier Detection on Mixed-Type Data: An Energy-based Approach, K Do, T Tran, D Phung, S Venkatesh, International Conference on Advanced Data Mining and Applications (ADMA 2016).
Latent patient profile modelling and applications with Mixed-Variate Restricted Boltzmann Machine, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’13), Gold Coast, Australia, April 2013.
Learning sparse latent representation and distance metric for image retrieval, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of IEEE International Conference on Multimedia and Expo (ICME), San Jose, California, USA, July 2013.
Mixed-Variate Restricted Boltzmann Machines, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 3rd Asian Conference on Machine Learning (ACML2011), Taoyuan, Taiwan, Nov 2011.

Tuesday, 17 June 2014

Mixed-type data analysis II: Pairwise models

Analysing mixed-type data is hard. A popular way is to consider a pair of types, e.g., continuous and binary. A simple method is to utilise the standard rule $ P(A,B) = P(A)P(B|A) $, were $A$ can be continuous and $B$ can be binary. In this particular case, $P(A)$ can be a Gaussian distribution and $P(B|A)$ a logistic distribution, where A plays some role in the location, intercept or scale parameters of $P(B|A)$.

For example, you wonder the relation between IQ ($A$) and having a top job ($B$). The distribution of IQ $P(A)$ can be easily estimated from the population, but the conditional distribution of having a top job given your IQ (any nothing else) $P(B|A)$ requires a bit of thinking. So you may want to consider IQ as a feature and estimate its weight. You may find some positive weight there, but it may not be as strong as you might wish. You may then question such a linear relationship and start modeling a polynomial equation, or chopping up the IQ into ranges (oops, this creates another issue, really). For example, it could be the case that after a certain IQ threshold, job success does not really depend on IQ anymore (possibly not every job is like that, especially in theoretical physics and maths).

To compute the conditional distribution of IQ given job success $P(A|B)$ we need the Bayes' rule: $Q(A|B) = P(A)P(B|A) / P(B)$, where $P(B)$ is a normalising factor.

Alternatively, you can use the same rule, but differently: $P(A,B) = P(B)P(A|B)$. Again $P(B)$, the proportion of working people with top jobs, can be estimated from some national statistics. The IQ distribution among successful people P$(A|B)$. You may find that $P(A|B)$ may not even be a Gaussian but rather skewed to the right. I don't know for sure, just a guess.

It is interesting to compare $Q(A|B)$ and $P(A|B)$ using the two methods. In a perfect world, they should be the same, but in practice they may not. This is because of the assumptions in the model specification and the unreliable estimation. This is one problem.

Another problem with this approach is that it is limited to pairs and cannot be easily generalised to more than two types. Even more, the number of models will be quadratic in number of variables. And finally, it does not offer an easy way for further analysis using existing tools (such as visualisation in 2D or 3D).

A better way is to imagine there exists some hidden variables that govern the generation of these types. Given these hidden variables, types are conditionally independent, and thus we don't have to really model the interaction among types. We have to, however, just model the interaction between each type and the hidden variables. These are techniques behind our recent attempts: mixed-variate restricted Boltzmann machines and Thurstonian Boltzmann machines. These are the subjects of subsequent posts.

Saturday, 7 July 2012

Mixed-type data analysis I: An overview

This post series is about analysing multiple data types. By types, we mean primitive natures such as real-valued, interval, counts, rank, binary, ordinal, categorical and multicategorical types. It is more detailed than simply saying data is numerical or nominal. For each of these types, we can still subdivide into finer sub-types, depending on the nature of the generating processes. For example, real-valued variables can be Gaussian or logistic (mean-value), Gumbel (extreme-value), or rate (exponential), etc.

Type handling is indeed important in many statistical application areas such as healthcare, business and engineering. There are good reasons for performing adequate type modelling and analysis. First, interpretation and insights are often important goals, sometimes more important than final (predictive) performance. Thus we need to understand the generative process that governs the generation of data types. For example, modelling a queuing process is critical in many applications, ranging from networking to manufacturing to supermarket scheduling. The number of arrivals within a period of time is often approximated by a Poisson distribution while the time between two arrivals by an exponential distribution. In collaborative filtering, ordinal modelling of ratings should be preferred to Gaussian treatments, because it offers insights of why and how a particular rating is chosen for a particular item by a particular user.

Second, even if predictive performance is the main goal, it is often the case that accurate modelling of data types would make it easier to achieve the goal.

Type-specific phenomena have been very well-studied, and now it is a good time to pay more attention to mixed types because real world processes are rarely isolated and they are likely to contain more than one type. Indeed, the problem is old, and there is a rich literature on modelling mixed data, but they often involve two or three types (e.g., continuous and categorical). Handling arbitrarily many types seems not to be adequately addressed in the existing literature. This series of post aim to partly fill that gap.

As an example of the situation when multiple data types appear simultaneously, let us take a typical multimedia object, where we can always have multiple ways of viewing it:

If it is described by a set of tags, then the type can be binary (we care about the presence and appearance of a word) or categorical (we care about why a particular word is present and not the others),
If it is described by a bag of visual words, then the type can be counts, or repeated categorical.
If it is described by a vector of histograms, then the type is real-valued.

Indeed, in the publicly available dataset like NUS-WIDE, all of these types are present, and I am not aware of any work in multimedia or machine learning making use of these types explicitly.

In healthcare, a patient can be characterised by various clinical and background information: number of previous admissions (count), age (continuous, or ordinal as in interval), height (continuous), test results (binary outcome or continuous measures), etc. A full account should call for an effective treatment of all the types at the same time. However, it seems that the development is till in an early stage.

In social sciences and business, surveys are one of the primary tools to measure the opinions and interests of people. The scale and impact are huge: Millions of surveys are conducted on millions of people each year. In a typical survey, a mixed bag of question types is used, and the answers often contain stated facts, preferences and choices. These can then be translated into the types such as binary (yes or no), counts (previous admissions), rank ( of political candidates), ordinal (strongly dislike, dislike, like), pairwise preferences, etc. Unfortunately, the literature of survey analysis is fully of advice about statistics for each data type and not so much about simultaneous modelling and analysis of all of these types.

Type-specific analysis in machine learning

Type modelling seems like an unsexy topic to many machine learning people. In the early days, types are explicitly handled in decision trees (e.g., popular in the 80s and early 90s): Nodes are split depending critically on their types, e.g., real-valued versus nominal. However, decision trees seem to be the problem of the day before yesterday (this is not to say they are not important anymore -- on the contrary, ensemble of tress is still one of the most competitive methods in predictive arts and they are widely used in practical data mining). There was not much interest in other developments, e.g., with neural networks, perceptrons and linear models or the more recent kernel methods.

While it is OK in practice to convert multiple data types into some unified form such as real-valued or binary (the process is also known as data coding), it may be more desirable to deal with data types directly. When the types are in the input, the coding may or may not reduce the amount of predictive information because coding can sometimes help to linearize the data, making easier for linear algorithms to do their job. However, if we want to understand the structure of the input (e.g., as in semi-supervised, self-taught and transfer learning), ignoring the data types can be problematic because the coding process partially destroys the type-specific information. For example, in a bag-of-word representation of a document, it may be more natural to deal with the counting using Poisson models than converting counts into real-values (which are an approximation), binary indicators (which lose information) or replicated binary indicators (which increase the model size).

When the types are in the output, there have been two directions: One of to model the type directly, and the reductionist approach is to convert type-specific problems into the more popular binary problems.

The former case is quite interesting, and in doing so, ML would invent a few more terminologies along the way:

For continuous outputs, the natural solution is using some regression techniques, such as ridge regression. However, this assumes the error structure is normal while it is not always necessarily so.
For binary outputs, this is really the canonical problem in machine learning.
For ordinal outputs, the problem is often called ordinal regression.
For counting outputs, a standard solution is Poisson regression.
For categorical outputs, this is often called multiclass problems, and now there are many solutions, under different names, e.g., multiclass SVM, maximum entropy, multinomial logit, multinomial probit.
For rank outputs, the problem is called label ranking.
For multiple binary outputs, the problem is often known as multilabel learning.

The main contribution from ML is the introduction of the notion of margins and the analysis of generalization and stability of algorithms, as well as specific algorithms such as decision trees, neural networks, boosting, random forests.

In the latter case, where we transform the complex problems into simpler ones, we may convert ordinal regression and ranking into a series of binary classifications from which existing methods such as SVMs can be reused. This can be an useful way since we don't have to invent new ML algorithms, this may also make the problem harder than necessary because we need to recombine binary predictors into the original form.

With regard to mixed-type analysis, its presence in machine learning is very limited, and I am only aware of a couple of attempts: here and here. The latter is ours, and we will cover in the subsequent posts.

Some research challenges

An important goal of mixed type modelling is to derive a joint distribution of all types. Ultimately, we would want a representation scheme that aids understanding of interaction among types, and at the same time, supports efficient inference (e.g., estimating marginals or expectations) and learning (e.g., structure and parameters). Probabilistic graphical models would be an excellent candidate here.

Sometimes, a full joint distribution is not needed if we care about prediction of some output variables given the rest. In machine learning, mixed-type outputs would be an instance of multitask learning where each output is a task. Thus, estimating a shared structure between output types (given arbitrary input types) would be an interesting direction here.

In some applications, on the other hand, joint distribution modelling might not be the ultimate goal. For example, we would need just a similarity measure between two mixed-type instances (e.g., in k-nearest neighbour classification, k-means clustering and kernel-methods), or a good visualisation of all instances in 2D (this often boils down to similarity estimation). This is a challenging problem because each type would lead to a measure which may not compatible with another in scale or in interpretation.

And finally, often we want to make use of some handy data analysis tools such as PCA/SVD/ICA/CCA/SVM/GLM on top of mixed data. Since these tools do not work on mixed data, the problem now is to turn mixed data into some real-valued vector representations which preserve as much as information of the original mixed-data as possible. In other words, we now have a problem of representation learning from mixed-data.

Can these challenges be overcome by a unified framework? In the subsequent posts, we would attempt to provide some early clues. Stay tuned.

Updated references:

Mixed-Variate Restricted Boltzmann Machines, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the 3rd Asian Conference on Machine Learning (ACML2011), Taoyuan, Taiwan, Nov 2011.
Embedded Restricted Boltzmann Machines for Fusion of Mixed Data Types and Applications in Social Measurements Analysis, Truyen Tran, Dinh Phung, Svetha Venkatesh, in Proc. of 15-th International Conference on Information Fusion (FUSION-12), Singapore, July 2012.
Latent patient profile modelling and applications with Mixed-Variate Restricted Boltzmann Machine, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’13), Gold Coast, Australia, April 2013.
Thurstonian Boltzmann machines: Learning from multiple inequalities, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of 30th International Conference in Machine Learning (ICML’13), Atlanta, USA, June, 2013.
Learning sparse latent representation and distance metric for image retrieval, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In Proc. of IEEE International Conference on Multimedia and Expo (ICME), San Jose, California, USA, July 2013.
Outlier Detection on Mixed-Type Data: An Energy-based Approach, K Do, T Tran, D Phung, S Venkatesh,International Conference on Advanced Data Mining and Applications (ADMA 2016).
Multilevel Anomaly Detection for Mixed Data, K Do, T Tran, S Venkatesh, arXiv preprint arXiv: 1610.06249.