Machine learns as data speak: Conditional random fields: Big innovation by many small improvements

The subject will be Conditional Random Fields (CRF), which cost me roughly 4 years of PhD. After 7 years since its first appearance in ICML 2001, there has been roughly 1,300 citations, according to Google Scholar. This is a very impressive figure so far. It is impressive because anyone can rightly criticise about CRF's novelty compared to its prior arts.

From machine learning point of view, it is mostly Maximum Entropy, which was first introduced in 1957 by Jaynes, and had became popular since the work of IBM researchers in 1996. The most popular parameterisation is still log-linear, which has been known to physicists since the time of Gibbs. The famous Generalised Iterative Scaling method for optimising log-linear models was introduced in the 1970s.
From graphical models point of view, it is just a conditional Markov random field (MRF), i.e., when conditioned on an observational data, we have a MRF of the output variables.
From modelling point of view, it is just a correction of Maximum Entropy Markov Models (MEMM), introduced in 2000.

Then why is it so popular among researchers who value novelty above everything else? The answer is perhaps, because it works and it came at a right time. Let's see how:

It was the right time in 2001. Statistical machine learning had reached the point of wide acceptance. At the same time, structural modelling had been widely recognised as indispensable in many domains. Graphical models had been come a mainstream in the AI research agenda. Computing power was strong enough to support large-scale real world applications.
By 2003, it was discovered that a Limited memory Newton method known as L-BFGS works extremely well with CRFs. It helps to scale up to problems with several million parameters, the size unthinkable for statisticians and optimization hard-core guys. Of course, by 2002, it was known that a L-BFGS cousin, Conjugate Gradients outperform all previous known methods for log-linear models. L-BFGS and CGs are two major advances by the numerical optimisation community, and they have come to rescue.
MRFs had been around for a long time, since the age of Ising models in physics early the 20th century, and the famous work of Besag in 1974 and the Browns in image modelling in 1984. However, MRFs are quite difficult to model because of its intractability in the joint form. By using the conditional forms, we do not need to model the messy observational data, rather we need only to deal with the output of interest. It makes thing significantly easier and more effective.
The CRF works like magic because it overcome much inherent limits of the well-known HMM, and the related MEMM. It paves the way for domain experts to engineer features as they please - the things are not so easy in HMMs.

In short, CRFs are not a breakthrough that challenges the current belief. It is a right combination of several advances at the right time and makes incremental improvements in each aspect. As a net result, it became a winner and since then has triggered much interest in learning in structured output space.

Some of our related work (updated):

Hierarchical semi-Markov conditional random fields for deep recursive sequential data, Truyen Tran, Dinh Phung, Hung Bui, Svetha Venkatesh, Artificial Intelligence, Minor revision. (Extension of the NIPS'08 paper).
Preference Relation-based Markov Random Fields in Recommender Systems, Shaowu Liu, Gang Li,Truyen Tran, Jiang Yuan, ACML'15, November 20-22, 2015, Hong Kong.
Predicting delays in software projects using networked classification, Morakot Choetikertikul, Hoa Khanh Dam, Truyen Tran, Aditya Ghose, 30th IEEE/ACM International Conference on Automated Software Engineering, November 9–13, 2015 Lincoln, Nebraska, USA.
Tree-based Iterated Local Search for Markov Random Fields with Applications in Image Analysis, Truyen Tran, Dinh Phung and Svetha Venkatesh, Journal of Heuristics, 2014, DOI:10.1007/s10732-014-9270-1
Ordinal random fields for recommender systems, Shaowu Liu, Truyen Tran, Gang Li, Yuan Jiang, ACML'14, Nha Trang, Vietnam, Nov 2014.
MCMC for Hierarchical Semi-Markov Conditional Random Fields, Truyen Tran, Dinh Q. Phung, Svetha Venkatesh and Hung H. Bui. In NIPS'09 Workshop on Deep Learning for Speech Recognition and Related Applications. December, 2009, Whistler, BC, Canada.
Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data, Truyen Tran, Dinh Q. Phung, Hung H. Bui, and Svetha Venkatesh. In Proc. of 21st Annual Conference on Neural Information Processing Systems, Dec 2008, Vancouver, Canada. [See technical report and thesis for more details and extensions.]
Constrained Sequence Classification for Lexical Disambiguation, Truyen Tran, Dinh Q. Phung and Svetha Venkatesh. In Proc. of 10th Pacific Rim International Conference on Artificial Intelligence (PRICAI-08), 15-19 Dec 2008, Hanoi, Vietnam.
Preference Networks: probabilistic models for recommendation systems, Truyen Tran, Dinh Q. Phung and Svetha Venkatesh, In Proc. of 6th Australasian Data Mining Conference: AusDM 2007, 3-4 Dec, Gold Coast, Australia.
AdaBoost.MRF: Boosted Markov random forests and application to multilevel activity recognition, Truyen Tran, Dinh Quoc Phung, Hung Hai Bui, and Svetha Venkatesh. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, volume Volume 2, pages 1686-1693, New York, USA, June 2006.
Boosted Markov networks for activity recognition, Truyen Tran, Hung Hai Bui and Svetha Venkatesh, In Proc. of International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP2005), 5-8 Dec, Melbourne, Australia.
Human Activity Learning and Segmentation using Partially Hidden Discriminative Models, Truyen Tran, Hung Hai Bui and Svetha Venkatesh, HAREM 2005: Proceedings of the International Workshop on Human Activity Recognition.

Machine learns as data speak

Friday, 29 August 2008

Conditional random fields: Big innovation by many small improvements

No comments:

Post a Comment