Machine learns as data speak: 2008

Thursday, 4 September 2008

Aspects of information systems

The goal of information systems is to satisfy user's information need. However, since the need is hard to quantify and we often get to know users through theirs interaction with the system. There are access modes

Active seeking. Search. Enter keywords. Issue queries.
Passive reading. Feeds, news. Stay for a while (e.g. 30s).
Active browsing. see what is interesting. Keep clicking.
Transactional. Perform tasks - buy stuffs, send e-mails, comment blogs, author content, add friends.

To satisfy user's need, much has been studied in the last 60 years. However, only very recently, part of the need could be met at the large-scale. Here we outline some aspects of current information systems:

Semantic Web. The W3C initiative is ambitious: to build a representation and connectivity that machines can confidently processed. This is a huge step from the current Web, which is for humans. Much has been said about Semantic Web, but the main problems are how to create ontologies, integrate them, and process them it the way we want the machines do. I guess this remains a far-reaching goal, and just keep people motivated. A long the way, other things get improved as a result.
User interface. This is where humans meet machines. Can't overestimate it more. Is list of result enough? Why many attempts of visualisation never catch on?
Deep analysis of content/context. Perhaps the NLP stuffs go here, as many people expected. But non has proved successful.
Serve targeted ads. The main goal of any commercial system is to monetise. An advertising is what most people aim for, given the great fortune made by Google. So, most machines are just constantly collecting data for just one purpose: serving better ads.
Mobile. This is not a trend anymore. It is a fact. The main difficulty is with the small screen, poor navigation (even with touch-screens).
Localisation. Everyone loves talking about things physically nearby. So the local languages.
Personalisation. Much has been tried but not sure when it works. From standard usability, provide proven standardised interface is much better in mental load.
Change in user access modes. People actively seek for something news when they are triggered by some things. They turn back to the passive mode if they are tired, and just want something to read, like news.
Large-scale architecture. Methods that are not scalable never catch on the Web.
Investment. Oops, after all, we need money to pay for everything.
Attract and keep talents. On the engineering side, it is about research, system engineering, user interface, etc. Everything needs to fit with each other, at the level never seen before. We need talents to move beyond the norms.
Enterprise search. This is meaty because no technologies will fit everything. A lot of customerisation is needed. Often, no hyperlinks, some is poorly organised, a lot of privacy, internal issues, type of documents, insights, leveraging resources (e.g. documentation, code, expertise), variety of platforms, protocols, legacy systems, database, XMLs, free texts, jobs, resumes, Web access log.
Social networks. People trust people in the circle of reach. Some messages are specific to the pair, some intended for public audience.
Query parsing, intention discovery. Query is currently the most effective way of expressing information need. When we search, we have a specific intention. However, people are often not good at forming query that can describe their intention. Does a query mean "give me a set of documents that contain these words", or "give me some further understanding of these topics"? Trained users with clear understanding of boolean formation may mean the first. It is likely that word order has some affect, and the collocation patterns of these keywords in the document may mean level of relevance.
Ranking. Ranking is the core of the search art. Users need a specific answers. Even a comprehensive description of some concepts may appear in only few pages. So it is not about recall but precision.
Learning to rank. This is currently a hot topic in machine learning. This is populated by the work of Cohen, Schapire and Singer 10 years ago. Ranking in traditional IR is heuristics-based, and it does not consistently use feedback to improve the ranking. Given implicit preference can be obtained from search engines, learning to rank has good promise. The hard part is that searching for the best ordering is NP-hard because this is a permutation proplem.
Hyperlink analysis. Google has proved that hyperlinks are the most important property of the Web when it comes to assessing quality of Web pages. There is much more that can be explored, e.g. at ranking at multiple levels (sites, clusters, objects, passages) and challenging the stationary assumption of PageRank.
Semantics emerged through users' interaction with systems, and other users. Interaction contexts count (time, location, keywords, keyword order, browsing/search history).
Clustering. Clustering and topic models likely increase recalls due to its capacity to group similar words together.
Insights. People like Ramesh Jain believe that the next generation of search should provide comprehensive picture of the world with respect to what users's need, and from this insights are drawn. Topic models, automatic annotation and visualisation may play a role here, but it is likely that most people will still need very simple form of keyword search.
Evaluation. No progress will be made if evaluation is not properly carried out. This is an area of research in its own right.
Common-sense. Common-sense assertion plus rough weighting may be important. For example, we may assert: Color(Roses,Red), HasProperty(Table,Leg,4), Hate(Cat,Dog). Top AI guys like Marvin Minsky believe common-sense is what current AI lacks to advance further. The most notable effort is the creation of CYC - a common-sense database, which was started arround 25 years ago, and is now claimed to reach the level that commercial applications can be developed.
Wiki as a common-sense database expressed in English. Wiki has become an important source of knowledge for both human and machines. Numerous papers appear just aim at making sense of Wiki texts. From the commercial side, the company Powerset does just that: it makes use of knowledge from Wiki only. It is now part of Microsoft.
Information extraction. Unlike Semantic Web which assumes formally structured concepts to be put right from the beginning, information extraction starts from unstructured texts to build up concepts. Some believe that this is the perhaps how ontologies, and thus Semantic Web, are materialised. However, the main challenge is to build a Web-scale extraction system that can extract thousands of concepts, making use of common-sense knowledge and collaborative effort. Current supervised learning, although promises high performance, cannot scale to this level yet.
Vertical search. Generic search is not enough for specific group of users. We are interested in high quality search systems in various domains: people, history, inventions, weather, health, musics, films, arts, images, wars, politics, mechanical parts, physics, chemistry, biology, family-line, schools, cafe, food, alcohol, writing, novels, poetry, IT, books, events,
User search experience. Beside making money, improving user search experience is perhaps the most important goal of search engines. Fancy interfaces may look interesting but simple text boxes are often enough.
Universal knowledge versus local knowledge. Roughly, facts described in (English) Wiki can be considered as universal. These facts, when well collected, can be translated without loss to local languages (e.g. Vietnamese) by humans. Other types of local knowledge must be appropriately learned for each language.
Business intelligence. This is a popular term. From the well-structured database community, this is about making sense of the data available to provide competitive-edge for business. However, current massive amount of free unstructured data asks for constantly brand monitoring, opinions, sentiments about products/services.
Question answering. It is sometimes believed that question answering is the next step of information retrieval. In fact, Google has gradually incorporated some degree of QA to their system. Shallow QA can exploit the great redundancy of expressions of facts on the Web to provide questions about simple facts like "Who", "Where", "What", "When". However, deep QA should be employed to deal with "How" and "Why" questions, because it is about reasoning.
Knowledge representation. First-order logic was once believed to represent most knowledge available. However, one has to deal with a whole lot of things other than pure knowledge: scale, uncertainty/noise, and emerging/shifting concepts.

Saturday, 30 August 2008

Data size matters

Most research is evaluated on small to medium-sized collection of data instances which seems like a toy to real-world collections. This is especially true in fields like information retrieval where real data size (about size of the Web) is 1,000 to 100,000 times larger than most research data collections. In recommender systems, the recent release of Netflix Prize data (which is clearly much smaller than the actual Netflix data) has about 100 millions ratings. This is 100 times larger than the best data set known before.

What does it mean to research and industry? Since research is only the place where models and results are freely published so that everyone can study, there results must be considered with great care when applying for real-world problems.

Computing power and programming. Much research today is done on single computers (may be information retrieval is the exception) and rapid development languages like Matlab, Python and Perl. When it comes to real-worlds, this assumption does not hold. It is likely that we will need clusters and low-level programming. This also puts more pressure on small research groups to upgrade their computers to match the state-of-the-art. For example, to freely do things with the Netflix data, we need somewhat 6GB RAM, which is clearly the top one at current time.
Simplifying models. Previously, our group had been playing around with several polynomial time algorithms like the famous Inside-Outside for text parsing. This is theoretically interesting because it is at least tractable. It also works fine with small data set (e.g. 100-10000 sequences whose length is limited to the range 10-50). When facing real-world data, however, we cannot continue this line because it will take years to run a single experiment. My rule of thumb is anything that is not close to linear-time complexity has to be re-considered.
Accepting approximation. The world is not perfect and it does not require perfect solutions anyway. Often exact computation is hard so approximation is the only way to go.
Doing things on-the-fly. In news classification, it is not good to assume that news articles are static because they come in every single second with much shift in content and wordings. So, one clever solution is on-line learning where we just update the parameters as soon as a data instance (an article in this case) comes in. Previously, it is sometimes believed that on-line learning performs a little bit worse than batch-learning because on-line learning is considered as an approximation to batch-learning. However, when a lot of data is available from time to time, then how can we still say that after a certain point of time?
Statistical change. This is a more fundamental problems. It will be very likely that large data collection will change the certain statistics. There are several reasons. First, in research collections, noise is usually less a problem, because the data is collected with care, or it is easier to remove noise, sometimes by hand. This is not true in real-world data where there is much noise, either from the nature of the data, or the collection procedure. Second, the change comes from the size of the data itself. With more data, there is more chance for extremal statistics to occur. If any method is sensitive to the outliers, it is more likely to fail. In addition, complex models generally require more data to estimate. For example, estimating n-gram language models with large n (n >= 4,5) is not reliable using small corpus.
Over-fitting. Often, due to limited resources, research conclusions are drawn from very limited data with not enough variations. So models that perform well on limited training data and one or two domains will be likely to fail on the Web, where anything can happen. In statistical machine translation, Google has observed that many well-studied methods fail when more data is available.
Simple statistics can do well. Google also observed in their translation experiments that it does not matter to strike for complex translation models. Simple unigrams are enough. More surprisingly, only first four characters are enough to represent a word! It seems that with enough data, simple models can be as powerful as complex models. Here, pure numerical statistics become very powerful, and it does not agree with common belief that complex, hand-crafted or semantics-driven models are better than simple counts. In my little study of Vietnamese text classification, even sub-syllable features like characters, vowels and consonants are indeed quite powerful although they are not comprehensive to readers. More general, in text classification, no linguistics-inspired document representations have proved to be widely effective.

As researchers have already warned, conclusions drawn from small data set must not be over-generalised.

Friday, 29 August 2008

Conditional random fields: Big innovation by many small improvements

The subject will be Conditional Random Fields (CRF), which cost me roughly 4 years of PhD. After 7 years since its first appearance in ICML 2001, there has been roughly 1,300 citations, according to Google Scholar. This is a very impressive figure so far. It is impressive because anyone can rightly criticise about CRF's novelty compared to its prior arts.

From machine learning point of view, it is mostly Maximum Entropy, which was first introduced in 1957 by Jaynes, and had became popular since the work of IBM researchers in 1996. The most popular parameterisation is still log-linear, which has been known to physicists since the time of Gibbs. The famous Generalised Iterative Scaling method for optimising log-linear models was introduced in the 1970s.
From graphical models point of view, it is just a conditional Markov random field (MRF), i.e., when conditioned on an observational data, we have a MRF of the output variables.
From modelling point of view, it is just a correction of Maximum Entropy Markov Models (MEMM), introduced in 2000.

Then why is it so popular among researchers who value novelty above everything else? The answer is perhaps, because it works and it came at a right time. Let's see how:

It was the right time in 2001. Statistical machine learning had reached the point of wide acceptance. At the same time, structural modelling had been widely recognised as indispensable in many domains. Graphical models had been come a mainstream in the AI research agenda. Computing power was strong enough to support large-scale real world applications.
By 2003, it was discovered that a Limited memory Newton method known as L-BFGS works extremely well with CRFs. It helps to scale up to problems with several million parameters, the size unthinkable for statisticians and optimization hard-core guys. Of course, by 2002, it was known that a L-BFGS cousin, Conjugate Gradients outperform all previous known methods for log-linear models. L-BFGS and CGs are two major advances by the numerical optimisation community, and they have come to rescue.
MRFs had been around for a long time, since the age of Ising models in physics early the 20th century, and the famous work of Besag in 1974 and the Browns in image modelling in 1984. However, MRFs are quite difficult to model because of its intractability in the joint form. By using the conditional forms, we do not need to model the messy observational data, rather we need only to deal with the output of interest. It makes thing significantly easier and more effective.
The CRF works like magic because it overcome much inherent limits of the well-known HMM, and the related MEMM. It paves the way for domain experts to engineer features as they please - the things are not so easy in HMMs.

In short, CRFs are not a breakthrough that challenges the current belief. It is a right combination of several advances at the right time and makes incremental improvements in each aspect. As a net result, it became a winner and since then has triggered much interest in learning in structured output space.

Some of our related work (updated):

Hierarchical semi-Markov conditional random fields for deep recursive sequential data, Truyen Tran, Dinh Phung, Hung Bui, Svetha Venkatesh, Artificial Intelligence, Minor revision. (Extension of the NIPS'08 paper).
Preference Relation-based Markov Random Fields in Recommender Systems, Shaowu Liu, Gang Li,Truyen Tran, Jiang Yuan, ACML'15, November 20-22, 2015, Hong Kong.
Predicting delays in software projects using networked classification, Morakot Choetikertikul, Hoa Khanh Dam, Truyen Tran, Aditya Ghose, 30th IEEE/ACM International Conference on Automated Software Engineering, November 9–13, 2015 Lincoln, Nebraska, USA.
Tree-based Iterated Local Search for Markov Random Fields with Applications in Image Analysis, Truyen Tran, Dinh Phung and Svetha Venkatesh, Journal of Heuristics, 2014, DOI:10.1007/s10732-014-9270-1
Ordinal random fields for recommender systems, Shaowu Liu, Truyen Tran, Gang Li, Yuan Jiang, ACML'14, Nha Trang, Vietnam, Nov 2014.
MCMC for Hierarchical Semi-Markov Conditional Random Fields, Truyen Tran, Dinh Q. Phung, Svetha Venkatesh and Hung H. Bui. In NIPS'09 Workshop on Deep Learning for Speech Recognition and Related Applications. December, 2009, Whistler, BC, Canada.
Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data, Truyen Tran, Dinh Q. Phung, Hung H. Bui, and Svetha Venkatesh. In Proc. of 21st Annual Conference on Neural Information Processing Systems, Dec 2008, Vancouver, Canada. [See technical report and thesis for more details and extensions.]
Constrained Sequence Classification for Lexical Disambiguation, Truyen Tran, Dinh Q. Phung and Svetha Venkatesh. In Proc. of 10th Pacific Rim International Conference on Artificial Intelligence (PRICAI-08), 15-19 Dec 2008, Hanoi, Vietnam.
Preference Networks: probabilistic models for recommendation systems, Truyen Tran, Dinh Q. Phung and Svetha Venkatesh, In Proc. of 6th Australasian Data Mining Conference: AusDM 2007, 3-4 Dec, Gold Coast, Australia.
AdaBoost.MRF: Boosted Markov random forests and application to multilevel activity recognition, Truyen Tran, Dinh Quoc Phung, Hung Hai Bui, and Svetha Venkatesh. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, volume Volume 2, pages 1686-1693, New York, USA, June 2006.
Boosted Markov networks for activity recognition, Truyen Tran, Hung Hai Bui and Svetha Venkatesh, In Proc. of International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP2005), 5-8 Dec, Melbourne, Australia.
Human Activity Learning and Segmentation using Partially Hidden Discriminative Models, Truyen Tran, Hung Hai Bui and Svetha Venkatesh, HAREM 2005: Proceedings of the International Workshop on Human Activity Recognition.

Wednesday, 27 August 2008

Under/over-generating in learning structured output spaces

Consider learning the structured output spaces. By structured spaces, we roughly means that the space can only be jointly described by multiple variables. In other words, variables are not completely independent of each other. They can be conditionally independent given some other variables. From the graphical models perspective, those variables can be modelled as a graph whose edges depict direct dependency between variables.

While this is interesting because most real-world data can be described in this way, it is a hard problem as generally it is intractable to do any meaningful exact computation. Everything must be approximated.

In learning, we basically estimate some compact description of the structured space, whose size is exponentially many if we want to count every point in it. Since exact generation of the space is impossible, there are two ways to proceed: (1) under-generating where we limit to very small subspace so that we can deal with efficiently, and (2) over-generating where we try to describe a larger but much more relaxed space so many constraints are violated.

Examples of under-generating methods:

Pseudo-likelihood of Besag: we exploit the conditional independence property to explore a subset of variables and their surrounding (also known as Markov blanket). The basic theory assures that if we have enough data, then there will be no difference between this localised estimation and the true global one.
Re-ranking of Collin: the idea is that we should use some tractable and approximate methods to generate the most meaningful and tractable subspace, and then deal with the new subspace.

Examples of over-generating methods:

Piecewise training of Sutton & McCallumn: the idea is that we can duplicate variables and break the structure into independent pieces. Since more variables are added, the new space is larger than the original one. In return we will have much more independence, which makes computation easy. This method may work fine if the independent pieces approximate the data well.
Tree-superimposition of my own: we can always relax a graph by removing some edges until a spanning tree is left. Since spanning trees are tractable, they can be learned given the knowledge of all other trees accumulated so far.

Thursday, 1 May 2008

A generic code for sequential conditional random fields

I have made available the C++ code for sequential labelling tasks such as part-of-speech tagging, noun-phrase chunking or human activity annotation. The code is domain independent and supports the following features:

First & second-order sequential CRFs.
Training algorithms: L-BFGS (from Taku Kudo), stochastic gradient ascent and Collin's voted perceptron.
Handle missing training labels.
Decoding: Viterbi (forward/backward) and Pearl's max-product.
Report label-wise and segment-wise statistics.

The project page is here. The page contains a tutorial on the theory of CRFs and a discussion on the implementation.