Saturday, 30 August 2008

Data size matters

Most research is evaluated on small to medium-sized collection of data instances which seems like a toy to real-world collections. This is especially true in fields like information retrieval where real data size (about size of the Web) is 1,000 to 100,000 times larger than most research data collections. In recommender systems, the recent release of Netflix Prize data (which is clearly much smaller than the actual Netflix data) has about 100 millions ratings. This is 100 times larger than the best data set known before.

What does it mean to research and industry? Since research is only the place where models and results are freely published so that everyone can study, there results must be considered with great care when applying for real-world problems.
  • Computing power and programming. Much research today is done on single computers (may be information retrieval is the exception) and rapid development languages like Matlab, Python and Perl. When it comes to real-worlds, this assumption does not hold. It is likely that we will need clusters and low-level programming. This also puts more pressure on small research groups to upgrade their computers to match the state-of-the-art. For example, to freely do things with the Netflix data, we need somewhat 6GB RAM, which is clearly the top one at current time.
  • Simplifying models. Previously, our group had been playing around with several polynomial time algorithms like the famous Inside-Outside for text parsing. This is theoretically interesting because it is at least tractable. It also works fine with small data set (e.g. 100-10000 sequences whose length is limited to the range 10-50). When facing real-world data, however, we cannot continue this line because it will take years to run a single experiment. My rule of thumb is anything that is not close to linear-time complexity has to be re-considered.
  • Accepting approximation. The world is not perfect and it does not require perfect solutions anyway. Often exact computation is hard so approximation is the only way to go.
  • Doing things on-the-fly. In news classification, it is not good to assume that news articles are static because they come in every single second with much shift in content and wordings. So, one clever solution is on-line learning where we just update the parameters as soon as a data instance (an article in this case) comes in. Previously, it is sometimes believed that on-line learning performs a little bit worse than batch-learning because on-line learning is considered as an approximation to batch-learning. However, when a lot of data is available from time to time, then how can we still say that after a certain point of time?
  • Statistical change. This is a more fundamental problems. It will be very likely that large data collection will change the certain statistics. There are several reasons. First, in research collections, noise is usually less a problem, because the data is collected with care, or it is easier to remove noise, sometimes by hand. This is not true in real-world data where there is much noise, either from the nature of the data, or the collection procedure. Second, the change comes from the size of the data itself. With more data, there is more chance for extremal statistics to occur. If any method is sensitive to the outliers, it is more likely to fail. In addition, complex models generally require more data to estimate. For example, estimating n-gram language models with large n (n >= 4,5) is not reliable using small corpus.
  • Over-fitting. Often, due to limited resources, research conclusions are drawn from very limited data with not enough variations. So models that perform well on limited training data and one or two domains will be likely to fail on the Web, where anything can happen. In statistical machine translation, Google has observed that many well-studied methods fail when more data is available.
  • Simple statistics can do well. Google also observed in their translation experiments that it does not matter to strike for complex translation models. Simple unigrams are enough. More surprisingly, only first four characters are enough to represent a word! It seems that with enough data, simple models can be as powerful as complex models. Here, pure numerical statistics become very powerful, and it does not agree with common belief that complex, hand-crafted or semantics-driven models are better than simple counts. In my little study of Vietnamese text classification, even sub-syllable features like characters, vowels and consonants are indeed quite powerful although they are not comprehensive to readers. More general, in text classification, no linguistics-inspired document representations have proved to be widely effective.
As researchers have already warned, conclusions drawn from small data set must not be over-generalised.

No comments:

Post a Comment