tag:blogger.com,1999:blog-72669073902672881132017-10-23T03:47:41.922-07:00Machine learns as data speakAI, machine learning, deep learning, data science and all those topics!Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.comBlogger37125tag:blogger.com,1999:blog-7266907390267288113.post-88380856204851147242017-03-08T16:03:00.001-08:002017-03-08T18:51:53.250-08:00The Matrix Encoded: Matrices as first-class citizen<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://i.ytimg.com/vi/cYKj9vCSCvA/maxresdefault.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://i.ytimg.com/vi/cYKj9vCSCvA/maxresdefault.jpg" width="640" /></a></div><br />One thing as popular as hydro in the universe, is vector. Most mathematical and data analytical analysis asks for this fundamental structure of the world. PCA, ICA, SVM, GMM, t-SNE, neural nets to name a few, all implicitly assume vector representation of data. The power of vector should not be underestimated. The so-called <i><a href="https://www.quora.com/Deep-Learning-What-is-meant-by-a-distributed-representation">distributed representation</a></i>, which is rocking the machine learning and cognitive science worlds, is nothing but <b>vector representation of thought</b> (in <a href="https://deeplearning4j.org/thoughtvectors">Geoff Hinton's words</a>, referring to <a href="https://arxiv.org/abs/1506.06726">Skip-Thought vectors</a>).<br /><br />The current love for distributed representation of things (yes, THINGS, as in Internet-of-Things) has gone really far. There is a huge line of work on [X]2vec, where you can substitute [X] by [word], [<a href="https://www.technologyreview.com/s/519581/how-google-converted-language-translation-into-a-problem-of-vector-space-mathematics/">sentence</a>],[paragraph], [document], [node], [tweet], [edge] and [subgraph]. I won't be surprised to see <b>thing2vec</b> very soon.<br /><br />But can you really compress structured things like sentences into vectors? I bet you could, given that the vector is long enough. After all, although the space of all possible sentences in a language is theoretically infinite, the majority of language usage is tightly packed, and in practice the sentence space can be mapped into a linear space of thousands of dimensions.<br /><br />However, compressing a data begs a question of decompressing it, e.g., to generate a target sentence in another language, as in machine translation. Surprisingly, the simplistic <b><a href="https://www.tensorflow.org/tutorials/seq2seq">seq2seq</a></b> trick works well in translation. But since the linguistic structures have been lost to vectorization, language generation from vector will be more difficult. A better way is to treat each sentence as a matrix, where each column is a word embedding. This gives rise to the <a href="https://arxiv.org/abs/1409.0473">attention scheme in machine translation</a>, which turns out to a huge success, as in the current <a href="https://en.wikipedia.org/wiki/Google_Neural_Machine_Translation">Google's Neural Machine Translation system</a>.<br /><br />Indeed, it has been well-recognized that vectors alone are not enough to memorize long-distant events. The idea is to augment vector-based RNN with an external memory, giving rise to the recent <a href="http://distill.pub/2016/augmented-rnns/">Memory-augmented RNNs</a>. The external memory is nothing but a matrix.<br /><br /><b>Enter the world of matrices</b><br /><br />Matrices in vector space are used for linear transformation, that is, to map a vector from one space, to another vector in a different space. As a mathematical object, matrices have their own life, just like vectors, e.g., <a href="https://en.wikipedia.org/wiki/Matrix_calculus">matrix calculus</a>.<br /><br />In NLP, it has been suggested that <a href="http://dl.acm.org/citation.cfm?id=1870773">noun is a vector and adjective is really a matrix</a>. The idea is cute, because adjective "acts" on noun, which will transform the meaning of the noun.<br /><br />Matrices also form a basis for parameterization of neural layers. Hence a space of multilayered neural nets is a joint space of matrices.<br /><br />Our recent paper titled "<a href="https://arxiv.org/abs/1703.01454">Matrix-centric neural networks</a>" (co-authored with my PhD student, Kien Do and my boss, Professor Svetha Venkatesh) pushes the line of matrix thinking to the extreme. That is, <b>matrices are fist-class citizen</b>. They are no longer a collection of vectors. The input, hidden layers, and the output are all matrices. The RNNs is now a model of a sequence of input matrices and a sequence of output matrices. The internal memory (as in LSTM) is also a matrix, making it resemble the Memory-augmented RNNs.<br /><br />To rephrase Geoff Hinton, we want a <b>matrix representation of thought</b>. Somehow, our neocortex looks like a matrix -- it is really a huge thin sheet of grey matter.<br /><br />May be one day we will live in <a href="https://en.wikipedia.org/wiki/The_Matrix">the space created by matrices</a>.<br /><br /><b>References</b><br /><ul><li><a href="https://arxiv.org/abs/1703.01454">Matrix-Centric Neural Networks</a>, Kien Do, Truyen Tran, Svetha Venkatesh. <i>arXiv preprint</i> arXiv: 1703.01454.</li></ul>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-16924420204261251652017-02-25T00:46:00.002-08:002017-02-26T14:28:35.424-08:00Column bundle: a single model for multiple multipe<div class="separator" style="clear: both; text-align: center;"><a href="http://www.4thmedia.org/wp-content/uploads/2011/11/Chinese-performers-dazzle-despite-disabilities-in-Myanmar.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://www.4thmedia.org/wp-content/uploads/2011/11/Chinese-performers-dazzle-despite-disabilities-in-Myanmar.jpg" height="422" width="640" /></a></div><br />Supervised machine learning has a few recurring concepts: data instance, feature set and label. Often, a data instance has one feature set and one label. But there are situations when you have multi-[X], where X = instance, view (feature subset), or label. For example, in multiple instance learning, you have more then one instance, but only one label.<br /><br />Things are getting interesting when you have multiple instances, multiple views and multiple labels <i>at the same time</i>. For example, a video clip can be considered as a set of video segments (instances), each of which has views (audio, visual frames and may be textual subtitle), and the clip has many tags (labels).<br /><br />Enter <a href="https://arxiv.org/abs/1702.07021">Column Bundle</a> (CLB), the latest invention in <a href="https://truyentran.github.io/repLearn.html">my group</a>.<br /><br />CLB makes use of the concept of <a href="https://en.wikipedia.org/wiki/Cortical_column">columns in neocortex</a>. In brain, neurons are arranged in thin mini-columns, each of which is thought to cover a small sensory area called receptive field. Mini-columns are bundled into super-columns, which are inter-connected to form the entire neocortex. In our previous work, this cute concept has been exploited to build a <a href="https://arxiv.org/abs/1609.04508">network of columns</a> for collective classification. For CLB, columns are arranged in a special way:<br /><br /><ul><li>There is one central column that serves as the main processing unit (CPU).</li><li>There are input mini-columns to read inputs for multiple parts (Input)</li><li>There are output mini-columns to generate labels (Output)</li><li>Mini-columns are only connected to the central column.</li></ul><div>Columns are recurrent neural nets with skip-connections (e.g., Highway Net, Residual Net or LSTM). Input parts can be instances, or views. The difference is only at the feature mapping: different views are first mapped into the same space.</div><div><br /></div><div>In a sense, it looks like a neural computer without a RAM.</div><div><br /></div><div><b>References</b></div><div><div><ul><li><a href="https://arxiv.org/abs/1702.07021">On Size Fit Many: Column Bundle for Multi-X Learning</a>, Trang Pham, Truyen Tran, Svetha Venkatesh. <i>arXiv preprint</i>, arXiv:1702.07021</li><li><a href="https://arxiv.org/abs/1609.04508">Column Networks for Collective Classification</a>, T Pham, T Tran, D Phung, S Venkatesh, <i>AAAI'17</i></li></ul></div></div>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-85479228139448420492017-02-19T22:44:00.001-08:002017-08-15T23:55:48.753-07:00Living in the future: AI for healthcare<div class="separator" style="clear: both; text-align: center;"><a href="http://img12.deviantart.net/19f4/i/2012/133/f/e/fortune_teller_5_by_obliviate_stock-d4zkgb1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://img12.deviantart.net/19f4/i/2012/133/f/e/fortune_teller_5_by_obliviate_stock-d4zkgb1.jpg" height="426" width="640" /></a></div><br /><br />In a not-so-distant future, it will be a routine to chat to a machine and receive medical advice from it. In fact, many of us have done this - seeking advice from healthcare sites, asking questions online and being recommended for known answers by algorithms. The <a href="http://venturebeat.com/2017/01/10/how-ai-took-center-stage-at-ces-2017/">current wave of AI</a> will only accelerate this trend.<br /><br />Medicine is by large a discipline of information, where the knowledge power is very asymmetric between doctors and patients. Doctors do the job well because humans are all alike, so that cases can be documented in medical textbooks and findings can be shared in journal articles and validated by others. In other words, medical knowledge is statistical, leading to the so-called e<i>vidence-based medicine</i> (EBM). And this is exactly the reason why the current breed of machine learning - deep learning - will do well in majority of cases.<br /><br /><b>Predictive medicine</b><br /><br />In Yann LeCun's words, the future of AI rests on <i><a href="https://medium.com/intuitionmachine/predictive-learning-is-the-key-to-deep-learning-acceleration-93e063195fd0#.ofuryadq3">predictive learning</a></i>, which is basically an alternative way to say <i>unsupervised learning</i>. Technically, this is the capability to fill the missing slots. For those who are familiar with <a href="https://en.wikipedia.org/wiki/Graphical_model">probabilistic graphical models</a>, it is akin to computing <a href="https://en.wikipedia.org/wiki/Pseudolikelihood">pseudo-likelihood</a>, or estimating values of some variables given the rest.<br /><br />A significant part of medicine is inherently predictive. One is <i>diagnosis</i> - finding out what is happening now, and the other <i>prognosis</i> - figuring out what will be happening if an action (or absence of action) is done. While it is fair to say diagnosis is quite advanced, prognosis has a long way to go.<br /><br />To my surprise as a machine learning practitioner, doctors are unreasonably poor at prediction into the future, especially when it comes to mental health and genomics. Doctors are, however, excellent in explaining the results <i>after-the-fact</i>. In machine learning's terms, their models can <i>practically </i><i>fit </i><i>anything</i> but do not generalize well. This must come from the culture of <i>know-it-all</i>, where medical knowledge is limited to only a handful of people, and doctors are obliged to explain what has happened to the poor patients.<br /><br /><b>Physical health</b><br /><br />Human body is a physical (and to some extent, a statistical) system. Hence it follows physical laws. Physiological processes, in theory, can be fully understood and predictable - at least in a close environment. What are hard to predict, are the (results of) interactions with the open environment. For example, virus infection and car accidents are those hardly predictable. Hence, physical health is predictable up to an accuracy limit, beyond which computers have no hope in predicting. So don't expect the performance to be close to that we have seen in object recognition.<br /><div><br /><b>Mental health</b><br /><b><br /></b>Mental health is hard. No one can really tell what happens inside your brain, even if you have it opened. With hundreds of billions neurons and tens of trillions connections between them that give rise to mental processes, the complexity of the brain is beyond human reach at present. But mental health never goes alone. It goes hand-in-hand with physical health. A poor physical condition is likely to worsen a mental condition, and vice versa.<br /><br />A good sign is that mental health is going computational. There is an emerging field called <a href="http://www.nature.com/neuro/journal/v19/n3/full/nn.4238.html">Computational Psychiatry</a>. They are surprisingly open to new technological ideas.<br /><br /><b>The future</b><br /><b><br /></b>AI is also eating the healthcare stage with hundreds of startups popping up each month around the world. So what to expect in the near future within 5 years?<br /><div><ul><li><i>Medical imaging diagnosis</i>. This is perhaps <a href="http://techemergence.com/deep-learning-medical-applications/">the most ready space</a> due to the availability of affordable imaging options (CT-Scan, ultra-sound, fMRI, etc) and recent advances in computer vision, thanks to convolutional nets. One interesting form is <i>microscopy imaging</i> diagnosis since getting images from microscopes can be quite cheap. Another one is <i>facial diagnosis -- </i>It turns out, many diseases manifest through facial expression.</li><li><i>Medical text to be better understood</i>. There are several types of text: doctor narrative in medical records, user-generated medical text online, social health places, and medical research articles. This field will take more time to take off, but given the high concentration of talents in NLP at present, we have a reason to hope.</li><li><i>Cheap, fast sequencing techniques.</i> <a href="https://www.genome.gov/sequencingcosts/">Sequencing cost </a>has come down to a historic milestone of $1,000 recently, and we still have reasons to believe that it will go down to $100 in a not far future. For example, <a href="https://en.wikipedia.org/wiki/Nanopore_sequencing">nanopore sequencing</a> is emerging, and the sequencing using signal processing will be improved significantly<i>. </i></li><li><i>Faster and better understanding of genomics. </i>Once the sequencing reaches a critical mass, the understanding of it will be accelerated by AI. Check out, for example, the work of this Toronto professor, <a href="http://www.psi.toronto.edu/~frey/">Brendan Frey</a>.</li><li><i>Clinical data sharing</i> will remain a bottleneck for the years to come. Unless we have access to a massive clinical database, things will move very slowly in clinical settings. But machine learning will have to work in <i>data efficiency regimes</i>, too.</li></ul><div><br />Beyond 5 years, it is far more difficult to predict. Some are still in the realm of sci-fi.</div></div><div><ul><li><i>Automation of drug discovery</i>. Drug chemical and biological properties will be estimated accurately by machine. The search for a drug given a desirable function will be accelerated by hundred times.</li><li>A full <i>dialog system</i> for diagnosis and treatment recommendation. You don't need to see doctor for a $100 consultation for just 10 mins. You want a thorough consultation <i>for free</i>.</li><li><i>M-health</i>, with distant robotic surgery.</li><li><i>Brain-machine interfacing</i>, where humans will rely on machine for high bandwidth communication. <a href="https://techcrunch.com/2017/02/13/elon-musk-reiterates-the-need-for-brain-computer-interfaces-in-the-age-of-ai/">This idea</a> is from my favorite technologist <a href="https://en.wikipedia.org/wiki/Elon_Musk">Elon Musk</a>.</li><li><i>Nano chips</i> will enter the body in millions and kill the nasty bugs, fix the damages and get out without being kicked out by the immune system. This idea is from the 2006 book <a href="https://en.wikipedia.org/wiki/The_Singularity_Is_Near">The Singularity is Near</a> by my favorite futurist <a href="https://en.wikipedia.org/wiki/Ray_Kurzweil">Ray Kurzweil.</a></li><li><i>Robot doctors will be licensed</i>, just like self-driving cars now.</li><li><i>Patients will be in control</i>. No more know-it-all doctors. Patients will have a full knowledge of their own health. This implies that things must be explainable, and patients must be educated about their own bio & mental.</li></ul><div><br />However, like everything else, it is easy to imagine than done. Don't forget that <a href="https://www.journals.elsevier.com/artificial-intelligence-in-medicine/">AI in Medicine</a> (AIIM) is a very old journal, and nothing really magic has happened yet.</div></div><br /><b>What we do</b><br /><b><br /></b>At PRaDA (Deakin University, Australia), we have <a href="https://truyentran.github.io/healthcare.html">our own share</a> in this space. Some most recent contributions are:<br /><ul><li><b><a href="https://arxiv.org/abs/1708.04357">Predicting drug response from molecular structure</a></b> (2017), where we use molecular structure to compute a drug representation, which is then used for predicting its bioactivity given a disease.</li><li><b><a href="https://arxiv.org/abs/1707.05010">Attend to temporal ICU risk</a></b> (2017), where we figure out a way to deal with ICU time-series, which are irregular and mostly missing. Again, the work will be in the public domain soon.</li><li><b><a href="https://arxiv.org/abs/1703.01454">Matrix-LSTM</a></b> (2017) for EEG, where we capture the tensor-like nature of EEG signals over time. </li><li><a href="https://arxiv.org/abs/1602.00357" style="font-weight: bold;">DeepCare</a> (2016), where we model the course of health trajectory, which is occasionally intervened at irregular time.</li><li><b><a href="https://arxiv.org/abs/1607.07519">Deepr</a> </b>(2016), where we aim to discover explainable predictive motifs though CNN.</li><li><b><a href="https://arxiv.org/abs/1610.06249">Anomaly detection</a> </b> (2016), where we discover outliers in healthcare data, which is inherently mixed-type.</li><li><b><a href="https://arxiv.org/abs/1609.08752">Stable risk discovery through Autoencoder</a></b> (2016), where we discover structure among risk factors.</li><li><a href="https://truyentran.github.io/papers/mlhc16_preterm_main.pdf"><b>Generating stable prediction rules</b></a> (2016), where we demonstrate that simple, and statistically stable rules can be uncovered from lots of administrative data for preterm-birth prediction at 25 weeks of gestation.</li><li><b><a href="https://www.ncbi.nlm.nih.gov/pubmed/25661261">eNRBM</a> </b> (2015): understanding the group formation of medical concepts through competitive learning and prior medical knowledge.</li></ul><br /><div><ul></ul></div></div>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-90393079016246755922017-01-12T19:16:00.002-08:002017-01-17T22:26:54.650-08:00On expressiveness, learnability and generalizability of deep learning<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="http://www.aturingmachine.com/turingFull560.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://www.aturingmachine.com/turingFull560.jpg" height="288" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Turing machine (aturingmachine.com)</td></tr></tbody></table>It is a coincidence that Big Data and Deep Learning popped up at the same time, <a href="http://letdataspeak.blogspot.com.au/2016/12/machine-learning-four-years-after.html">roughly around 2012</a>. And it is told that data to deep learning is fuel to rockets (this line is often attributed to Andrew Ng, co-founder of Coursera and Chief Scientist at Baidu).<br /><br />It is true that current deep learning flourishes as it leverages big, complex data better than existing techniques. Equipped with advances in hardware (GPU, HPC), deep learning applications are more powerful and useful than ever. However, without theoretical advances, big data might have remained a big pile of junk artifacts.<br /><br />Let us examine three key principles to any learning system: expressiveness, learnability and generalizability, and see how deep learning fits in.<br /><br /><b>Expressiveness</b><br /><b><br /></b>This requires learning system that can:<br /><br /><ul><li><i>Represent the complexity of the world</i>. It was proved in early 1990s that <a href="http://www.sciencedirect.com/science/article/pii/089360809190009T">feedforward nets are universal function approximator</a>. It means that any function imaginable can be represented by a suitable neural network. Note that convolutional nets are also feedforward net which represents a function that maps an image to any target values.</li><li><i>Compute anything computable</i>. Roughly the same time, it was proved that <a href="http://ieeexplore.ieee.org/document/558801/">recurrent nets are Turing-complete</a>. It says that any program written down in a standard computer can be represented by a suitable recurrent neural network (RNN). It is even suggested that Turing machines (and even human brains) are indeed RNN.</li></ul><div>These two theoretical guarantees are powerful enough to enable any computable applications, from object recognition to video understanding to automated translation to conversational agents to automated programmers. For example, one biggest challenge set out by OpenAI is to <a href="https://openai.com/blog/special-projects/">write a program that wins all programming challenges</a>.</div><div><br /></div><br /><b>Learnability</b><br /><br />But merely proving that there exists a neural net to do a job does not mean that we can find the net within a budget of time, unless there are efficient ways to do so. In a supervised learning setting, learnability means at least three things:<br /><br /><ul><li>Have a correct <i>computational graph</i> that enables effective and efficient passing of information and gradient between inputs and outputs. Finding a near-optimal graph is the job of <i>architecture engineering</i>, which is rather an art than a science. This is because the space of architectures are exponentially large, if not infinite. A right graph helps at least two things: (i) essential information is captured, and (ii) information passing is much easier. For example, convolutional nets allow translation invariance, which is often seen in images, speech and signals. With parameter sharing and the pyramid structure, training signals are distributed evenly between layers, and even weak signals at each image patch can multiply, enabling easier learning. And current skip-connections allow much easier passing of information across hundreds of layers.</li><li>Have <i>flexible optimizers</i> to navigate the rugged landscape of objective functions. Complex computational graphs are generally non-convex, meaning it is usually impossible to find the global optima in limited time. Fortunately, adaptive stochastic gradient descents are fairly efficient, including Adam, AdaDelta, RMSProp, etc. They can find good local minima in less than a hundred of passes through data.</li><li>Have <i>enough data</i> to statistically cover all small variations in reality. Practically it means hundreds of thousand data points for moderate problems, and millions for complex problems. An immediate corollary is the need to have very powerful compute, which usually means lots of GPUs, RAM, time and patience.</li></ul><b>Generalizability</b><br /><br />Having a capacity to learn any function or program is not enough. The learnt program must be able to generalize to unseen data as expected. Overfitting easily occurs in modern models where millions of parameters are common. Fortunately, with lots of data, overfitting is less a problem. Also recent advances have introduced Dropout (and its cousin like Maxout, DropConnect, stochastic layers) and Batch-Norm, and they together help reduce overfitting significantly.<br /><br />This is evidenced in deep nets systems that work in the wild (Self-driving cars, Google Translate/Voice, AlphaGo).<br /><div><br /></div><div>Of course, these three concepts are not enough to make deep learning work in practice. There are <a href="http://letdataspeak.blogspot.com.au/2016/12/machine-learning-in-three-lines.html">hundreds of models, techniques, and programming frameworks</a> out there to make things happen.</div><div><br /></div>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com1tag:blogger.com,1999:blog-7266907390267288113.post-14163257782532442582016-12-27T19:14:00.000-08:002016-12-28T15:22:39.679-08:00Deep learning as new electronics<div class="separator" style="clear: both; text-align: center;"><a href="http://www.ultra-electronics.com/Uploads/BusinessImages/Motherboard_450.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://www.ultra-electronics.com/Uploads/BusinessImages/Motherboard_450.jpg" height="360" width="640" /></a></div><br />It is hard to imagine a modern life without <b>electronics</b>: radios, TVs, microwaves, mobile phones and many more gadgets. Dump or smart, they are all based on the principles of semi-conducting and electromagnetism. Now we are using these devices for granted without worrying about these underlying laws of physics. Most people do not care about circuits that run in chips and carry out most functions of the devices.<br /><br />For the past 5 years, a <a href="https://medium.com/intuitionmachine/the-ultimate-deep-learning-applications-list-434d1425da1d#.4jlv0c7ss">new breed of human-like functionalities</a> has emerged through advances of a new field called <b>deep learning</b>: self-driving cars, voice command in mobile phone, translation in hundreds of language pairs and a new kind of art. In 2016, ten years after its revival, <a href="https://www.wired.com/2016/12/2016-year-deep-learning-took-internet/">deep learning has taken over the Internet</a>. People have used deep learning-powered products in daily life without worrying about how the underlying neural nets work.<br /><br /><div>These two fields free us from many physical and psychological constraints:<br /><br /><ul><li><i>Electronic devices</i> give us freedom of communication over distance, a new kind of experiences with augmented reality and many more.</li><li><i>Deep learning</i> enables freedom from having to make tedious and incorrect decisions (e.g., driving a car), freedom of information access (personalization), of hand (e.g., voice command), of finance (automated trading), of feature extraction (through representation learning), and many more.</li></ul></div>It is worth noting that electronics and deep learning are different in principles.<br /><ul><li><i>Electronics devices</i> are designed with great precision for specific functions in mind. Imprecision comes from the quantum uncertainty principle and thermal fluctuations.</li><li><i>Neural nets</i> on the other hand, are designed to learn to perform a function of its own, where data (and sometimes model) uncertainty is built in.</li></ul>However, it is also striking that they are quite similar in many ways.<br /><br /><div><b>Super-city of interconnected simple parts</b><br /><br />Modern electronic devices are truly super-cities built out of just few kinds of primitive building blocks. The same holds for deep neural nets:</div><ul><li><i>Electronic primitives</i>: resistor, capacitor, transistor, coil, diode, logic gate and switch.</li><li><i>Neural net primitives</i>: integrate-and-fire neuron, multiplicative gating, differentiable logic gate, switch and attention module. Interestingly, one of the most recent idea is called "<a href="https://arxiv.org/abs/1505.00387">Highway networks</a>", borrowing the idea that highway traffic is free of traffic lights.</li></ul><div>These primitives are connected in graphs:</div><div><ul><li><i>Electronic devices</i> works by moving electrons in correct order and number. The force that makes them move is potential difference. A design circuit captures all necessary information.</li><li>In <i>neural nets</i>, the activation function is like the electronic current. The main difference is that the magnitude of "current" in neural nets can be learnt. A computational graph is what is needed for model execution.</li></ul></div><div><b>Not just analogy: A two-way relationship</b></div><div><ul><li><i>Electronics </i>→ <i>deep learning</i>: At present, advances in electronics have given huge boost in efficiency of deep learning with GPU, TPU and other initiatives. It is interesting to see if we can learn from electronics in designing deep nets? For example, will something analogous to integrated-circuits in deep architectures?</li><li><i>Deep learning </i>→<i> electronics</i>: I predict that soon the reverse will hold true: deep learning will play a great role in improving efficiency and functionalities of electronic devices. Stay tuned.</li></ul></div>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-75434507059398077322016-12-25T15:39:00.000-08:002017-01-12T19:33:33.826-08:00Making a dent in machine learning, or how to play a fast ball game<div class="separator" style="clear: both; text-align: center;"><a href="http://media.gettyimages.com/photos/englands-striker-wayne-rooney-shoots-from-the-penalty-spot-to-score-picture-id487304302" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://media.gettyimages.com/photos/englands-striker-wayne-rooney-shoots-from-the-penalty-spot-to-score-picture-id487304302" height="328" width="640" /></a></div><br />Neil Lawrence had an <a href="http://inverseprobability.com/2015/07/12/Thoughts-on-ICML-2015">interesting observation about the current state of machine learning</a>, and linked it to fast ball games:<br /><blockquote class="tr_bq">“[…] the dynamics of the game will evolve. In the long run, the right way of playing football is to <b>position yourself intelligently and to wait for the ball to come to you</b>. You’ll need to run up and down a bit, either to respond to how the play is evolving or to get out of the way of the scrum when it looks like it might flatten you.” </blockquote>Neil Lawrence is known for his work in Gaussian Processes and is a proponent of data efficiency. He used to be professor at University of Sheffield, is now with Amazon. Apparently the strategy works. The ball has come to him.<br /><br />I once heard about a professor who said he would come to top conferences just to learn what others were busy doing and tried to do something else.<br /><br />I also read somewhere from a top physicist that students who applied to work with him often expressed the wish to study shiny-and-clean fields. Some other fields were too messy and seemed unsexy. The professor insisted that the messy fields were exactly the best to work on.<br /><br />In "<a href="https://www.amazon.com/Letters-Young-Scientist-Edward-Wilson/dp/0871403854">Letters to a young scientist</a>", <a href="https://en.wikipedia.org/wiki/E._O._Wilson">Edward Osborne Wilson</a> told his life story. He spent his entire life cataloging ants since childhood, right at the time where ant ecology wasn't a shiny field. He is considered as father of biodiversity.<br /><br /><b>Wonder what to do in deep learning now?</b><br /><br />It is an extremely fast ball game with thousands of top players. You will be either crushed with ideas being stolen <b><i>weekly</i></b>, or out of steam pretty quickly.<br /><br />It looks like most of the <a href="http://letdataspeak.blogspot.com.au/2016/12/machine-learning-four-years-after.html">low hanging fruits have been picked</a>.<br /><br />Then ask yourself, what is your unique position? What are your strengths and advantages that people do not have? Can you move faster than others? It may be by having <a href="http://letdataspeak.blogspot.com.au/2016/12/caring-deeper-motif-detection-from.html">access to data</a>, access to expertise in the neighborhood, or borrowing <a href="http://letdataspeak.blogspot.com.au/2016/12/caring-deeply-intervened-long-short.html">angles outside the field</a>. Sometimes <a href="http://letdataspeak.blogspot.com.au/2016/12/everything-old-is-new-again-deep.html">digging up old ideas</a> is highly <a href="http://letdataspeak.blogspot.com.au/2016/12/everything-old-is-new-again-nested.html">beneficial</a>, too.<br /><br />Alternatively, just calm down, and do <a href="http://letdataspeak.blogspot.com.au/2016/12/model-stability-learn-it-many-more.html">boring-but-important stuffs</a>. Important problems are like the goal areas in ball games. The ball will surely come.<br /><br />Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-91321898110445476752016-12-25T03:03:00.001-08:002017-10-09T00:35:47.930-07:0030 years of a Swiss army knife: Restricted Boltzmann machines <div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://cdn.dribbble.com/users/938469/screenshots/3052729/mega-swiss_army_knife.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="600" data-original-width="800" height="480" src="https://cdn.dribbble.com/users/938469/screenshots/3052729/mega-swiss_army_knife.jpg" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>I read somewhere, but cannot recall exactly who said so, that in ancient worlds, 30 years are long enough for the new generation to settle down with a new system, regime or ideology. As there are only a few days away from 2017, I would like to look back the history of a 30-year old model which has captured my research attention for the past 10 years.<br /><div class="separator" style="clear: both; text-align: center;"><br /></div>To some of you, <a href="https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine">restricted Boltzmann machine</a> (RBM) may be a familiar name, especially for those who follow the current deep learning literature since the beginning. But RBM has also passed its prime time, so you may have heard about it in passing.<br /><br />I was attracted to RBM for several reasons. When I was studying conditional random fields in 2004 and was looking for a fast way to train with arbitrary structures, Contrastive Divergence (CD) appears to be an interesting one. While CD is a generic technique, it was derived especially for RBMs. Second, RBM has "Boltzmann" in the name, which is kind of interesting, because physicists are kind of sexy :)<br /><br />Needless too say, another big reason is that RBM, together with its cousin, Autoencoder are building blocks of unsupervised deep nets, which started the current revolution -- deep learning.<br /><div><br /></div>The greatest reason is that I think RBM is one of the most important classes of data models known to date, perhaps comparable to PCA in dimensionality-reduction and k-means in clustering in terms of usefulness.<br /><br />First introduced in 1986 by Paul Smolensky under the name <i>Harmonium</i> in a classic two-volume book known as PDP (Parallel Distributed Processing), co-edited by Rumelhart and McLelland. RBMs were subsequently popularised by Geoff Hinton in the 2000s, especially in 2001 with the introduction of Contrastive Divergence (CD), and in 2006 with the introduction of a deep version known as Deep Belief Nets (DBN).<br /><br />Statistically, RBM is a probabilistic model of data, i.e., it assigns a probability (or density) to a multivariate data. Initially, RBMs are limited to binary data (known as Bernoulli-Bernoulli RBM), but subsequently extended to Gaussian data (known as Gaussian-Bernoulli RBM), and mixed-types (known as <a href="http://letdataspeak.blogspot.com.au/2016/12/mixed-type-data-analysis-iii-restricted.html">Mixed-variate RBM</a>, or <a href="http://letdataspeak.blogspot.com.au/2016/12/mixed-type-data-analysis-v-one-size.html">Thurstonian Boltzmann machine</a>).<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://deeplearning.net/tutorial/_images/rbm.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://deeplearning.net/tutorial/_images/rbm.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><span style="font-size: xx-small;">Source: http://deeplearning.net/tutorial/_images/rbm.png</span></div><br />RBM is a special case of Boltzmann machine, which is in turn a special case of Markov random field. It has two layers, one for observed data, the other for latent representation. Due to its special bipartite structure, MCMC inference can be implemented in a block-wise fashion. Learning is relatively fast with CD or its Persistent version. Estimating of latent representation is very fast with a single matrix operation. RBM is also a powerful model in the sense that it can represent any distribution given enough hidden units. As a Markov random field, it has log-linear paramerization which makes it easy to incorporate a variety of domain knowledge.<br /><br />With all of these advantages, RBMs have been used successfully in many applications, ranging from density modelling, feature extraction, dimensional reduction, clustering, topic modeling, imputation, classification, retrieval and anomaly detection.<br /><br /><b>Some bias selection of developments</b><br /><ul><li>1986: first introduced as Harmonium.</li><li>2001: fast approximate biased learning introduced as Contrastive Divergence (CD)</li><li>2004: generalized Harmonium introduced</li><li>2006: used successfully in Deep Belief Networks</li><li>2007: demonstrated with great success on a very large-scale task within the Netflix challenge</li><li>2007: temporal RBM</li><li>2008: recurrent temporal RBM</li><li>2008: classification RBM</li><li>2008: persistent CD introduced, essentially another variant of Young's.</li><li>2008: convolutional RBMs</li><li>2008: universality property proved</li><li>2009: topic models with Replicated Softmax</li><li>2009: matrix modelling with non <i>i.i.d.</i> RBMs, ordinal data, semi-restricted RBM</li><li>2009: implicit mixtures of RBMs</li><li>2010: factored high-order RBM</li><li>2010: mean-covariance RBM</li><li>2010: rectifier linear units RBM</li><li>2010: deep BM</li><li>2011: mixed-variate RBM</li><li>2012: a proper modeling of ordinal matrix data</li><li>2013: Thurstonian BM for joint modeling of most known data types</li><li>2013: nonnegative RBMs for parts-based representation</li><li>2015: trained with graph priors, demonstrating better generalization</li><li>2015: extended to tensor-objects</li><li>2016: infinite RBM</li></ul><div>In short, most of the work has been on extending the representational power of RBM to suit problem structures. The rest is about analysing theoretical properties of RBMs, making deep nets out of RBMs, and improving training speed & accuracy. For the past few years, research about RBMs has slowed down significantly, mostly because the superb accuracy of supervised deep nets, and the ease of deployment of deterministic nets on large-scale problems. </div><br /><b>Some of our own work</b><br /><ul><li><a href="https://arxiv.org/abs/1708.05594">Statistical Latent Space Approach for Mixed Data Modelling and Applications</a>, T Nguyen, T Tran, D Phung, S Venkatesh, <i>arXiv preprint arXiv:1708.05594</i></li><li><a href="https://arxiv.org/abs/1708.05603">Nonnegative Restricted Boltzmann Machines for Parts-based Representations Discovery and Predictive Model Stabilization</a>, T Nguyen, T Tran, D Phung, S Venkatesh, <i>arXiv preprint arXiv:1708.05603</i></li><li><a href="https://arxiv.org/abs/1610.06249">Multilevel Anomaly Detection for Mixed Data</a>, K Do, T Tran, S Venkatesh, <i>arXiv preprint arXiv: 1610.06249.</i></li><li><a href="https://arxiv.org/abs/1603.01359">Learning deep representation of multityped objects and tasks</a>, Truyen Tran, D. Phung, and S. Venkatesh, <i>arXiv preprint arXiv: 1603.01359</i>.</li><li><a href="https://arxiv.org/abs/1608.04830">Outlier Detection on Mixed-Type Data: An Energy-based Approach</a>, K Do, T Tran, D Phung, S Venkatesh, <i>International Conference on Advanced Data Mining and Applications (ADMA 2016)</i>.</li><li><a href="http://www.sciencedirect.com/science/article/pii/S002002551500609X">Graph-induced restricted Boltzmann machines for document modeling</a>, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, <i>Information Sciences</i>. doi:10.1016/j.ins.2015.08.023.</li><li><a href="https://www.ncbi.nlm.nih.gov/pubmed/25661261">Learning vector representation of medical objects via EMR-driven nonnegative restricted Boltzmann machines (e-NRBM)</a>, Truyen Tran, Tu D. Nguyen, D. Phung, and S. Venkatesh, <i>Journal of Biomedical Informatics</i>, 2015, pii: S1532-0464(15)00014-3. doi: 10.1016/j.jbi.2015.01.012. </li><li><a href="http://dro.deakin.edu.au/view/DU:30076888">Tensor-variate Restricted Boltzmann Machines</a>, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, <i>AAAI 2015</i>. </li><li><a href="https://arxiv.org/abs/1408.0055">Thurstonian Boltzmann machines: Learning from multiple inequalities</a>, Truyen Tran, D. Phung, and S. Venkatesh, In<i> Proc. of 30th International Conference in Machine Learning (ICML’13),</i> Atlanta, USA, June, 2013.</li><li><a href="http://dro.deakin.edu.au/view/DU:30081487">Learning parts-based representations with Nonnegative Restricted Boltzmann Machine</a>, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, <i>Journal of Machine Learning Research (JMLR) Workshop and Conference Proceedings, Vol. 29, Proc. of 5th Asian Conference on Machine Learning,</i> Nov 2013.</li><li><a href="https://www.researchgate.net/publication/286292820_Latent_Patient_Profile_Modelling_and_Applications_with_Mixed-Variate_Restricted_Boltzmann_Machine">Latent patient profile modelling and applications with Mixed-Variate Restricted Boltzmann Machine</a>, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In <i>Proc. of 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’13)</i>, Gold Coast, Australia, April 2013.</li><li><a href="http://ieeexplore.ieee.org/document/6607435/">Learning sparse latent representation and distance metric for image retrieval</a>, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In <i>Proc. of IEEE International Conference on Multimedia and Expo (ICME)</i>, San Jose, California, USA, July 2013.</li><li><a href="https://arxiv.org/abs/1408.0043">Learning from Ordered Sets and Applications in Collaborative Ranking</a>, Truyen Tran, Dinh Phung and Svetha Venkatesh, in <i>Proc. of. the 4th Asian Conference on Machine Learning (ACML2012)</i>, Singapore, Nov 2012.</li><li><a href="http://dro.deakin.edu.au/view/DU:30052641">Cumulative Restricted Boltzmann Machines for Ordinal Matrix Data Analysis</a>, Truyen Tran, Dinh Phung and Svetha Venkatesh, in <i>Proc. of. the 4th Asian Conference on Machine Learning (ACML2012)</i>, Singapore, Nov 2012.</li><li><a href="http://ieeexplore.ieee.org/document/6290523/">Embedded Restricted Boltzmann Machines for Fusion of Mixed Data Types and Applications in Social Measurements Analysis</a>, Truyen Tran, Dinh Phung, Svetha Venkatesh, in<i> Proc. of 15-th International Conference on Information Fusion (FUSION-12),</i> Singapore, July 2012.</li><li><a href="http://ieeexplore.ieee.org/document/6298401/">Learning Boltzmann Distance Metric for Face Recognition</a>, Truyen Tran, Dinh Phung, Svetha Venkatesh, in <i>Proc. of IEEE International Conference on Multimedia & Expo (ICME-12)</i>, Melbourne, Australia, July 2012.</li><li><a href="https://arxiv.org/abs/1408.1160">Mixed-Variate Restricted Boltzmann Machines</a>, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the <i>3rd Asian Conference on Machine Learning (ACML2011)</i>, Taoyuan, Taiwan, Nov 2011.</li><li><a href="https://arxiv.org/abs/1205.2611">Ordinal Boltzmann Machines for Collaborative Filtering</a>. Truyen Tran, Dinh Q. Phung and Svetha Venkatesh. In <i>Proc. of 25th Conference on Uncertainty in Artificial Intelligence</i>, June, 2009, Montreal, Canada. Runner-up for the best paper award.</li></ul>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-57666990400385867502016-12-22T03:03:00.000-08:002017-01-15T17:07:21.352-08:00Machine learning in three lines<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://i.ytimg.com/vi/b99UVkWzYTQ/maxresdefault.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="360" src="https://i.ytimg.com/vi/b99UVkWzYTQ/maxresdefault.jpg" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>How can we characterize machine learning as a field? What make machine learning work?<br /><br />Machine learning is a fast changing field. The list of ideas is practically endless: Decision trees, ensemble learning, random forests, boosting, neural networks, hidden Markov models, graphical models, kernel methods, conditional random fields, sparsity, compressed sensing, budgeted learning, multi-kernel learning, transfer learning, co-training, active learning, multitask learning, deep learning, lifelong learning and many more.<br /><br />The problem is, ideas <a href="http://letdataspeak.blogspot.com/2012/05/non-convexity-cccp-dca-em-and-log.html">come and go</a>, and <a href="http://letdataspeak.blogspot.com/2016/12/everything-old-is-new-again-deep.html">bounce back</a>, roughly every 10-15 years. Long enough for a grad student learns the tricks, makes a big noise, graduates when it is still hot and gets a good academic job, IF he is lucky to start early in the cycle. Also long enough so that the new batch of students are not aware of the hot things of the previous wave. How familiar is "particle filtering" to you?<br /><blockquote class="tr_bq">Popular in the late 1990s and early 2000s, <a href="https://en.wikipedia.org/wiki/Particle_filter">particle filtering</a> is a fast way to generate samples of state for a dynamic system when an observation is made.</blockquote>When I started my grad training in 2004, I asked one of my supervisors on what hot topic I should focus on. He said, pick either graphical models or kernel methods (which meant SVM at the time). I picked graphical models, and then <a href="http://letdataspeak.blogspot.com.au/2012/05/innovation-by-many-incremental.html">was given conditional random fields</a> (CRFs) to work on. By the time I submitted my PhD thesis in early 2008, CRFs were largely gone. SVMs were gone a couple of years before that, just around the time neural nets bounced back under a new umbrella, deep learning, in 2006. It used to be all about convex loss functions (SVMs & CRFs), now <a href="http://letdataspeak.blogspot.com.au/2012/05/non-convexity-cccp-dca-em-and-log.html">everything is non-convex</a>. Local minima? Doesn't matter, adaptive stochastic gradient descents such as Adagrad, Adam or RMSprop will find a really good one for you.<br /><br /><b>Applying machine learning is like flying commercial aircraft</b><br /><br />Ever wanted to apply a technique to your problem? A sure way is to employ a PhD in machine learning! Packages available, but what are the correct ways to use, let alone the best way? Think about flying commercial aircrafts. There are hundreds of knobs to tune. There are even autopilot mode. We just need to have two human pilots: one to tune the right knob at the right time, and the other making sure that the correct things are being done.<br /><br />Wanna use deep learning? You need to decide between: feedforward, recurrent, convolutional nets and any combination of these three. Will attention be used? How about memory? Which loss function? Embedding size? Optimizers and their parameters? and many many more.<br /><br />I work with clinicians on clinical problems. At least two of them -- young, smart and highly motivated -- insisted to come over and observe how I do machine learning and learn to do it themselves. They claimed they could do Statra, R and sometimes Python. My boss was crossed. This is not how collaboration should work, right? You want to learn our art for free, then trash us?<br /><br />But I told the boss, let them come.<br /><br />They came and left, even more puzzled. I ended up doing the job I usually did and so did they.<br /><br /><b>Machine learning in three lines</b><br /><br />I once delivered an internal talk on deep learning. My boss requested that I talked only about three things. Just three things. This bugged me a lot. But the trick actually worked.<br /><br />Here I am trying to characterize the current state of machine learning in general and it should apply to deep learning. Machine learning works by:<br /><ol><li><i>Having good priors of the problem at hand (80-90% gain)</i></li><li><i>Accounting for data uncertainty and model uncertainty with ensemble and Bayesian methods. (1-5% gain)</i></li><li><i>Reusing models when data/model redundancies are available (5-10% gain)</i></li></ol><b>Priors are king</b><br /><br />By "good priors", I meant several things:<br /><ul><li><i>Features that capture all meaningful signals from data</i>. Getting good features are the job of feature engineering, which usually accounts for 80-90% of total effort in a machine learning project. Once you have good features, most modern classifiers will work just fine. Deep learning succeed partly because it solves this problem.</li></ul><ul><li><i>Problem structures are respected</i>. For example, sequential data would suggest the use of Hidden Markov Models (HMM) or chain-like Conditional Random Fields (CRF). In deep learning, it reduces to architecture engineering!</li></ul><ul><li><i>Knowledge about the model class</i>. E.g., will linearity be the dominant model? What are the expected complexity and nonlinearity? Will interpretability needed? What is about transparency? Is sparsity important? For neural nets, how many layers? For SVMs, will one kernel type be enough?</li></ul><ul><li><i>Assumptions about data manifold</i>. One well-studied phenomenon is the intrinsic low dimensionality of data embedded in a very high dimensional space. This is usually manifested through data redundancies. Another assumption is separation of classes, e.g., the region at the class boundary is usually sparse, but is very dense near the class examplars. This assumption essentially gives rise to semi-supervised learning.</li></ul><ul><li><i>Assumptions about the data space</i>. <a href="http://letdataspeak.blogspot.com/2008/08/data-size-matters.html">How many data instances</a>? Will characterizing the data variance enough? If yes then use PCA. What about factors of variation are the key? If yes then RBM perhaps helps.</li></ul><b>Don't forget uncertainty</b><br /><br />Even with a good prior, we would never be sure that our choices are correct. Model uncertainty is there and must be accounted for. A popular way is to use many (diverse) models, then employ model averaging, ensemble methods and Bayesian approach. Deep learning has dropout as one of the best tricks invented in the past 10 years. It works like wonder. Bayesian neural nets, which were studied in mid 1990s, is also back!<br /><br />In fact, every single modern challenge was won by some ensemble, mostly gradient boosting by the time of this writing AND model blending. One of the best known example is the Netflix challenge, which was won by blending hundreds of models -- so complex that Netflix found it was useless to implement in practice.<br /><br /><b>Many are easier than one</b><br /><br />I often told my 5-year old daughter: do one thing at a time. But by listening to me AND playing at the same time, she has already multi-tasked. Humans seem to learn better that way. We learn by making senses of many co-occurring feedback signals.<br /><br />A key idea in recent machine learning is model reuse. It has many forms:<br /><br /><ul><li><i>Domain adaption</i>, which is about reusing previous model on similar domains with minimal changes.</li><li><i>Transfer learning</i>, which is about reusing models on similar tasks with minimal changes. In neural nets, it is equivalent to not forgetting the old trained nets when learning a new concepts.</li><li><i>Multi-task learning</i>, which is about learning more than ones correlated tasks at a time. The idea is that models can be partly shared among tasks, leading to less training data and less overfitting.</li><li><i>Lifelong learning</i>, which is like continual version of transfer learning. Just like humans, we learn to do new things from birth to death, every single day! Popularized by Sebastian Thrun in mid 1990s, lifelong learning is now back in various forms: never-ending learning at CMU, reinforce learning in robotics and games at a various labs.</li><li><i>Multi-X</i>, where X is substituted by view, modality, instance, label, output/outcome, target, type and so on.</li></ul>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com2tag:blogger.com,1999:blog-7266907390267288113.post-27994695796810086782016-12-20T01:34:00.001-08:002017-03-05T19:07:29.346-08:00Everything old is new again: Nested sequential models<div class="separator" style="clear: both; text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/en/thumb/d/dd/HHMM.png/400px-HHMM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="198" src="https://upload.wikimedia.org/wikipedia/en/thumb/d/dd/HHMM.png/400px-HHMM.png" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>Recently, multi-layer RNN architectures have been demonstrated to work better than single-layer versions. The <a href="https://arxiv.org/abs/1609.08144">Google's Neural Machine Translation</a> machine, for example, has 8 layers of LSTMs as of Dec 2016.<br /><br />The idea goes back to earlier days of multi-layer HMMs in the 1990s, which are special cases of <a href="https://en.wikipedia.org/wiki/Dynamic_Bayesian_network">Dynamic Bayesian Networks</a>. These were then followed by multi-layer <a href="https://en.wikipedia.org/wiki/Conditional_random_field">Conditional Random Fields</a> (CRFs), which are also special case of Dynamic CRFs.<br /><br />The idea is that higher layers represent more abstract semantics. In temporal sequences, one would expect that the "clock" of the upper layers is slower than that of the lower layers. But most existing work has to explicitly design the temporal resolution by hand.<br /><br />Learning the temporal resolution automatically is an attractive idea. In 1998, <a href="https://en.wikipedia.org/wiki/Hierarchical_hidden_Markov_model">Hierarchical HMM</a> was introduced, here parent state is assumed to generate a child sequence, and each child in turn generates a grandchild subsequence and so forth. The network becomes nested. Learning and inference cost cubic time, which is prohibitive for long sequences.<br /><br />A CRF counterpart is known as <a href="https://arxiv.org/abs/1009.2009">Hierarchical Semi-Markov CRF</a> introduced by us in 2008.<br /><br />Both HHMMs and HSCRFs are member of the <a href="https://en.wikipedia.org/wiki/Stochastic_context-free_grammar">Stochastic Context-Free Grammar</a> family, which is known for its cubic time complexity. Not just being slow, HHMMs and HSCRFs are hopeless in large-scale tasks that require many bits to represent the world.<br /><br />Given the recent successes of RNNs (mostly LSTM and GRU) for sequential tasks, one would naturally ask whether we can achieve the same feat as in HHMMs, that is, the hierarchy is learnt automatically from data. It proves to be a difficult task, until very recently. Check this <a href="https://arxiv.org/abs/1609.01704">paper</a> by Bengio's group for more detail. I'm very curious to see how the idea plays out in practice. Let's wait and see.<br /><br /><b>Work by us</b>:<br /><ul><li><a href="http://www.sciencedirect.com/science/article/pii/S0004370217300231">Hierarchical semi-Markov conditional random fields for deep recursive sequential data</a>, Truyen Tran, Dinh Phung, Hung Bui, Svetha Venkatesh, <i> Artificial Intelligence, 2017.</i> (Extension of the NIPS'08 paper).</li><li>MCMC for Hierarchical Semi-Markov Conditional Random Fields, Truyen Tran, Dinh Q. Phung, Svetha Venkatesh and Hung H. Bui. In <i>NIPS'09 Workshop on Deep Learning for Speech Recognition and Related Applications</i>. December, 2009, Whistler, BC, Canada.</li><li>Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data, Truyen Tran, Dinh Q. Phung, Hung H. Bui, and Svetha Venkatesh. In P<i>roc. of 21st Annual Conference on Neural Information Processing Systems</i>, Dec 2008, Vancouver, Canada. </li><li>AdaBoost.MRF: Boosted Markov random forests and application to multilevel activity recognition, Truyen Tran, Dinh Quoc Phung, Hung Hai Bui, and Svetha Venkatesh. In <i>Proc. of IEEE Conference on Computer Vision and Pattern Recognition</i>, volume Volume 2, pages 1686-1693, New York, USA, June 2006.</li></ul>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com4tag:blogger.com,1999:blog-7266907390267288113.post-30069816029325282872016-12-19T15:32:00.001-08:002016-12-27T00:34:40.312-08:00Everything old is new again: Deep statistical relational learning<div class="separator" style="clear: both; text-align: center;"><a href="http://image.glamourdaze.com/2014/04/1920s-DRESS-TIMELINE-eveningwear.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://image.glamourdaze.com/2014/04/1920s-DRESS-TIMELINE-eveningwear.jpg" height="308" width="640" /></a></div><br />In the age of <a href="http://www.almaden.ibm.com/coevolution/pdf/varian_paper.pdf">combinatorial innovation</a>, old things will be given a new shiny face, <a href="http://letdataspeak.blogspot.com.au/2012/05/innovation-by-many-incremental.html">even nothing really new happens</a>. The same holds for <a href="https://en.wikipedia.org/wiki/Statistical_relational_learning">Statistical Relation Learning</a> (SRL) -- a sub-field of machine learning for characterizing the relational structure of the world.<br /><br />Started in late 1990s, SRL had gone through a fruitful period of about 10 years and reached its peak in 2007 with the publication of a book titled "<a href="http://www.cs.umd.edu/srl-book/">Introduction to Statistical Relational Learning</a>" co-edited by Lise Getoor and the late Ben Taskar (who died unexpectedly in 2013 at the age of 36 at his academic peak). Many significant models appeared in the first half of the 2000s, including <a href="https://en.wikipedia.org/wiki/Conditional_random_field">Conditional Random Fields</a> (CRF, 2001), Relational Markov networks (2002) and <a href="https://en.wikipedia.org/wiki/Markov_logic_network">Markov Logic Networks</a> (2006). Despite being more powerful than non-relational alternatives, SRL still relies on manual feature engineering, which will soon reach its limit of utility.<br /><br />Developed rather in parallel is Deep Learning (DL), where the current wave officially started in 2006 with the publication of <a href="https://en.wikipedia.org/wiki/Deep_belief_network">Deep Belief Networks</a> in Science. Deep learning is concerned about learning data abstraction (aka features), favoring end-to-end learning through multiple steps of non-linear computation.<br /><br />A <b>combinatorial thinking</b> would naturally lead to the question whether these two sub-fields can work together. The answer is a big YES, because SRL and DL are rather complementary. For example, in the past 3 years, there have been lots of papers marrying CRF and deep nets. While CRFs offer a semi-formal framework for joint learning and inference, deep nets offer learning of features (with feedforward nets), deterministic dynamics (with recurrent nets), and translation invariance (with convolutional nets). The marriage would be a happy one. But like any marriage of convenience, it won't go very far. Some genuine blending is needed.<br /><br />Our recent work, <a href="https://arxiv.org/abs/1609.04508">Column Networks</a>, scheduled to appear in AAAI'17, blends the SRL and DL even further so that learning and inference can be carried out naturally. The term "column" refers to the famous <a href="https://en.wikipedia.org/wiki/Cortical_column">columnar structure of neo-cortex</a> in mammals. Interestingly, Column Networks share design features of <b>all three main deep net architectures</b>:<br /><br /><ul><li>A column is a feedfoward net,</li><li>Parameters are tied across layers, which is essentially the idea behind recurrent nets.</li><li>The network between columns is designed so that the multi-relations between columns are invariant across columns, hence the translation invariance property of convolutional nets.</li></ul><div>As Column Networks are very generic, expect more to come in the next few months. Stay tuned.</div><div><br /></div><div><b>Updated references</b></div><br /><div><ul><li>Column Networks for Collective Classification, T Pham, T Tran, D Phung, S Venkatesh, <i>AAAI'17</i>.</li><li>Graph-induced restricted Boltzmann machines for document modeling, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, <i>Information Sciences</i>. doi: 10.1016/j.ins.2015.08.023</li><li>Neural Choice by Elimination via Highway Networks, Truyen Tran, Dinh Phung and Svetha Venkatesh, <i>PAKDD workshop on Biologically Inspired Techniques for Data Mining (BDM'16)</i>, April 19-22 2016, Auckland, NZ.</li><li>Predicting delays in software projects using networked classification, Morakot Choetikertikul, Hoa Khanh Dam, Truyen Tran, Aditya Ghose, <i>30th IEEE/ACM International Conference on Automated Software Engineering</i>, November 9–13, 2015 Lincoln, Nebraska, USA.</li><li>Learning vector representation of medical objects via EMR-driven nonnegative restricted Boltzmann machines (e-NRBM), Truyen Tran, Tu D. Nguyen, D. Phung, and S. Venkatesh, <i>Journal of Biomedical Informatics</i>, 2015, pii: S1532-0464(15)00014-3. doi: 10.1016/j.jbi.2015.01.012. </li><li>Cumulative Restricted Boltzmann Machines for Ordinal Matrix Data Analysis, Truyen Tran, Dinh Phung and Svetha Venkatesh, in P<i>roc. of. the 4th Asian Conference on Machine Learning (ACML2012)</i>, Singapore, Nov 2012.</li><li>Ordinal Boltzmann Machines for Collaborative Filtering. Truyen Tran, Dinh Q. Phung and Svetha Venkatesh. In <i>Proc. of 25th Conference on Uncertainty in Artificial Intelligence</i>, June, 2009, Montreal, Canada. </li><li>Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data, Truyen Tran, Dinh Q. Phung, Hung H. Bui, and Svetha Venkatesh. In <i>Proc. of 21st Annual Conference on Neural Information Processing Systems</i>, Dec 2008, Vancouver, Canada.</li><li>AdaBoost.MRF: Boosted Markov random forests and application to multilevel activity recognition, Truyen Tran, Dinh Quoc Phung, Hung Hai Bui, and Svetha Venkatesh. In <i>Proc. of IEEE Conference on Computer Vision and Pattern Recognition</i>, volume Volume 2, pages 1686-1693, New York, USA, June 2006.</li></ul></div>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-74400270102882447172016-12-19T03:55:00.001-08:002017-01-19T03:22:22.778-08:00Machine learning four years after the turning point<div class="separator" style="clear: both; text-align: center;"><a href="http://media.tumblr.com/f66e77790fc385a75b1a98fef44dfb4a/tumblr_inline_mksd7gm8a81qz4rgp.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://media.tumblr.com/f66e77790fc385a75b1a98fef44dfb4a/tumblr_inline_mksd7gm8a81qz4rgp.jpg" height="322" width="640" /></a></div><br />In May 2012 I wrote a note titled "<a href="http://letdataspeak.blogspot.com.au/2012/05/non-convexity-cccp-dca-em-and-log.html">Machine at its turning point</a>" to argue for the new wave of machine learning in that we do not need to worry about having a convex loss but rather be happy with non-convex ones. At the time I did not know about AlexNet and its record-breaking result on the ImageNet benchmark. It was <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-">published</a> 7 months later in NIPS'12.<br /><br />AlexNet was truely a turning point for machine learning. It declared the winning of deep neural nets over others, which were combination of clever manual feature engineering and some variants of SVMs or random forests. AlexNet is remarkable in many ways: Dropout, rectifier linear units, end-to-end training on massive data with GPUs, data augmentation and carefully designed convolutional nets.<br /><br />It was the year that <a href="https://plus.google.com/+YannLeCunPhD/posts/gurGyczzsJ7">Yann LeCun posted his complaints about the computer vision community, but quickly retracted his boycott given the aftershock of AlexNet.</a><br /><br />Recently, there has been an interesting comment floating around: In machine learning, we ask what we can do for neural networks, and in applied domains, we ask what can neural networks do for X. And the list of Xs keeps growing from cognitive domain to non-cognitive domains. Andrew Ng made an interesting point that for domains where humans can do well to map A to B in less than a second, it is ripe for machine automation.<br /><br />This year also marks the 10th year after Deep Belief Nets, the model that announces the beginning of the current wave of neural nets. Early this year, <a href="https://techcrunch.com/2016/03/15/google-ai-beats-go-world-champion-again-to-complete-historic-4-1-series-victory/">AlhaGo of DeepMind defeated one of the best Go champions 4 to 1</a>, officially ending the superiority of human on this ancient game. AlphaGo is a mixture of convolutional nets to read the board positions and evaluate the moves, and random tree-search moves.<br /><br />Many things have changed since 2012. It is clear that supervised learning works if we have sufficient labels without pre-training. Unsupervised learning, after an initial burst with Boltzmann machines and Autoencoders, failed to deliver. There are new interesting developments, however, with Variational Autoencoder (VAE) and Generative Adversarial Nets (GAN), both invented in 2014. At this point, GAN is the best technique to generate faithful images. It is considered by Yann LeCun as one of the best ideas in recent years.<br /><br />The machine learning community has witnessed 10-15 year mini-cycles. Neural networks, graphical models, kernel methods, statistical relational learning and currently, deep learning. So what is up for deep learning? If we consider 2006 as the year of beginning of current deep learning, then it is already 10 years, enough for a mini-cycle. But if we consider 2012 as the true landmark, then we have 6 more years to count.<br /><br />Like other methodologies, deep learning will eventually morph into something else in 5 years time. We may call it by other names. With programming becomes reasonably effortless and with the availability of powerful CPUs/GPUs designed specifically for deep learning, the low hanging fruits will soon be picked up.<br /><br />Practice-wise, as feature engineering is an unsung hero of machine learning prior to 2012, <a href="http://smerity.com/articles/2016/architectures_are_the_new_feature_engineering.html">architecture engineering is at the core of deep learning these days</a>.<br /><br />It is also time for the hardcores. Data efficiency, statistics, geometry, information theory, Bayesian and other "serious" topics. Like any major progresses in science and engineering, nothing really occurs over night. At this point, deep learning is already mixed with graphical models, planning, inference, symbolic reasoning, memory, execution, Bayesian among other things. All together, something fancy will happen, just like what <a href="http://letdataspeak.blogspot.com.au/2012/05/innovation-by-many-incremental.html">I noted about Conditional Random Fields</a> years ago, that it is the combination of incremental innovations that pushes the boundary of certain field to a critical point. It also concurs with the idea of emergence intelligence, where human intelligence is really the emerging product of many small advances over apes.<br /><br />For a more comprehensive review, see my <a href="https://truyentran.github.io/ai16-tute.html">recent tutorials at AI'16</a> on the topic. Some incremental innovations were produced at PRaDA (Deakin University), listed below.<br /><br /><b>Work by us:</b><br /><ul><li>Multilevel Anomaly Detection for Mixed Data, K Do, T Tran, S Venkatesh, <i>arXiv preprint arXiv: 1610.06249</i>.</li><li>A deep learning model for estimating story points, M Choetkiertikul, HK Dam, T Tran, T Pham, A Ghose, T Menzies, <i>arXiv preprint arXiv: 1609.00489</i></li><li>Deepr: A Convolutional Net for Medical Records, Phuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, Svetha Venkatesh, To appear in<i> IEEE Journal of Biomedical and Health Informatics</i>.</li><li>Column Networks for Collective Classification, T Pham, T Tran, D Phung, S Venkatesh, <i>AAAI'17</i></li><li>DeepSoft: A vision for a deep model of software, Hoa Khanh Dam, Truyen Tran, John Grundy and Aditya Ghose, <i>FSE VaR 2016</i>.</li><li>Faster Training of Very Deep Networks Via p-Norm Gates, Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh, <i>ICPR'16</i>.</li><li>Hierarchical semi-Markov conditional random fields for deep recursive sequential data, Truyen Tran, Dinh Phung, Hung Bui, Svetha Venkatesh, To appear in <i>Artificial Intelligence</i>.</li><li>DeepCare: A Deep Dynamic Memory Model for Predictive Medicine, Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh, <i>PAKDD'16</i>, Auckland, NZ, April 2016. </li><li>Neural Choice by Elimination via Highway Networks, Truyen Tran, Dinh Phung and Svetha Venkatesh, <i>PAKDD workshop on Biologically Inspired Techniques for Data Mining (BDM'16)</i>, April 19-22 2016, Auckland, NZ.</li><li>Tensor-variate Restricted Boltzmann Machines, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, <i>AAAI 2015</i>. </li><li>Thurstonian Boltzmann machines: Learning from multiple inequalities, Truyen Tran, D. Phung, and S. Venkatesh, In <i>Proc. of 30th International Conference in Machine Learning (ICML’13)</i>, Atlanta, USA, June, 2013.</li></ul>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com2tag:blogger.com,1999:blog-7266907390267288113.post-61407746704847762542016-12-19T00:29:00.002-08:002016-12-27T02:51:07.211-08:00Model stability: Learn it many more times, but on different datasets<div class="separator" style="clear: both; text-align: center;"><a href="https://replicationindex.files.wordpress.com/2015/01/the-simpsons-2x11-one-fish-two-fish-blowfish-blue-fish1.jpg?w=510&h=270&crop=1" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="337" src="https://replicationindex.files.wordpress.com/2015/01/the-simpsons-2x11-one-fish-two-fish-blowfish-blue-fish1.jpg?w=510&h=270&crop=1" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>How often do you actually publish your data-derived models? Chances are you almost never do, if you are machine learning type. Only quite recently, when training a big deep net is very expensive, people start publishing models. And it helps people who come afterward significantly.<br /><br />This is quite contrary to fields like medicine, where the models (often regression coefficients for a GLM) are routinely published. This is because in those fields, model is the actual finding, not the learning algorithm that produces it.<br /><br />In a way, empirical sciences progress as new models are found, published and verified. One key requirement is that the models are reproducible. Empirical models, those derived from data, must be stable across different datasets by different research groups to be accepted as a rule.<br /><br />But this has not been well-respected in data-driven fields.<br /><br />Anyone who use decision trees to derive a prediction rule from a reasonably complex data would experience a phenomenon that trees will differ drastically if you change just few data points. Unfortunately there have been many trees published in the medicine literature, probably because trees are highly interpretable. But I doubt that anyone could ever reproduce a tree from their own data.<br /><br />At a recent "Big Data" conference I asked a bioinformatics professor why people keep publishing new "findings" of genes, which are supposed to cause or worsen a medical condition. The trouble is that different groups claim different subsets, <b>many of which do not overlap at all</b>. Needless to say, all of those findings are statistically significant, <u>on their own dataset</u>. The professor did not answer my question directly. She said people had different hypotheses and thus focused on those genes whey suspected. Sometimes, the biases or resource limitations prevent people from looking elsewhere.<br /><div><br /></div>For the past few years I have worked on deriving simple prediction rules for healthcare from high-dimensional data. The standard method of the day is sparsity-induced techniques such as Lasso. Every time I changed the data a little bit, either by changing some patients due to different selection criteria, or changing some features (there are endless possibilites), I would have a different feature subset and their coefficients with comparable predictive power!<br /><br />For those who care, <a href="http://ieeexplore.ieee.org/document/5989836/">stability and sparsity are not the best friends</a>. Sparse models are known to be unstable. Same as feature selection techniques.<br /><br />Model instability is a daunting issue for empirical sciences (e.g., the so-called evidence-based medicine). There are two jobs that must be done. One is quantifying the instability. The other is deriving strategies to stabilize the estimation.<br /><br />The first job has been studied to a great detail in the context of confidence interval estimation. For standard GLMs, the estimation is well-known, but as soon as sparsity comes into play, the job is much harder. A solution is simulation-based, a.k.a., the one-size-fit-all bootstrap. That is, for a dataset, resample it to obtain the new set of the same size, and re-estimate the model. Parameter confidence intervals can then be calculated from multiple estimates, says B times, where B is usually in the order of thousands. While this method is straightforward with modern computer, its theoretical properties still need further investigation.<br /><br />The second job is much less studied. At PRaDA (Deakin University), we have attempted to solve the problem from several directions, and for several GLM instances such as logistic regression, ordinal regression and Cox's model. The main idea is to exploit the domain knowledge or statistics, so that the degree of freedom is limited. Some of the recent works are listed in the References below.<br /><br />In subsequent posts, we will cover some specific techniques. Stay tuned.<br /><br /><b>Updated references</b><br /><br /><ul><li>Preterm Birth Prediction: Deriving Stable and Interpretable Rules from High Dimensional Data, Truyen Tran, Wei Luo, Dinh Phung, Jonathan Morris, Kristen Rickard, Svetha Venkatesh, <i>Conference on Machine Learning in Healthcare</i>, LA, USA Aug 2016.</li><li>Stabilizing Linear Prediction Models using Autoencoder, Shivapratap Gopakumara, Truyen Tran, Dinh Phung, Svetha Venkatesh, <i>International Conference on Advanced Data Mining and Applications (ADMA 2016)</i>.</li><li>Stabilizing Sparse Cox Model using Statistic and Semantic Structures in Electronic Medical Records. Shivapratap Gopakumar, Tu Dinh Nguyen, Truyen Tran, Dinh Phung, and Svetha Venkatesh, <i>PAKDD'15</i>, HCM City, Vietnam, May 2015.</li><li>Stabilizing high-dimensional prediction models using feature graphs, Shivapratap Gopakumar, Truyen Tran, Tu Dinh Nguyen, Dinh Phung, and Svetha Venkatesh, <i>IEEE Journal of Biomedical and Health Informatics</i>, 2014 DOI:10.1109/JBHI.2014.2353031S </li><li>Stabilizing sparse Cox model using clinical structures in electronic medical records, S Gopakumar, Truyen Tran, D Phung, S Venkatesh, <i>2nd International Workshop on Pattern Recognition for Healthcare Analytics,</i> August 2014</li><li>Stabilized sparse ordinal regression for medical risk stratification, Truyen Tran, Dinh Phung, Wei Luo, and Svetha Venkatesh, <i>Knowledge and Information Systems</i>, 2014, DOI: 10.1007/s10115-014-0740-4.</li></ul>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-63843178052884161542016-12-18T20:23:00.001-08:002016-12-27T02:52:55.045-08:00Caring deeper: motif detection from medical records using convolutional nets<div class="separator" style="clear: both; text-align: center;"><a href="https://cdn.strandofsilk.com/journeymap_social_img/strand_of_silk_-_journey_map_-_applique_-_motifs_and_colours_-_elephantandfloralmotif.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="384" src="https://cdn.strandofsilk.com/journeymap_social_img/strand_of_silk_-_journey_map_-_applique_-_motifs_and_colours_-_elephantandfloralmotif.jpg" width="640" /></a></div><br />In the <a href="http://letdataspeak.blogspot.com.au/2016/12/caring-deeply-intervened-long-short.html">previous post</a>, I have introduced <a href="https://arxiv.org/abs/1602.00357">DeepCare</a>, a LSTM-based model for (potentially very long) medical records with irregular timing and treatments.<br /><br />Here I introduce another deep net called <a href="https://arxiv.org/abs/1607.07519">Deepr</a>, a CNN-based model for relatively short medical records. The main purpose is learning to discover medical motifs that lead to some future events (e.g., death).<br /><br />Unlike DeepCare which assumes a clear temporal dynamics in the medical records, Deepr requires only repeated short patterns (motifs) over the data sequence. Time gaps are discretized into symbols which are treated in the same way as diagnoses, procedures and medications. All symbols are then sequenced. Those co-occurring will be randomly ordered.<br /><br />Once Deepr has been learnt, motif segments in a record that respond well to an outcome can be detected.<br /><br />Note that Deepr can be used in other situations where irregular time gaps and discrete data are present.<br /><br /><b>Update references</b><br /><br /><ul><li>Deepr: A Convolutional Net for Medical Records, Phuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, Svetha Venkatesh, <i>IEEE Journal of Biomedical and Health Informatics</i>, 2017.</li><li>DeepCare: A Deep Dynamic Memory Model for Predictive Medicine, Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh, <i>PAKDD'16</i>, Auckland, NZ, April 2016.</li></ul>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-54411795500455448202016-12-17T19:43:00.001-08:002016-12-27T02:59:15.048-08:00Caring deeply: Intervened long short-term memory for medical records<div class="separator" style="clear: both; text-align: center;"><a href="https://www.clementstheory.com/img/guides/khachaturian-sabre-dance.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="264" src="https://www.clementstheory.com/img/guides/khachaturian-sabre-dance.png" width="640" /></a></div><br />In the US, healthcare expenditure accounts for approximately 18% GDP, almost twice as much that in Australia. And the percentage keeps growing. One possible explanation is that after having cheap, accessible everything, people want to spend more and more on is their own health.<br /><br />Given this big fat cake, it is no surprise that healthcare is the next target for the current AI wave. At present, startups pop up every week, all hoping to claim a big share.<br /><br />Central to modern healthcare systems is Electronic Medical Records (EMRs), the personal database of any encounter with the healthcare systems, usually consists of information regarding diseases, treatments, billing, measurements, social care and more. EMRs are the promise of the modern healthcare to improve efficiency, accessibility and personalized medicine.<br /><br />We will focus our attention to predictive medicine, a new approach that is not just about diagnosis (what happens now), but also about prognosis (what will happen if we do X). Not surprisingly, to predict the future, we need to study the past. Ultimately, we end up modeling the entire health trajectory since birth (if the data is available).<br /><br />Two things that make EMRs a modeling challenge are:<br /><br /><ul><li><i>Data are episodic</i>. Data is only recorded when patient turns up at clinic or hospital. There are time gap in between, and the gap is irregular. </li><li><i>There is "care" in healthcare, that is, interventions done by clinician</i>. Treatments disrupt the natural course of health trajectory. Treatments are supposed to lessen or eliminate the illness. But medical errors do also occur, making the illness worse.</li></ul><br />Our recent model, <a href="https://arxiv.org/abs/1602.00357">DeepCare</a>, is a deep architecture that directly models the effect of irregular time gap and treatment. It modifies the gates of the popular Long Short-Term Memory (LSTM). "Memory" plays a great role here because there is weak and irrelevant information in the records, and we do not know which one! LSTM is great because it can decide to ignore or keep certain new information as well as forget or keep the old illness memory.<br /><br />What can DeepCare do? You can think of treatment recommendation, disease progression prediction, readmission prediction, attributing the past illness to future event and more. Check out the paper <a href="https://arxiv.org/abs/1602.00357">here</a>.<br /><br /><b>Update references</b><br /><br /><ul><li>DeepCare: A Deep Dynamic Memory Model for Predictive Medicine, Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh, <i>PAKDD'16</i>, Auckland, NZ, April 2016.</li><li>Deepr: A Convolutional Net for Medical Records, Phuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, Svetha Venkatesh, <i>IEEE Journal of Biomedical and Health Informatics</i>, 2017.</li></ul><br /><br /><br />Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com1tag:blogger.com,1999:blog-7266907390267288113.post-40009436866527411452016-12-17T13:15:00.000-08:002016-12-27T03:01:09.874-08:00Mixed-type data analysis V: One size fits many with Thurstonian Boltzmann machines<div class="separator" style="clear: both; text-align: center;"><a href="http://www.bandt.com.au/information/uploads/2014/11/Diversity.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://www.bandt.com.au/information/uploads/2014/11/Diversity.jpg" height="324" width="640" /></a></div><br /><div>This is part V of this series. The previous parts are here:</div><ul><li>Part I: <a href="http://letdataspeak.blogspot.com.au/2012/07/mixed-type-data-analysis-i-overview.html">Overview</a></li><li>Part II: <a href="http://letdataspeak.blogspot.com.au/2014/06/mixed-type-data-analysis-ii-pairwise.html">Pairwise methods</a></li><li>Part III: <a href="http://letdataspeak.blogspot.com.au/2016/12/mixed-type-data-analysis-iii-restricted.html">Mixed-variate Restricted Boltzmann Machines</a></li><li>Part IV: <a href="http://letdataspeak.blogspot.com.au/2016/12/mixed-type-data-analysis-iv.html">Ordinal data</a></li></ul><br />The random variable world is very diverse and fascinating. They are real, count, categorical (e.g., single choice), multi-categorical (e.g., multiple choices), ordered (e.g., rating), rank, preference, boxed and many other forms. Wonder what they have in common?<br /><br />A fundamental observation made in our recent <a href="https://arxiv.org/abs/1408.0055">ICML'13 paper</a> is that, these variables can be expressed using the same form -- a set of inequalities. For example, real variables can receive values as a point, or an interval, which is essentially defined by two inequalities at two sides. A categorical variable can be thought as having the highest "utility" among all choices. A ranking is akin to having an ordered list of "utilities".<br /><br />These kinds of thinking have a long history. The root can be traced back to the 1920s and 1930s under Thurstone. He posited that if we pick one choice over the other, it means the utility of that choice is higher than the other. A popular way to model utility is to assume a latent Gaussian variable, giving rise to probit functions. Later, in the 1950s, Luce derived a generalized formula for categorical choice among several. He found that if the utility follows the <a href="https://en.wikipedia.org/wiki/Gumbel_distribution">Gumbel distribution</a>, also known as <a href="https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution">Extreme Value Distribution</a>, then the probability of choosing the right choice is proportional to its (exponential of) utility. This is also known as multinomial distribution.<br /><br /><blockquote class="tr_bq">The Gumbel distribution is interesting itself. It is often used to model extreme values, for example, the highest tide of a year. Little surprise that it leads to categorical distribution, which is about choosing the best option. My <a href="http://prada-research.net/~truyen/papers/truyen_etal_aaai12.pdf">AAAI'12 paper</a> studies this distribution in the recommender system context.</blockquote>Now we need a joint tool to glue these inequalities. Again, let us return to Thurstone. For simplicity, assume that there exist latent Gaussian utilities that give rise to these inequalities. What we need now is a joint distribution of all Gaussian variables.<br /><br />Usually we can use multivariate Gaussian distributions. However, since we do not observe the Gaussian directly, inference and estimation are very difficult with many latent variables.<br /><br />We found a particular architecture that is reasonable efficient to sample -- the restricted Boltzmann machine (RBM), a canonical method in the current wave of deep learning. The rest are just tedious details of MCMC inference.<br /><br /><br />Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-52032880419887045892016-12-17T02:51:00.002-08:002016-12-27T03:04:55.274-08:00Markov random fields: Parameter learning I<div class="separator" style="clear: both; text-align: center;"><a href="http://blogs.oregonstate.edu/payetn/files/2011/03/cropped-reunion_cannes.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://blogs.oregonstate.edu/payetn/files/2011/03/cropped-reunion_cannes.jpg" height="128" width="640" /></a></div><br /><script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>In this post, I will review some practical learning methods which are applicable for large-scale Markov random fields (MRFs), e.g., with millions of parameters. This has some overlaps with <a href="http://letdataspeak.blogspot.com/2008/08/underover-generating-in-learning.html">this post</a>, but explains the ideas in more details.<br /><br />Here I use the term "Markov random fields" to refer to a class of multivariate models whose (density) probability distributions involves a <i>normalisation</i> constant \( Z \) known as the partition function:<br />\[ P(\mathbf{x};\theta) = \frac{1}{Z(\theta)} \Psi(\mathbf{x}) \] <br />where \( \theta \) is model parameter.<br /><br />This also includes <a href="http://en.wikipedia.org/wiki/Conditional_random_field">conditional random fields</a> (CRFs), <a href="http://www.blogger.com/ai.stanford.edu/~koller/Papers/Taskar+al:SRL07.pdf">Markov relational networks</a> (MRN), and to some extent, <a href="http://en.wikipedia.org/wiki/Markov_logic_network">Markov logic networks</a>. The main difference with CRFs and other conditional models is that the partition function is input dependent, that is, we need to write \( Z(\mathbf(y),\theta) \), i.e., we need to compute the partition function for <i>every</i> data point. A similar situation happens when the graph of the MRF is data dependent, e.g., in our <a href="http://letdataspeak.blogspot.com/2012/06/markov-random-fields-for-recommender.html">previous</a> <a href="http://letdataspeak.blogspot.com/2012/06/markov-random-fields-for-recommender_21.html">application</a> of MRFs for <a href="http://letdataspeak.blogspot.com/2012/06/markov-random-fields-for-recommender_24.html">collaborative filtering</a>, we build a MRF for each user to exploit the sparsity in the data. On the other hand, in non-conditional MRF, we need only one partition function. <br /><br />The estimation of \( Z(\theta) \) is general intractable except for the tree-structured MRFs, and indeed if we know how to estimate \( Z(\theta) \), we are likely to be able to infer most quantities (e.g., expectation of some features). For example, <a href="http://www.cs.berkeley.edu/~wainwrig/">Wainwright</a> has demonstrated that by minimising its upper bound, we can arrive at a quite <a href="http://www.eecs.berkeley.edu/~wainwrig/Papers/WaiJaaWil05_Upper.pdf">stable message-passing algorithm</a>, from which marginals can be estimated efficiently. <br /><br />However, the approximate estimation of \( Z(\theta) \) can still be very expensive, especially in the setting of learning because it depends on the model parameters being estimated, i.e., for every parameter updates, we need to recompute \( Z(\theta) \). The trick is to avoid computing it explicitly.<br /><br /><b>The maximum likelihood</b><br /><br />Denote my \(G = (V,E)\) the underlying graph where \( V \) is a set of nodes, and \( E \) is a set of edges. For simplicity, we will assume the log-linear parameterisation for edges and nodes, that is, the model potential function will have the following energy-based form<br />\[ <br /> \Psi(\mathbf{x}) = \exp( - F(\mathbf{x}) )<br />\]<br />where the energy \( F(\mathbf{x}) \) is decomposed into parts<br /> \[<br />F(\mathbf{x}) = -\left( \sum_{i \in V} \sum_k\theta_{ik} f_{ik}(x_i) + \sum_{ij \in E} \sum_m\theta_{ijm} g_{ijm}(x_i,x_j)\right)<br />\]<br />where \(f_{ik}(x_i) \) and \( g_{ijm}(x_i,x_j) \) are feature functions.<br /><br />Note that this model is somewhat overparameterised, but this form will be of much help when we have to deal with rich input features in the conditional setting (aka <a href="http://en.wikipedia.org/wiki/Conditional_random_field">conditional random fields</a>).<br /><br />For now let us assume that the training data is fully observed. We will cover the partially observed variables later, e.g., for the case of <a href="http://en.wikipedia.org/wiki/Boltzmann_machine#Restricted_Boltzmann_machine">restricted Boltzmann machines</a> (RBM).<br /><br />Maximum likelihood estimation starts with the data log-likelihood<br />\[<br />L(\theta) = \frac{1}{D}\sum_{l=1}^D \log P(\mathbf{x}^{(l)};\theta)<br />\]<br />where \( D \) is the number of observations. This criterion is indeed statistically powerful as it leads to consistent and fast estimate in the asymptotic sense.<br /><br />This can be rewritten as<br />\[<br />L(\theta) = - \frac{1}{D}\sum_{l=1}^D E(\mathbf{x}^{(l)}) - \log Z(\theta)<br />\]<br /><br />Note that the partition function does not depend on the the observation in this setting, while it depends on the input in the conditional setting.<br /><br />The first part of the right-hand-side is easier to compute, but the partition function is still there. The nice property of the log-partition function is that it is <i>convex</i> in \( \theta \). Thus if we can find a way to maximise the log-likelihood, we are guaranteed to obtain the global maximum.<br /><br />In a typical optimisation, we would need the gradient of the log-likelihood. For example, for pairwise parameter:<br />\[<br /> \frac{\partial L}{\partial \theta_{ijm}} = \frac{1}{D}\sum_{l=1}^D g_{ijm}(x_i^{(l)},x_j^{(l)})-<br />\sum_{x_i,x_j} P( x_i,x_j) g_{ijm}(x_i,x_j)) <br />\]<br />In the literature, especially those associated with the <a href="http://en.wikipedia.org/wiki/Boltzmann_machine">Boltzmann machines</a>, the computation of the first part of the right-hand-side is often referred to as the <i>clamped phase</i> and the second part as the <i>free phase</i>. This is because in the former, variables are clamped to their specific observations, while in the latter, variables are free to vary according to the distribution.<br /><br />Another view is that the first part is the expectation of feature function with respect to the empirical distribution, while the second part is with respect to the model distribution. The maximum likelihood solution is equivalent to solving the gradient equation, e.g., by matching the two expectations (or moments). Thus, sometimes the term <i>moment-matching</i> is used.<br /><br />We now have three options for learning.<br />1. Approximate objective function<br />2. Approximate Z<br />3. Approximate gradients<br /><br />As for optimisation, we may not need a full evaluation of the log-likelihood, and thus the partition function is not needed. The idea here is to go with the gradient only, and hopefully the learning still progresses as we expect.<br /><br />In the subsequent posts, we will cover some of the following methods:<br /><ul><li>Pseudo-likelihood</li><li>Structured pseudo methods </li><li>Approximate gradients with truncated MCMCs, aka (Persistent) Contrastive Divergence</li><li>Piecewise methods</li><li>Tree-based methods</li><li>Herding</li><li>Label relaxation</li><li>Structured mean-fields</li><li>Approximate perceptron</li><li>Wainwright moment matching</li><li>Conditional versus non-conditional </li><li>Mean-field approximation as Recurrent Neural Networks.</li></ul>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-55516021252020331432016-12-17T02:45:00.001-08:002016-12-27T03:06:28.352-08:00Markov random fields: Structure learning I<div class="separator" style="clear: both; text-align: center;"><a href="http://theprovingground.wdfiles.com/local--files/usc-arch517-exercise1/USC_517_Exercise%201%20Space%20Truss.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://theprovingground.wdfiles.com/local--files/usc-arch517-exercise1/USC_517_Exercise%201%20Space%20Truss.jpg" height="318" width="640" /></a></div><br /><script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>Structure learning in Markov random fields is an important topic since without the correct structures, we can potentially estimate a wrong or unreliable model. Often structures are specified by domain knowledge such as the 2D grid in modelling images or the 1D chain in word labelling. However, there are cases which structures are not available. For example, a word co-occurrence graph can be useful to learn about the document semantics, or a disease correlation network is essential to understand the patient health trajectory. In these situations, there are no predetermined structures and thus we need to estimate from data.<br /><br />There are often an implicit requirement that the network should be sparse and interaction between neighboring variables must be strong. Otherwise, learning will be unreliable since the number of parameters can be quadratic in number of variables. Inference wise, powerful methods such as loopy belief propagation cannot work for densely connected networks, and Gibbs sampling can be too slow to move to reasonably distant states. Practically, learning and inference speed will suffer critically for large-scale problems (e.g., in our application of <a href="http://letdataspeak.blogspot.com/2012/06/markov-random-fields-for-recommender.html">MRF for collaborative filtering</a>, the number of nodes (items or users) can be hundreds of thousands).<br /><br />Sparse structure learning is therefore needed but it is often difficult to do efficiently because the space of all possible structures is \( 2^{N(N-1)/2} \) for \( N \) variables. This is because we have a set of \( N(N-1)/2 \) possible symmetric edges and each structure corresponds to a subset of this set (empty set is possible as it means variables are totally independent). To avoid searching through this combinatorial space, often approximate methods are needed.<br /><br />There are a number of ways to do so. First, we can apply simple <i>screening</i> tests, e.g., by measuring the correlation between variables (e.g., Pearson's), Those pairs whose correlation falls below a threshold will be removed. There are a number of drawbacks here: First, we do not often know which measure is the most appropriate for the task. Second, the method considers pairs independently without paying attention to related pairs. And most importantly, the method is not connected to any objective function, e.g., data likelihood. However, due to its simplicity, this method can be widely applicable in certain problems. For example, in <a href="http://letdataspeak.blogspot.com/2012/06/markov-random-fields-for-recommender.html">our MRF application to collaborative filtering</a>, this method is quite sufficient.<br /><br />Another way is to <i>embed</i> the edge selection process into learning. The key is to have a surrogate objective function which can be evaluated efficiently. <br /><br /><b>Pseudo-likelihood with edge shrinkage</b><br /><br />A recent embedded selection method exploits the power of pseudo-likelihood and shrinkage method such as Lasso. Assume that the model energy has the following additive form<br />\[ E(\mathbf{x},\gamma,\theta) = - \left( \sum_i \phi_i(x_i,\gamma) + \sum_i\sum_{j>i} \theta_{ij} \psi_{ij}(x_i, x_j) \right) \]<br />We want to optimise the following objective function<br />\[ F(\gamma,\theta) = \sum_i \log P(x_i | \mathbf{x}_{\neg i},\gamma, \theta) - \lambda \Omega(\theta) \]<br />where \( \theta \) is the edge parameter and \( \gamma \) is the rest of parameters. \( \Omega(\theta) \) is the penalty for non-zero values, and \( \lambda > 0\) is the penalty factor.<br /><br />The role of pseudo-likelihood is clear -- this is one of the consistent methods which is feasible to perform exactly. Now each edge is associated with a parameter, usually in a log-linear way, and thus a zero parameter would mean no edge. The strategy is then to actively shrink the edge parameters are until they are zeros or ignorable.<br /><br />This is achieved by placing a sparsity prior on the edge parameter. Ideally the prior is the exponential of the negative count of non-zeros edge parameters, but this leads to non-smooth optimisation problem. A solution is to relax the count to the sum of absolute parameter values.<br />\[ \Omega(\theta) = \sum_i\sum_{j>i} |\theta_{ij}| \]<br />The sparsity is still possible to obtain but the optimisation is much easier.<br /><br />For more detail, see our recent work: <a href="https://arxiv.org/abs/1602.02842">Collaborative filtering via sparse Markov random fields</a>.<br /><br /><ul></ul>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-37141567393776316202016-12-16T18:35:00.001-08:002016-12-27T03:10:22.295-08:00Mixed-type data analysis IV: Representing multivariate ordinal data<div class="separator" style="clear: both; text-align: center;"><a href="https://blogs.baylor.edu/madison_farrell/files/2015/04/3-5-stars-2bmbspe.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="148" src="https://blogs.baylor.edu/madison_farrell/files/2015/04/3-5-stars-2bmbspe.png" width="640" /></a></div><br />Multivariate ordinal data is popular when human judgement is involved. For example, in collaborative filtering, we rate multiple items, each of which with a number of stars. In a typical survey, we provide ordinal assessment of many things, ranging from feeling of the day (<i>happy, OK, sad</i>) to the current situation of worldwide security (<i>safe, OK, dangerous</i>). Since these come from the same person, they are correlated, and thus we need to model multiple ordinal variables simultaneously. This blog will present an overview of the area.<br /><br />Much of existing work, however, is focusing on single ordinal variable, typically under the umbrella of "ordinal regression". How about multiple ordinal variables?<br /><br />There are several approaches. One way is to assume that ordinal data are just quantized version of an underlying continuous variable. Thus, each ordinal value corresponds to an interval of the underlying variable. This is intuitive, for example, when we says salary levels A, B and C, and they refer to ranges like A = $[50K,60K], B = $[60K,70K] and C = $70K+.<br /><br />This thinking is convenient, especially when the underlying variable is assumed to be Gaussian. We can build a multivariate Gaussian distribution. The problem is that we will never observe these Gaussian variables directly but indirectly through the intervals dictated by the ordinal levels. Things get more interesting when the intervals are unknown. The only requirement is that the intervals have to be consecutive (i.e., no gaps). With this, we need to estimate the interval boundaries from data.<br /><br />This is basically the main idea behind <a href="http://prada-research.net/~truyen/papers/acml12_recsys_revised.pdf">this paper</a> published in ACML'12. However, we go further because the multivariate Gaussian distributions are hard to sample from under interval constraints. We leverage Gaussian-Bernoulli Restricted Boltzmann Machines instead. This makes MCMC sampling quite efficiently. The RBM style can also make it easy to extend to model the matrix with row and column RBMs linked together.<br /><br />The other way is to use log-linear model, treating the ordinal as categorical but with log-linear constraints among the ordered levels. This is the idea behind <a href="https://arxiv.org/abs/1205.2611">this work</a> published in UAI'09.<br /><br /><b>Updated references</b>:<br /><br /><ul><li>Ordinal Boltzmann Machines for Collaborative Filtering. Truyen Tran, Dinh Q. Phung and Svetha Venkatesh. In <i>Proc. of 25th Conference on Uncertainty in Artificial Intelligence</i>, June, 2009, Montreal, Canada. </li><li>A Sequential Decision Approach to Ordinal Preferences in Recommender Systems, Truyen Tran, Dinh Phung, Svetha Venkatesh, in <i>Proc. of 25-th Conference on Artificial Intelligence (AAAI-12)</i>, Toronto, Canada, July 2012</li><li>Cumulative Restricted Boltzmann Machines for Ordinal Matrix Data Analysis, Truyen Tran, Dinh Phung and Svetha Venkatesh, in <i>Proc. of. the 4th Asian Conference on Machine Learning (ACML2012)</i>, Singapore, Nov 2012.</li><li>Ordinal random fields for recommender systems, Shaowu Liu, Truyen Tran, Gang Li, Yuan Jiang, <i>ACML'14</i>, Nha Trang, Vietnam, Nov 2014.</li></ul><br /><br /><br />Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-75776902597881034692016-12-10T23:42:00.000-08:002016-12-27T03:13:06.698-08:00Markov random fields for recommender systems IV: Recommending a set<div class="separator" style="clear: both; text-align: center;"><a href="http://www.thehighlandtimes.com/shopping-basket-007.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://www.thehighlandtimes.com/shopping-basket-007.jpg" height="384" width="640" /></a></div><br /><script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>Let us start with the standard setting in collaborative filtering: We are given an incomplete rating matrix and one usual task is to fill those missing entries (aka rating prediction), and another task is to produce a ranked list of new items for each user (or a ranked list of new users for each item).<br /><br />In this post, we propose to view the two tasks slightly: We want to recommend a set of items for each user, ideally with rating per each item. This differs from the standard rating prediction in that the set itself is unknown. List ranking is now a by product.<br /><br />There are two variants: One is we do not need to predict the ratings in the set, and another is a ranked list of subsets.<br /><br />Let us start with the first variant, which is applicable in many real-world situations. When you go shopping at a grocery for the whole week, you usually get a basket full of items of varied qualities. Some items are the same product, but in general those items are different, and together they satisfy the nutrition need and food quality for the family as well as the budget constraints. Another situation is travel package: There are many constraints on the routes, airlines, ticket costs & promotions, waiting time, visa applications, the business deadlines, transits, and luggage allowances. Yet another example is music consumption: You don't always need all best rock pieces but sometimes you want a good mix of them even though some pieces are not very high quality.<br /><br />Let us assume for now that we will deal only with non-repeated items. The problem is now to specify the best set out of many possible sets. In fact, if there are \( N \) items, there will be \( 2^N -1 \) possible non-empty sets.<br /><br />One simple solution is to first rank all the items, and then estimate a threshold from which high quality items will be selected. This will sometimes work if items are homogeneous, but this will fail if items are complementary: In the case of shopping baskets, you will end up with high quality items you won't need. Even if items are fairly homogeneous, some items tend to go together more often than with other items:<br /><br />A better solution should deal with the item set directly.<br /><br />An immediate question is how can we judge the set quality?<br /><br />(To be continued)<br /><br /><br /><br />Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-61960513817359089562016-12-10T23:40:00.001-08:002016-12-27T03:16:47.408-08:00On terminologies in machine learning and other related areas<div class="separator" style="clear: both; text-align: center;"><a href="http://static1.businessinsider.com/image/51c895abecad046c60000000/9-maps-that-show-how-americans-speak-in-different-regions.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://static1.businessinsider.com/image/51c895abecad046c60000000/9-maps-that-show-how-americans-speak-in-different-regions.jpg" height="480" width="640" /></a></div><br /><div>Every field speaks their own dialect. Some are here:</div><ul><li>Discrete outcome regression, multiclass logistic regression, softmax, maximum entropy (MaxEnt), multinomial logit, categorical data analysis.</li><li>Multiple response, multivariate outcome, multiple outcome regression, multitask learning</li><li>Ranking with co-variates, learning-to-rank</li><li>Random effect models, collaborative filtering, recommender systems, matrix factorisation</li><li>Subset selection, feature selection</li><li>Binary regression, logistic regression, dichotomous regression</li><li>Generalized linear additive models, neural networks</li><li>Ordinal variables, ordered categorical variables</li><li>Spatial models/processes with co-variates, conditional random fields</li><li>Multiple binary outcomes, multivariate logistic, multilabel learning, multivariate probits</li><li>Features, independent variables, input variables, risk factors, explanatory variables</li><li>Models, machines, networks, architectures</li><li>Structured outputs, networked classifier, collective classification, conditional random fields</li></ul><br />(To be continued).Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-57972634058068183912016-12-10T23:30:00.004-08:002017-01-20T17:52:57.749-08:00Tutorial on deep learning and applications in non-cognitive domains<div class="separator" style="clear: both; text-align: center;"><a href="http://blogs.yis.ac.jp/21muil/files/2015/01/tutorial-13mi946.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://blogs.yis.ac.jp/21muil/files/2015/01/tutorial-13mi946.jpg" height="426" width="640" /></a></div><br />In Dec 2016, I delivered a tutorial on deep learning and its applications in non-cognitive domains at <a href="http://ausdm16.ausdm.org/">AusDM'16</a>. It covers:<br /><br /><ul><li><i>Basic concepts of deep learning</i>: <a href="http://letdataspeak.blogspot.com.au/2016/12/machine-learning-in-three-lines.html">best practices in machine learning</a>, three architectures (feedforward, recurrent and convolutional nets), enablers of deep learning (adaptive stochastic gradient, skip-connections, gating, dropouts).</li><li><i>Current practices in data mining</i>: programming frameworks, tricks of the trade, case-studies in non-cognitive domains (<a href="http://letdataspeak.blogspot.com.au/2016/12/caring-deeply-intervened-long-short.html">healthcare</a>, software engineering & anomaly detection). </li><li><i><a href="http://letdataspeak.blogspot.com.au/2016/12/machine-learning-four-years-after.html">Advanced topics for research</a></i>: <a href="http://letdataspeak.blogspot.com.au/2016/12/everything-old-is-new-again-deep.html">deep learning in relational domains</a>, <a href="http://letdataspeak.blogspot.com.au/2016/12/30-years-of-restricted-boltzmann.html">unsupervised learning</a>, memory & attention, <a href="http://letdataspeak.blogspot.com.au/2016/12/making-dent-in-machine-learning-or-how.html">how to position yourself in this highly dynamic game</a>.</li></ul>The materials are accessible <a href="https://truyentran.github.io/ausdm16-tute.html">here</a>.<br /><br /><br /><br />Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com1tag:blogger.com,1999:blog-7266907390267288113.post-65663671207795881552016-12-10T23:26:00.000-08:002017-01-28T17:10:19.519-08:00Mixed-type data analysis III: Mixed-variate Restricted Boltzmann machines<div class="separator" style="clear: both; text-align: center;"><a href="http://www.zastavki.com/pictures/1920x1200/2012/Archive_Miscellaneous_Mixing_colors_033853_.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://www.zastavki.com/pictures/1920x1200/2012/Archive_Miscellaneous_Mixing_colors_033853_.jpg" height="400" width="640" /></a></div><br /><div>This is part III of this series. The previous parts are here:</div><ul><li>Part I: <a href="http://letdataspeak.blogspot.com.au/2012/07/mixed-type-data-analysis-i-overview.html">Overview</a></li><li>Part II: <a href="http://letdataspeak.blogspot.com.au/2014/06/mixed-type-data-analysis-ii-pairwise.html">Pairwise methods</a></li></ul><script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>It is clear from the <a href="http://letdataspeak.blogspot.com/2014/06/mixed-type-data-analysis-ii-pairwise.html">previous post</a> that we need a better way to represent arbitrary number of types without resorting to ad-hoc treatments. <a href="http://en.wikipedia.org/wiki/Restricted_Boltzmann_machine">Restricted Boltzmann machines</a> (RBMs) offering an unified way, leading to what is called <i><a href="http://www.jmlr.org/proceedings/papers/v20/tran11/tran11.pdf">mixed-variate RBM</a></i>.<br /><br />A RBM is a Markov random field with two layer of nodes, one for visible variables, the other for hidden variables. For our purpose, we shall assume that hidden variables are binary, and visible variables are typed. Standard RBMs often deal with only a single visible type, mostly binary and Gaussian. Other less popular types are <a href="https://www.google.com.au/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CBwQFjAA&url=http%3A%2F%2Fwww.utstat.toronto.edu%2F~rsalakhu%2Fpapers%2Fsemantic_final.pdf&ei=smiiU-efN9bh8AXOyYCoBQ&usg=AFQjCNFd-45a64hRfR97AKq27vcnNzU47A&sig2=upmK0R-kv-jscA52kWVz6Q&bvm=bv.69411363,d.dGI">count</a>, <a href="http://www.cs.toronto.edu/~fritz/absps/netflix.pdf">categorical</a>, <a href="http://truyen.vietlabs.com/papers/uai09_final.pdf">ordinal</a>, intervals. There are several treatments of count: Poisson, constrained Poisson and replicated multinomial by <a href="http://www.cs.toronto.edu/~rsalakhu/">Ruslan Salakhutdinov</a> and others.<br /><br />Regardless of the types, the principles are the same: The model is a Boltzmann machine (clearly from its name) that admits a Boltzmann distribution (a.k.a. exponential family, Gibbs distribution, log-linear, etc). This form is flexible, virtually <a href="http://metaoptimize.com/qa/questions/3599/design-of-rbm-energy-functions-how-to-make-use-of-multi-parameter-exponential-distributions">most known types</a> can be expressed easily. The RBMs are one of the hot kids these days, thanks to the hype in <a href="http://en.wikipedia.org/wiki/Deep_learning">deep learning</a> driven by big guys such as <a href="http://www.wired.com/2014/01/geoffrey-hinton-deep-learning/">Google</a>, <a href="http://gigaom.com/2014/03/18/facebook-shows-off-its-deep-learning-skills-with-deepface/">Facebook</a>, <a href="http://gigaom.com/2013/11/01/the-gigaom-guide-to-deep-learning-whos-doing-it-and-why-it-matters/">Microsoft</a> and <a href="http://deeplearning.net/2014/05/17/andrew-ng-is-hired-by-baidu/">Baidu</a>.<br /><br />It is natural that we can plug all types together, <i>all share the same hidden layer</i>. Then all the problems with mixed-type suddenly disappear! This is because now the interactions are limited to types and the hidden layer, which is usually binary. No need for all the types to mess up with each other directly.<br /><br />Although it appears effortless, and we simply expect it to work, as we have shown in the mixed-variate RBM, there are several issues that are worth discussing.<br /><br />First, we have intended only to work with primitive types, regardless of how the types arise. However, moving up to one more layer, we may be worried about multiple modalities rather than types. For example, an image with tags has two modalities, and they represent different kinds of data at different levels of semantics. Typically an image is considered as a vector of pixels, and thus it is natural that we can use either Gaussian (for continuous pixel intensity) or Beta types (for bounded intensity).<br /><br />Second, we can efficiently estimate the data density up to a constant, using a quantity known as "free-energy". This offers a natural way for <a href="https://arxiv.org/abs/1608.04830">anomaly detection</a>.<br /><br />Third, since the representation layer is all binary, we can indeed stack multiple RBMs on top of the first layer, making the entire structure a <a href="https://arxiv.org/abs/1610.06249">mixed-variate Deep Belief Network</a>.<br /><br />Fourth, since the estimation of the posterior is straightforward, it paves a way to <a href="http://prada-research.net/~truyen/papers/icme13_142.pdf">learn distance metric</a>. The distance are useful in many tasks including information retrieval, k-nearest neighbor classification and <a href="http://prada-research.net/~truyen/papers/acml13_nguyen.pdf">clustering</a>. These capabilities are not possible with other mixed-type modeling methods.<br /><br /><b>Updated references</b><br /><ul><li>Multilevel Anomaly Detection for Mixed Data, K Do, T Tran, S Venkatesh, <i>arXiv preprint arXiv: 1610.06249.</i></li><li>Learning deep representation of multityped objects and tasks, Truyen Tran, D. Phung, and S. Venkatesh, <i>arXiv preprint arXiv: 1603.01359</i>.</li><li>Outlier Detection on Mixed-Type Data: An Energy-based Approach, K Do, T Tran, D Phung, S Venkatesh, <i>International Conference on Advanced Data Mining and Applications (ADMA 2016)</i>.</li><li>Latent patient profile modelling and applications with Mixed-Variate Restricted Boltzmann Machine, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In <i>Proc. of 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’13)</i>, Gold Coast, Australia, April 2013.</li><li>Learning sparse latent representation and distance metric for image retrieval, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In <i>Proc. of IEEE International Conference on Multimedia and Expo (ICME)</i>, San Jose, California, USA, July 2013.</li><li>Mixed-Variate Restricted Boltzmann Machines, Truyen Tran, Dinh Phung and Svetha Venkatesh, in Proc. of. the <i>3rd Asian Conference on Machine Learning (ACML2011)</i>, Taoyuan, Taiwan, Nov 2011.</li></ul>Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com1tag:blogger.com,1999:blog-7266907390267288113.post-72259412337818207572014-06-17T21:00:00.002-07:002016-12-17T18:55:37.885-08:00Mixed-type data analysis II: Pairwise models<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> Analysing mixed-type data is hard. A popular way is to consider a pair of types, e.g., continuous and binary. A simple method is to utilise the standard rule \( P(A,B) = P(A)P(B|A) \), were \(A\) can be continuous and \(B\) can be binary. In this particular case, \(P(A)\) can be a Gaussian distribution and \(P(B|A)\) a logistic distribution, where A plays some role in the location, intercept or scale parameters of \(P(B|A)\).<br /><br />For example, you wonder the relation between IQ (\(A\)) and having a top job (\(B\)). The distribution of IQ \(P(A)\) can be easily estimated from the population, but the conditional distribution of having a top job given your IQ (any nothing else) \(P(B|A)\) requires a bit of thinking. So you may want to consider IQ as a feature and estimate its weight. You may find some positive weight there, but it may not be as strong as you might wish. You may then question such a linear relationship and start modeling a polynomial equation, or chopping up the IQ into ranges (oops, this creates another issue, really). For example, it could be the case that after a certain IQ threshold, job success does not really depend on IQ anymore (possibly not every job is like that, especially in theoretical physics and maths).<br /><br />To compute the conditional distribution of IQ given job success \(P(A|B)\) we need the Bayes' rule: \(Q(A|B) = P(A)P(B|A) / P(B)\), where \(P(B)\) is a normalising factor.<br /><br />Alternatively, you can use the same rule, but differently: \(P(A,B) = P(B)P(A|B)\). Again \(P(B)\), the proportion of working people with top jobs, can be estimated from some national statistics. The IQ distribution among successful people P\((A|B)\). You may find that \(P(A|B)\) may not even be a Gaussian but rather skewed to the right. I don't know for sure, just a guess.<br /><br />It is interesting to compare \(Q(A|B)\) and \(P(A|B)\) using the two methods. In a perfect world, they should be the same, but in practice they may not. This is because of the assumptions in the model specification and the unreliable estimation. This is one problem.<br /><br />Another problem with this approach is that it is limited to pairs and cannot be easily generalised to more than two types. Even more, the number of models will be quadratic in number of variables. And finally, it does not offer an easy way for further analysis using existing tools (such as visualisation in 2D or 3D).<br /><br />A better way is to imagine there exists some hidden variables that govern the generation of these types. Given these hidden variables, types are conditionally independent, and thus we don't have to really model the interaction among types. We have to, however, just model the interaction between each type and the hidden variables. These are techniques behind our recent attempts:<a href="http://www.jmlr.org/proceedings/papers/v20/tran11/tran11.pdf"> mixed-variate restricted Boltzmann machines</a> and <a href="http://www.jmlr.org/proceedings/papers/v28/tran13.pdf">Thurstonian Boltzmann machines</a>. These are the subjects of subsequent posts.<br /><br /><br /><br />Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0tag:blogger.com,1999:blog-7266907390267288113.post-2248334202172871072012-07-07T10:38:00.002-07:002016-12-18T16:52:48.246-08:00Mixed-type data analysis I: An overviewThis post series is about analysing multiple data types. By <i>types</i>, we mean primitive natures such as real-valued, interval, counts, rank, binary, ordinal, categorical and multicategorical types. It is more detailed than simply saying data is numerical or nominal. For each of these types, we can still subdivide into finer sub-types, depending on the nature of the generating processes. For example, real-valued variables can be Gaussian or logistic (mean-value), Gumbel (extreme-value), or rate (exponential), etc.<br /><br /><i>Type handling</i> is indeed important in many statistical application areas such as healthcare, business and engineering. There are good reasons for performing adequate type modelling and analysis. First, interpretation and insights are often important goals, sometimes more important than final (predictive) performance. Thus we need to understand the generative process that governs the generation of data types. For example, modelling a queuing process is critical in many applications, ranging from networking to manufacturing to supermarket scheduling. The number of arrivals within a period of time is often approximated by a <a href="http://en.wikipedia.org/wiki/Poisson_distribution">Poisson distribution</a> while the time between two arrivals by an exponential distribution. In collaborative filtering, <a href="http://letdataspeak.blogspot.com/2012/05/on-ordinal-modelling-of-rating-matrices.html">ordinal modelling of ratings</a> should be preferred to Gaussian treatments, because it offers insights of why and how a particular rating is chosen for a particular item by a particular user.<br /><br />Second, even if predictive performance is the main goal, it is often the case that accurate modelling of data types would make it easier to achieve the goal.<br /><br /><i>Type-specific phenomena</i> have been very well-studied, and now it is a good time to pay more attention to mixed types because real world processes are rarely isolated and they are likely to contain more than one type. Indeed, the problem is <a href="http://www.google.com.au/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CFUQFjAA&url=http%3A%2F%2Fwww.jstor.org%2Fstable%2F10.2307%2F2237755&ei=UGj4T_URiZSJB_rU1YAH&usg=AFQjCNHTbVxVWk1Vy7OH7jYbLefqZfq3uA&sig2=AcvQbaUZTgKYq9Yy8NhOng">old</a>, and there is a <a href="http://math.ucalgary.ca/~adeleon/mixed_outcomes_data.pdf">rich literature</a> on modelling mixed data, but they often involve two or three types (e.g., continuous and categorical). Handling <i>arbitrarily many types</i> seems not to be adequately addressed in the existing literature. This series of post aim to partly fill that gap.<br /><br />As an example of the situation when multiple data types appear simultaneously, let us take a typical multimedia object, where we can always have multiple ways of viewing it:<br /><ul><li>If it is described by a set of tags, then the type can be binary (we care about the presence and appearance of a word) or categorical (we care about why a particular word is present and not the others),</li><li>If it is described by a bag of visual words, then the type can be counts, or repeated categorical.</li><li>If it is described by a vector of histograms, then the type is real-valued.</li></ul>Indeed, in the publicly available dataset like <a href="http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm">NUS-WIDE</a>, all of these types are present, and I am not aware of any work in multimedia or machine learning making use of these types <i>explicitly</i>.<br /><br />In healthcare, a patient can be characterised by various clinical and background information: number of previous admissions (count), age (continuous, or ordinal as in interval), height (continuous), test results (binary outcome or continuous measures), etc. A full account should call for an effective treatment of all the types at the same time. However, <a href="http://math.ucalgary.ca/~adeleon/mixed_outcomes_data.pdf">it seems</a> that the development is till in an early stage.<br /><br />In social sciences and business, surveys are one of the primary tools to measure the opinions and interests of people. The scale and impact are huge: Millions of surveys are conducted on millions of people each year. In a typical survey, a mixed bag of question types is used, and the answers often contain stated facts, preferences and choices. These can then be translated into the types such as binary (yes or no), counts (previous admissions), rank ( of political candidates), ordinal (strongly dislike, dislike, like), pairwise preferences, etc. Unfortunately, the literature of survey analysis is fully of advice about statistics for each data type and not so much about simultaneous modelling and analysis of all of these types.<br /><br /><b>Type-specific analysis in machine learning</b><br /><br /><i>Type modelling</i> seems like an unsexy topic to many machine learning people. In the early days, types are explicitly handled in <a href="http://en.wikipedia.org/wiki/Decision_tree_learning">decision trees</a> (e.g., popular in the 80s and early 90s): Nodes are split depending critically on their types, e.g., real-valued versus nominal. However, decision trees seem to be the problem of the day before yesterday (this is not to say they are not important anymore -- on the contrary, ensemble of tress is still one of the most competitive methods in predictive arts and they are widely used in practical data mining). There was not much interest in other developments, e.g., with neural networks, perceptrons and linear models or the more recent kernel methods.<br /><br />While it is OK in practice to convert multiple data types into some unified form such as real-valued or binary (the process is also known as <i>data coding</i>), it may be more desirable to deal with data types directly. When the types are in the input, the coding may or may not reduce the amount of predictive information because coding can sometimes help to linearize the data, making easier for linear algorithms to do their job. However, if we want to understand the structure of the input (e.g., as in semi-supervised, self-taught and transfer learning), ignoring the data types can be problematic because the coding process partially destroys the type-specific information. For example, in a bag-of-word representation of a document, it may be more natural to deal with the counting using Poisson models than converting counts into real-values (which are an approximation), binary indicators (which lose information) or replicated binary indicators (which increase the model size).<br /><br />When the types are in the output, there have been two directions: One of to model the type directly, and <a href="http://hunch.net/~jl/projects/reductions/reductions.html">the reductionist approach</a> is to convert type-specific problems into the more popular binary problems.<br /><br />The former case is quite interesting, and in doing so, ML would invent a few more terminologies along the way:<br /><ul><li>For continuous outputs, the natural solution is using some regression techniques, such as <a href="http://en.wikipedia.org/wiki/Tikhonov_regularization">ridge regression</a>. However, this assumes the error structure is normal while it is not always necessarily so.</li><li>For binary outputs, this is really the canonical problem in machine learning. </li><li>For ordinal outputs, the problem is often called <a href="http://en.wikipedia.org/wiki/Ordinal_regression">ordinal regression</a>.</li><li>For counting outputs, a standard solution is <a href="http://en.wikipedia.org/wiki/Poisson_regression">Poisson regression</a>.</li><li>For categorical outputs, this is often called multiclass problems, and now there are many solutions, under different names, e.g., multiclass SVM, maximum entropy, multinomial logit, multinomial probit.</li><li>For rank outputs, the problem is called label ranking.</li><li>For multiple binary outputs, the problem is often known as multilabel learning. </li></ul>The main contribution from ML is the introduction of the notion of margins and the analysis of generalization and stability of algorithms, as well as specific algorithms such as decision trees, neural networks, boosting, random forests.<br /><br />In the latter case, where we transform the complex problems into simpler ones, we may convert ordinal regression and ranking into a series of binary classifications from which existing methods such as SVMs can be reused. This can be an useful way since we don't have to invent new ML algorithms, this may also make the problem harder than necessary because we need to recombine binary predictors into the original form.<br /><br />With regard to mixed-type analysis, its presence in machine learning is very limited, and I am only aware of a couple of attempts: <a href="http://www.cs.ubc.ca/~murphyk/Papers/nips2010.pdf">here</a> and <a href="http://truyen.vietlabs.com/papers/tran_phung_venkatesh_acml11.pdf">here</a>. The latter is ours, and we will cover in the subsequent posts.<br /><br /><b>Some research challenges</b><br /><br />An important goal of mixed type modelling is to derive a joint distribution of all types. Ultimately, we would want a <i>representation scheme</i> that aids understanding of interaction among types, and at the same time, supports efficient inference (e.g., estimating marginals or expectations) and learning (e.g., structure and parameters).<a href="http://en.wikipedia.org/wiki/Graphical_model"> Probabilistic graphical models</a> would be an excellent candidate here.<br /><br />Sometimes, a full joint distribution is not needed if we care about prediction of some output variables given the rest. In machine learning, mixed-type outputs would be an instance of multitask learning where each output is a task. Thus, estimating a shared structure between output types (given arbitrary input types) would be an interesting direction here.<br /><br />In some applications, on the other hand, joint distribution modelling might not be the ultimate goal. For example, we would need just a similarity measure between two mixed-type instances (e.g., in k-nearest neighbour classification, k-means clustering and kernel-methods), or a good visualisation of all instances in 2D (this often boils down to similarity estimation). This is a challenging problem because each type would lead to a measure which may not compatible with another in scale or in interpretation. <br /><br />And finally, often we want to make use of some handy data analysis tools such as PCA/SVD/ICA/CCA/SVM/GLM on top of mixed data. Since these tools do not work on mixed data, the problem now is to turn mixed data into some real-valued vector representations which preserve as much as information of the original mixed-data as possible. In other words, we now have a problem of representation learning from mixed-data.<br /><br />Can these challenges be overcome by a unified framework? In the subsequent posts, we would attempt to provide some early clues. Stay tuned.<br /><br /><b>Updated references</b>:<br /><br /><ul><li>Mixed-Variate Restricted Boltzmann Machines, Truyen Tran, Dinh Phung and Svetha Venkatesh, in <i>Proc. of. the 3rd Asian Conference on Machine Learning (ACML2011)</i>, Taoyuan, Taiwan, Nov 2011.</li><li>Embedded Restricted Boltzmann Machines for Fusion of Mixed Data Types and Applications in Social Measurements Analysis, Truyen Tran, Dinh Phung, Svetha Venkatesh, in <i>Proc. of 15-th International Conference on Information Fusion (FUSION-12)</i>, Singapore, July 2012.</li><li>Latent patient profile modelling and applications with Mixed-Variate Restricted Boltzmann Machine, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In <i>Proc. of 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’13)</i>, Gold Coast, Australia, April 2013.</li><li>Thurstonian Boltzmann machines: Learning from multiple inequalities, Truyen Tran, D. Phung, and S. Venkatesh, In <i>Proc. of 30th International Conference in Machine Learning (ICML’13)</i>, Atlanta, USA, June, 2013.</li><li>Learning sparse latent representation and distance metric for image retrieval, Tu D. Nguyen, Truyen Tran, D. Phung, and S. Venkatesh, In <i>Proc. of IEEE International Conference on Multimedia and Expo (ICME)</i>, San Jose, California, USA, July 2013.</li><li>Outlier Detection on Mixed-Type Data: An Energy-based Approach, K Do, T Tran, D Phung, S Venkatesh,<i>International Conference on Advanced Data Mining and Applications (ADMA 2016)</i>.</li><li>Multilevel Anomaly Detection for Mixed Data, K Do, T Tran, S Venkatesh, <i>arXiv preprint arXiv: 1610.06249</i>.</li></ul><br /><br />Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com3tag:blogger.com,1999:blog-7266907390267288113.post-41071260332470718662012-06-24T23:14:00.000-07:002012-07-01T21:33:23.080-07:00Markov random fields for recommender systems III: Embedding ordinal matrix factorisation<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>In the previous posts I have described how Markov random fields (MRFs) can be used to represent the <a href="http://letdataspeak.blogspot.com/2012/06/markov-random-fields-for-recommender.html">local structures</a> and <a href="http://letdataspeak.blogspot.com/2012/06/markov-random-fields-for-recommender_21.html">latent spaces</a> in recommendation. The advantage of MRFs is in its disciplined interpretation and inference. However, the employment of log-linear modelling in MRFs does not enable a simple <a href="http://truyen.vietlabs.com/papers/uai09_final.pdf">treatment of ordinal ratings</a>, which we <a href="http://letdataspeak.blogspot.com/2012/05/on-ordinal-modelling-of-rating-matrices.html">have claimed</a> to be of great importance. <br /><br />The goal of this post is to develop a MRF-based model that achieves both the representational power of the probabilistic graphical models and and the ease of modelling <a href="http://truyen.vietlabs.com/papers/truyen_etal_aaai12.pdf">ordinal rating</a> by <a href="http://dl.acm.org/citation.cfm?id=2043956">matrix factorisation</a>. In short, our new model combines a MRF and an ordinal matrix factorisation (OMF) method. For simplicity, we shall focus only on the MRF with local structures. This is because the latent spaces have been captured by the OMF. Of course, nothing will prevent us from modelling the latent spaces by the MRF itself because the two approaches (OMF and MRF) are very different, one is linear and another is non-linear.<br /><br /><b>The OMF</b><br /><br />Recall that the OMF aims to model an ordinal distribution \( Q(r|u,i) \). The standard way is to assume that there exists a latent utility function \( f(u,i) \) that captures how much value the user \( u \) judges the item \( i \) . For simplicity we assume that the utility has the linear, parametric form \( f(u,i) = a_u + b_i + W_u' H_i \), although nonlinear and nonparametric options can be possible. The assumption by <a href="http://people.csail.mit.edu/jrennie/papers/other/mccullagh-ordinal-80.pdf">McCullagh</a> is that the rating is chosen based on the interval to which the utility belongs:<br />\[ r_{ui} = l \mbox{ if } f(u,i) \in (\theta_{l-1},\theta_{l}] \]<br />for \( l < L \) and<br />\[ r_{ui} = L \mbox{ if } f(u,i) > \theta_{L-1} \]<br />where \( L \) is the number of ordinal levels. Here usually we fix \( \theta_{0} = -\infty \). <a href="http://letdataspeak.blogspot.com.au/2012/05/on-ordinal-modelling-of-rating-matrices.html">Other assumptions</a> are also possible but this is by far the most popular. The probability of receiving a rating is therefore<br />\[ Q(r=l|u,i) = \int_{\theta_{l-1}}^{\theta_l}P(f(u,i)| \theta) = F(\theta_l) - F(\theta_{l-1}) \]<br /> where \( F(\theta_l) \) is the cumulative distribution evaluated at \( \theta_l \). The thresholds can be prameterised to depend on user and item but care must be taken to ensure that they are monotonic in \( l \), e.g., \( \theta_{ui,l} = \theta_{ui,l-1} + e^{\eta_{u,l} + \gamma_{i,l}} \).<br /><br />Depending on the choice of the distribution over \( f(u,i) \) we may end up with the logistic form of \( F(\theta_l) \) or its probit alternatives. See our recent AAAI'12 paper for a <a href="http://truyen.vietlabs.com/papers/truyen_etal_aaai12.pdf">comparison</a>. <br /><br /><b>The hybrid model of MRF and OMF</b><br /><br />Now we wish to build a MRF model by multiplying the point-wise OMF and the item-item interactions:<br />\[ P(\mathbf{r}|u) \propto \Psi_u(\mathbf{r}) \prod_{i \in I(u)} Q(r_{ui}|u,i) \]<br />where \( \Psi_u(\mathbf{r}) > 0 \) is the potential function capturing the interaction among items and \( \mathbf{r} \) is the set of items rated by user \( u \). In essence this model promotes the agreement between the latent aspects discovered by the OMF and the local structures in the data. The usual form of the potential function is factorised into pairwise potentials<br />\[ \Psi_u(\mathbf{r}) = \exp \left(\sum_i\sum_{j>i}\sum_m \beta_{ijm} g_m(r_i,r_j)\right) \]<br />where \( \beta_{ijm} \) is parameter and \( g_m(r_i,r_j) \) is feature function. For example, the feature function can be just indicators<br />\[ g_{lk}(r_i,r_j) = \delta(r_i,l) \delta(r_j,k) \]<br />as in standard treatment of categorical variables; or bilinear association<br />\[ g(r_i,r_j) = (r_i - \bar{r}_i)( r_j - \bar{r}_j)\]<br />as in usual Gaussian assumption where \( \bar{r}_i \) is the mean rating for item \( i \); or the metric cost<br />\[ g(r_i,r_j) = |r_i - r_j| ^ p \]<br />for \( p > 0\).<br /><br /><b>Model estimation</b><br /><b></b><br />Like most MRFs we cannot estimate the parameters easily using the standard maximum likelihood approach. Currently, there are two most effective alternatives: The pseudo-likelihood of Besag and the stochastic gradient using truncated MCMC. The pseudo-likelihood leads to exact computation of the loss function and its gradient with respect to parameters, and thus may be preferred by some practitioners. The MCMC-based methods may, on the other hand, lead to better estimation given enough time.<br /><br />Let us present here the case for pseudo-likelihood. We need to estimate the local conditional distribution<br />\[ P(r_i | \mathbf{r}_{\neg i}) \propto Q(r_{ui}|u,i) \exp \left(\sum_{j\in I(u), j \ne i}\sum_m \beta_{ijm} g_m(r_i,r_j)\right) \]<br />Maximising the log of product of all local conditional distributions with respect to parameters will give us the result.<br /><br /><b>Rating prediction</b><br /><br />Predicting the new rating is easy, we just need to compute \( P(r_j | \mathbf{r}) \) in the same way as we have done in the pseudo-likelihood and search for the most probable rating (for metrics such as MAE), or its expectation (for metrics such as RMSE).<br /><br /><b>Joining user-based models and item-based models</b><br /><br />As we have always argued, each item should has its own life, much in the same way as the users. Thus we can build item-based model with user-user interactions. Since user and item are complementary entities, the two model types can be fused into one. The idea is quite simple: We just need to multiply all models together and remove duplicates (because the OMF components will appear twice):<br />\[ P(\mathbf{R}) \propto \left[\prod_u\Psi_u(\mathbf{r}_u)\right] \left[\prod_i\Phi_i(\mathbf{r}_i)\right] \prod_{u,i} Q(r_{ui}|u,i) \] <br />where \( \mathbf{R} \) is the collection of all seen ratings.<br /><br /><b>Connection to our <a href="http://truyen.vietlabs.com/papers/truyen_etal_aaai12.pdf">AAAI'12 paper</a></b><br /><br />In our AAAI'12 paper we suggest the following form of the utility function<br />\[ f(u,i) = a_u + b_i + W_u' H_i + \sum_{j \in I(u)} w_{ij} g(r_j) \]<br />This appears to be quite similar to our local (log) probability in the pseudo-likelihood method. Of course the fine details will be different but in essence the information captured by both the approaches may be similar. Another subtle difference is that in the MRF treatment, the pairwise interactions are symmetric (i.e., \( w_{ij} = w_{ji} \)), while in this model, it is asymmetric \( w_{ij} \ne w_{ji} \).<br /><br /><b>Previous posts</b>: <a href="http://letdataspeak.blogspot.com/2012/06/markov-random-fields-for-recommender.html">Part 1: Modelling local dependency</a> | <a href="http://letdataspeak.blogspot.com/2012/06/markov-random-fields-for-recommender_21.html">Part 2: Discovering latent spaces</a>.<br /><br />Truyen Tranhttp://www.blogger.com/profile/12750803948667104637noreply@blogger.com0