## Saturday, 17 December 2016

### Markov random fields: Structure learning I

Structure learning in Markov random fields is an important topic since without the correct structures, we can potentially estimate a wrong or unreliable model. Often structures are specified by domain knowledge such as the 2D grid in modelling images or the 1D chain in word labelling. However, there are cases which structures are not available. For example, a word co-occurrence graph can be useful to learn about the document semantics, or a disease correlation network is essential to understand the patient health trajectory. In these situations, there are no predetermined structures and thus we need to estimate from data.

There are often an implicit requirement that the network should be sparse and interaction between neighboring variables must be strong. Otherwise, learning will be unreliable since the number of parameters can be quadratic in number of variables. Inference wise, powerful methods such as loopy belief propagation cannot work for densely connected networks, and Gibbs sampling can be too slow to move to reasonably distant states. Practically, learning and inference speed will suffer critically for large-scale problems (e.g., in our application of MRF for collaborative filtering, the number of nodes (items or users) can be hundreds of thousands).

Sparse structure learning is therefore needed but it is often difficult to do efficiently because the space of all possible structures is $$2^{N(N-1)/2}$$ for $$N$$ variables. This is because we have a set of $$N(N-1)/2$$ possible symmetric edges and each structure corresponds to a subset of this set (empty set is possible as it means variables are totally independent). To avoid searching through this combinatorial space, often approximate methods are needed.

There are a number of ways to do so. First, we can apply simple screening tests, e.g., by measuring the correlation between variables (e.g., Pearson's), Those pairs whose correlation falls below a threshold will be removed. There are a number of drawbacks here: First, we do not often know which measure is the most appropriate for the task. Second, the method considers pairs independently without paying attention to related pairs. And most importantly, the method is not connected to any objective function, e.g., data likelihood. However, due to its simplicity, this method can be widely applicable in certain problems. For example, in our MRF application to collaborative filtering, this method is quite sufficient.

Another way is to embed the edge selection process into learning. The key is to have a surrogate objective function which can be evaluated efficiently.

Pseudo-likelihood with edge shrinkage

A recent embedded selection method exploits the power of pseudo-likelihood and shrinkage method such as Lasso. Assume that the model energy has the following additive form
$E(\mathbf{x},\gamma,\theta) = - \left( \sum_i \phi_i(x_i,\gamma) + \sum_i\sum_{j>i} \theta_{ij} \psi_{ij}(x_i, x_j) \right)$
We want to optimise the following objective function
$F(\gamma,\theta) = \sum_i \log P(x_i | \mathbf{x}_{\neg i},\gamma, \theta) - \lambda \Omega(\theta)$
where $$\theta$$ is the edge parameter and $$\gamma$$ is the rest of parameters. $$\Omega(\theta)$$ is the penalty for non-zero values, and $$\lambda > 0$$ is the penalty factor.

The role of pseudo-likelihood is clear -- this is one of the consistent methods which is feasible to perform exactly.  Now each edge is associated with a parameter, usually in a log-linear way, and thus a zero parameter would mean no edge. The strategy is then to actively shrink the edge parameters are until they are zeros or ignorable.

This is achieved by placing a sparsity prior on the edge parameter. Ideally the prior is the exponential of the  negative count of non-zeros edge parameters, but this leads to non-smooth optimisation problem. A solution is to relax the count to the sum of absolute parameter values.
$\Omega(\theta) = \sum_i\sum_{j>i} |\theta_{ij}|$
The sparsity is still possible to obtain but the optimisation is much easier.

For more detail, see our recent work: Collaborative filtering via sparse Markov random fields.