While this is interesting because most real-world data can be described in this way, it is a hard problem as generally it is intractable to do any meaningful exact computation. Everything must be approximated.
In learning, we basically estimate some compact description of the structured space, whose size is exponentially many if we want to count every point in it. Since exact generation of the space is impossible, there are two ways to proceed: (1) under-generating where we limit to very small subspace so that we can deal with efficiently, and (2) over-generating where we try to describe a larger but much more relaxed space so many constraints are violated.
Examples of under-generating methods:
- Pseudo-likelihood of Besag: we exploit the conditional independence property to explore a subset of variables and their surrounding (also known as Markov blanket). The basic theory assures that if we have enough data, then there will be no difference between this localised estimation and the true global one.
- Re-ranking of Collin: the idea is that we should use some tractable and approximate methods to generate the most meaningful and tractable subspace, and then deal with the new subspace.
- Piecewise training of Sutton & McCallumn: the idea is that we can duplicate variables and break the structure into independent pieces. Since more variables are added, the new space is larger than the original one. In return we will have much more independence, which makes computation easy. This method may work fine if the independent pieces approximate the data well.
- Tree-superimposition of my own: we can always relax a graph by removing some edges until a spanning tree is left. Since spanning trees are tractable, they can be learned given the knowledge of all other trees accumulated so far.