next up previous
Next: Bayesian Learning 2: EM Up: 59.771 Research Topics in Previous: Bounds on Code Length

Introduction to MML/MDL

Read Sections 3.6.2 and 6.6 of Mitchell, esp. the last paragraph of Section 6.6.

Also read the papers given: Wallace and Boulton (1968), Wallace and Georgeff (1984) and Rissanen (1978).

MDL follows directly from Eqn 2:

\begin{displaymath}h_{MAP} \equiv \underset{h \in H}{\rm argmax}\quad P(D\vert h)\cdot P(h)
\end{displaymath}

Taking logs, we get:

\begin{eqnarray*}h_{MAP}&\equiv&\underset{h \in H}{\rm argmax}\quad \log P(D\ver...
...underset{h \in H}{\rm argmin}\quad -\log P(D\vert h) - \log P(h)
\end{eqnarray*}


From results we derived in information theory, we can interpret the RHS as the length of optimally coding both the hypothesis and the data with respect to this hypothesis.

Note that MDL will only produce the MAP hypothesis if the encodings chosen for h and D/h are optimal. It won't work for any arbitrary encoding strategy.

Why is MDL useful? Sometimes it is easier to design a code that captures the knowledge in the system when precise quantification of probabilities is difficult. In all other case use a strictly probabilistic framework and by all means strive for the latter.

Points to ponder/discuss:

1.
What would you expect the MDL of most general and most specific hypotheses (Mitchell, Section 2.3.1, pp.24-33) to be?
2.
How is the null hypothesis different from the data?
3.
Most importantly, why should we prefer short hypotheses to longer ones?

We will now read and discuss the distributed papers and look at an example of trying to learn PFSA using MDL.


next up previous
Next: Bayesian Learning 2: EM Up: 59.771 Research Topics in Previous: Bounds on Code Length
Anand Venkataraman
1999-09-16