Why is LSE likely to output a ML hypothesis? next up previous
Next: Bayes Optimal Classifier Up: Bayesian Learning 1 Previous: Example 3: Why Lab

Why is LSE likely to output a ML hypothesis?

Read Section 6.4 of Mitchell. Under certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output a maximum likelihood hypothesis.

Consider a hypothesis space H consisting of some real valued functions $h: X \rightarrow \Re$. Suppose we are to learn a target function $f: X \rightarrow \Re$ from H from a set of m training examples (xi, di). Assume that data is corrupted by noise that is Normally distributed about the target values, i.e. di = f(xi) + ei. What is hML?

From Eqn 3

\begin{displaymath}h_{ML} = \underset{h \in H}{\rm argmax}\quad P(D\vert h)
\end{displaymath}

If we assume the training examples are independent of each other, we can write the above as:

\begin{displaymath}h_{ML} = \underset{h \in H}{\rm argmax}\quad \prod\limits_{i=1}^m P(d_i\vert h)
\end{displaymath}

But P(D|h) = 0 for any particular data item since the error is distributed Normally (continuously).

However, note

\begin{displaymath}P(d_i\vert h) = \lim\limits_{\epsilon \rightarrow 0} \epsilon p(d_i\vert h)
\end{displaymath}

So,

\begin{eqnarray*}h_{ML} &=& \underset{h \in H}{\rm argmax}\quad \lim\limits_{\ep...
...{h \in H}{\rm argmax}\quad \prod\limits_{i=1}^m p(d_i\vert h)\\
\end{eqnarray*}


Now, since we assumed that the di are Normally distributed around $\mu = h(x_i)$, we can write


\begin{eqnarray*}h_{ML} &=& \underset{h \in H}{\rm argmax}\quad \prod\limits_{i=...
... \in H}{\rm argmin}\quad \sum\limits_{i=1}^m
(d_i-h(x_i))^2\\
\end{eqnarray*}


So hML is the hypothesis that minimizes the sum of squared errors between the training examples and the hypothesis predictions.


next up previous
Next: Bayes Optimal Classifier Up: Bayesian Learning 1 Previous: Example 3: Why Lab
Anand Venkataraman
1999-09-16