Naive Bayes Classifier next up previous
Next: Markov Networks Up: Bayesian Learning 1 Previous: Gibbs Algorithm

Naive Bayes Classifier

Read Section 6.9 of Mitchell. Suppose that each instance $x \in X$ is a sequence of attribute values $\langle a_1, a_2, \cdots, a_n
\rangle$ and that the discrete valued target function we are trying to learn is $f: X \rightarrow V$ where $V = \{v_1, v_2, \cdots, v_t\}$.

A set of m training instances of the form $\langle x_i, v_i \rangle$ is presented followed by a new instance x. What is the most likely value of f(x)?


 
vMAP = $\displaystyle \underset{v_j \in V}{\rm argmax}\quad P(x\vert v_j) \cdot P(v_j)$  
  = $\displaystyle \underset{v_j \in V}{\rm argmax}\quad P(a_1, a_2, \cdots, a_n \vert v_j) \cdot
P(v_j)$ (5)

The prior probability of vj is easy to estimate as the relative frequency of vj in the training examples.

But what about P(x|vj)? To estimate this we need to count the number of times the n-gram $\langle a_1, a_2, \cdots, a_n
\rangle$ occurs in the subset of the training data for which f(x) = vj. Technically this is infeasible.

So we make a simplifying assumption (like in Markov chains) that the ai are conditionally independent of each other.

That is, $P(a_1, a_2, \cdots, a_n \vert v_j) = \prod_{i=1}^n P(a_i\vert v_j)$

If we substitute the above in Eqn 6, we get the expression for the classification given by the Naive Bayes classifier:


 \begin{displaymath}
v_{NB} = \underset{v_j \in V}{\rm argmax}\quad \prod_{i=1}^n P(a_i\vert v_j) P(v_j)
\end{displaymath} (6)

Note that P(ai|vj) can be easily estimated as the ratio of the frequency of the attribute ai in the subset of the training examples for which f(x) = vj to the size of this subset.

How many different values must we estimate now? (n |V|). How many different values must we estimate if we didn't make the simplifying assumption? (n!|V|).


next up previous
Next: Markov Networks Up: Bayesian Learning 1 Previous: Gibbs Algorithm
Anand Venkataraman
1999-09-16