next up previous
Next: Issues in Decision Tree Up: 59.771 Research Topics in Previous: Estimating the means of

Decision Trees

Read Chapter 3 of Mitchell and Quinlan & Rivest (1987).

Decision trees are typically used for classification related problems and are best suited when

1.
Instances are described by a fixed set of attribute-value pairs.
2.
The target function has discrete output values (or can be partitioned into disjoint intervals).
3.
Disjunctive expressions may be required.

         
      Normal Yes
         
  Sun \fbox{Humidity?}    
         
      High No
         
\fbox{Outlook?} Overcast Yes    
         
      Strong No
         
  Rain \fbox{Wind?}    
         
      Weak Yes
         

A decision tree for the concept PlayTennis

A decision tree generalizes over the data. For example, the tree classifies all instances with Outlook = Overcast as suitable for playing tennis. But, in fact, the available training data may not have included cases for all combinations of other attributes where Outlook = Overcast. It is perfectly possible that a future instance arrives where Outlook=Overcast, and PlayTennis=No.

Here are some training examples for the target concept PlayTennis:

Day Outlook Temp Humidity Wind PlayTennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rainy Mild High Weak Yes
5 Rainy Cool Normal Weak Yes
6 Rainy Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rainy Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rainy Mild High Strong No

How do we determine which attribute to test at each node of the tree?

Quinlan's ID3 takes advantage of the insight that knowledge reduces uncertainty and therefore information (entropy).

Knowing the value of an attribute narrows down the range of choices available and consequently reduces the entropy.

We define the information gain with respect to knowing the value of an attribute as the expected reduction in the entropy of the source.

If H(S) is the entropy of the original source, S, A is an attribute that can take on Values(A) distinct values, Sv is subset of S for which A = v and H(Sv) is the entropy of Sv, then the expected entropy upon being given a random value for the attribute A is just

\begin{displaymath}E[H(S)\vert A] = \sum_{v \in Values(A)} P(S_v) H(S_v)
\end{displaymath}

If we estimate P(Sv) by $\frac{\vert S_v\vert}{\vert S\vert}$, we get

\begin{displaymath}E[H(S)\vert A] = \sum_{v \in Values(A)} \frac{\vert S_v\vert}{\vert S\vert} H(S_v)
\end{displaymath}

Then the information gain with respect to attribute A is given by:

\begin{displaymath}Gain(S,A) = H(S) - \sum_{v \in Values(A)} \frac{\vert S_v\vert}{\vert S\vert} H(S_v)
\end{displaymath}

which is the number of bits that are saved on average by being told the value of A.

To select an attribute to classify instances at a given node, compute the information gain for all attributes and pick the attribute that promises most gain.

Obviously Gain(S,A) is at most the entropy of the original source H(S), but can you show that Gain(S,A) cannot be negative?

For the training examples given above ( PlayTennis), determine the best attribute to classify by at the root and the next level down. Compute Gain(S,A) in nits, i.e. using natural logs.

What guarantees are there that an attribute that promises most gain NOW is in fact the best attribute to classify by? None!

Points to ponder/discuss:



 
next up previous
Next: Issues in Decision Tree Up: 59.771 Research Topics in Previous: Estimating the means of
Anand Venkataraman
1999-09-16