Read Chapter 3 of Mitchell and Quinlan & Rivest (1987).
Decision trees are typically used for classification related problems and are best suited when
| Normal | Yes | |||
| Sun |
|
|||
| High | No | |||
|
|
Overcast | Yes | ||
| Strong | No | |||
| Rain |
|
|||
| Weak | Yes | |||
A decision tree for the concept PlayTennis
A decision tree generalizes over the data. For example, the tree classifies all instances with Outlook = Overcast as suitable for playing tennis. But, in fact, the available training data may not have included cases for all combinations of other attributes where Outlook = Overcast. It is perfectly possible that a future instance arrives where Outlook=Overcast, and PlayTennis=No.
Here are some training examples for the target concept PlayTennis:
| Day | Outlook | Temp | Humidity | Wind | PlayTennis |
| 1 | Sunny | Hot | High | Weak | No |
| 2 | Sunny | Hot | High | Strong | No |
| 3 | Overcast | Hot | High | Weak | Yes |
| 4 | Rainy | Mild | High | Weak | Yes |
| 5 | Rainy | Cool | Normal | Weak | Yes |
| 6 | Rainy | Cool | Normal | Strong | No |
| 7 | Overcast | Cool | Normal | Strong | Yes |
| 8 | Sunny | Mild | High | Weak | No |
| 9 | Sunny | Cool | Normal | Weak | Yes |
| 10 | Rainy | Mild | Normal | Weak | Yes |
| 11 | Sunny | Mild | Normal | Strong | Yes |
| 12 | Overcast | Mild | High | Strong | Yes |
| 13 | Overcast | Hot | Normal | Weak | Yes |
| 14 | Rainy | Mild | High | Strong | No |
How do we determine which attribute to test at each node of the tree?
Quinlan's ID3 takes advantage of the insight that knowledge reduces uncertainty and therefore information (entropy).
Knowing the value of an attribute narrows down the range of choices available and consequently reduces the entropy.
We define the information gain with respect to knowing the value of an attribute as the expected reduction in the entropy of the source.
If H(S) is the entropy of the original source, S, A is an
attribute that can take on Values(A) distinct values, Sv is
subset of S for which A = v and H(Sv) is the entropy of Sv,
then the expected entropy upon being given a random value for the
attribute A is just
If we estimate P(Sv) by
,
we get
Then the information gain with respect to attribute A is given by:
To select an attribute to classify instances at a given node, compute the information gain for all attributes and pick the attribute that promises most gain.
Obviously Gain(S,A) is at most the entropy of the original source H(S), but can you show that Gain(S,A) cannot be negative?
For the training examples given above ( PlayTennis), determine the best attribute to classify by at the root and the next level down. Compute Gain(S,A) in nits, i.e. using natural logs.
What guarantees are there that an attribute that promises most gain NOW is in fact the best attribute to classify by? None!
Points to ponder/discuss: