next up previous
Next: Artificial Neural Nets Up: Decision Trees Previous: Decision Trees

Issues in Decision Tree Learning

1. Overfitting the data:

Definition: given a hypothesis space H, a hypothesis $h \in H$ is said to overfit the training data if there exists some alternative hypothesis $h' \in H$, such that h has smaller error than h' over the training examples, but h' has smaller error than h over the entire distribution of instances.

How can we prevent overfitting? Here are some common heuristics:

How does one tell if a given tree is one that has overfit the data?

If we use the validation set to guide pruning, how do we tell that the tree is not overfitting the validation set? In this case, we need to extract yet another set called the test set from the training set and use this for the final check.

2. Guarding against bad attribute choices:

The information gain measure has a bias that favors attributes with many values over those with only a few.

For example, if you imagine an attribute Date with unique values for each training example, then Gain(S,Date) will yield H(S) since $\forall v \in Values(Date): H(S_v) = 0$.

Obviously no other attribute can do better. This will result in a very broad tree of depth 1.

To guard against this, Quinlan uses GainRatio(S,A) instead of Gain(S,A).

\begin{displaymath}GainRatio(S,A) = \frac{Gain(S,A)}{SplitInformation(S,A)}
\end{displaymath}

where

\begin{displaymath}SplitInformation(S,A) = -\sum_{v\in Values(A)} P(S_v)\log P(S_v)
\end{displaymath}

with P(Sv) estimated by relative frequency as before.

This introduces a penalty for partitioning into equipossible groups.

3. Handling continuous valued attributes:

Note that continuous valued attributes can be partitioned into a discrete number of disjoint intervals. Then we can test for membership to these intervals.

If the Temperature attribute in the PlayTennis example took on continuous values in the range 40-90, note that we cannot just use the same approach as before.

Why? Because Temperature becomes a bad choice for classification (It alone may perfectly classify the training examples and therefore promise the highest information gain like in the earlier Date example) while remaining a poor predictor on the test set.

The solution to this problem is to classify based not on the actual temperature, but on dynamically determined intervals within which the temperature falls.

For instance we can introduce boolean attributes $T \le a$, $a < T \le
b$, $b < T \le c$ and T > c instead of real valued T. a, b and c can be computed dynamically based on boundaries where T is likely or known to change.

4. Handling missing attribute values:

What happens if some of the training examples contain one or more ``?'', meaning ``value not known'' instead of the actual attribute values?

Here are some common ad hoc solutions:

Quinlan's solution in C4.5 is slightly more complex: Instead of a value, he assigns a distribution over values to the ``?''. This distribution is used when computing Gain(S,A). The distribution determines how much of H(Sv=?) will contribute to each H(Sv).

5. Handling attributes with differing costs:

Sometimes we want to introduce additional bias against the selection of certain attributes. Eg. ``Wherever possible try not to use InvasiveTest as an attribute, since this might inconvenience the patient.

Here is another ad hoc heuristic to fix this:

Introduce a user-defined measure Cost(A), to specify a fixed bias against each attribute.

Now we can use a CostedGain(S,A) which is defined along the lines of:

\begin{eqnarray*}CostedGain(S,A) &=& \frac{Gain^2(S,A)}{Cost(A)} \qquad {\rm or }\\
CostedGain(S,A) &=& \frac{2^{Gain(S,A)}-1}{(Cost(A)+1)^w}
\end{eqnarray*}


where $w \in [0,1]$ may be a constant that determines the relative importance of cost versus information gain.


next up previous
Next: Artificial Neural Nets Up: Decision Trees Previous: Decision Trees
Anand Venkataraman
1999-09-16