Decision Trees

Decision Trees
Introduction
Some facts about decision trees:
• They represent data-classification models.
• An internal node of the tree represents a question about feature-vector attribute, and whose answer
dictates which child node is next queried.
• Each leaf node represents a potential classification of the feature vector.
• Alternatively (from a logical perspective), decision trees allow classes of elements to be represented as
logical disjunctions of conjunctions of attribute values.
1
2
Here are some characteristics of data universes that admit good decision-tree models.
• Instances are represented by feature vectors whose attributes are preferably discrete.
• Instances are discretely classified.
• A disjunctive description represents a reasonable means of representing a class of elements.
• Errors may exist, such as missing attribute values or erroneous classification of some training vectors.
Entropy of a collection of classified feature vectors. Given a set of training vectors S, suppose that
the vectors are classified in c different ways, and that pi represents the proportion of vectors in S that belong
to the i th class. Then the classification entropy of S is defined as
H(S) = −
c
X
pi log2 pi .
i=1
Example 1. Calculate the classification entropy for the following set of feature vectors.
Weight
medium
heavy
medium
light
medium
light
heavy
medium
heavy
medium
medium
medium
medium
color
orange
green
green
red
orange
red
green
red
yellow
yellow
red
green
orange
texture
smooth
smooth
smooth
bumpy
bumpy
bumpy
rough
smooth
smooth
smooth
smooth
smooth
rough
classification
orange
melon
apple
berry
orange
berry
melon
apple
melon
orange
apple
apple
orange
3
Quinlan’s ID3 algorithm. At each phase of the construction of the decision tree, and for each branch of
the decision tree under construction, the attribute A that is considered next is the one which
1. has yet to be considered along that branch; and
2. minimizes the conditional classification entropy H(S|A). Indeed, for a given feature/attribute A,
H(S|A) =
X |Sa |
a∈A
|S|
H(Sa ),
where S is the set of training vectors that that reach the current branch under construction, and, for
all a ∈ A, Sa represents the set of feature vectors v ∈ S such that vA = a. Clearly, the smaller H(S|A),
the less classification information that remains in the vectors of S, once they are divided according to
their A-attribute.
Example 2. Using the table of feature vectors from Example 1 and the concept of conditional classificastion
entropy, construct a decision tree for classifying fruit as either, apple, orange, melon, or berry.
4
Split Information. Let A be an attribute, considered here as a discrete set of possible values. Then the
split information relative to A and a set of feature vectors S is defined as
SI(A, S) = −
X |Sa |
a∈A
|S|
log
|Sa |
,
|S|
where Sa represents the set of feature vectors v ∈ S such that vA = a.
For attributes that take on many values, using the gain ratio IG(S,A)
instead of IG(S, A) can help avoid
SI(S,A)
favoring many-valued attributes which may not perform well in classifying the data. Here IG(S, A) is defined
as H(S) − H(S|A).
Example 3. Suppose attribute A has 8 possible values. If selecting A for the next node of a decision tree
.
yields I(S, A) = 2 bits of information, then compute IG(S,A)
SI(S,A)
5
Classifier Selection and The Bias-Variance Tradeoff
The hypothesis space of decision trees is complete in the sense that, for every set S of training vectors,
there exists a decision tree TS which correctly classifies all of the training vectors. Moreover, that tree can
be constructed quite easily, assuming discrete features and a finite number of classes. So why bother with
ID3 when there is already a tree that will correctly classify all the training data? The problem of course is
that this tree does not inform use about how to classify any of the data that is not part of the training set.
For those cases let us suppose that (assuming two classes that are equally probable) the tree is designed
so that non-training vectors are classified by the toss of a coin. Such a classifier that correctly classifies all
training vectors and randomly classifies non-training vectors represents an extreme example of an unbiased
classifier, in that it makes no assumptions about correlations between a vector’s class and attribute values.
In practice, being unbiased implies a lack of learning. For example, if you touch a very hot baking dish just
removed from the oven, chances are you will think twice about touching it again the next time you see it
on the counter. In other words, your next encounter with that dish on the counter will be biased from the
past encounter.
Another quality that TS suffers from is that of possessing a high amount of variance that exists in how a
vector is classified based on the training set. When a vector is in the training set, it is correctly classified,
but in all other cases it will receive a random classification that has a variance that grows according to the
number of possible classes. Ideally, a good learning algorithm should keep the variance low, meaning that the
classification of a vector does not change much from training set to training set. For example, given a large
enough basket of fruit to learn from, we would expect that our concept of an orange (e.g. medium-sized,
orange-colored, and smooth) would not change from basket to basket. In this case our learning algorithm
would display low variance.
It should also be noted that attempting to minimize variance can sometimes lead to an increase in bias,
which in turn may increase the overall classification error. For example, suppose we are biased towards
classifying medium-sized, orange-colored, smooth fruit as oranges. Doing so may cause the misclassification
of some smaller orange-complexioned grapefruits. In other words, by increasing bias for the sake of reducing
variance, we sometimes make errors on the “exceptions to the rule”.
The following mathematical derivation suggests that the ideal learning algorithm is one that strikes an
optimal balance when attempting to reduce both bias and variance.
For a given vector x, let P (c|x) denote the classification probability distribution associated with x. Let γ be
a classifier, and let γ(x) denote the class that γ assigns to x. Then the mean squared error of γ, denoted
mse(γ) is defined as
Ex (γ(x) − P (c|x))2 ,
where the expectation is taken over a probability distribution over the data universe X . Now let Γ denote
a learning algorithm, and ΓD denote a particular classifier that is derived by the algorithm upon input of
a randomly drawn training-data sample D. Assume that all training samples have a fixed size, and that
they are obtainined by independent sampling from the distribution over X . Define the learning error of
Γ, denoted, learning-error(Γ) as
learning-error(Γ) = ES [mse(ΓD )] =
ED Ex [ΓD (x) − P (c|x)]2 =
6
Ex ED [ΓD (x) − P (c|x)]2 ,
where the last equality is a change in the order of summation. Now the inner summation can be simplified
using the following claim.
Claim. E[x − k]2 = (Ex − k)2 + E[x − Ex]2 , where Ex denotes the expectancy of random variable x, and k
is some constant. The term (Ex − k)2 is called the bias term. Here, we are thinking of k as representing a
desired target value that x is attempting to attain, while Ex denotes the average of what x actually attained
in practice. Finally, E[x − Ex]2 is the definition of variance for random variable x.
Proof of Claim. By linearity of expectation,
E[x − k]2 = Ex2 − 2kEx + k 2 =
[(Ex)2 − 2kEx + k 2 ] + [Ex2 − 2(Ex)2 + (Ex)2 ] =
(Ex − k)2 + E[x − Ex]2 .
Applying the claim to the expectation ED [ΓD (x) − P (c|x)]2 , we get
learning-error(Γ) = Ex [bias(Γ, x) + variance(Γ, x)],
where
bias(Γ, x) = (ED ΓD (x) − P (c|x))2 ,
and
variance(Γ, x) = ED [ΓD (x) − ED ΓD (x)]2 .
Example 4. Suppose X = {1, 2, 3, 4} and that 1 and 2 are in class 0, while 3 and 4 are in class +1. Suppose
that |D| = 2 (one training vector from each class) and that our learning algorithm uses a nearest neighbor
algorithm, in that the resulting classifier classifies a number based on which training point it is nearest,
breaking ties by tossing a coin. Compute the learning-error for this nearest neighbor algorithm.
7
In the light of bias and variance tradeoff, we see that the ID3 algorithm attempts to reduce the length of of
branches in the search tree. This has the effect of reducing variance, at the expense of increasing bias. To
further achieve this, the ID3 algorithm is usually followed by a rule pruning phase in which it is attempted
to shorten rules.
Rule post-pruning steps.
1. develop the decision tree without any concern towards overfitting
2. convert tree into an equivalent set of rules
3. prune each rule by removing any preconditions whose removal improves accuracy
4. sort rules by their estimated accuracy and use them in this sequence
8
Exercises
1. Draw a minimum-sized decision tree for the three-input XOR function which produces a 1 iff an odd
number of the inputs evaluate to one.
2. Provide decision trees to represent the following Boolean functions: A and not B, A or (B and C), A
xor B, (A and B) or (C and D).
3. Consider the following sets of training examples:
Instance
1
2
3
4
5
6
Classification a1
+
T
+
T
T
+
F
F
F
a2
T
T
F
F
T
T
Calculate the entropy of the collection with respect to the classification, and determine which of the
two attributes provides the most information gain.
4. Repeat Example 2, but instead use the measure
node of the tree.
IG(S,A)
SI(S,A)
to calculate the attribute to use at a given
5. Create a decision tree using the ID3 Algorithm for the following table of data.
Vector
v1
v2
v3
v4
v5
A1
1
1
0
1
1
A2
0
0
1
1
1
A3
0
1
0
1
0
Class
0
0
0
1
1
6. Suppose X = {1, 2, 3, 4, 5, 6} and that 1,2,3 are in class 0, while 3,4,5 are in class 1. Suppose that
each training set S has |S| = 2 (one training vector from each class) and that the learning algorithm Γ
is again the nearest neighbor algorithm (see Example 4). Compute learning-error(Γ). Also, compute
bias(Γ, 1) and variance(Γ, 1). Hint: there are only nine possible training sets to consider.
9