Decision trees
Directly optimize tree structure for good classification.
A decision tree is a function f : X → Y, represented by a binary tree in which:
Decision trees
I
I
Each tree node is associated with a splitting rule g : X → {0, 1}.
Each leaf node is associated with a label y ∈ Y.
When X = Rd , typically only consider
splitting rules of the form
x1 > 1.7
g(x) = 1{xi > t}
x2 > 2.8
ŷ = 1
for some i ∈ [d] and t ∈ R.
Called axis-aligned or coordinate splits.
(Notation: [d] := {1, 2, . . . , d})
ŷ = 2
ŷ = 3
2 / 19
1 / 19
Decision tree example
Basic decision tree learning algorithm
Classifying irises by sepal and petal
measurements
6
5.5
I
petal length/width
5
I
4.5
I
ŷ = 2
x1 > 1.7
x1 > 1.7
2
X = R , Y = {1, 2, 3}
−→
x1 = ratio of sepal length to width
ŷ = 1
ŷ = 3
x2 = ratio of petal length to width
−→
x2 > 2.8
ŷ = 1
ŷ = 2
ŷ = 3
4
Basic “top-down” greedy algorithm
x1 > 1.7
3.5
3
2.5
x2 > 2.8
ŷ = 1
2
1.5
2
2.5
sepal length/width
I
Initially, tree is a single leaf node containing all (training) data.
I
Loop:
I
I
3
Pick the leaf ` and rule h that maximally reduces uncertainty.
Split data in ` using h, and grow tree accordingly.
. . . until some stopping criterion is satisfied.
ŷ = 2
ŷ = 3
[Label of a leaf is the plurality label among the data contained in the leaf.]
3 / 19
4 / 19
Notions of uncertainty
Notions of uncertainty
Notions of uncertainty: general case (Y = {1, 2, . . . , K})
Notions of uncertainty: binary case (Y = {0, 1})
Suppose in S ⊆ X × Y, a py fraction are labeled as y (for each y ∈ Y).
Suppose in a set of examples S ⊆ X × {0, 1}, a p fraction are labeled as 1.
1. Classification error:
0.5
1. Classification error:
u(S) := 1 − max py
0.45
y∈Y
0.4
u(S) := min{p, 1 − p}
2. Gini index:
0.35
u(S) := 1 −
0.3
2. Gini index:
0.25
3. Entropy:
0.2
u(S) := 2p(1 − p)
0.15
u(S) :=
0.1
3. Entropy:
0
p2y
y∈Y
py log
y∈Y
0.05
1
1
u(S) := p log +(1−p) log
p
1−p
X
X
0
0.2
0.4
0.6
0.8
1
1
py
Each is maximized when py = 1/K for all y ∈ Y
(i.e., equal numbers of each label in S).
p
Gini index and entropy (after some
rescaling) are concave upper-bounds
on classification error.
Each is minimized when py = 1 for a single label y ∈ Y
(so S is pure in label).
6 / 19
5 / 19
Uncertainty reduction
Uncertainty reduction
5.5
One leaf (with ŷ = 1) already has zero
uncertainty (a pure leaf).
5
Other leaf (with ŷ = 3) has Gini index
6
petal length/width
Suppose the data S at a leaf ` is split by a rule h into SL and SR , where
wL := |SL |/|S| and wR := |SR |/|S|.
data S at leaf `
uncertainty: u(S)
wL fraction
have h(x) = 0
wR fraction
have h(x) = 1
SL
SR
uncertainty: u(SL )
uncertainty: u(SR )
4.5
u(S) = 1 −
4
1
101
= 0.5098 .
3.5
2
−
50
101
2
−
50
101
2
3
Split S with 1{x2 > 2.7222} to SL , SR :
2.5
2
1.5
The reduction in uncertainty from using rule h at leaf ` is
u(S) − wL · u(SL ) + wR · u(SR ) .
2
2.5
sepal length/width
3
2 2 2
1
− 30
− 29
= 0.0605,
30
2 2 2
1
− 49
− 21
= 0.4197.
u(SR ) = 1− 71
71
71
u(SL ) = 1−
0.5098 −
7 / 19
0
30
Reduction in uncertainty:
x1 > 1.7
ŷ = 1
ŷ = 3
30
71
· 0.0605 +
· 0.4197
101
101
= 0.2039 .
8 / 19
Comparing notions of uncertainty
Aside: reduction in entropy
The (Shannon) entropy of Z-valued random variable Z with is
It is possible to have a splitting rule h with
I
zero reduction in classification error, but
I
non-zero reduction in Gini index or entropy.
H(Z) :=
w∈W
0.4
1/3 of S
have h(x) = 1
(Weighted average of entropies of random variables of form Z|W = w.)
0.35
Think of (X, Y ) as random pair taking value (x, y) ∈ S with probability 1/|S|.
Using entropy as uncertainty measure, u(S) = H(Y ).
0.3
SL
SR
2/5 of SL have y = 1
1/5 of SR have y = 1
0.25
Reduction in uncertainty after a split g : X → {0, 1} is
H(Y ) − P(g(X) = 0) · H(Y |g(X) = 0) + P(g(X) = 1) · H(Y |g(X) = 1)
0.2
(Y = {0, 1})
0.15
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
p
Reduction in classification error:
Reduction in Gini index:
1
.
P(Z = z)
The conditional entropy of Z given a W-valued random variable W is
X
H(Z|W ) :=
P(W = w) · H(Z|W = w).
0.45
1/3 of S have y = 1
2/3 of S
have h(x) = 0
P(Z = z) log
z∈Z
0.5
data S at leaf `
X
4
9
−
2
3
·
1
−
3
12
+
25
2
3
1
3
·
·
2
+ 13 · 15
5
8
4
= 225
25
=0
= H(Y ) − H(Y |g(X)).
This is called mutual information between Y and g(X).
9 / 19
Limitations of uncertainty notions
10 / 19
Basic decision tree learning algorithm
Suppose X = R2 and Y = {red, blue}, and the data is as follows:
x2
ŷ = 2
x1 > 1.7
x1 > 1.7
−→
ŷ = 1
ŷ = 3
−→
x2 > 2.8
ŷ = 1
ŷ = 2
ŷ = 3
Basic “top-down” greedy algorithm
x1
I
Initially, tree is a single leaf node containing all (training) data.
I
Loop:
I
I
Every split of the form 1{xi > t} provides no reduction in uncertainty (whether
based on classification error, Gini index, or entropy).
Pick the leaf ` and rule h that maximally reduces uncertainty.
Split data in ` using h, and grow tree accordingly.
. . . until some stopping criterion is satisfied.
[Label of a leaf is the plurality label among the data contained in the leaf.]
Upshot:
Zero reduction in uncertainty may not be a good stopping condition.
11 / 19
12 / 19
Stopping criterion
Overfitting
Many alternatives; two common choices are:
1. Stop when the tree reaches a pre-specified size.
Involves setting additional “tuning parameters” (similar to k in k-NN).
2. Stop when every leaf is pure. (More common.)
Serious danger of overfitting spurious structure due to sampling.
error
true error
x2
training error
number of nodes in tree
I
Training error goes to zero as the number of nodes in the tree increases.
I
True error decreases initially, but eventually increases due to overfitting.
x1
14 / 19
13 / 19
What can be done about overfitting?
Example: Spam filtering
Data
Preventing overfitting
0
00
Split training data S into two parts, S and S :
I
I
0
4601 e-mail messages, 39.4% are spam.
I
Y = {spam, not spam}
I
Use first part S to grow the tree until all leaves are pure.
00
Use second part S to choose a good pruning of the tree.
E-mails represented by 57 features:
I
I
Pruning algorithm
Loop:
I
I
I
Replace any tree node by a leaf node if it improves the error on S 00 .
48: percentange of e-mail words that is specific word
(e.g., “free”, “business”)
6: percentage of e-mail characters that is specific
character (e.g., “!”).
3: other features (e.g., average length of ALL-CAPS words).
. . . until no more such improvements possible.
Results
This can be done efficiently using dynamic programming (bottom-up traversal
of the tree).
Using variant of greedy algorithm to grow tree; prune tree using validation set.
0
Chosen tree has just 17 leaves. Test error is 9.3%.
00
Independence of S and S make it unlikely for spurious structures in each
to perfectly align.
y = not spam
y = spam
15 / 19
ŷ = not spam
57.3%
5.3%
ŷ = spam
4.0%
33.4%
16 / 19
c
Elements of Statistical Learning (2nd Ed.) !Hastie,
Tibshirani & Friedman 2009 Chap 9
Example: Spam filtering
Final remarks
email
600/1536
ch$<0.0555
ch$>0.0555
email
spam
280/1177
48/359
remove<0.06
remove>0.06
email
180/1065
ch!<0.191
ch!>0.191
email
spam
spam email
9/112
26/337
I
0/22
george<0.15 CAPAVE<2.907
george>0.15 CAPAVE>2.907
email
80/861
hp<0.405
hp>0.405
100/204
spam email
spam
spam
6/109
19/110
7/227
0/3
george<0.005
CAPAVE<2.7505
1999<0.58
george>0.005
CAPAVE>2.7505
1999>0.58
email
email
80/652
0/209
email
spam
spam email
36/123
16/81
18/109
hp<0.03
hp>0.03
free<0.065
free>0.065
email
email
email
77/423
3/229
16/94
I
0/1
email
20/238
57/185
email
14/89
I
Certain greedy strategies for training decision trees are consistent.
I
But also very prone to overfitting in most basic form.
I
(NP-hard to find smallest decision tree consistent with data.)
Current theoretical understanding of (greedy) decision tree learning:
I
As fitting a non-parametric model (like NN).
I
As meta-algorithm for combining classifiers (splitting rules).
spam
9/29
CAPMAX<10.5
business<0.145
CAPMAX>10.5
business>0.145
email
Decision trees are very flexible classifiers (like NN).
spam
3/5
receive<0.125 edu<0.045
receive>0.125 edu>0.045
email
19/236
spam email
1/2
48/113
email
9/72
our<1.2
our>1.2
email
37/101
spam
1/12
17 / 19
FIGURE 9.5. The pruned tree for the spam example.
Key takeaways
The split variables are shown in blue on the branches,
and the classification is shown in every node.The numbers under the terminal nodes indicate misclassification
rates on the test data.
1. Structure of decision tree classifiers.
2. Greedy learning algorithm based on notions of uncertainty; limitations of
the greedy algorithm.
3. High-level idea of overfitting and ways to deal with it.
19 / 19
18 / 19
© Copyright 2026 Paperzz