Tree-based Models
• Input vector X = (X1 , X2 , . . . , Xp ) ∈ X ;
• Response variable Y ∈ R or {1, 2, . . . , K}.
• Trees are constructed by recursively splitting subsets of X into two
subsets, beginning with the whole space X .
For simplicity, focus on recursive binary partitions.
1
• Notation: node (t), child node (tL , tR ), leaf/terminal node, split
(var j, value s).
• Every leaf node (i.e. a rectangle region Rm in X ) is assigned with a
class for classification tree, or a constant for regression tree
fˆ(X) =
X
cm I{X ∈ Rm }.
m
• R page: check fitted regression tree fitted on data Hitters.
2
Advantages of Trees
• Variable selection and interactions between variables are handled
automatically
• Invariant under any monotone transformation of variables
• Robust to outliers
• Easy to interpret
3
How to Build a Tree?
Three elements:
1. Where to split?
2. When to stop?
3. How to predict at each leaf node?
4
Prediction at Leaf Nodes
Each leaf node contains some samples. Assign the prediction for a leaf
node to be
• the average (of the response variable Y ) for regression,
• the most frequent label (i.e., hard decision via majority voting) or
the frequency vector (i.e., soft decision rule – a probability vector
over all K classes) for classification.
5
Where to Split?
• A split is denoted by (j, s): split the data into two parts based on
“var j < value s or not”.
• For each split, define a Goodness of split criterion Φ(j, s)
– deduction of RSS for regression, or
– an impurity measure for classification.
• Start with the root. Try all possible variables j = 1 : p and all
possible split valuesa , and pick the best split, i.e., the split having
the best Φ value. Now, data are divided into the left node and right
node. Repeat this procedure in each node.
a
For each variable j, sort the n values (from n samples), and choose s to be a
middle point of two adjacent values. So at most (n − 1) possible values for s.
6
Goodness of Split Φ(j, s)
For Regression tree, we look at the deduction of RSS if we split samples
at node t into tR and tL :
h
i
Φ(j, s) = RSS(t) − RSS(tR ) + RSS(tL ) ,
where
RSS(t) =
X
(yi − ct )2 ,
xi ∈t
ct = AVE{yi : xi ∈ t}.
Note that Φ(j, s) is always positive if we split the data into two groups
(even randomly), unless the mean of the left node and the one of right
node are the same.
7
For Classification tree, we look at the gain of an impurity measure
Φ(j, s) = i(t) − pR · i(tR ) + pL · i(tL )
where
i(t) = I(p̂t (1), . . . , p̂t (K)),
with p̂t (j) being the frequency of class j at node t and I(· · · ) is an
impurity measure.
Ideally we want the impurity measure to be small for each node (similar
to the RSS for regression), so we want to pick a split (j, s) that leads to
a large reduction of impurity measure.
8
Impurity Measures
An impurity measure is a function I(p1 , . . . , pK ) where pj ≥ 0 and
P
j pj = 1 with properties
1. maximum occurs at (1/K, . . . , 1/K) (the most impure node);
2. minimum occurs at pj = 1 (the purest node)
3. I(· · · ) is symmetric function of p1 , . . . , pK , i.e., permutation of pj
does not affect φ.
9
Examples of impurity functions:
• Misclassification rate : 1 − maxj pj
• Entropy or deviance: −
• Gini index :
PK
j=1 pj (1
PK
j=1 pj
log pj
− pj ) = 1 −
10
P
j
p2j
Specially when K = 2, we have
Misclassification rate
Entropy or deviance
Gini index
: min(p, 1 − p)
1
1
: p log + (1 − p) log
p
1−p
: 2p(1 − p)
The latter two are strictly concave, consequently the corresponding
goodness-of-split measure Φ is always positive (unless the class
frequency of the left node and the one of the right node are exactly the
same), therefore they are often used in growing trees.
11
Consider a binary classification problem. Evaluate the following split:
(20, 8) =⇒ (10, 0) + (10, 8).
• A node with 20 samples from “class 1” and 8 samples from “class
0” is split into two,
• left node has 10 samples from “class 1” and 0 samples from “class
0” and
• right node has 10 samples from “class 1” and 8 samples from “class
0”.
12
(20, 5) =⇒ (10, 0) + (10, 5).
• Impurity measure based on misclassification rate
5
10 0
15 5
−
·
−
·
=0
25 25 10 25 15
• Impurity measure based on entropy
5
25 20
25 10
15 h 5
15 10
15 i
log
+
log
−
·0−
·
log
+
log
>0
25
5
25
20 25
25 15
5
15
10
This split is regarded as zero gain if using misclassification, but positive
gain if using entropy or gini (which favor splits that lead to pure nodes).
13
When to Stop?
• A simple one : stop splitting at a node t if the gain from any split is
less than some pre-specified threshold.
• BUT, this is short-sighted.
• Another strategy: grow a large tree and then prune it (i.e., cut
some branches).
14
Preliminaries for Pruning
First, grow a vary large tree Tmax
1. until all terminal nodes are nearly pure;
2. or when the number of data in each terminal node is less than
certain threshold;
3. or when the tree reaches certain size.
As long as the tree is sufficiently large, the size of the initial tree is not
critical.
Notation : subtree T 0 ≺ T , branch Tt .
15
Minimum Cost-Complexity Pruning
For any subtree T ≺ Tmax , define the Complexity-cost
Rα (T ) = R(T ) + α|T |,
• R can be sum-of-squares (over the terminal nodes) for regression
trees and weighted sum of (over the terminal nodes) one of the
impurity measures for classification trees;
• |T |: tree size, i.e., the number of leaf nodes;
• α: penalty term.
16
• Pick the best subtree that minimizes the cost
T (α) = argminT Tmax Rα (T ).
• T (α) may not be unique.
• Define the optimal subtree T ∗ (α) to be the smallest one among
T (α)’s
(1) Rα (T ∗ (α)) = minT Tmax Rα (T ).
(2) T ∗ (α) any T (α).
T ∗ (α) is unique.
17
• For a pair of leaf nodes (tL , tR ), there exists α∗ , such that for any
α > α∗ , you would like to collapse them to just node t, otherwise,
keep them.
That is, α∗ is the maximal price you would like to pay to keep that
split.
18
• For any non-leaf node t, do the following calculation to find out the
maximal price we’d like to pay for keeping the whole branch Tt .
– Define Rα ({t}) = R({t}) + α
– Define Rα (Tt ) = R(Tt ) + α|Tt |
– Calculate
R({v}) − R(Tt )
α =
.
|Tt | − 1
∗
That is, if the given α > α∗ , then it is too expensive to keep this
branch and we would like to cut the whole branch and make t to be
a leaf node.
19
Weakest-Link Pruning
The weakest-link pruning algorithm.
• Start with T0 = Tmax and α0 = 0.
• For any non-leaf node t, denote the maximal price we’d like to pay
to keep Tt by α(t).
• α1 = mint α(t). The corresponding (non-terminal node) t1 is called
the weakest link. Cut the branch at t1 .
• Next recompute the maximal price we’d like to pay to keep its
branch for each non-leaf node. Find α2 and cut the branche at the
2nd weakest link. Keep doing this until we get to the root.
20
The steps above generate a Solution path:
T0 T ∗ (α1 ) T ∗ (α2 ) · · · {root node},
0 = α0 < α1 < α2 < · · · ,
and T ∗ (αi ) continues to be the optimal subtree for α ∈ [αi , αi+1 )
(αm+1 = ∞).
• How to Choose α? Cross-validation.
• Why introducing the complexity-cost function? Can’t we just apply
cross-validation to select a subtree? There are too many subtrees;
Using complexity-cost, we can narrow down a smaller set of
subtrees and then apply CV.
21
Other Issues
• Why binary split? – Multiple splits fragment the data too quickly.
• Linear combination splits.
• Instability due to the hierarchical process.
• Lack of smoothness.
22
© Copyright 2026 Paperzz