Decision Trees

Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
Statistical Machine Learning from Data
Decision Trees
Samy Bengio
IDIAP Research Institute, Martigny, Switzerland, and
Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland
[email protected]
http://www.idiap.ch/~bengio
January 25, 2006
Samy Bengio
Statistical Machine Learning from Data
1
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
1
Introduction
2
Training a Decision Tree
3
Controlling the Capacity of a Decision Tree
Samy Bengio
Statistical Machine Learning from Data
2
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
1
Introduction
2
Training a Decision Tree
3
Controlling the Capacity of a Decision Tree
Samy Bengio
Statistical Machine Learning from Data
3
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
Rule Based Decisions
Let Dn be a training set of n pairs (xi , y i )
Let y be a class (say -1 or 1)
Let x be a vector of d attributes denoted {xj |1 ≤ j ≤ d}.
A decision rule could be
if x3 is true AND x5 ≤ 3.2 then y = 1
How to construct such a decision?
This is the family of decision trees.
Samy Bengio
Statistical Machine Learning from Data
4
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
A Decision Tree
x3 = 1
no
yes
x4
x2 < 3.2
red
green
x1 > −1.3
x2 > 3
y = −1
Samy Bengio
blue
x5
y =1
Statistical Machine Learning from Data
5
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
Information Theory
Decision Stumps
Recursion
Continuous Input Variables
1
Introduction
2
Training a Decision Tree
3
Controlling the Capacity of a Decision Tree
Samy Bengio
Statistical Machine Learning from Data
6
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
Information Theory
Decision Stumps
Recursion
Continuous Input Variables
The Most Informative Feature
At each given node i of the tree, we have a subset Di of
training examples.
We would like to select a feature j such that we divide Di
wisely according to the task.
The objective: after the segmentation of Di , all examples in a
node should ideally be of the same class: this is called purity.
There are several heuristics to lean towards purity.
Let us first look at the case for discrete attributes.
The choice made by most decision tree algorithms:
maximize the information gain
Samy Bengio
Statistical Machine Learning from Data
7
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
Information Theory
Decision Stumps
Recursion
Continuous Input Variables
Information Gain
maximize the information gain
IGj = H(Y ) − H(Y |Xj )
where H(Y ) is the entropy of random variable Y
(the average number of bits needed to transmit Y )
X
H(Y ) = −
P(Y = k) log P(Y = k)
k
and H(Y |X ) is the conditional entropy...
X
X
H(Y |Xj ) = −
P(Xj = m)
P(Y = k|Xj = m) log P(Y = k|Xj = m)
m
k
We thus select
j ∗ = arg max IGj
j
Samy Bengio
Statistical Machine Learning from Data
8
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
Information Theory
Decision Stumps
Recursion
Continuous Input Variables
Decision Stumps
A Stump is simply a one-level decision tree
Hence, you select THE feature which maximizes the
information gain in your whole data set
x3 = 1
y =1
Samy Bengio
y = −1
Statistical Machine Learning from Data
9
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
Information Theory
Decision Stumps
Recursion
Continuous Input Variables
Recursion Step
Take the original dataset
Partition it according to the value of the attribute we split on
For each partition, create a decision stump
And do that recursively... until there is no more example to
split
You obtain a decision tree!
Samy Bengio
Statistical Machine Learning from Data
10
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
Information Theory
Decision Stumps
Recursion
Continuous Input Variables
A Decision Tree
x3 = 1
no
yes
x4
x2 < 3.2
red
green
x1 > −1.3
x2 > 3
y = −1
Samy Bengio
blue
x5
y =1
Statistical Machine Learning from Data
11
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
Information Theory
Decision Stumps
Recursion
Continuous Input Variables
Continuous Input Variables
What to do with non-discrete input variables?
Instead of partitionning according to Xj = k
Find the value of Xj = t such that
H(Y |Xj,t ) = max H(Y |Xj < t)P(Xj < t)+H(Y |Xj ≥ t)P(Xj ≥ t)
t
Compute the information gain as usual:
IGj = H(Y ) − H(Y |Xj,t )
if IGj is the maximum among features, Partition Xj according
to Xj < t or not.
Samy Bengio
Statistical Machine Learning from Data
12
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
1
Introduction
2
Training a Decision Tree
3
Controlling the Capacity of a Decision Tree
Samy Bengio
Statistical Machine Learning from Data
13
Introduction
Training a Decision Tree
Controlling the Capacity of a Decision Tree
Controlling the Capacity
If you build a complete tree, it will learn by heart all your
training set
Hence, capacity is virtually infinite
We need to create trees with controlled capacity
There are various ways to do this:
Don’t build the complete tree but stop before
Build the complete tree, and then prune it
... according to some validation set or prior information...
Samy Bengio
Statistical Machine Learning from Data
14