Learning with Trees

Learning with Trees
Unit 6
Machine Learning
University of Vienna
Learning with Trees
Quinland’s ID3 (C4.5)
Classification and Regression Trees (CART)
Benefits of the decision tree
The computational cost of making the tree is fairly low.
The cost of using it is even lower: O(logN), where N is the
number of data points.
Important for machine learning:
querying the trained algorithm is fast
the result is immediately
easy to understand
makes people trust it more than getting an answer from a
black box
Eg.: phone a helpline for computer faults. The phone operators are
guided through the decision tree by your answers to their questions.
Idea of a decison tree
We start at the root (base) of the tree and progress down to the
leaves, where we receive the classification decision.
At each stage we choose a question that gives us the most
information given what we know already. Encoding this
mathematically is the task of information theory.
The trees can even be turned into a set of if-then rules, suitable for
use in a rule induction system.
Quick Aside: Entropy in Information Theory
Claude Shannon in 1948 published a paper: A Mathematical
Theory of Communication.
Information entropy describes the amount of impurity in a set of
features.
The entropy H of a set of probabilities pi is
X
H(p) =
pi log2 (pi ),
i
where the logarithm is base 2 because we are imagining that we
encode everything using binary digits (bits), and we define
0 log 0 = 0
log2 p =
ln p
ln 2
Binary decision Theory
Binary decision : positive“ with probaility p, negative“ with
”
”
probaility 1 p
Maximal entropy of 1 lies by p = 0.5
Desicion tree
If the feature separates the example into 50% positive and 50%
negative, then the amount of entropy is at a maximum, and
knowing about that feature is very useful to us.
For our desicion tree, the best feature to pick as the one to classify
on now is the one that gives you most information, i.e., the one
with the highest entropy.
After using that feature, we re-evaluate the entropy of each feature
and again pick the one with the highest entropy.
ID3
Information gain:
the entropy of the whole set S minus the entropy when a particular
feature F is chosen.
|Sf | is a count of the number of S that have value f for feature F .
Gain(S, F ) = Entropy(S)
P
|Sf |
f 2values(F ) |S| Entropy(Sj )
Example
Outcomes
S = (s1 = true, s2 = false, s3 = false, s4 = false)
Entropy(S) =
p(true) log2 p(true)
=
1
1
log2
4
4
p(false) log2 p(false) =
3
3
log2 = 0.811
4
4
Suppose we have one feature F that can have values f1 , f2 , f3 and
if
if
if
if
s1
s2
s3
s4
then
then
then
then
feature
feature
feature
feature
has
has
has
has
value
value
value
value
f2 ,
f2 ,
f3 ,
f1 .
Example
Then
|Sf1 |
|S| Entropy(Sf1 )
=0
|Sf2 |
|S| Entropy(Sf2 )
= 24 (
|Sf3 |
|S| Entropy(Sf3 )
=0
1
2
log2
1
2
1
2
Gain(S, F ) = 0.811
log2 11 ) = 0.5
0.5 = 0.311
The ID3 Algorithm
if all examples have the same label:
- return a leaf with that label
else if there are no features left to test:
- return a leaf with the most common label
else:
- choose the feature Fb that maximises the information gain of S
to be the next node
- add a branch from the node for each possible value f in Fb
- for each branch:
b from the set of features
o calculate Sf by removing F
o recursively call the algorithm with Sf , to compute the gain
relative to the current set of examples
Implementation in Python
For a graph, the key to each dictionary entry is the name of the
node, and its value is a list of the nodes that it is connected to, as
in this example:
graph = (A : [B, C ], B : [C , D], C : [D], D : [C ], E : [F ], F : [C ])
ID3 uses the inductive bias :
the next feature added into the tree is the one with the highest
information gain, which biases the algorithm towards smaller trees,
since it tries to minimise the amout of information that is left.
The shortest description of something, i.e., the most compressed
one, is the best description.
It is known as the Minimum Description Length (MDL)
(Rissanen, 1989)
ID3
ID3 can deal with noise and with missing data.
Continuous variables: in general, only one split is made, rather than
allowing for threeways or higher splits.
Univariate trees - split according one feature
Mulivariate trees - split by picking combination of features
To avoid overfitting:
to limit the size of the tree
pruning (Quinlan in C4.5 )
Naive pruning
run the decision tree algorithm until all of the features are used
produce smaller trees by running over the tree, picking each
node in turn
replace the subtree beneath every node with a leaf labelled
with the most common classification of the sub-tree.
the error of the pruned tree is evaluated on the validation set
pruned tree is kept if the error is the same as or less than the
original tree, and rejected otherwise.
Post-pruning
Implemented in C4.5.
take the tree generated by ID3
convert it to a set of if-then rules
prune each rule by removing preconditions if the accuracy of
the rule increases without it.
the rules are then sorted according to their accuracy on the
training set and applied in order
The advantages of dealing with rules are:
- they are easier to read,
- their order in the tree does not matter
- just their accuracy in the classification
Classification and Regression Trees (CART)
CART can be used for both classification and regression.
CART constrained to construct binary trees. But: a question that
has three answers can be split into two questions.
Another information measure is the Gini impurity.
a leaf is pure if all of the training data within it have just one class.
N(i) - the fraction of the number of datapoints that belong to a
class i
For any particular feature k we can compute:
Gk =
c X
X
i=1 j6=i
c is the number of classes.
N(i)N(j)
Gini impurity
Pc
i=1 N(i)
P
j6=i
= 1, while there has to be some output class and so
N(i) = 1
N(i)
Another presentation:
Gk = 1
c
X
N(i)2
i=1
The Gini impurity is equivalent to computing the expected error
rate if the classificaton was picked according to the class
distribution.
Gini impurity
We can change the information measure, adding weight to the
misclassifications.
Risk
ij
represents the cost of misclassifying i as j.
Modified Gini impurity:
Gk =
X
j6=i
ij N(i)N(j)
Regression Trees (CART)
The splitting criteria for regression is
SSTotal
(SSLeft + SSRight ),
where SSTotal = (yi y )2 is the sum of squares for the node, and
SSRight , SSLeft are the sums of squares for the right and left
branch, respectively.
This is equivalent to chosing the split to maximize the
between-groups sum-of-squares in a simple analysis of variance
(anova).
We pick then the feature that has the split point that provides the
best sum-of-squares error.
The error of a node: the variance of y for anova.
Example using ID3
Options: go to the pub, watch TV, go to a party or study. Depends
from an assignement, an availability of a party and feeling.
Deadline Party Lazy Activity
Urgent
Yes
Yes
Party
Urgent
No
Yes
Study
Near
Yes
Yes
Party
None
Yes
No
Party
None
No
Yes
Pub
None
Yes
No
Party
Near
No
No
Study
Near
No
Yes
TV
Near
Yes
Yes
Party
Urgent
No
No
Study
Example using ID3
To produce a decision tree for this problem:
first: which feature to use as the root node?
Compute the entropy of S:
P(Party) = 0.5, P(Study) = 0.3, P(Pub) =
0.1, P(TV) = 0.1
Entropy = 0.5 log2 0.5 + 0.3 log2 0.3 + 2 ⇥ 0.1 log2 0.1 = 1.685
Find which feature has the maximal information gain:
3 1
1
2
2
Gain(S, Deadline) = 1.685
10 ( 3 log2 3 + 3 log2 3 )+
4
+ 10
(2 14 log2 14 + 12 log2 12 )+
3 1
+ 10 ( 3 log2 13 + 23 log2 23 ) = 1.685 1.150 = 0.534
Example using ID3
Gain(S, Party ) = 1.685
5
10
⇥ 0+
5
+ 10
(2 15 log2 15 + 35 log2 35 ) = 1.685 0.685 = 1
6 3
3
1
1
Gain(S, Lazy ) = 1.685
10 ( 6 log2 6 + 3 ⇥ 6 log2 6 )+
4
+ 10
(2 ⇥ 12 log2 12 ) = 1.685
1.475 = 0.21
Example using ID3
The root node will be the Party feature and it will have two
branches coming out of it ( yes and no ).
All five cases of yes branch are party and we put a leaf node there
Go to party
For the no branch there are tree di↵erent outcomes, so we need to
choose another feature.
The same procedure as before!
Example - Solution in Python
Party
Yes
->
Party
->
Study
->
Pub
No
Deadlline
Urgent
None
Near
Lazy
Yes
->
TV
No
->
Study
Example -Regression Trees (CART)
covariation <- matrix(c(5 , 0.6 , 0.6 , 5) , nrow = 2 , ncol = 2)
means <- c(2 , 7) a3= (a1- means[1])*(a2- means[2]))
Result -Regression Trees (CART)
Iris -Regression Trees (CART)
Iris -Cross-Validation
Iris - Results
Iris - Partition