Decision Trees Slides

Decision Trees
SDSC Summer Institute 2012
Natasha Balac, Ph.D.
© Copyright 2012, Natasha Balac
DECISION TREE INDUCTION

Method for approximating discrete-valued
functions



robust to noisy/missing data
can learn non-linear relationships
inductive bias towards shorter trees
© Copyright 2012, Natasha Balac
Decision trees



“Divide-and-conquer” approach
Nodes involve testing a particular attribute
Attribute value is compared to





Constant
Comparing values of two attributes
Using a function of one or more attributes
Leaves assign classification, set of
classifications, or probability distribution to
instances
Unknown instance is routed down the tree
© Copyright 2012, Natasha Balac
Decision Tree Learning

Applications:





medical diagnosis – ex. heart disease
analysis of complex chemical compounds
classifying equipment malfunction
risk of loan applicants
Boston housing project – price prediction
© Copyright 2012, Natasha Balac
DECISION TREE FOR THE
CONCEPT “Sunburn”
Name
Hair
Height
Weight
Lotion
Result
Sarah
blonde
average
light
no
sunburned
(positive)
Dana
blonde
tall
average
yes
none
(negative)
Alex
brown
short
average
yes
none
Annie
blonde
short
average
no
sunburned
Emily
red
average
heavy
no
sunburned
Pete
brown
tall
heavy
no
none
John
brown
average
heavy
no
none
Katie
blonde
short
light
yes
none
© Copyright 2012, Natasha Balac
DECISION TREE FOR THE
CONCEPT “Sunburn”
© Copyright 2011, Natasha Balac
Medical Diagnosis and Prognosis
<= 91
Class 2:
Minimum systolic blood
pressure over a 24-hour
period following
admission to the hospital
> 91
Age of Patient
<=62.5
>62.5
Early death
Class 1:
Was there sinus
tachycardia?
Survivors
NO
YES
Class 1:
Class 2:
Survivors
Early death
Beriman et. al, 1984
© Copyright 2012, Natasha Balac
Occam’s Razor

“The world is inherently simple. Therefore
the smallest decision tree that is
consistent with the samples is the one that
is most likely to identify unknown objects
correctly”
© Copyright 2012, Natasha Balac
Decisions Trees Representation
Each internal node tests an attribute
 Each branch corresponds to attribute
value
 Each leaf node assigns a classification

© Copyright 2012, Natasha Balac
When to Consider Decision Trees

Instances describable by attribute-value
pairs



each attribute takes a small number of disjoint
possible values
Target function has discrete output value
Possibly noisy training data


may contain errors
may contain missing attribute values
© Copyright 2012, Natasha Balac
Weather Data Set-Make the Tree!
Day
Outlook
Temp
Humidity
Wind
PlayTennis
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
© Copyright 2012, Natasha Balac
DECISION TREE FOR THE CONCEPT
“Play Tennis”
Day OutlookTempHumidityWind
PlayTennis
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Sunny Hot
Sunny Hot
OvercastHot
Rain
Mild
Rain
Cool
Rain
Cool
OvercastCool
Sunny Mild
Sunny Cool
Rain
Mild
Sunny Mild
OvercastMild
OvercastHot
Rain
Mild
Mitchell, 1997
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
[Mitchell,1997]
© Copyright 2011, Natasha Balac
Constructing decision trees





Normal procedure: top down in recursive divide-andconquer fashion
First: attribute is selected for root node and branch is
created for each possible attribute value
Then: the instances are split into subsets (one for
each branch extending from the node)
Finally: procedure is repeated recursively for each
branch, using only instances that reach the branch
Process stops if all instances have the same class
© Copyright 2012, Natasha Balac
Induction of Decision Trees


Top-down Method
Main loop:





A pick the ``best'' decision attribute for next node
Assign A as decision- split value attribute for node
For each value of A, create new descendant of
node
Sort training examples to leaf nodes
If training examples perfectly classified,
Then STOP
 Else iterate over new leaf nodes

© Copyright 2012, Natasha Balac
Which is the best attribute?
© Copyright 2011, Natasha Balac
Attribute selection

How to choose the best attribute?



Smallest tree
Heuristic: Attribute that produces the “purest”
nodes
Impurity criterion:

Information gain


Increases with the average purity of the subsets
produced by the attribute split
Choose attribute that results in greatest
information gain
© Copyright 2012, Natasha Balac
Computing information




Information is measured in bits
Given a probability distribution, the info
required to predict an event is the
distribution’s entropy
Entropy gives the information required in bits
(this can involve fractions of bits!)
Formula for computing the entropy:
© Copyright 2012, Natasha Balac
Expected information for attribute
“Outlook”
“Outlook” = “Sunny”
 “Outlook” =“Overcast”
 “Outlook” =“Rainy”
Total expected information:


© Copyright 2012, Natasha Balac
Computing the information
gain

Information gain: information before splitting –
information after splitting

Gain(“Outlook”)=info([9,5])-info([2,3],[4,0],[3,2])
= 0.940-0.693 = 0.247 bits

Information gain for attributes from weather data:




Gain (“Outlook”) = 0.247 bits
Gain (“Temp”) = 0.029 bits
Gain (“Humidity”) = 0.152 bits
Gain (“Windy”) = 0.048 bits
© Copyright 2012, Natasha Balac
Which is the best attribute?
© Copyright 2011, Natasha Balac
Further splits
Gain (“Temp”)=0.571 bits Gain (“Humidity”)=0.971 Gain(“Windy”)=0.020 bits
© Copyright 2011, Natasha Balac
Final product
© Copyright 2012, Natasha Balac
Purity Measure

Desirable properties



Pure Node -> measure = zero
Impurity maximal -> measure = maximal
Multistage property
decisions can be made in several stages
 measure([2 ,3,4])= measure([2,7])+(7/9)  measure([3,4])


Entropy is the only function that satisfies all
the properties
© Copyright 2012, Natasha Balac
Highly-branching attributes

Attributes with a large number of values


example: ID code
Subsets more likely to be pure if there is a
large number of values


Information gain biased towards attributes with
a large number of values
Overfitting
© Copyright 2012, Natasha Balac
New version of Weather Data
© Copyright 2012, Natasha Balac
ID code attribute split
Info([9,5]) = 0.940 bits
© Copyright 2012, Natasha Balac
Gain ratio


Modification that reduces its bias
Takes number and size of branches into
account when choosing an attribute


Taking the intrinsic information of a split into
account
Intrinsic information:

Entropy of distribution of instances into branches

How much info do we need to tell which branch an
instance belongs to
© Copyright 2012, Natasha Balac
Computing the gain ratio

Example: intrinsic information for ID code




info([1,1,…1])=14(-1/14 log1/14)=3.807 bits
Value of attribute decreases as intrinsic
information gets larger
Definition of gain ratio:
Example:
© Copyright 2012, Natasha Balac
Gain ration example
© Copyright 2012, Natasha Balac
Avoid Overfitting

How can we avoid Overfitting:



Stop growing when data split not statistically
significant
Grow full tree then post-prune
How to select best tree?


Measure performance over training data
Measure performance over separate validation data
set
© Copyright 2012, Natasha Balac
Pruning


Pruning simplifies a decision tree to prevent
overfitting to noise in the data
Post-pruning:


Pre-pruning:


takes a fully-grown decision tree and discards unreliable
parts
stops growing a branch when information becomes
unreliable
Post-pruning preferred in practice because of early
stopping in pre-pruning
© Copyright 2012, Natasha Balac
Summary




Algorithm for top-down induction of decision
trees
Probably the most extensively studied method
of machine learning used in data mining
Different criteria for attribute/test selection
rarely make a large difference
Different pruning methods mainly change the
size of the resulting pruned tree
© Copyright 2012, Natasha Balac
WEKA Tutorial
© Copyright 2012 Natasha Balac
REGRESSION TREE INDUCTION

Why Regression tree?

Ability to:



Predict continuous variable
Model conditional effects
Model uncertainty
© Copyright 2012, Natasha Balac
Regression Trees



Quinlan, 1992
Continuous goal
variables
Induction by means of
an efficient recursive
partitioning algorithm
Uses linear regression
to select internal node
© Copyright 2012, Natasha Balac
Regression trees

Differences to decision trees:






Splitting: minimizing intra-subset variation
Pruning: numeric error measure
Leaf node predicts average class values of training
instances reaching that node
Can approximate piecewise constant functions
Easy to interpret and understand the structure
Special kind: Model Trees
© Copyright 2012, Natasha Balac
Model trees




RT with linear regression functions at each leaf
Linear regression (LR) applied to instances that
reach a node after full regression tree has been
built
Only a subset of the attributes is used for LR
Fast

Overhead for LR is minimal as only a small subset
of attributes is used in tree
© Copyright 2012, Natasha Balac
Building the tree

Splitting criterion: standard deviation
reduction (T portion of the data reaching
the node)

Where T1,T2, are the sets that result from splitting the
node according to the chosen attribute

Termination criteria


Standard deviation becomes smaller than
certain fraction of sd for full training set (5%)
Too few instances remain (< 4)
© Copyright 2012, Natasha Balac
Nominal attributes


Nominal attributes are converted into binary
attributes (that can be treated as numeric ones)
Nominal values are sorted using average class
value



If there are k values, k-1 binary attributes are
generated
It can be proven that the best split on one of the
new attributes is the best binary split on original
M5‘ only does the conversion once
© Copyright 2012, Natasha Balac
Pruning Model Trees


Based on estimated absolute error of LR models
Heuristic estimate for smoothing calculation

P’ is prediction past up to the next node; p is passed to nod below; q is predicted
value by the model; n # of training instances in the node; K smoothing const

Pruned by greedily removing terms to minimize the
estimated error

Heavy pruning allowed – single LR model can
replace a whole subtree

Pruning proceeds bottom up - error for LR model
© Copyright
2012 Natasha Balac
at internal node is compared to error
fosubtree
Building the tree

Splitting criterion

standard deviation reduction

T – portion T of the training data
T1, T2, …sets that result from splitting the node on the chosen attribute


Treating SD of the class values in T as a measure of the error
at the node; calc expected reduction in error by testing each
attribute at node

Termination criteria


Standard deviation becomes smaller than certain
fraction of SD for full training set (5%)
© Copyright 2012, Natasha Balac
Too few instances remain (< 4)
Pseudo-code for M5’

Four methods:






Main method: MakeModelTree()
Method for splitting: split()
Method for pruning: prune()
Method that computes error: subtreeError()
We’ll briefly look at each method in turn
Linear regression method is assumed to
perform attribute subset selection based on
error
© Copyright 2012, Natasha Balac
MakeModelTree
SD =sd(instances)
for each k-valued nominal attribute
convert into k-1 synthetic binary attributes
root =newNode
root.instances = instances
split(root)
prune(root)
printTree(root)
© Copyright 2012 Natasha Balac
split(node)
if sizeof(node.instances) < 4 or
sd(node.instances) < 0.05*SD
node.type = LEAF
else
node.type = INTERIOR
for each attribute
for all possible split positions of the attribute
calculate the attribute’s SDR
node.attribute = attribute with maximum SDR
split(node.left)
split(node.right)
© Copyright 2012, Natasha Balac
prune()
if node = INTERIOR then
prune(node.leftChild)
prune(node.rightChild)
node.model =
linearRegression(node)
if subtreeError(node) > error(node) then
node.type = LEAF
© Copyright 2012, Natasha Balac
subtreeError()
l = node.left; r = node.right
if node = INTERIOR then
return (sizeof(l.instances)*subtreeError(l)
+ sizeof(r.instances)*subtreeError(r))
/sizeof(node.instances)
else return error(node)
© Copyright 2012, Natasha Balac
MULTI-VARIATE REGRESSION
TREES*



All the characteristics of a regression tree
Capable of predicting two or more
outcomes
Example:

Activity and toxicity, monetary gain and time
Balac, Gaines ICML 2001
© Copyright 2012, Natasha Balac
MULTI-VARIATE REGRESSION TREE
INDUCTION
Var 1
Var 2
>-4.71 OR
>4.83
<=-4.71 AND
<=4.83
Var 1
Var 3
>0.5 AND
>-3.61
Var 2
Var 4
>0.5 OR
>-3.61
<=-2 AND
<=4.8
Activity = 7.39
Activity = 7.05
Toxicity = 2.89
Toxicity = 0.173
<=0.5 AND
<=-3.56
…
Var 1
Var 4
>0.5 OR
>-3.56
© Copyright 2012, Natasha Balac
CLUSTERING
 Basic idea: Group similar things
together
 Unsupervised Learning – Useful when
no other info is available
 K-means
Partitioning instances into k disjoint
clusters
Measure of similarity
© Copyright 2012 Natasha Balac