IAML: Decision Trees

Predict if John will play tennis 9 yes / 5 no Training examples: •  Hard to guess •  Divide & conquer: IAML: Decision Trees –  split into subsets –  are they pure? (all yes or all no) –  if yes: stop –  if not: repeat Victor Lavrenko and Charles Su;on School of Informa>cs •  See which subset new data falls into Semester 1 Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 Outlook Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain New data: D15 Rain Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal High Wind
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No High Weak
? Copyright © 2011 Victor Lavrenko
Humid
High High High Normal
Normal
9 yes / 5 no Outlook Outlook Overcast Overcast Day Outlook Humid
D3 Overcast High D7 Overcast Normal
D12 Overcast High D13 Overcast Normal
Sunny Day Outlook
D1 Sunny
D2 Sunny
D8 Sunny
D9 Sunny
D11 Sunny
9 yes / 5 no Wind Weak Strong Weak Weak Strong Wind Weak Strong Strong Weak 4 yes / 0 no pure subset 2 yes / 3 no split further Rain Day Outlook
D4 Rain D5 Rain D6 Rain D10 Rain D14 Rain Sunny Humid
High Normal
Normal
Normal
High Wind Weak Weak Strong Weak Strong 3 yes / 2 no split further Copyright © 2011 Victor Lavrenko
Day Outlook Humid
D3 Overcast High D7 Overcast Normal
D12 Overcast High D13 Overcast Normal
4 yes / 0 no pure subset Humidity High Day Humid
D1 High D2 High D8 High Normal Wind Weak Strong Weak Wind Weak Strong Strong Weak Day Humid Wind D9 Normal Weak D11 Normal Strong Copyright © 2011 Victor Lavrenko
Rain Day Outlook
D4 Rain D5 Rain D6 Rain D10 Rain D14 Rain Humid
High Normal
Normal
Normal
High Wind Weak Weak Strong Weak Strong 3 yes / 2 no split further 9 yes / 5 no 9 / 5 Outlook Outlook 4 / 0 Overcast Sunny Day Outlook Humid
D3 Overcast High D7 Overcast Normal
D12 Overcast High D13 Overcast Normal
Overcast Wind Weak Strong Strong Weak Rain Humidity High Day Humid
D1 High D2 High D8 High Weak Wind Weak Strong Weak Day Humid Wind D9 Normal Weak D11 Normal Strong Day Humid
D4 High D5 Normal
D10 Normal
Rain Humidity 0 / 3 Strong Wind Day Humid Wind Weak D6 Normal Strong Weak D14 High Strong Weak 2 / 0 Weak Strong no yes yes no New data: Copyright © 2011 Victor Lavrenko
Day Outlook Humid
D15 Rain High Wind Weak  Yes Copyright © 2011 Victor Lavrenko
Which a;ribute to split on? 9 yes / 5 no •  Split (node, {examples} ): •  Ross Quinlan (ID3: 1986), (C4.5: 1993) •  Breimanetal (CaRT: 1984) from sta>s>cs 0/ 2 Normal ID3 algorithm A  the best a;ribute for spli`ng the {examples} Decision a;ribute for this node  A For each value of A, create new child node Split training {examples} to child nodes If examples perfectly classified: STOP else: iterate over new child nodes Split (child_node, {subset of examples} ) Wind 3 / 0 High Copyright © 2011 Victor Lavrenko
1. 
2. 
3. 
4. 
5. 
3 / 2 Sunny Wind Normal yes 2 / 3 9 yes / 5 no Outlook Sunny 2 yes / 3 no Overcast Wind Rain 4 yes / 0 no 3 yes / 2 no Weak 6 yes / 2 no Strong 3 yes / 3 no •  Want to measure “purity” of the split –  more certain about Yes/No aier the split •  pure set (4 yes / 0 no) => completely certain (100%) •  impure (3 yes / 3 no) => completely uncertain (50%) –  can’t use P(“yes” | set): •  must be symmetric: 4 yes / 0 no as pure as 0 yes / 4 no Copyright © 2011 Victor Lavrenko
Entropy Informa>on Gain •  Want many items in pure sets •  Expected drop in entropy aier split: •  Entropy: H(S) = -­‐ p(+) log2 p(+) -­‐ p(-­‐) log2 p(-­‐) bits –  S … subset of training examples –  p(+) / p(-­‐) … % of posi>ve / nega>ve examples in S SV
Gain(S, A) = H(S) − ∑
H(SV )
S
V ∈Values(A )
•  Interpreta>on: assume item X belongs to S –  how many bits need to tell if X posi>ve or nega>ve •  impure (3 yes / 3 no): p(+)= ½
€
3
3 3
3
H(S) = − log 2 − log 2 = 1 bits
6
6 6
6
•  pure set (4 yes / 0 no): €
4
4 0
0
H(S) = − log 2 − log 2 = 0 bits
4
4 4
4
p(+)=0
•  Mutual Informa>on p(+)=1
–  between a;ribute A and class labels of S V … possible values of A
S … set of examples {X}
Sv … subset where XA = V
− 149 log 2 149 − 145 log 2 145
Wind €
Weak Gain (S, Wind)
= H(S) – 8/14 H(Sweak) – 6/14 H(Sweak)
= 0.94 – 8/14 * 0.81 – 6/14 * 1.0
= 0.049
6 yes / 2 no Strong 3 yes / 3 no − 68 log 2 68 − 28 log 2 28
− 36 log 2 36 − 36 log 2 36
H(Sweak) = 0.81
H(Sstrong) = 1.0
€
Copyright © 2011 Victor Lavrenko
9 yes / 5 no H(S) = 0.94
Copyright © 2011 Victor Lavrenko
€
€
Overfi`ng in Decision Trees •  Can always classify training examples perfectly –  keep spli`ng un>l each node contains 1 example –  singleton = pure •  Doesn’t work on new data Avoid overfi`ng •  Stop spli`ng when not sta>s>cally significant •  Grow, then post-­‐prune –  based on valida>on set •  Sub-­‐tree replacement pruning (WF 6.1) –  for each node: •  pretend remove node + all children from the tree •  measure performance on valida>on set –  remove node that results in greatest improvement –  repeat un>l further pruning is harmful Copyright © 2011 Victor Lavrenko
Copyright © 2011 Victor Lavrenko
General Structure Problems with Informa>on Gain 9 yes / 5 no •  Task: classifica>on, discrimina>ve •  Model structure: decision tree •  Score funcFon Day •  Biased towards a;ributes with D2 D3 D4 D1 D5 D14 many values 0 / 1 0 / 1 1 / 0 1 / 0 1 / 0 0 / 1 •  Won’t work all subsets perfectly pure => optimal split
for new data: D15 Rain High Weak •  Use GainRa>o: –  informa>on gain at each node –  preference for short trees –  preference for high-­‐gain a;ributes near the root SV
S
log V
S
S
V ∈Values(A )
•  OpFmizaFon / search method SplitEntropy(S, A) = −
–  greedy search from simple to complex –  guided by informa>on gain Copyright © 2011 Victor Lavrenko
GainRatio(S, A) =
€
∑
Gain(S, A)
SplitEntropy(S, A)
A … candidate attribute
V … possible values of A
S … set of examples {X}
Sv … subset where XA = V
penalizes attributes
with many values
Copyright © 2011 Victor Lavrenko
€
Trees are interpretable •  Read rules off the tree –  concise descrip>on of what makes an item posi>ve •  No “black box” –  important for users Rule:
(Outlook = Overcast) ∨
(Outlook = Rain ∧ Wind = Weak) ∨
(Outlook = Sunny ∧ Humidity = Normal)
Copyright © 2011 Victor Lavrenko
Con>nuous A;ributes •  Dealing with continuous-valued attributes:
•  create a split: (Temperature > 72.3) = True,False
•  Threshold can be optimized (WF 6.1)
Mul>-­‐class and Regression •  Mul>-­‐class classifica>on: –  predict most frequent class in the subset –  entropy: H(S) = -­‐ Σc p(c) log2 p(c) –  p(c) … % of examples of class c in S •  Regression: –  predicted output = mean of the training examples in the subset –  requires a different defini>on of entropy –  can use linear regression at the leaves (WF 6.5) Copyright © 2011 Victor Lavrenko
Summary •  ID3: grows decision tree from the root down –  greedily selects next best a;ribute (Gain) •  Searches a complete hypothesis space –  prefers smaller trees, high gain at the root •  Overfi`ng addressed by post-­‐prunning –  prune nodes, while accuracy  on valida>on set •  Can handle missing data (see WF 6.1) •  Easily handles irrelevant variables –  Informa>on Gain = 0 => will not be selected Copyright © 2011 Victor Lavrenko
Random Decision Forest •  Grow K different decision trees: –  pick a random subset Sr of training examples –  grow a full ID3 tree (no prunning): •  when spli`ng: pick from d << D random a;ributes •  compute gain based on Sr instead of full set –  repeat for r = 1 … K •  Given a new data point X: –  classify X using each of the K trees –  use majority vote: class predicted most oien •  Fast, scalable, state-­‐of-­‐the-­‐art performance Copyright © 2011 Victor Lavrenko