DecisionTrees
TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomade
theircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademic
purposes,providedthatyouincludeproperaHribuIon.PleasesendcommentsandcorrecIonstoEric.
RobotImageCredit:ViktoriyaSukhanova©123RF.com
FuncIonApproximaIon
ProblemSe*ng
• Setofpossibleinstances X
• Setofpossiblelabels Y
• UnknowntargetfuncIon f : X ! Y
• SetoffuncIonhypotheses H = {h | h : X ! Y}
Input:TrainingexamplesofunknowntargetfuncIonf
n
{hxi , yi i}i=1
= {hx1 , y1 i , . . . , hxn , yn i}
h2H
Output:Hypothesisthatbestapproximatesf
BasedonslidebyTomMitchell
SampleDataset
• ColumnsdenotefeaturesXi
• Rowsdenotelabeledinstanceshxi , yi i
• Classlabeldenoteswhetheratennisgamewasplayed
hxi , yi i
DecisionTree
• Apossibledecisiontreeforthedata:
• Eachinternalnode:testoneaHributeXi
• Eachbranchfromanode:selectsonevalueforXi
p(Y | x 2 leaf)
• Eachleafnode:predictY(or)
BasedonslidebyTomMitchell
DecisionTree
• Apossibledecisiontreeforthedata:
• WhatpredicIonwouldwemakefor
<outlook=sunny,temperature=hot,humidity=high,wind=weak>?
BasedonslidebyTomMitchell
DecisionTree
• IffeaturesareconInuous,internalnodescan
testthevalueofafeatureagainstathreshold
6
DecisionDecisionTreeLearning
Tree Learning
Problem Setting:
• Set of possible instances X
– each instance x in X is a feature vector
– e.g., <Humidity=low, Wind=weak, Outlook=rain, Temp=hot>
• Unknown target function f : XY
– Y is discrete valued
• Set of function hypotheses H={ h | h : XY }
– each hypothesis h is a decision tree
– trees sorts x to leaf, which assigns y
SlidebyTomMitchell
Stagesof(Batch)MachineLearning
n
Given:labeledtrainingdata X, Y = {hxi , yi i}i=1
yi = ftarget (xi )
• Assumeseachwith
xi ⇠ D(X )
Trainthemodel: modelßclassifier.train(X,Y )
X,Y learner
x model
Applythemodeltonewdata:
• Given:newunlabeledinstance x ⇠ D(X )
ypredicIonßmodel.predict(x)
ypredicIon
ExampleApplicaIon:ATreeto
PredictCaesareanSecIonRisk
BasedonExamplebyTomMitchell
DecisionTreeInducedParIIon
Color
green
Size
big
-
small
+
blue
red
+
Shape
square
round
Size
+
big
-
small
+
DecisionTree–DecisionBoundary
• Decisiontreesdividethefeaturespaceintoaxisparallel(hyper-)rectangles
• Eachrectangularregionislabeledwithonelabel
– oraprobabilitydistribuIonoverlabels
Decision
boundary
11
Expressiveness
• DecisiontreescanrepresentanybooleanfuncIonof
theinputaHributes
Truthtablerowàpathtoleaf
• Intheworstcase,thetreewillrequireexponenIally
manynodes
Expressiveness
Decisiontreeshaveavariable-sizedhypothesisspace
• Asthe#nodes(ordepth)increases,thehypothesis
spacegrows
– Depth1(“decisionstump”):canrepresentanyboolean
funcIonofonefeature
– Depth2:anybooleanfnoftwofeatures;someinvolving
(x1 ^ x2 ) _ (¬x1 ^ ¬x3 )
threefeatures(e.g.,)
– etc.
BasedonslidebyPedroDomingos
AnotherExample:
RestaurantDomain(Russell&Norvig)
Model a patron’s decision of whether to wait for a table at a restaurant
~7,000possiblecases
ADecisionTree
fromIntrospecIon
Isthisthebestdecisiontree?
Preferencebias:Ockham’sRazor
• PrinciplestatedbyWilliamofOckham(1285-1347)
– “nonsuntmul0plicandaen0apraeternecessitatem” – enIIesarenottobemulIpliedbeyondnecessity
– AKAOccam’sRazor,LawofEconomy,orLawofParsimony
Idea:ThesimplestconsistentexplanaIonisthebest
• Therefore,thesmallestdecisiontreethatcorrectly
classifiesallofthetrainingexamplesisbest
• FindingtheprovablysmallestdecisiontreeisNP-hard
• ...SoinsteadofconstrucIngtheabsolutesmallesttree
consistentwiththetrainingexamples,constructonethat
ispreHysmall
BasicAlgorithmforTop-Down
InducIonofDecisionTrees
[ID3,C4.5byQuinlan]
node=rootofdecisiontree
Mainloop:
1. Aßthe“best”decisionaHributeforthenextnode.
2. AssignAasdecisionaHributefornode.
3. ForeachvalueofA,createanewdescendantofnode.
4. Sorttrainingexamplestoleafnodes.
5. Iftrainingexamplesareperfectlyclassified,stop.
Else,recurseovernewleafnodes.
HowdowechoosewhichaHributeisbest?
ChoosingtheBestAHribute
Keyproblem:choosingwhichaHributetosplita
givensetofexamples
• SomepossibiliIesare:
– Random:SelectanyaHributeatrandom
– Least-Values:ChoosetheaHributewiththesmallest
numberofpossiblevalues
– Most-Values:ChoosetheaHributewiththelargest
numberofpossiblevalues
– Max-Gain:ChoosetheaHributethathasthelargest
expectedinforma0ongain
• i.e.,aHributethatresultsinsmallestexpectedsizeofsubtrees
rootedatitschildren
• TheID3algorithmusestheMax-Gainmethodof
selecIngthebestaHribute
ChoosinganAHribute
Idea:agoodaHributesplitstheexamplesintosubsets
thatare(ideally)“allposiIve”or“allnegaIve”
WhichsplitismoreinformaIve:Patrons?orType?
BasedonSlidefromM.desJardins&T.Finin
ID3-induced
DecisionTree
BasedonSlidefromM.desJardins&T.Finin
ComparetheTwoDecisionTrees
BasedonSlidefromM.desJardins&T.Finin
InformaIonGain
WhichtestismoreinformaIve?
Split over whether
Balance exceeds 50K
Less or equal 50K
Over 50K
BasedonslidebyPedroDomingos
Split over whether
applicant is employed
Unemployed
Employed
22
InformaIonGain
Impurity/Entropy(informal)
– Measuresthelevelofimpurityinagroupof
examples
BasedonslidebyPedroDomingos
23
Impurity
Very impure group
BasedonslidebyPedroDomingos
Less impure
Minimum
impurity
24
Entropy:acommonwaytomeasureimpurity
Entropy
Entropy H(X) of a random variable X
# of possible
values for X
H(X) is the expected number of bits needed to encode a
randomly drawn value of X (under most efficient code)
Why? Information theory:
• Most efficient code assigns -log2P(X=i) bits to encode
the message X=i
• So, expected number of bits to code one random X is:
SlidebyTomMitchell
Entropy:acommonwaytomeasureimpurity
Entropy
Entropy H(X) of a random variable X
# of possible
values for X
H(X) is the expected number of bits needed to encode a
randomly drawn value of X (under most efficient code)
Why? Information theory:
• Most efficient code assigns -log2P(X=i) bits to encode
the message X=i
• So, expected number of bits to code one random X is:
SlidebyTomMitchell
Example:Huffmancode
• In1952MITstudentDavidHuffmandevised,inthecourse
ofdoingahomeworkassignment,anelegantcoding
schemewhichisopImalinthecasewhereallsymbols’
probabiliIesareintegralpowersof1/2.
• AHuffmancodecanbebuiltinthefollowingmanner:
– Rankallsymbolsinorderofprobabilityofoccurrence
– Successivelycombinethetwosymbolsofthelowest
probabilitytoformanewcompositesymbol;eventually
wewillbuildabinarytreewhereeachnodeisthe
probabilityofallnodesbeneathit
– Traceapathtoeachleaf,noIcingdirecIonateachnode
BasedonSlidefromM.desJardins&T.Finin
Huffmancodeexample
MP
A.125
B.125
C.25
D.5
1
1
.5
.5
D
0
1
.25
.25
0
.125
A
0
C
1
.125
BasedonSlidefromM.desJardins&T.Finin
B
M
code length
A
B
C
D
000
001
01
1
3
3
2
1
prob
0.125
0.125
0.250
0.500
average message length
0.375
0.375
0.500
0.500
1.750
Ifweusethiscodetomany
messages(A,B,CorD)withthis
probabilitydistribuIon,then,over
Ime,theaveragebits/message
shouldapproach1.75
2-ClassCases:
Entropy H(x) =
n
X
P (x = i) log2 P (x = i)
i=1
• Whatistheentropyofagroupinwhichall
examplesbelongtothesameclass?
Minimum
impurity
– entropy=-1log21=0
notagoodtrainingsetforlearning
• Whatistheentropyofagroupwith50%
ineitherclass?
Maximum
impurity
– entropy=-0.5log20.5–0.5log20.5=1
goodtrainingsetforlearning
BasedonslidebyPedroDomingos
30
Sample Entropy
SampleEntropy
SlidebyTomMitchell
InformaIonGain
• WewanttodeterminewhichaHributeinagivenset
oftrainingfeaturevectorsismostusefulfor
discriminaIngbetweentheclassestobelearned.
• InformaIongaintellsushowimportantagiven
aHributeofthefeaturevectorsis.
• WewilluseittodecidetheorderingofaHributesin
thenodesofadecisiontree.
BasedonslidebyPedroDomingos
32
FromEntropytoInformaIonGain
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mututal information (aka Information Gain) of X and Y :
SlidebyTomMitchell
FromEntropytoInformaIonGain
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mututal information (aka Information Gain) of X and Y :
SlidebyTomMitchell
FromEntropytoInformaIonGain
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mututal information (aka Information Gain) of X and Y :
SlidebyTomMitchell
FromEntropytoInformaIonGain
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mututal information (aka Information Gain) of X and Y :
SlidebyTomMitchell
InformaIonGain
Information Gain is the mutual information between
input attribute A and target variable Y
Information Gain is the expected reduction in entropy
of target variable Y for data sample S, due to sorting
on variable A
SlidebyTomMitchell
CalculaIngInformaIonGain
InformaOonGain=entropy(parent)–[averageentropy(children)]
child = −⎛⎜ 13 ⋅ log 13 ⎞⎟ − ⎛⎜ 4 ⋅ log 4 ⎞⎟ = 0.787
impurity
2
2
17 ⎠ ⎝ 17
17 ⎠
entropy ⎝ 17
Entire population (30 instances)
17 instances
child
1 ⎞ ⎛ 12
12 ⎞
⎛1
impurity
=
−
⋅
log
−
⋅
log
⎜
⎟
⎜
⎟ = 0.391
2
2
entropy 13
13
13
13
⎝
parent= −⎛⎜ 14 ⋅ log 2 14 ⎞⎟ − ⎛⎜ 16 ⋅ log 2 16 ⎞⎟ = 0.996
impurity
30 ⎠ ⎝ 30
30 ⎠
entropy ⎝ 30
⎠ ⎝
⎠
13 instances
⎛ 17
⎞ ⎛ 13
⎞
⋅
0
.
787
+
⋅
0
.
391
⎜
⎟
⎜
⎟ = 0.615
(Weighted) Average Entropy of Children =
30
30
⎝
⎠ ⎝
⎠
Information Gain= 0.996 - 0.615 = 0.38
BasedonslidebyPedroDomingos
38
Entropy-BasedAutomaIcDecision
TreeConstrucIon
TrainingSetX
x1=(f11,f12,…f1m)
x2=(f21,f22,f2m)
.
.
xn=(fn1,f22,f2m)
Node1
Whatfeature
shouldbeused?
Whatvalues?
QuinlansuggestedinformaIongaininhisID3system
andlaterthegainraIo,bothbasedonentropy.
BasedonslidebyPedroDomingos
39
UsingInformaIonGaintoConstruct
aDecisionTree
FullTrainingSetX
ChoosetheaHributeA
withhighestinformaIon
AHributeA
gainforthefulltraining
v2
v1
vk setattherootofthetree.
Constructchildnodes
foreachvalueofA.
SetX X
Eachhasanassociated
repeat
subsetofvectorsin
recursively
whichAhasaparIcular
Illwhen?
value.
={x
X|value(A)=v1}
DisadvantageofinformaIongain:
• ItprefersaHributeswithlargenumberofvaluesthatsplit
thedataintosmall,puresubsets
• Quinlan’sgainraIousesnormalizaIontoimprovethis
BasedonslidebyPedroDomingos
40
SlidebyTomMitchell
SlidebyTomMitchell
SlidebyTomMitchell
DecisionTreeApplet
hHp://webdocs.cs.ualberta.ca/~aixplore/learning/
DecisionTrees/Applet/DecisionTreeApplet.html
Which Tree Should We Output?
• ID3 performs heuristic
search through space of
decision trees
• It stops at smallest
acceptable tree. Why?
Occam’s razor: prefer the
simplest hypothesis that
fits the data
SlidebyTomMitchell
The ID3 algorithm builds a decision tree, given a set of non-categorical attributes C1, C2, ..,
Cn, the class attribute C, and a training set T of records
function ID3(R:input attributes, C:class attribute,
S:training set) returns decision tree;
If S is empty, return single node with value Failure;
If every example in S has same value for C, return
single node with that value;
If R is empty, then return a single node with most
frequent of the values of C found in examples S;
# causes errors -- improperly classified record
Let D be attribute with largest Gain(D,S) among R;
Let {dj| j=1,2, .., m} be values of attribute D;
Let {Sj| j=1,2, .., m} be subsets of S consisting of
records with value dj for attribute D;
Return tree with root labeled D and arcs labeled
d1..dm going to the trees ID3(R-{D},C,S1). . .
ID3(R-{D},C,Sm);
BasedonSlidefromM.desJardins&T.Finin
Howwelldoesitwork?
Manycasestudieshaveshownthatdecisiontrees
areatleastasaccurateashumanexperts.
– Astudyfordiagnosingbreastcancerhadhumans
correctlyclassifyingtheexamples65%ofthe
Ime;thedecisiontreeclassified72%correct
– BriIshPetroleumdesignedadecisiontreefor
gas-oilseparaIonforoffshoreoilpla}ormsthat
replacedanearlierrule-basedexpertsystem
– Cessnadesignedanairplaneflightcontrollerusing
90,000examplesand20aHributesperexample
BasedonSlidefromM.desJardins&T.Finin
© Copyright 2025 Paperzz