Learning from Data COMP61011 : Machine Learning

COM24111: Machine Learning
Decision Trees
Gavin Brown
www.cs.man.ac.uk/~gbrown
Recap: threshold classifiers
height
t
weight
if ( weight  t ) then " player" else " dancer"
efinition of a decision stump back in subsection 2.1.2, and imagine
following 1-dimensional problem.
1
0
Also known as “decision stump”
10 #
#20 #
10
#30 #
20
#40 #
#50 #
30
40
#60#
50
60
x > 25 ?
ses are label y = 1, and blue dots y = 0. With this non-linearly
a, we decide to fit a decision stump. Our model has the form:
no#
yes#
i f x > t t hen ŷ = 1 el se ŷ = 0
(6.1)
x > 16 ?
Q. Where is a good threshold?
no# yes#
ŷ = 0
10.5#
14.1#
SEL F-T EST
x >q ?
ŷ = 0
ŷ = 1
Where is an optimal decision stump threshold
for
the dat a and model above? Draw it into the diano#
gram above.yes#
Hint: you should find that it has a
classification error rat ŷe=of1 about 0.364.
27.1#
30.1#
x > 50 ?
no#
ŷ = 0
yes#
ŷ = 1
From Decision Stumps, to Decision Trees
- New type of non-linear model
- Copes naturally with continuous and categorical data
- Fast to both train and test (highly parallelizable)
- Generates a set of interpretable rules
D ecision Stumps... t o D ecision Trees
1
0
Recap: Decision Stumps
tion of a decision stump back in subsection 2.1.2, and imagine
owing 1-dimensional problem.
10
#
#20 #
20
#30 #
predict
0
30
40
#40 #
#50 #
predict
1
50
60
#60#
x > 25 ?
are label y = 1, and blue dots y = 0. With this non-linearly
we decide to fit a decision stump. Our model has the form: x > 25 ?
no#
yes#
yes
no
i f x > t t hen ŷ = 1 el se ŷ = 0
(6.1)
0
x > 16 ?ŷ = 0
no#
yes#10.5
The stump “splits” the dataset.
14.1
SEL F-T EST
x > qis an
? optimal decisionŷstump
=0
ŷ17.0
=1
Where
threshold
for
Here we have 4 classification errors.
21.0
the dat a and model above? Draw it into 23.2
the diano#
gram above.yes#
Hint: you should find that it has a
classification error rat ŷe=of1 about 0.364.
ŷ =1
x > 50 ?
27.1
no# yes#
30.1
42.0
ŷ = 0
47.0ŷ = 1
57.3
59.9
1
0
A modified stump
1. FROM DECISION ST UMPS... T O DECISION T REES
it h fig 6.1, shows a way for us t o make an improved st ump model.
10
20
30
40
50
71
60
ur improved decision st ump model, for a given t hreshold t, is:
Set yr i gh t t o t he most common label in t he (> t) subsample.
Set yl ef t t o t he most common label in t he (< t) subsample.
i f x > t t hen
predict ŷ = yr i gh t
el se
predict ŷ = yl ef t
endif
x > 48 ?
no
10.5
14.1
17.0
he learning algorit hm would be t he same, simple line-search t o find t he opt i21.0 st ump
mum t hreshold t hat minimises t he number of mist akes. So, our improved
23.2
model works by t hresholding on t he t raining dat a, and predict ing
a t est dat point label as t he most common label observed in t he t raining dat27.1
a subsample.
Here we have 3 classification errors.
30.1
Even wit h t his improved st ump, t hough, we are st ill making some errors.
42.0
here is in principle no reason why we can’t fit another decision st ump (or
ndeed any ot her model) t o t hese dat a sub-samples. On t he left47.0
branch dat a
ub-sample, we could easily pick an opt imal t hreshold for a st ump, and t he same
yes
57.3
59.9
Recursion…
x > 25 ?
no
yes
27.1
30.1
42.0
47.0
57.3
59.9
10.5
14.1
17.0
21.0
23.2
Just another
dataset!
Build a stump!
x >16 ?
no
ŷ =1
yes
x > 50 ?
no
ŷ = 0
ŷ =1
yes
ŷ = 0
Decision Trees = nested rules
10
20
30
40
50
60
x > 25 ?
no
yes
x >16 ?
no
ŷ =1
if x>25 then
if x>50 then y=0 else y=1; endif
else
if x>16 then y=0 else y=1; endif
endif
yes
ŷ = 0
x > 50 ?
no
ŷ =1
yes
ŷ = 0
Trees build “orthogonal” decision boundaries.
Boundary is piecewise, and at 90 degrees to feature axes.
The most important concept in Machine Learning
The most important concept in Machine Learning
Looks good so far…
The most important concept in Machine Learning
Looks good so far…
Oh no! Mistakes!
What happened?
The most important concept in Machine Learning
Looks good so far…
Oh no! Mistakes!
What happened?
We didn’t have all the data.
We can never assume that we do.
This is called “OVER-FITTING”
to the small dataset.
Number of possible paths down tree tells you the number of rules.
More rules = more complicated.
Could have N rules, where N is the num of examples in training data.
… the more rules… the more chance of OVERFITTING.
x > 25 ?
no
yes
x >16 ?
no
ŷ =1
yes
ŷ = 0
x > 50 ?
no
ŷ =1
yes
ŷ = 0
Overfitting….
starting to overfit
testing
error
optimal depth about here
1
2
3
4
5
6
7
8
9
tree depth / length of rules
Take a short break.
Talk to your neighbours.
Make sure you understand the
recursive algorithm.
Decision trees can be seen as nested rules.
Nested rules are FAST, and highly parallelizable.
x,y,z-coordinates per joint, ~60 total
x,y,z-velocities per joint, ~60 total
joint angles (~35 total)
joint angular velocities (~35 total)
if x>25 then
if x>50 then y=0 else y=1; endif
else
if x>16 then y=0 else y=1; endif
endif
Real-time human pose recognition in parts from single depth images
Computer Vision and Pattern Recognition 2011
Shotton et al, Microsoft Research
(a)
𝜃
( 𝐼, x)
( 𝐼, x)
(b)
tree 1
𝜃
tree 𝑇
…
𝜃
•
•
•
•
𝜃
Figure 3. Depth image featur es. The yellow crosses indicates the
pixel x being classified. The red circles indicate the offset pixels
Trees
are
the1.basis
controller
as
defined
in Eq.
In (a),of
theKinect
two example
features give a large
depth difference response. In (b), the same two features at new
image
locations
a muchimage
smaller response.
Features
aregive
simple
properties.
𝑃 ( 𝑐)
𝑃 ( 𝑐)
Figure 4. Randomized Decision For ests. A forest is an ensemble
of trees. Each tree consists of split nodes (blue) and leaf nodes
(green). The red arrows indicate the different paths that might be
taken by different trees for a particular input.
3.3. Randomized decision forests
parts for left and right allow the classifier to disambiguate
Randomized decision trees and forests [35, 30, 2, 8] have
the
left
and
right
sides
of
the
body.
Test phase: 200 frames per sec on GPU
proven fast and effective multi-class classifiers for many
Of course, the precise definition of these parts could be
tasks [20, 23, 36], and can beimplemented efficiently on the
changed to suit a particular application. For example, in an
[34]. As illustrated in Fig. 4, a forest is an ensemble
Train phase
more complex
but still
parallel(top
Figure
5. Example
inferences.
Synthetic
row); GPU
real
(middle);
failure modes (bottom). Left column
upper body
tracking scenario,
all the lower
body parts
could
of T decision trees, each consisting of split and leaf nodes.
merged. Parts
be sufficiently
to accurately
abereference.
In should
each example
we small
see the
depth image,Each
thesplit
inferred
most likely
bodyf part
and⌧.the
node consists
of a feature
a threshold
✓ andlabels,
localize body joints, but not too numerous as to waste caclassify
x in image
I , one starts
the root
andabove
reand
top views (overlaid on a depth point cloud). OnlyTothe
mostpixel
confident
proposal
forateach
joint
pacity of the classifier.
peatedly evaluates Eq. 1, branching left or right according
to the comparison to threshold ⌧. At the leaf node reached
3.2. Depth image featur es
tree t, a learned distributionThe
Pt (c|I
, x ) over body
part laToWe
keep
thesimple
training
times down
weinspired
employ aindistributed
detected
modes
lie o
employ
depth comparison
features,
bels c is stored. The distributions are averaged together for
by those in [20]. At a given
pixel x ,3the
features
compute
implementation.
Training
trees
to
depth
20
from
1 million
mode is therefore pushed b
✓
◆
✓
◆
all trees in the forest to give the final classification
u
v
images
a day− don
ax +1000
core
cluster.
z 1offset
f ✓(I , x )takes
= dI about
x+
, (1)
XT
c to produce a fin
I
dI (x )
dI (x )
P (c|I , x ) =
Pt (c|I , x ) .
(2)
T t = 1 efficient approach w
simple,
where dI (x ) is the depth at pixel x in image I , and parame-
⇣
3.4. Joint position proposals
We’ve been assuming continuous variables!
10
5
0
0
5
10
10
20
30
40
50
60
The Tennis Problem
Outlook
SUNNY
Humidity
HIGH
RAIN
OVERCAST
YES
NORMAL
NO
YES
6.3. DEALING WIT H CAT EGORICAL DATA
Wind
STRONG
NO
WEAK
YES
75
Once again, an answer for any given example is found by following a pat h
down t he t ree, answering quest ions as you go. Each pat h down t he t ree encodes
an if-t hen rule. T he full ruleset for t his t ree is:
if
if
if
if
if
(
(
(
(
(
Out l ook==sunny AND Humi di t y==hi gh )
Out l ook==sunny AND Humi di t y==nor mal )
Out l ook==over cast )
Out l ook==r ai n AND Wi nd==st r ong )
Out l ook==r ai n AND Wi nd==weak )
t hen
t hen
t hen
t hen
t hen
NO
YES
YES
NO
YES
The Tennis Problem
Note: 9 examples say “YES”, while 5 say “NO”.
Partitioning the data…
Thinking in Probabilities…
The “Information” in a feature
More uncertainty = less information
H(X) = 1
The “Information” in a feature
Less uncertainty = more information
H(X) = 0.72193
Entropy
Calculating Entropy
Information Gain, also known as “Mutual Information”
gain
maximum information gain
Outlook
SUNNY
Humidity
HIGH
NO
NORMAL
NO
MILD
MILD
HOT
YES
HIGH
COOL
NORMAL
HIGH
YES
Wind
MILD
YES
HIGH
YES
WEAK
NO
Wind
WEAK STRONG
YES
COOL
Wind
Humidity
NORMAL
STRONG
YES
HOT
YES
Humidity
Humidity
NO
Temp
Temp
Temp
COOL
RAIN
OVERCAST
NO
WEAK
NORMAL
YES
Wind
STRONG
WEAK
YES
NO
STRONG
NO
Decision trees
Trees learnt by recursive algorithm, ID3 (but there are many variants of this)
Trees draw a particular type of decision boundary (look it up in these slides!)
Highly efficient, highly parallelizable, lots of possible “splitting” criteria.
Powerful methodology, used in lots of industry applications.
6.4. MEASURING INFORMAT ION GAIN
SEL F -T EST
Calculat e t he ent ropy of a two-valued feat ure, i.e.
x 2 { 0, 1} , wit h p(x = 1) = 0.6.