Introduction to Machine Learning

Decision Trees
What is a Decision Tree?
How to build a good one…
Machine Learning Group
University College Dublin
2
Classifying Apples & Pears
Greeness
Height
210
60
220
70
215
55
180
76
220
68
160
65
215 Apples &
63Pears
180
55
220
68
190
60
No. 1
No. 2
No. 3
No. 4
No. 5
No. 6
No. 7
No. 8
70
No. 9
65No. 10
Width
62
53
50
40
45
68
45
56
65
58
Taste
Sweet
Sweet
Tart
Sweet
Sweet
Sour
Sweet
Sweet
Tart
Sour
Weight
186
180
152
152
153
221
140
154
221
174
60
Width
55
Pear
Apple
50
45
40
35
30
50
55
60
D Trees
65
Height
70
75
80
Height/Width
0.97
1.32
1.10
1.90
1.51
0.96
1.40
0.98
1.05
1.03
Class
Apple
Pear
Apple
Pear
Pear
Apple
Pear
Apple
Apple
Apple
3
A Decision Tree
Apples & Pears
70
65
60
Width
55
Pear
Apple
50
45
Width
<55
>55
40
35
50
55
60
65
70
75
Apple
Height
30
80
Height
<59
Apple
D Trees
>59
Pear
4
A Decision Tree
No. 1
No. 2
No. 3
No. 4
No. 5
No. 6
No. 7
No. 8
No. 9
No. 10
Greeness
210
220
215
180
220
160
215
180
220
190
Height
60
70
55
76
68
65
63
55
68
60
Width
62
53
50
40
45
68
45
56
65
58
Taste
Sweet
Sweet
Tart
Sweet
Sweet
Sour
Sweet
Sweet
Tart
Sour
Weight
186
180
152
152
153
221
140
154
221
174
Height/Width
0.97
1.32
1.10
1.90
1.51
0.96
1.40
0.98
1.05
Width
1.03
<55
Class
Apple
Pear
Apple
Pear
Pear
Apple
Pear
Apple
Apple
Apple
>55
Height/Width
<1.2
>1.2
Apple
Pear
D Trees
Apple
Height
<59
Apple
>59
Pear
5
Decision Trees




Each internal node tests an attribute
Each branch corresponds to an attribute value
Each leaf node assigns a classification
Cannot readily represent
,, XOR
(A  B) , (C  D  E)
M of N
D Trees
6
When to consider D-Trees





Instances described by attribute-value pairs
Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data
Classification can be done using a few features
Examples
Equipment or medical diagnosis
Credit risk analysis
D Trees
7
D-Tree Example
Alternative
Bar
Fri/Sat
Hungary
Patrons
Price
Raining
Reservation
Type
Whether there is a suitable alternative restaurant nearby.
Is there a comfortable bar area?
True on Friday or Saturday nights.
How hungry is the subject?
How many people in the restaurant?
Price range.
Is it raining outside?
Does the subject have a reservation?
Type of Restaurant.
Stay?
Stay or Go
D Trees
8
D-Tree Example
Case
Alt.
Bar
Fri.
Hun
Pat
Price
Rain
Res
Type
Est.
Stay?
X1
Yes
No
No
Yes
Some
$$$
No
Yes
French
0-10
Yes
X2
Yes
No
No
Yes
Full
$
No
No
Thai
30-60
No
X3
No
Yes
No
No
Some
$
No
No
Burger
0-10
Yes
X4
Yes
No
Yes
Yes
Full
$
No
No
Thai
Oct-30
Yes
X5
Yes
No
Yes
No
Full
$$$
No
Yes
French
>60
No
X6
No
Yes
No
Yes
Some
$$
Yes
Yes
Italian
0-10
Yes
X7
No
Yes
No
No
None
$
Yes
No
Burger
0-10
No
X8
No
No
No
Yes
Some
$$
Yes
Yes
Thai
0-10
Yes
X9
No
Yes
Yes
No
Full
$
Yes
No
Burger
>60
No
X10
Yes
Yes
Yes
Yes
Full
$$$
No
Yes
Italian Oct-30
X11
No
No
No
No
None
$
No
No
Thai
0-10
No
X12
Yes
Yes
Yes
Yes
Full
$
No
No
Burger
30-60
Yes
D Trees
No
9
D-Tree Example
Very good D-Tree
• Classifies all examples
correctly
• Very few nodes
Patrons?
None
Full
Some
Hungry?
Yes
No
No
Yes
No
Type?
Burger
French
Thai
Italian
Yes
No
Objective in building a decision tree is
to choose attributes so as to minimise
the depth of the tree
D Trees
Fri/Sat?
No
No
Yes
Yes
Yes
10
Top-down induction of D-Trees
1.
A  the “best” decision attribute for next node
2. Assign A as decision attribute for node
3.
For each value of A create new descendant of node
4.
Sort training examples to leaf nodes
5.
If training examples perfectly classified,
Then Stop,
Else repeat recursively over leaf nodes
A1=?
[29+,35-]
t
A2=?
[29+,35-]
f
t
f
Which attribute is best?
[21+,5-]
D Trees
[8+,30-]
[18+,33-]
[11+,2-]
11
Good and Bad Attributes

A perfect attribute divides
examples into categories of
one type. (e.g. Patrons)
• A poor attribute produces
categories of mixed type.
(e.g. Type)
Patrons?
None
No
2 No
Some
Yes
4 Yes
Type?
Full
French
Hungry?
4 No
2 Yes
Yes/No
1 No
1 Yes
How can we
measure this?
D Trees
Thai
Yes/No
2 No
2 Yes
Burger
Yes/No
2 No
2 Yes
Italian
Yes/No
1 No
1 Yes
12
Entropy
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
p
D Trees
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0
0.2

0.9
0.1

1
0

S is a sample of training examples
p is the proportion of positive examples
in S
q is the proportion of positive examples
in S
Entropy measures the impurity of S
Entropy(S) = -plog2(p) -qlog2(q)
Entropy(s)

13
Entropy
Entropy(S) = expected number of bits needed to encode class (p or
q) of randomly drawn members of S (under optimal shortest length
code)
Why?
Information theory: optimal length code assigns
- log2(q) bits to messages having probability p.
So, expected number of bits to encode messages in ratio p:q of
random members of S.
-p(log2(p)) -q(log2(q))
i.e. Entropy(S) = -plog2(p) -qlog2(q)
D Trees
14
Information Gain
Gain(S,A) = expected reduction in entropy due to
sorting on A
Sv
Gain( S , A)  Entropy ( S ) 
Entropy ( Sv )
S
v Values( A)


A1=?
[29+,35-]
t
[21+,5-]
D Trees
A2=?
[29+,35-]
f
[8+,30-]
t
[18+,33-]
f
[11+,2-]
15
D-Tree Example
Heigh
Sean
short
Mike
tall
Paddy
tall
Mike Óg short
Hair
blond
blond
red
dark
Eyes
blue
brow
blue
blue
Clan
McD
Joyce
McD
Joyce
Colm
Liam
Johnny
Cóilín
dark
blond
dark
blond
blue
blue
brow
brow
Joyce
McD
Joyce
Joyce
D Trees
tall
tall
tall
short
16
Minimal D-Tree
Hair
dark
short, dark, blue: J
tall, dark, blue: J
tall, dark, brown:J
blond
red
tall, red, blue, McD
Eyes
blue
short, blond, blue: McD
tall, blond, blue: McD
D Trees
brown
tall, blond, brown: J
short,blond, brown:J
17
Summary





ML avoids some KE effort
Recursive algorithm for bulding D-Trees
Using informatio gain (Entropy) to select
discriminating attribute
Example
Important People
 Claude Shannon
 http://en.wikipedia.org/wiki/Claude_Shannon
 William of Ockham
 http://en.wikipedia.org/wiki/William_of_Ockham
D Trees