Decision Trees: Attribute Selection (cont)

Lecture Notes
CPSC 310 (Fall 2016)
Today
• Quiz 6
• Attribute Selection (cont)
Assignments
• HW 4 due
• HW 5 out
S. Bowers
1 of 10
Lecture Notes
CPSC 310 (Fall 2016)
Example iPhone Purchases (Fake)
standing job status credit rating buys iphone
1
3
fair
no
1
3
excellent
no
2
3
fair
yes
2
2
fair
yes
2
1
fair
yes
2
1
excellent
no
2
1
excellent
yes
1
2
fair
no
1
1
fair
yes
2
2
fair
yes
1
2
excellent
yes
2
2
excellent
yes
2
3
fair
yes
2
2
excellent
no
2
3
fair
yes
• buys iphone is the class (with labels “yes” and “no”)
• standing: 1 = lower division, 2 = upper division
• job status: 1 = none, 2 = < 20 hrs, 3 = ≥ 20 hrs
S. Bowers
2 of 10
Lecture Notes
CPSC 310 (Fall 2016)
The TDIDT (Top-Down Induction of Decision Trees) Algorithm
Basic Approach:
• Uses recursion!
• At each step, pick an attribute (“attribute selection”)
• Partition data by attribute values
... pairwise disjoint partitions
• Repeat until one of the following occurs (base cases):
1. Partition has only class labels that are the same
... no “clashes”
2. No more attributes to partition
... there may be clashes
3. No more instances to partition
... see options below
S. Bowers
3 of 10
Lecture Notes
CPSC 310 (Fall 2016)
Selecting Attributes
Resulting decision tree depends on attribute selection approach
• want high predictive accuracy with small number of rules
• in practice, using “information gain” does well (popular approach)
Basic idea:
• select attribute with the largest “information gain”
• typically the attribute that most “unevenly” partitions the instances on class
How it works: use “Shannon Entropy” as a measure of information content
• Q: how many bits are needed to represent numbers between 1 and 64?
– log2 64 = 6 bits
• What if instead we had messages involving combinations of 64 words
– could capture each word using a 6 bit number
– thus a message with 10 words would cost 10 × 6 = 60 bits
• However, if we knew more about the distribution of words, we could (on
average) use fewer bits per message!
– e.g., “the” occurs more than any other word
– use encodings whose lengths are inversely proportional to their frequencies (probabilities)
⇒ Entropy gives a precise lower bound for expected number of bits needed to
encode a message (based on a probability distribution)
S. Bowers
4 of 10
Lecture Notes
CPSC 310 (Fall 2016)
The details:
• entropy E = 0 implies low content (e.g., all values are the same)
• highest entropy value when all values equally likely (most content)
• entropy formula assumes information encoded in bits ... (thus, log2 )
E=−
n
X
pi log2 (pi )
i=1
– n for us is the number of class labels
– pi is proportion of instances labeled with i (i.e., P (C) for C = i)
– note pi is assumed to be strictly greater than 0, up to and including 1
• what the formula is saying:
– since 0 < pi ≤ 1, we know that −pi log2 (pi ) ≥ 0 is positive
– e.g., for log2 0.5 = y, we have 2y = 12 , which means y = −1
– if pi = 1, then −pi log2 (pi ) = 0
– E has the highest value when labels are equally distributed
S. Bowers
5 of 10
Lecture Notes
CPSC 310 (Fall 2016)
The function −pi log2 pi for 0 < pi ≤ 1
y = −pi log2 pi
0.6
0.5
y
0.4
0.3
0.2
0.1
0.0
0.0
S. Bowers
0.1
0.2
0.3
0.4
0.5
pi
0.6
0.7
0.8
0.9
1.0
6 of 10
Lecture Notes
CPSC 310 (Fall 2016)
The function −(pi log2 pi + (1 − pi ) log2 (1 − pi ))
y = −(pi log2 pi +(1−pi )log2 (1−pi ))
1.1
1.0
0.9
0.8
0.7
y
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.1
0.2
0.3
0.4
0.5
pi
0.6
0.7
0.8
0.9
1.0
• This mimics having just two labels
• As shown, the closer the two labels the higher the entropy
S. Bowers
7 of 10
Lecture Notes
CPSC 310 (Fall 2016)
Pick the attribute that maximizes information gain
• Information Gain = Estart − Enew
– at each partition, pick attribute with highest information gain
– that is, split on attribute with greatest reduction in entropy
– which means find attribute with smallest Enew
S. Bowers
8 of 10
Lecture Notes
CPSC 310 (Fall 2016)
Example: Calculating E
• Assume we have: pyes = 3/8 and pno = 5/8
Q: What is E for this distribution? ... in python:
from math import log
p_yes = 3/8.0
p_no = 5/8.0
E = -(p_yes*log(p_yes, 2) + p_no*log(p_no, 2))
# E = 0.954
• Now assume we have: pyes = 2/8 and pno = 6/8
Q: Will E for this distrubtion be larger or smaller?
... should be smaller!
p_yes = 2/8.0
p_no = 6/8.0
E = -(p_yes * log(p_yes, 2) + p_no * log(p_no, 2))
# E = 0.811
S. Bowers
9 of 10
Lecture Notes
CPSC 310 (Fall 2016)
Calculing Enew for an Attribute
Assume we want to partition D on an attribute A
• where A has values a1 , a2 , . . . , av
• creating partitions D1 , D2 , . . . , Dv
We’d like each partition to be “pure” (contain instances of same class label)
• They may not be, however (i.e., they may have “clashes”)
• The amount of additional info still needed for a “pure” classification is:
Enew =
|Dj |
× EDj
j=1 |D|
v
X
• where EDj is the entropy of partition Dj
•
|Dj |
|D|
is the likelihood an instance is in the j-th parttion
• thus, Enew is a weighted average of the attributes corresponding partitions
We pick the attribute with the smallest Enew value
• this means the smallest amount of information needed to classify an instance
S. Bowers
10 of 10