09 - School of Computing

COMP3740 CR32:
Knowledge Management
and Adaptive Systems
Supervised ML to learn Classifiers:
Decision Trees and Classification Rules
Eric Atwell, School of Computing,
University of Leeds
(including re-use of teaching resources from other sources,
esp. Knowledge Management by Stuart Roberts,
School of Computing, University of Leeds)
Reminder:
Objectives of data mining
• Data mining aims to find useful patterns in data.
• For this we need:
– Data mining techniques, algorithms, tools, eg WEKA
– A methodological framework to guide us, in collecting
data and applying the best algorithms, CRISP-DM
• TODAY’S objective: learn how to learn classifiers
• Decision Trees and Classification Rules
• Supervised Machine Learning: training set has the
“answer” (class) for each example (instance)
Reminder:
Concepts that can be “learnt”
The types of concepts we try to ‘learn’ include:
• Clusters or ‘Natural’ partitions;
– Eg we might cluster customers according to their shopping habits.
• Rules for classifying examples into pre-defined classes.
– Eg “Mature students studying information systems with high grade
for General Studies A level are likely to get a 1st class degree”
• General Associations
– Eg “People who buy nappies are in general likely also to buy beer”
• Numerical prediction
– Eg Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree
(but are Gender, Programme really numbers???)
Output: decision tree
Outlook
sunny
rainy
Humidity
high
Play = ‘no’
Windy
normal
Play = ‘yes’
true
Play = ‘no’
false
Play = ‘yes’
Decision Tree Analysis
• Example instance set
Shares files
Yes
Yes
No
Yes
Yes
No
Yes
Uses
scanner
Yes
No
No
Yes
Yes
Yes
No
Infected before
Risk
No
No
Yes
Yes
No
No
Yes
High
High
Medium
Low
High
Low
High
Can we predict, from the first 3 columns, the risk of
getting a virus?
For convenience later:
F = ‘shares Files’, S = ‘uses Scanner’,
I = ‘Infected before’
Decision tree building method
• Forms a decision tree
– tries for a small tree covering all or most of the training set
– internal nodes represent a test on an attribute value
– branches represent outcome of the test
• Decides which attribute to test at each node
– this is based on a measure of ‘entropy’
• Must avoid ‘over-fitting’
– if the tree is complex enough it might describe the training
set exactly, but be no good for prediction
• May leave some ‘exceptions’
Building a decision tree (DT)
The algorithm is recursive, at any step:
T = set of (remaining) training instances,
{C1, …, Ck} = set of classes
• If all instances in T belong to a single class Ci, then
DT is a leaf node identifying class Ci. (done!)
…continued
Building a decision tree (DT)
…continued
• If T contains instances belonging to mixed
classes, then choose a test based on a single
attribute that will partition T into subsets
{T1, …, Tn} according to n outcomes of the test.
The DT for T comprises a root node identifying
the test and one branch for each outcome of the
test.
• The branches are formed by applying the rules
above recursively on each of the subsets {T1, …,
Tn} .
Tree Building example
T=
F
Yes
Yes
No
Yes
Yes
No
Yes
S
Yes
No
No
Yes
Yes
Yes
No
I
No
No
Yes
Yes
No
No
Yes
Risk
High
High
Medium
Low
High
Low
High
Classes = {High, Medium, Low}
Choose a test based on F, number of outcomes, n = 2 (Yes or No)
yes
T1 =
F
Yes
Yes
Yes
Yes
Yes
S
Yes
No
Yes
Yes
No
I
No
No
Yes
No
Yes
Risk
High
High
Low
High
High
F
no
T2 =
F
No
No
S
No
Yes
I
Yes
No
Risk
Medium
Low
Tree Building example
T1 =
F
Yes
Yes
Yes
Yes
Yes
S
Yes
No
Yes
Yes
No
I
No
No
Yes
No
Yes
Risk
High
High
Low
High
High
Classes = {High, Medium, Low}
Choose a test based on I, number of outcomes, n = 2 (Yes or No)
F
no
yes
yes
T3 =
F
Yes
Yes
S
Yes
No
I
Yes
Yes
Risk
Low
High
I
no
T4 =
F
Yes
Yes
Yes
S
Yes
No
Yes
I
No
No
No
Risk
High
High
High
Tree Building example
T1 =
F
Yes
Yes
Yes
Yes
Yes
S
Yes
No
Yes
Yes
No
I
No
No
Yes
No
Yes
Risk
High
High
Low
High
High
Classes = {High, Medium, Low}
Choose a test based on I, number of outcomes, n = 2 (Yes or No)
F
no
yes
yes
T3 =
F
Yes
Yes
S
Yes
No
I
Yes
Yes
Risk
Low
High
I
no
Risk = ‘High’
Tree Building example
T3 =
F
Yes
Yes
S
Yes
No
I
Yes
Yes
Classes = {High, Medium, Low}
Risk
Low
High
Choose a test based on S, number of outcomes, n = 2 (Yes or No)
yes
I
yes
yes
F
Yes
S
Yes
I
Yes
Risk
Low
no
no
Risk = ‘High’
S no
F
Yes
F
S
No
I
Yes
Risk
High
Tree Building example
T3 =
F
Yes
Yes
S
Yes
No
I
Yes
Yes
Classes = {High, Medium, Low}
Risk
Low
High
Choose a test based on S, number of outcomes, n = 2 (Yes or No)
yes
yes
yes
Risk = ‘Low’
S no
I
F
no
no
Risk = ‘High’
Risk = ‘High’
Tree Building example
T2 =
F
No
No
S
No
Yes
I
Yes
No
Classes = {High, Medium, Low}
Risk
Medium
Low
Choose a test based on S, number of outcomes, n = 2 (Yes or No)
yes
I
yes
yes
Risk = ‘Low’
S no
F
no
no
Risk = ‘High’
Risk = ‘High’
yes
F
No
S
Yes
I
No
S no
Risk
Low
F
No
S
No
I
Yes
Risk
Medium
Tree Building example
T2 =
F
No
No
S
No
Yes
I
Yes
No
Risk
Medium
Low
Classes = {High, Medium, Low}
Choose a test based on S, number of outcomes, n = 2 (Yes or No)
yes
I
yes
yes
Risk = ‘Low’
S no
F
no
Risk = ‘High’
Risk = ‘High’
no
yes
S no
Risk = ‘Low’
Risk = ‘Medium’
Example Decision Tree
Shares files?
no
yes
Infected before?
Uses scanner?
yes
no
medium
no
yes
low
high
Uses scanner?
no
high
Yes
low
Which attribute to test?
• The ROOT could be S or I instead of F – leading to a
different Decision Tree
• Best DT is the “smallest”, most concise model
• The search space in general is too large to find the
smallest tree by exhaustive searching (try them all).
• Instead we look for the attribute which splits the
training set into the most homogeneous sets.
• The measure used for ‘homogeneity’ is based on
entropy.
Tree Building example (modified)
T=
F
S
I
High Risk?
Yes
Yes
No
Yes
Yes
No
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
No
No
Yes
Yes
Yes
No
No
Yes
No
Yes
Classes = {Yes, No}
Choose a test based on F, number of outcomes, n = 2 (Yes or No)
F no
yes
F
Yes
Yes
Yes
Yes
Yes
S
Yes
No
Yes
Yes
No
I
No
No
Yes
No
Yes
High Risk?
Yes
Yes
No
Yes
Yes
F
S
I
High Risk?
No
No
No
Yes
Yes
No
No
No
Tree Building example (modified)
T=
F
S
I
High Risk?
Yes
Yes
No
Yes
Yes
No
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
No
No
Yes
Yes
Yes
No
No
Yes
No
Yes
Classes = {Yes, No}
Choose a test based on F, number of outcomes, n = 2 (Yes or No)
F no
yes
High Risk = ‘yes’
High Risk = ‘no’
5, 1
2, 0
Tree Building example (modified)
T=
F
S
I
High Risk?
Yes
Yes
No
Yes
Yes
No
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
No
No
Yes
Yes
Yes
No
No
Yes
No
Yes
Classes = {Yes, No}
Choose a test based on S, number of outcomes, n = 2 (Yes or No)
S no
yes
F
S
I
High Risk?
F
S
I
High Risk?
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
No
Yes
No
No
Yes
No
Yes
No
Yes
No
Yes
No
No
No
No
Yes
Yes
Yes
No
Yes
Tree Building example (modified)
T=
F
S
I
High Risk?
Yes
Yes
No
Yes
Yes
No
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
No
No
Yes
Yes
Yes
No
No
Yes
No
Yes
Classes = {Yes, No}
Choose a test based on S, number of outcomes, n = 2 (Yes or No)
S no
yes
High Risk = ‘no’
High Risk = ‘yes’
4,2
3,1
Tree Building example (modified)
T=
F
S
I
High Risk?
Yes
Yes
No
Yes
Yes
No
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
No
No
Yes
Yes
Yes
No
No
Yes
No
Yes
Classes = {Yes, No}
Choose a test based on I, number of outcomes, n = 2 (Yes or No)
I no
yes
F
S
I
High Risk?
No
Yes
Yes
No
Yes
No
Yes
Yes
Yes
No
No
Yes
F
S
I
High Risk?
Yes
Yes
Yes
No
Yes
No
Yes
Yes
No
No
No
No
Yes
Yes
Yes
No
Tree Building example (modified)
T=
F
S
I
High Risk?
Yes
Yes
No
Yes
Yes
No
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
Yes
Yes
No
No
Yes
Yes
Yes
No
No
Yes
No
Yes
Classes = {Yes, No}
Choose a test based on I, number of outcomes, n = 2 (Yes or No)
I no
yes
High Risk = ‘no’
High Risk = ‘yes’
3, 1
4,1
Decision tree building algorithm
• For each decision point,
– If all remaining examples are all +ve or all -ve, stop.
– Else if there are some +ve and some -ve examples left and
some attributes left, pick the remaining attribute with
largest information gain
– Else if there are no examples left, no such example has
been observed; return default
– Else if there are no attributes left, examples with the same
description have different classifications: noise or
insufficient attributes or nondeterministic domain
Evaluation of decision trees
• At the leaf nodes two numbers are given:
– N: the coverage for that node: how many instances
– E: the error rate: how many wrongly classified instances
• The whole tree can be evaluated in terms of its size
(number of nodes) and overall error-rate expressed in
terms of the number and percentage of cases wrongly
classified.
• We seek small trees that have low error rates.
Evaluation of decision trees
• The error rate for the whole tree can also be
displayed in terms of a confusion matrix:
(A)
(B)
(C)
 Classified as
35
2
1
Class (A) = high
4
41
5
Class (B) = medium
2
5
68
Class (C) = low
Evaluation of decision trees
• The error rates mentioned on previous slides are
normally computed using
a. The training set of instances.
b. A test set of instances – some different examples!
• If the decision tree algorithm has ‘over-fitted’ the
data, then the error rate based on the training set
will be far less than that based on the test set.
Evaluation of decision trees
•
10-fold cross-validation can be used when the training set is
limited in size:
–
–
–
–
•
Divide the test set randomly into 10 subsets.
Build a tree from 9 of the subsets and test using the 10th.
Repeat the experiment 9 more times, using a different test set each time.
Overall error rate is average of 10 experiments
10-fold cross-validation will lead to up to 10 different decision
trees being built. The method for selecting or constructing the
best tree is not clear.
From decision trees to rules
• Decision trees may not be easy to interpret:
– tests associated with lower nodes have to be read in the
context of tests further up the tree
– ‘sub-concepts’ may sometimes be split up and distributed
to different parts of the tree (see next slide)
– Computer Scientists may prefer “if … then …” rules!
DT for “F = G = 1 or J = K = 1”
J=K=1 is split across
two subtrees.
F= 0;
J = 0; no
J = 1;
K = 0; no
K = 1; yes
F = 1;
1
G = 1; yes
G = 0;
yes
J = 0; no
J = 1;
K = 0; no
K = 1; yes
K
J
1
F
0
0
no
0
no
1
G 1
0
0
no
J
yes
1
0
no
K
1
yes
Converting DT to rules
• Step 1: Every path from root to leaf represents a
rule:
F= 0;
If F = 0 and J = 0 then class no;
J = 0; no
J = 1;
If F = 0 and J = 1 and K = 0 then class no
K = 0; no
K = 1; yes
F = 1;
G = 1; yes
G = 0;
J = 0; no
J = 1;
If F = 0 and J = 1 and K = 1 then class yes
….
If F = 1 and G = 0 and J = 1 and K = 1
then class yes
K = 0; no
K = 1; yes
Generalising rules
If F = 0 and J = 1 and K = 1 then class yes
If F = 1 and G = 0 and J = 1 and K = 1
then class yes
If G = 1 then class yes
If J =1 and K = 1 then class
yes
Tidying up rule sets
• Generalisation leads to 2 problems:
• Rules no longer mutually exclusive
– Order rules and use the first matching rule used as the
operative rule.
– Ordering is based on how many false positive errors the
rule makes
• Rule set no longer exhaustive
– Choose a default value for the class when no rule applies
– Default class is that which contains the most training
cases not covered by any rule.
Decision Tree - Revision
 Decision tree builder algorithm discovers rules for
classifying instances.
 At each step, it needs to decide which attribute to
test at that point in the tree; a measure of
‘information gain’ can be used.
 The output is a decision tree based on the ‘training’
instances, evaluated with separate “test” instances.
 Leaf nodes which have a small coverage may be
pruned if the error rate is small for the pruned tree.
Pruning example (from W & F)
Health plan
contribution
none
4 bad
2 good
half
1 bad
1 good
full
4 bad
2 good
We replace the subtree with:
Bad
Number of instances
14, 5
number of
errors
Decision trees v classification rules
• Decision trees can be used for prediction or
interpretation.
– Prediction: compare an unclassified instance against the
tree and predict what class it is in (with error estimate)
– Interpretation: examine tree and try to understand why
instances end up in the class they are in.
• Rule sets are often better for interpretation.
– ‘Small’, accurate rules can be examined, even if overall
accuracy of the rule set is poor.
Self Check
• You should be able to:
– Describe how the decision-trees are built from a set of
instances.
– Build a decision tree based on a given attribute
– Explain what the ‘training’ and ‘test’ sets are for.
– Explain what “Supervised” means, and why classification
is an example of supervised ML

Download Report

09 - School of Computing

Paperzz.com

Your Paperzz