a1 - KSU Faculty Member websites

IS463 – Introduction to Data Mining
Semester 2, Academic year 2011-2012
Tutorial # 4
Question 1:
What are the two main purposes for that the classification model?
Sol:
Descriptive Modeling: A classification model can serve as an explanatory tool
to distinguish between objects of different classes. For example, it would be
useful—for both biologists and others—to have a descriptive model that
summarizes the data and explains what features define a vertebrate as a
mammal, reptile, bird, fish, or amphibian.
Predictive modeling: a classification model can also be used to predict the
class label of unknown records. A classification model can be treated as a black
box that automatically assigns a class label when presented with the attribute set
of an unknown record.
Question 2:
What kind of attributes are the classification techniques most suited for?
Why?
Sol:
Classification techniques are most suited for predicting or describing data sets
with binary or nominal categories. They are less effective for ordinal categories
(e.g., to classify a person as a member of high-, medium-, or low- income group)
because they do not consider the implicit order among the categories. Other
forms of relationships, such as the subclass–superclass relationships among
categories (e.g., humans and apes are primates, which in turn, is a subclass of
mammals) are also ignored.
Question 3:
Consider these training examples shown in following table for a binary
classification problem.
Customer
ID
1
2
Gender
Car Type
Shirt Size
Class
M
M
Family
Sports
Small
Medium
C0
C0
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
M
M
M
M
F
F
F
F
M
M
M
M
F
F
F
F
F
F
Sports
Sports
Sports
Sports
Sports
Sports
Sports
Luxury
Family
Family
Family
Luxury
Luxury
Luxury
Luxury
Luxury
Luxury
Luxury
Medium
Large
Extra Large
Extra Large
Small
Small
Medium
Large
Large
Extra Large
Medium
Extra Large
Small
Small
Medium
Medium
Medium
Large
C0
C0
C0
C0
C0
C0
C0
C0
C1
C1
C1
C1
C1
C1
C1
C1
C1
C1
(a) Compute the Gini index for the overall collection of training examples.
(b) Compute the Gini index for the Customer ID attribute.
(c) Compute the Gini index for the Gender attribute.
(d) Compute the Gini index for the Car Type attribute using multiway split.
(e) Compute the Gini index for the Shirt Size attribute using multiway split.
(f) Which attribute is better, Gender, Car Type, or Shirt Size?
(g) Explain why Customer ID should not be used as the attribute test
condition even though it has the lowest Gini.
Sol:
a) GINI(t)  1   p j | t 
2
j
b) Customer ID

c 1
Gini(t)  1   pi | t 
2
i0
10 2 10 2 
1 1 
 1        1     0.5
4 4 
20  20  
0 2 12 
 1        1  0 1  0
1  1 
Weighted Average = 0

c) Gender Type
Female:
c 1
Gini(t)  1   pi | t 
i0
Male:

2
 6 2  4 2 
 1        1  0.36  0.16  0.48
10  10  
c 1
Gini(t)  1   pi | t 
2
i0
 6 2  4 2 
 1        1  0.36  0.16  0.48
10  10  
Weighted Average:



 

10 
10 
ni
Gini(i)  Female :  (0.48) Male :   0.48 0.48
20 
20 
n

 

i1
k
Ginisplit  
d) Car Type
Family:
c 1
Gini(t)  1   pi | t 
2
i0
Sports:
c 1

Gini(t)  1   pi | t 
2
i0
Luxury:
c 1

Gini(t)  1   pi | t 
2
i0
1 2 3 2 
 1        1  0.625  0.5625  0.375
4  4  
8 2 0 2 
 1        1  1 0  0
8  8  
1 2 7 2 
 1        1  0.0156  0.7656  0.21875
8  8  
Weighted Average:

k
ni
Gini(i)
i1 n
Ginisplit  

 

 4 
 8   
 8 
 Family :  (0.375)  Sports :   0  Luxury :   0.21875  0.163
20 
20   
20 

 


e) Shirt Size
Small:
c 1
Gini(t)  1   pi | t 
2
i0
Medium:
c 1

Gini(t)  1   pi | t 
2
i0
Large:
c 1

Gini(t)  1   pi | t 
2
i0
Extra Large:
c 1


Gini(t)  1   pi | t 
i0
2
3 2 2 2 
 1        1  0.36  0.16  0.48
5  5  
3 2 4 2 
 1        1  0.184  0.327  0.49
7  7  
2 2 2 2 
 1        1  0.25  0.25  0.5
4  4  
2 2 2 2 
 1        1  0.25  0.25  0.5
4  4  
Weighted Average:
k
ni
Gini(i)
i1 n
Ginisplit  

 
 
 

 5 
 7 
 4 
 4 
 Small :  (0.48)  Medium :   0.49  L arg e :   0.5  ExtraLarg e :   0.5  0.4915
20 
20 
20 
20 

 
 
 

f) Using the GINI Index, Car Type would be the better attribute. It is also
considered that the distribution of the sample with zero reflecting the most
distributed sample set. The Car Type has the lowest GINI Index.

g) Customer ID should not be used as the attribute test condition even
though it has the lowest GINI because it unique.
Question 4:
Consider the training examples shown in the following table for a binary
classification problem.
Instance
a1
a2
a3
1
2
3
4
5
6
7
8
9
T
T
T
F
F
F
F
T
F
T
T
F
F
T
T
F
F
T
1.0
6.0
5.0
4.0
7.0
3.0
8.0
7.0
5.0
Target
Class
+
+
+
+
-
(a) What is the entropy of this collection of training examples with respect
to the positive class?
(b) What are the information gains of a1 and a2 relative to these training
examples?
(c) For a3, which is a continuous attribute, compute the information gain for
every possible split.
(d) What is the best split (among a1, a2, and a3) according to the
information gain?
(e) What is the best split (between a1 and a2) according to the
classification error rate?
(f) What is the best split (between a1 and a2) according to the Gini index?
Sol:
c 1
4
4
5
5 
a) Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2  0.99107
9
9 9
9 
i0

b)
T
3
3 1
1 
Entropy(a1)   p(i | t)log 2 p(i | t)    log 2   log 2  0.81128
4
4 4
4 
i0
c 1
2
2 3
3 
Entropy(a2 )   p(i | t)log 2 p(i | t)    log 2   log 2  0.97095
5
5 5
5 
i0
c 1

F

c 1
1
1 4
4 
Entropy(a1)   p(i | t)log 2 p(i | t)    log 2   log 2  0.72193
5
5 5
5 
i0
c 1
2
2 2
2 
Entropy(a2 )   p(i | t)log 2 p(i | t)    log 2   log 2 1
4
4 4
4 
i0

Information Gain

k
a1 :   I( parent)  
j 1
k
a2 :   I( parent)  
N( j )
N
N( j )
N
j 1
4

5
I( j )  0.991    0.81128   0.72193 0.229
9

9
5
4 
I( j )  0.991    0.971  1 0.007
9
9 


c)
+
-
1
0.5
≤ >
0 4
0 5
Gain
0
3
2
≤
1
0
>
3
5
0.143
4
3.5
≤ >
1 3
1 4
0.002
5
5
4.5
≤ >
2 2
1 4
0.072
7
6
5.5
≤ >
2 2
3 2
0.007
1
7
6.5
≤
3
3
>
1
2
0.01824
8
7.5
≤ >
4 0
4 1
0.102
1
Split Number 1
4
4 5
5 
Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2  0.99107
9
9 9
9 
i0
Information Gain : = 0
c 1


Split Number 2
8.5
≤ >
4 0
5 0
0
1

1
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2  0  log 2 0 0
1

1
i0
c 1
3
3 5
5 
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2  0.95443
8
8 8
8 
i0
1 
 8 

Weighted Average :   0   0.95443 0.84839
9 
 9 

Information Gain :   0.991  (0.84839) 0.143
c 1

















Split Number 3
1
1 1
1 
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2 1
2
2 2
2 
i0
c 1
3
3 4
4 
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2  0.98523
7
7 7
7 
i0
2 
 7 

Weighted Average :   1   0.98523 0.988512
9 
 9 

Information Gain :   0.991  (0.988512) 0.002488
c 1
Split Number 4
c 1
2
2 1
1 
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2  9183
3
3 3
3 
i0
c 1
2
2 4
4 
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2  0.9183
6
6 6
6 
i0
3
 6 

Weighted Average :   0.9183   0.9183 0.9183
9 
 9 

Information Gain :   0.991  (0.9183) 0.0727
Split Number 5
c 1
2
2 3
3 
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2  0.97095
5
5 5
5 
i0
c 1
2
2 2
2 
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2 1
4
4 4
4 
i0
5 
 4 

Weighted Average :   0.97095   1 0.98386
9 
 9 

Information Gain :   0.991  (0.98386) 0.00714
Split Number 6
c 1
3
3 3
3 
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2 1
6
6 6
6 
i0
1
1 2
2 
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2  0.9183
3
3 3
3 
i0
6 
 3

Weighted Average :   1   0.9183 0.972765
9 
 9 

Information Gain :   0.991  (0.972765) 0.018235
c 1











Split Number 7
4
4 4
4 
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2 1
8
8 8
8 
i0
c 1
0
0 1
1
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2  0
1
1 1
1
i0
8 
 1 

Weighted Average :   1   0 0.8889
9 
 9 

Information Gain :   0.991  (0.8889) 0.10211
c 1
Split Number 8
c 1
4
4 5
5 
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2  0.99108
9
9 9
9 
i0
c 1
0
0 0
0 
 Entropy(t)   p(i | t)log 2 p(i | t)    log 2   log 2  0
0
0 0
0 
i0
9 
 0 

Weighted Average :   0.99108   0 0.991
9 
 9 

Information Gain :   0.991  (0.991) 0
d) According to the information gain, the best split is a1 because it has the
higher gain.
e)
7 2 
7 
a1 : Classification error(t)  1  max p(i | t)  1   ,  1    0.222
i
9 9 
9 
5 4 
5 
a2 : Classification error(t)  1  max p(i | t)  1   ,  1    0.444
i
9 9 
9 


According to the classification error rate, the best split is a1 because it has
the lower classification error.
f)
c 1
a1 : Gini(t)  1   pi | t 
i0

2
7 2 2 2 
 1        1  0.605  0.0494   0.3456
9  9  
c 1
a2 : Gini(t)  1   pi | t 
2
i0

4 2 5 2 
 1        1  0.1975  0.3086  0.4939
9  9  
According to the Gini Index, the best split is a1 because it has the lower
value.
Question 5:
Consider the following set of training examples.
X
Y
Z
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
Number
of class
C1
examples
5
0
10
45
10
25
5
0
Number
of class
C2
example
40
15
5
0
5
0
20
15
(a) Compute a two-level decision tree using the greedy approach
described in this chapter. Use the classification error rate as the criterion
for splitting. What is the overall error rate of the induced tree?
(b) Repeat part (a) using X as the first splitting attribute and then choose
the best remaining attribute for splitting at each of the two successor
nodes. What is the error rate of the induced tree?
(c) Compare the results of parts (a) and (b). Comment on the suitability of
the greedy heuristic used for splitting attribute selection.
Sol:
a)
C1
C2
X
40
40
Y
60
40
Z
70
30
40 40 
1 
X : Classification error(t)  1  max p(i | t)  1   ,  1    0.5
i
80 80 
2 
 60 40 
3 
Y : Classification error(t)  1  max p(i | t)  1   ,
 1    0.4

i
100 100 
5 


 70 30 
 7 
Z : Classification error(t)  1  max p(i | t)  1   ,
 1    0.3

i
100 100 
10 
The lowest classification error is Z, therefore the next is to split Z.













15 45 
3 
X  0 Z  0 : Classification error(t)  1  max p(i | t)  1   ,  1    0.25
i
60 60 
4 
15 25 
5 
X  1 Z  0 : Classification error(t)  1  max p(i | t)  1   ,  1    0.375
i
40 40 
8 
 60 
  40 

Weighted Average :   0.25   0.375 0.3
100 
 100 

15 45 
3 
Y  0 Z  0 : Classification error(t)  1  max p(i | t)  1   ,  1    0.25
i
60 60 
4 
15 25 
5 
Y  1 Z  0 : Classification error(t)  1  max p(i | t)  1   ,  1    0.375
i
40 40 
8 
 60 
  40 

Weighted Average :   0.25   0.375 0.3
100 
 100 

Both X and Y, they have the same classification error. Therefore the next
step can be either to split X and/or Y.
45 15 
3 
X  0 Z  1: Classification error(t)  1  max p(i | t)  1   ,  1    0.25
i
60 60 
4 
25 15 
5 
X  1 Z  1: Classification error(t)  1  max p(i | t)  1   ,  1    0.375
i
40 40 
8 
 60 
  40 

Weighted Average :   0.25   0.375 0.3
100 
 100 

25 15 
3 
Y  0 Z  1: Classification error(t)  1  max p(i | t)  1   ,  1    0.375
i
40 40 
8 
45 15 
3 
Y  1 Z  1: Classification error(t)  1  max p(i | t)  1   ,  1    0.25
i
60 60 
4 
 40 
  60 

Weighted Average :   0.375   0.25 0.3
100 
 100 

Complete error rate for the induced tree = (0.3 + 0.3) = 0.6
The corresponding two-level decision tree is shown below.
b)




55 5 
11 
X  0 Y  0 : Classification error(t)  1  max p(i | t)  1   ,  1    0.083
i
60 60 
12 
 5 55 
11 
X  0 Y  1: Classification error(t)  1  max p(i | t)  1   ,  1    0.083
i
60 60 
12 
1 
 1 

Weighted Average :   0.083   0.083 0.083
2 
 2 

45 15 
3 
X  0 Z  0 : Classification error(t)  1  max p(i | t)  1   ,  1    0.25
i
60 60 
4 
 15 45 
3
X  0 Z  1 : Classifica tion error (t )  1  max  p(i | t )  1   ,   1     0.25
i
 60 60 
4
1 
 1 

Weighted Average :   0.25   0.25 0.25
2 
 2 

The lowest classification error is Y. Therefore the next step is to split Y.





 5 35 
7 
X  1 Y  0 : Classification error(t)  1  max p(i | t)  1   ,  1    0.125
i
40 40 
8 
35 5 
7 
X  1 Y  1: Classification error(t)  1  max p(i | t)  1   ,  1    0.125
i
40 40 
8 
1 
 1 

Weighted Average :   0.125   0.125 0.125
2 
 2 

25 15 
5 
X  1 Z  0 : Classification error(t)  1  max p(i | t)  1   ,  1    0.375
i
40 40 
8 
15 25 
5 
X  1 Z  1: Classification error(t)  1  max p(i | t)  1   ,  1    0.375
i
40 40 
8 
1 
 1 

Weighted Average :   0.375   0.375 0.375
2 
 2 



The lowest classification error is Y with X=1. Therefore the complete error
rate of the induced tree is 0.208.
The corresponding two-level decision tree is shown below.
c) When comparing the results from part (a) and (b), the error rate for part
(a) is significantly larger than that for part (b). This example shows that a
greedy heuristic does not always produce an optimal solution.