IS463 – Introduction to Data Mining Semester 2, Academic year 2011-2012 Tutorial # 4 Question 1: What are the two main purposes for that the classification model? Sol: Descriptive Modeling: A classification model can serve as an explanatory tool to distinguish between objects of different classes. For example, it would be useful—for both biologists and others—to have a descriptive model that summarizes the data and explains what features define a vertebrate as a mammal, reptile, bird, fish, or amphibian. Predictive modeling: a classification model can also be used to predict the class label of unknown records. A classification model can be treated as a black box that automatically assigns a class label when presented with the attribute set of an unknown record. Question 2: What kind of attributes are the classification techniques most suited for? Why? Sol: Classification techniques are most suited for predicting or describing data sets with binary or nominal categories. They are less effective for ordinal categories (e.g., to classify a person as a member of high-, medium-, or low- income group) because they do not consider the implicit order among the categories. Other forms of relationships, such as the subclass–superclass relationships among categories (e.g., humans and apes are primates, which in turn, is a subclass of mammals) are also ignored. Question 3: Consider these training examples shown in following table for a binary classification problem. Customer ID 1 2 Gender Car Type Shirt Size Class M M Family Sports Small Medium C0 C0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 M M M M F F F F M M M M F F F F F F Sports Sports Sports Sports Sports Sports Sports Luxury Family Family Family Luxury Luxury Luxury Luxury Luxury Luxury Luxury Medium Large Extra Large Extra Large Small Small Medium Large Large Extra Large Medium Extra Large Small Small Medium Medium Medium Large C0 C0 C0 C0 C0 C0 C0 C0 C1 C1 C1 C1 C1 C1 C1 C1 C1 C1 (a) Compute the Gini index for the overall collection of training examples. (b) Compute the Gini index for the Customer ID attribute. (c) Compute the Gini index for the Gender attribute. (d) Compute the Gini index for the Car Type attribute using multiway split. (e) Compute the Gini index for the Shirt Size attribute using multiway split. (f) Which attribute is better, Gender, Car Type, or Shirt Size? (g) Explain why Customer ID should not be used as the attribute test condition even though it has the lowest Gini. Sol: a) GINI(t) 1 p j | t 2 j b) Customer ID c 1 Gini(t) 1 pi | t 2 i0 10 2 10 2 1 1 1 1 0.5 4 4 20 20 0 2 12 1 1 0 1 0 1 1 Weighted Average = 0 c) Gender Type Female: c 1 Gini(t) 1 pi | t i0 Male: 2 6 2 4 2 1 1 0.36 0.16 0.48 10 10 c 1 Gini(t) 1 pi | t 2 i0 6 2 4 2 1 1 0.36 0.16 0.48 10 10 Weighted Average: 10 10 ni Gini(i) Female : (0.48) Male : 0.48 0.48 20 20 n i1 k Ginisplit d) Car Type Family: c 1 Gini(t) 1 pi | t 2 i0 Sports: c 1 Gini(t) 1 pi | t 2 i0 Luxury: c 1 Gini(t) 1 pi | t 2 i0 1 2 3 2 1 1 0.625 0.5625 0.375 4 4 8 2 0 2 1 1 1 0 0 8 8 1 2 7 2 1 1 0.0156 0.7656 0.21875 8 8 Weighted Average: k ni Gini(i) i1 n Ginisplit 4 8 8 Family : (0.375) Sports : 0 Luxury : 0.21875 0.163 20 20 20 e) Shirt Size Small: c 1 Gini(t) 1 pi | t 2 i0 Medium: c 1 Gini(t) 1 pi | t 2 i0 Large: c 1 Gini(t) 1 pi | t 2 i0 Extra Large: c 1 Gini(t) 1 pi | t i0 2 3 2 2 2 1 1 0.36 0.16 0.48 5 5 3 2 4 2 1 1 0.184 0.327 0.49 7 7 2 2 2 2 1 1 0.25 0.25 0.5 4 4 2 2 2 2 1 1 0.25 0.25 0.5 4 4 Weighted Average: k ni Gini(i) i1 n Ginisplit 5 7 4 4 Small : (0.48) Medium : 0.49 L arg e : 0.5 ExtraLarg e : 0.5 0.4915 20 20 20 20 f) Using the GINI Index, Car Type would be the better attribute. It is also considered that the distribution of the sample with zero reflecting the most distributed sample set. The Car Type has the lowest GINI Index. g) Customer ID should not be used as the attribute test condition even though it has the lowest GINI because it unique. Question 4: Consider the training examples shown in the following table for a binary classification problem. Instance a1 a2 a3 1 2 3 4 5 6 7 8 9 T T T F F F F T F T T F F T T F F T 1.0 6.0 5.0 4.0 7.0 3.0 8.0 7.0 5.0 Target Class + + + + - (a) What is the entropy of this collection of training examples with respect to the positive class? (b) What are the information gains of a1 and a2 relative to these training examples? (c) For a3, which is a continuous attribute, compute the information gain for every possible split. (d) What is the best split (among a1, a2, and a3) according to the information gain? (e) What is the best split (between a1 and a2) according to the classification error rate? (f) What is the best split (between a1 and a2) according to the Gini index? Sol: c 1 4 4 5 5 a) Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 0.99107 9 9 9 9 i0 b) T 3 3 1 1 Entropy(a1) p(i | t)log 2 p(i | t) log 2 log 2 0.81128 4 4 4 4 i0 c 1 2 2 3 3 Entropy(a2 ) p(i | t)log 2 p(i | t) log 2 log 2 0.97095 5 5 5 5 i0 c 1 F c 1 1 1 4 4 Entropy(a1) p(i | t)log 2 p(i | t) log 2 log 2 0.72193 5 5 5 5 i0 c 1 2 2 2 2 Entropy(a2 ) p(i | t)log 2 p(i | t) log 2 log 2 1 4 4 4 4 i0 Information Gain k a1 : I( parent) j 1 k a2 : I( parent) N( j ) N N( j ) N j 1 4 5 I( j ) 0.991 0.81128 0.72193 0.229 9 9 5 4 I( j ) 0.991 0.971 1 0.007 9 9 c) + - 1 0.5 ≤ > 0 4 0 5 Gain 0 3 2 ≤ 1 0 > 3 5 0.143 4 3.5 ≤ > 1 3 1 4 0.002 5 5 4.5 ≤ > 2 2 1 4 0.072 7 6 5.5 ≤ > 2 2 3 2 0.007 1 7 6.5 ≤ 3 3 > 1 2 0.01824 8 7.5 ≤ > 4 0 4 1 0.102 1 Split Number 1 4 4 5 5 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 0.99107 9 9 9 9 i0 Information Gain : = 0 c 1 Split Number 2 8.5 ≤ > 4 0 5 0 0 1 1 Entropy(t) p(i | t)log 2 p(i | t) log 2 0 log 2 0 0 1 1 i0 c 1 3 3 5 5 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 0.95443 8 8 8 8 i0 1 8 Weighted Average : 0 0.95443 0.84839 9 9 Information Gain : 0.991 (0.84839) 0.143 c 1 Split Number 3 1 1 1 1 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 1 2 2 2 2 i0 c 1 3 3 4 4 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 0.98523 7 7 7 7 i0 2 7 Weighted Average : 1 0.98523 0.988512 9 9 Information Gain : 0.991 (0.988512) 0.002488 c 1 Split Number 4 c 1 2 2 1 1 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 9183 3 3 3 3 i0 c 1 2 2 4 4 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 0.9183 6 6 6 6 i0 3 6 Weighted Average : 0.9183 0.9183 0.9183 9 9 Information Gain : 0.991 (0.9183) 0.0727 Split Number 5 c 1 2 2 3 3 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 0.97095 5 5 5 5 i0 c 1 2 2 2 2 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 1 4 4 4 4 i0 5 4 Weighted Average : 0.97095 1 0.98386 9 9 Information Gain : 0.991 (0.98386) 0.00714 Split Number 6 c 1 3 3 3 3 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 1 6 6 6 6 i0 1 1 2 2 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 0.9183 3 3 3 3 i0 6 3 Weighted Average : 1 0.9183 0.972765 9 9 Information Gain : 0.991 (0.972765) 0.018235 c 1 Split Number 7 4 4 4 4 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 1 8 8 8 8 i0 c 1 0 0 1 1 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 0 1 1 1 1 i0 8 1 Weighted Average : 1 0 0.8889 9 9 Information Gain : 0.991 (0.8889) 0.10211 c 1 Split Number 8 c 1 4 4 5 5 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 0.99108 9 9 9 9 i0 c 1 0 0 0 0 Entropy(t) p(i | t)log 2 p(i | t) log 2 log 2 0 0 0 0 0 i0 9 0 Weighted Average : 0.99108 0 0.991 9 9 Information Gain : 0.991 (0.991) 0 d) According to the information gain, the best split is a1 because it has the higher gain. e) 7 2 7 a1 : Classification error(t) 1 max p(i | t) 1 , 1 0.222 i 9 9 9 5 4 5 a2 : Classification error(t) 1 max p(i | t) 1 , 1 0.444 i 9 9 9 According to the classification error rate, the best split is a1 because it has the lower classification error. f) c 1 a1 : Gini(t) 1 pi | t i0 2 7 2 2 2 1 1 0.605 0.0494 0.3456 9 9 c 1 a2 : Gini(t) 1 pi | t 2 i0 4 2 5 2 1 1 0.1975 0.3086 0.4939 9 9 According to the Gini Index, the best split is a1 because it has the lower value. Question 5: Consider the following set of training examples. X Y Z 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 Number of class C1 examples 5 0 10 45 10 25 5 0 Number of class C2 example 40 15 5 0 5 0 20 15 (a) Compute a two-level decision tree using the greedy approach described in this chapter. Use the classification error rate as the criterion for splitting. What is the overall error rate of the induced tree? (b) Repeat part (a) using X as the first splitting attribute and then choose the best remaining attribute for splitting at each of the two successor nodes. What is the error rate of the induced tree? (c) Compare the results of parts (a) and (b). Comment on the suitability of the greedy heuristic used for splitting attribute selection. Sol: a) C1 C2 X 40 40 Y 60 40 Z 70 30 40 40 1 X : Classification error(t) 1 max p(i | t) 1 , 1 0.5 i 80 80 2 60 40 3 Y : Classification error(t) 1 max p(i | t) 1 , 1 0.4 i 100 100 5 70 30 7 Z : Classification error(t) 1 max p(i | t) 1 , 1 0.3 i 100 100 10 The lowest classification error is Z, therefore the next is to split Z. 15 45 3 X 0 Z 0 : Classification error(t) 1 max p(i | t) 1 , 1 0.25 i 60 60 4 15 25 5 X 1 Z 0 : Classification error(t) 1 max p(i | t) 1 , 1 0.375 i 40 40 8 60 40 Weighted Average : 0.25 0.375 0.3 100 100 15 45 3 Y 0 Z 0 : Classification error(t) 1 max p(i | t) 1 , 1 0.25 i 60 60 4 15 25 5 Y 1 Z 0 : Classification error(t) 1 max p(i | t) 1 , 1 0.375 i 40 40 8 60 40 Weighted Average : 0.25 0.375 0.3 100 100 Both X and Y, they have the same classification error. Therefore the next step can be either to split X and/or Y. 45 15 3 X 0 Z 1: Classification error(t) 1 max p(i | t) 1 , 1 0.25 i 60 60 4 25 15 5 X 1 Z 1: Classification error(t) 1 max p(i | t) 1 , 1 0.375 i 40 40 8 60 40 Weighted Average : 0.25 0.375 0.3 100 100 25 15 3 Y 0 Z 1: Classification error(t) 1 max p(i | t) 1 , 1 0.375 i 40 40 8 45 15 3 Y 1 Z 1: Classification error(t) 1 max p(i | t) 1 , 1 0.25 i 60 60 4 40 60 Weighted Average : 0.375 0.25 0.3 100 100 Complete error rate for the induced tree = (0.3 + 0.3) = 0.6 The corresponding two-level decision tree is shown below. b) 55 5 11 X 0 Y 0 : Classification error(t) 1 max p(i | t) 1 , 1 0.083 i 60 60 12 5 55 11 X 0 Y 1: Classification error(t) 1 max p(i | t) 1 , 1 0.083 i 60 60 12 1 1 Weighted Average : 0.083 0.083 0.083 2 2 45 15 3 X 0 Z 0 : Classification error(t) 1 max p(i | t) 1 , 1 0.25 i 60 60 4 15 45 3 X 0 Z 1 : Classifica tion error (t ) 1 max p(i | t ) 1 , 1 0.25 i 60 60 4 1 1 Weighted Average : 0.25 0.25 0.25 2 2 The lowest classification error is Y. Therefore the next step is to split Y. 5 35 7 X 1 Y 0 : Classification error(t) 1 max p(i | t) 1 , 1 0.125 i 40 40 8 35 5 7 X 1 Y 1: Classification error(t) 1 max p(i | t) 1 , 1 0.125 i 40 40 8 1 1 Weighted Average : 0.125 0.125 0.125 2 2 25 15 5 X 1 Z 0 : Classification error(t) 1 max p(i | t) 1 , 1 0.375 i 40 40 8 15 25 5 X 1 Z 1: Classification error(t) 1 max p(i | t) 1 , 1 0.375 i 40 40 8 1 1 Weighted Average : 0.375 0.375 0.375 2 2 The lowest classification error is Y with X=1. Therefore the complete error rate of the induced tree is 0.208. The corresponding two-level decision tree is shown below. c) When comparing the results from part (a) and (b), the error rate for part (a) is significantly larger than that for part (b). This example shows that a greedy heuristic does not always produce an optimal solution.
© Copyright 2026 Paperzz