School of IT Technical Report DECISION TREES FOR IMBALANCED DATA SETS TECHNICAL REPORT 630 WEI LIU AND SANJAY CHAWLA SCHOOL OF INFORMATION TECHNOLOGIES THE UNIVERSITY OF SYDNEY SEPTEMBER 2008 Decision Trees for Imbalanced Data Sets Wei Liu School of Information Technology University of Sydney Australia [email protected] Abstract We propose a new variant of decision tree for imbalanced classification. Decision trees use a greedy approach based on information gain to select the attribute to split. We express information again in terms of confidence and show that like confidence, information gain is biased towards the majority class. We overcome the bias of information gain by embedding a new measure, the ratio of confidence of the minority and majority class, into the entropy formula. Furthermore we prune the decision tree using p-value instead of the conventional error-based approach. Together these two changes yield a decision tree which is robust to imbalanced data. Extensive experiments and comparision against C4.5, CBA and variants on imbalanced and balance data confirm our claims. 1. Introduction While there are several types of classifiers, rule-based classifiers have the distinct advantage of being easily interpretable. This is especially true in a ‘data mining’ setting where the high-dimensionality of data often means that apriori very little is known about the underlying mechanism which generated the data. Decision Trees are perhaps the most popular form of rule-based classifiers[8]. Lately however, classifiers based on association rules have also become popular [7, 10]. These are often called associative classifiers. Associative classifiers use association rule mining to discover interesting/significant rules from the training data and the set of rules discovered constitute the classifier. classifiers. The canonical example of an associative classifier is CBA (Classification Based on Associations) [7], which uses the minimum support and confidence framework to find ules. The accuracy of associative classifiers depends on the quality of their discovered rules. Sanjay Chawla School of Information Technology University of Sydney Australia [email protected] However, the success of both C4.5 and CBA is dependent upon the assumption that there is an equal amount of information for each class contained in the training data. For example, in binary classification, if there are similar number of instances for both positive and negative class, then both C4.5 and CBA generally perform well. On the other hand if the training data set tends to have an imbalanced class distribution then both these types of classifier have a bias towards the majority class. As it happens it is often the accurate prediction related to the minority class that is usually of more interest. One way of solving the imbalance class problem is to modify the class distributions in the training data by oversampling the minority class or undersampling the majority class. SMOTE [4] for example, uses oversampling to increase the the number of the minorty class instances by creating synthetic examples. Further variations on SMOTE [5] have integrated boosting with sampling strategies to better model the minority class by focusing on difficult examples that belong not only to the minor class but also the majority class. Nonetheless, data sampling is not the only way to deal with the class imbalanced problems, and some specifically designed “imbalanced data oriented” algorithms can perform well on the original unmodified imbalanced data sets. For example, a variation on associative classifier called SPARCCC [10] has been shown to outperform CBA [7], CMAR [6] and CCCS [1] on imbalanced data sets. The downside of SPRARCC is that it generates a large number of rules. This seems to be a feature of all associative classifiers and negates many of the advantages of rule based classification. Our main objective is to have a rule based classifier which generates much fewer rules than associative classifiers but can successfully handle both imbalanced and balanced data sets. Towards that end we first theoretically examine how imbalanced distribution effects the metrics that are used by both decision trees and associative classifiers. In particular we study the effect of an imbalanced distribu- tion on confidence as well as information gain. One issue specific to decision trees is the use of a pruning strategy to prevent the tree from overfitting the data. Traditional pruning algorithms are based on error estimations. For example, a branch is replaced by a leaf if the predicted error rate for the leaf is lower than that for the branch. But this pruning technique will not always perform well on imbalanced data sets. Nitesh et al. [3] has shown that the pruning in C4.5 can have a detrimental effect on learning from imbalanced data sets, since lower error rate can be achieved by removing the branches that lead to minor class leaves. In our pruning process, we use the Fisher’s Exact Test (FET) to check if or not a path in a decision tree is statistically significant and if not then the leaf is pruned. As an added advantage every rule that we get is also statistically significant. The paper is structured as follows. In Section 2 we analyze the reasons why CBA and C4.5 perform poorly on imbalanced data sets. In Section 3 we present the Class Confidence Ratio (CCR) as the metric of choice for selecting attributes to split for imabalanced classificaiton. Section 4 present the new decision tree algorithm which integrate CCR and Fisher Exact Test (FET). Experiments are presented in Section 5 and we conclude in Section 6 with directions for future work. 2. Analysis of rule based classifiers We analyze the metrics used by rule based classifiers in the context of imbalanced data. We show that both associative classifiers, like CBA and variants and decision trees use metrics that are not favourable towards the classification of imbalanced data sets. In particular we first show that ranking of rules, based on confidence is biased towards the majority class. We also express information gain as a function of confidence and show that it too suffers from similar problems. 2.1. CBA The performance of associative Classifiers is dependendent on the quality of the rules it discovers during the training process. We now show that in an imbalanced setting, confidence is biased towards the majority class. Suppose we have a training data set which consists of n records, and the antecedents (denoted by X and ¬X) and class (denote by y and ¬y) distributions are in the form of Table 1: The rule selection strategy in CBA is to find all rule items that have support and confidence above some predefined thresholds. For a rule X → y, the confidence of this rule is defined as: Table 1. An Example for CBA Analysis X ¬X ΣInstances y a b a+b ¬y c d c+d ΣAttributes a + c b + d n S Supp(X y) a Conf (X → y) = = Supp(X) a+c (1) Similarly, we have: Conf (X → ¬y) = S c Supp(X ¬y) = Supp(X) a+c (2) From the definition in Equation 1, it is clear that the highest confidence rules are actually selecting the most frequent class among all the instances that contains that antecedent (i.e. X in this example). However, for imbalanced data set, since the size of the positive class is always much smaller than the negative one, we always have: a + b c + d (suppose y is positive class). Moreover, assuming that imbalanced data does not affect the distribution of antecedents, we can assume that Xs and ¬Xs are nearly equally distributed. Then in the case when the data set is imabalanced, a and b are both small, while c and d are both large. Even if y is supposed to occur with X more frequently than ¬y, c is unlikely to be less than a because the positive class size is much smaller than the negative class size. Thus it is not surprising the right side term in Equation 2 always tends to be lower bounded by the right side term in Equation 1. Apparently, even though the rule X → ¬y may not be significant, it is easy for it to have a high confidence. In these circumstances, for a “good” rule X → y, it will be much harder for its confidence to be significantly larger than a “bad” rule X → ¬y. What’s more, because of its low confidence, during the classifier building up process, it will be ranked behind some rules which have a higher confidence just because they predict the majority class. This is a fatal error as in an imbalanced class problem it is often the minority class that is of more interest. 2.2. Decision Trees Decision trees like C4.5 use information gain to decide which variable to split [8]. The information gain for splitting a node t is defined as: Inf oGainsplit = Entropy(t) − X ni Entropy(i) (3) n i=1,2 Where Entropy(t) is defined as: Entropy(t) = − X p(j|t)logp(j|t) (4) j For a fixed training set (or its subsets), the number of instances for each class (i.e. the j in equation 4) is the same for all attributes. So the first term in equation 3 is fixed, which means the Inf ormationGain for each attributesplitting depends only on the second term. Maximizing Inf ormationGain reduces to be the problem of maximizP ing the second term − i=1,2 nni Entropy(i). If we expand the second term in equation 3, we can get: Inf oGainsplit = Entropy(t) − − Figure 1. Proportion of Information Gain to conf (X → y) and conf (¬X → y). n1 Entropy(subnode1) n n2 Entropy(subnode2) n (5) Suppose the node t is split into two sub nodes with two corresponding paths: X and ¬X, and the instances in each node have two classes denoted by y and ¬y, then equation 5 can be rewritten as: Inf oGainsplit = Entropy(t) n1 [−p(y|X)log(y|X) − p(¬y|X)logp(¬y|X)] − n n2 − [−p(y|¬X)log(y|¬X) − p(¬y|¬X)logp(¬y|¬X)] n (6) Note that the probability of y given X is actually the confidence of X implies y: supp(X ∪ y) p(X ∪ y) = = conf (X → y) p(y|X) = p(X) supp(X) Then if we ignore the “fixed” terms (Entropy(t)) in equation 6, we can obtain the following relationship (Equation 7): Inf oGainsplit = Entropy(t) n1 + [conf (X → y)logconf (X → y) n + conf (X → ¬y)logconf (X → ¬y)] n2 + [conf (¬X → y)logconf (¬X → y) n + conf (¬X → ¬y)logconf (¬X → ¬y)] n1 ∝ [conf (X → y)logconf (X → y) n + conf (X → ¬y)logconf (X → ¬y)] n2 [conf (¬X → y)logconf (¬X → y) + n + conf (¬X → ¬y)logconf (¬X → ¬y)] n1 = logconf (X → y)conf (X→y) n ∗ conf (X → ¬y)conf (X→¬y) n2 + logconf (¬X → y)conf (¬X→y) n ∗ conf (¬X → ¬y)conf (¬X→¬y) (7) If we represent conf (X → y) by p, and represent conf (¬X → y) by q, then equation 7 can be rewritten as: Inf oGainsplit n2 n1 logpp (1 − p)1−p + logq q (1 − q)1−q ∝ n n ∝ logpp (1 − p)1−p + logq q (1 − q)1−q (8) ∝ pp (1 − p)1−p + q q (1 − q)1−q The distribution of equation 8 as a 3-dimensional surface is shown in Figure 1. Apparently, Information Gain is maximized when conf (X → y) and conf (¬X → y) are both close to either 0 or 1, and is minimized when conf (X → y) and conf (¬X → y) are both close to 0.5. Note that when conf (X → y) is close to 0, we have conf (X → ¬y) close to 1; and when conf (¬X → y) is close to 0, conf (¬X → ¬y) is actually close to 1. We can therefore conclude : Information Gain achieves the highest value when either X → y or X → ¬y has a high confidence, and either ¬X → y or ¬X → ¬y also has a high confidence. Thus decision trees like C4.5 split an an attribute such that the resulting partition provides the highest confidence. This is very similar to the rule ranking mechanism of association classifiers like CBA. However, as we have analysed in Section 2.1, for imbalanced data set, high confidence rules do not always mean significant rules, and there might be some significant rules that do not get a high confidence. So the splitting criteria in C4.5 is probably suitable for balanced data sets, but not for imbalanced ones. 3. Class Confidence Ratio and Fisher’s Exact Test Having identified the weakness of the supportconfidence framework and also the close relationship between information gain and confidence, we are now in a position to propose new measures to remedy the problem. Recall that more frequently a particular class c appears together with X does not necessarily mean X “explains” the class c because c could be the overwhelming majority class. Then it is reasonable that instead of focusing on the antecedents (Xs), we focus only on each class and find the most significant antecedents for them. In this way, all instances are partitioned according to the class they contain, and consequently instances that belong to difference classes will not have an impact on each other. So we define a new concept, Class Confidence (CC), to find the most interesting antecedents (X) from all the classes (y): S Supp(X y) (9) CC(X → y) = Supp(y) If we use the notation in Table 1, we get the following: CC(X → y) = a a+b ones, regardless of whether they are discovered from balanced imbalanced data sets. The ratio of significance of one rule compared against with its alternative class has been shown in SPARCCC to be an efficient way of measuring the rule’s interestingness [10]. We use the ratio of the Class Confidence between the class one rule contains and the alternative one. We call it the Class Confidence Ratio (CCR): CCR(X → y) = a(c + d) CC(X → y) = CC(X → ¬y) b(a + b) (12) A rule with high CCR means that, compared with its alternative class, the class this rule implies has higher Class Confidence, and consequently is more likely to occur together with this rule’s antecedents regardless of the proportion of classes in the data set. While CCR can help us select which rules are “good” to discrimanate between classes, we also want to evaluate the statistical significance of each rule. For this we use the Fisher’s Exact Test (FET) which allows us to do that. For a rule X → y, the FET will find the probability (pValue) of obtaining the contingency table where X and y are more positively associated, under the null hypothesis that {X, ¬X} and {y, ¬y} are independent [10]. The pValue of this rule is given by: min(b,c) (a + b)!(c + d)!(a + c)!(b + d)! n!(a + i)!(b − i)!(c − i)!(d + i)! i=0 (13) Then given a threshold for p-Value, we can find and keep those rules that are statistically significant in the positively associated direction, and discard those rules that are not. In the next section we introduce the CCR-based decision tree which combine CCR and FET. p([a, b; c, d]) = X 4. CCR-based Decision Trees The original definition of entropy in decision trees is presented in Equation 4. As analyzed in section 2.2, the frequency term “p” in Equation 4 is not a good criteria for learning with imbalanced data sets In order to integrate CCR, we give modify the definition of entropy that replaces the frequency term with CCR: (10) (11) X CCR(X → yj ) CCR(X → yj ) log M axCCR M axCCR j Then even if a + b c + d, Equation 10 and 11 will not be affected by imbalanced data. Consequently, rules with high Class Confidence will indeed be the significant In Equation 14, M axCCRj is the new metric which replaces the former p(j|t) in Equation 4. In order to make sure that the new metric is still between 0 and 1, we divide c CC(X → ¬y) = c+d Entropy CCR (t) = − (14) CCR(X→y ) each CCR by M axCCR which is the largest CCR value among all the partitioned children nodes. After replacing the new term in the Entropy definition, the conclusion made in Section 2.2 can be restated as: In CCR-based Decision Trees, Information Gain achieves the highest value when either X → y or X → ¬y has a high Class Confidence (CC), and either ¬X → y or ¬X → ¬y also has a high a high Class Confidence (CC). The process of creating Decision Trees based on CCR is described in Algorithm 1. The major difference between CCRDT and C4.5 is the the way of selecting the candidate splitting attribute, as shown in line 9. The process of discovering the attribute that has the highest Information Gain is presented in the subroutine Algorithm 2. In Algorithm 2, line 5 gets the entropy of a attribute before its splitting, line 6 to line 19 get the new CCR-based entropy after the splitting of that attribute, and line 20 to line 23 record the attribute that has the highest Information Gain. In our decision tree model, we treat each branch node as the last antecedent of a rule. For example, if there are three branch nodes (BranchA, BranchB, and BranchC) from the root to a leaf Leaf Y , then we assume a rule exists like this: BranchA ∧ BranchB ∧ BranchC → Leaf Y . In Algorithm 2, we calculate the CCR for each branch node (shown in line 7 to line 18), then under the same example, the CCR of BranchC is the CCR of the previous rule, and the CCR of BrachB is the CCR of rule BranchA ∧ BranchB → Leaf Y , etc. In this way, it is guaranteed that the attribute we select is the one whose split can generate rules (pathes in tree) that has the highest Class Confidence and highest CCR. After the creation of decision tree, Fisher’s Exact Test is applied on each branch node of the tree. A branch node will be replaced by a leaf node if the branch node holds a p-Value lower than the desired threshold. This process of pruning in CCRDT is described in line 6 to line 12 of Algorithm 3, among which line 6 gets the p-Value of a branch node. Note that the contingency table used for each branch node in subroutine (Algorithm 4) has already been built up in Algorithm 2. 5. Experiments In our experiments, we compare CCRDT with CBA [7], C4.5 [8] and SPARCCC [10] on both balanced and imbalanced data sets. We use Weka’s C4.5 implementation [11] in our experiments (note that it’s named j48 in Weka). CCRDT, CBA and SPARCCC were also implemented in Weka 1 . We choose 16 data sets from UCI repository [2], among which 10 data sets are imbalanced and 6 are balanced. De1 Implementation source code and data sets used in the experiments can be obtained from http://www.cs.usyd.edu.au/ weiliu/CCRDT/ Algorithm 1 Creation of CCR-based Decision Tree Input: Training Data: T D Output: Decision Tree —————————————————————— CCRDT(TD): 1: Create a node as the Root of the tree 2: if All instances are positive then 3: Return decision tree with one node (Root), labeled “positive” 4: else 5: if All instances are negative then 6: Return decision tree with one node (Root), labeled “negative” 7: else 8: // Find the best splitting attribute (Attri) 9: Attri = MaxCCRDTGain(T D), 10: Assign Attri to the tree root (Root = Attri), 11: for each value vi of Attri do 12: Add a branch for vi , 13: Find T D(vi ) that contains the instances which are vi at attribute Attri 14: if T D(vi ) is empty then 15: Add a leaf to this branch, with label = most common class in T D 16: else 17: Add a subtree CCRDT (T D(vi )) to this branch 18: end if 19: end for 20: end if 21: end if 22: Return Root tailed information for each data set is shown in Table 2 and Table 3. Class distributions are displayed in column 3 and 4 in Table 2 and Table 3. For the imbalanced ones, three of the data sets (with no asterisk at the end of data set names) are originally imbalanced and have binary classes. The other seven data sets have multiple classes, and we select one of them and convert all the rest of the classes to be “rest”. For example, the “vehicle” data originally has 4 classes, so we select one of the class (“bus”), and treat all the other three classes to be “rest”. Then “bus” becomes the minority class (25.8%), and “rest” is the majority class (74.2%). For the balanced data sets, we used similar ways to modify three data sets (“arrhythmia”, “optdigits” and “tae”) to use them for binary classification. All the experiments were carried out using 10-fold crossvalidation. In CCRDT, we set the p-Value threshold to be 0.001. In C4.5 we use its default confidence level of 25% in pruning. In SPARCCC, we set CCR threshold to be 1.0 Table 2. Methods Comparison on Imbalanced Data Sets Index 1 Data Sets breast 286 9 Minor Class 34.5% (malignant) Major Class 65.5% (benign) 2 car* 1728 6 30% (unacc) 70% (rest) 3 cmc* 1473 9 34.7% (3) 65.3 (rest) 4 ecoli* 336 7 15.5% (pp) 84.5% (rest) 5 hepatitis 155 79 20.6% (DIE) 79.4% (LIVE) 6 sick 3772 29 6.1% (sick) 93.9% (negative) 7 splice* 3190 60 24% (EI) 76% (rest) 8 synthetic* 600 60 16.7 (cyclic) 83.3 (rest) 9 vehicle* 846 18 25.8% (bus) 74.2% (rest) 10 wine* 178 13 27% (3) 73% (rest) Average and p-Value threshold to be 0.001. In CBA, we set support threshold to be 1% and confidence threshold to be 10%. The performances of each classifier are listed in the right half part of Table 2 and Table 3. Classifier CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA 5.1 Leaves/Rules 13 14 189 51 53 79 66 69 19 65 35 34 6 7 13 72 16 17 42 34 24 20 66 78 56 57 173 122 4 5 82 41 14 14 277 119 3 6 45 49 20.8 28.4 98.8 66.90 TP Rate 92.5 92.5 93.8 90.4 95.4 93.4 79.9 97.4 38.2 39.1 63.6 65.4 75 75 75 66.7 56.3 34.4 65.6 15.6 84 88.3 96.1 51.1 94.1 93.6 99.1 54.5 84 74 100 90.9 93.1 93.6 84.9 37.3 89.6 87.5 100 85.4 80.22 77.14 85.80 65.47 FP Rate 0.5 4.4 2.8 7.1 3.7 4.7 2.8 2.6 18.9 25.2 42.4 38.8 4.6 3.2 12.7 9.2 12.2 10.6 16.3 1.6 1.3 5 14.8 0 2.8 3.5 6.6 19.8 0 4.2 0 0 1.6 1.9 3.8 0 6.2 3.1 4.6 2.3 5.18 6.58 10.68 8.14 ROC Area 0.974 0.955 0.955 0.901 0.974 0.914 0.886 0.987 0.643 0.609 0.606 0.633 0.846 0.838 0.812 0.788 0.79 0.73 0.747 0.57 0.96 0.951 0.967 0.755 0.971 0.957 0.951 0.674 0.975 0.883 1 0.955 0.97 0.958 0.905 0.686 0.944 0.94 0.977 0.916 0.91 0.87 0.88 0.79 Imbalanced Data Sets As accuracy is considered as a poor performance measure for classifying imbalanced data sets, we use three other measurements: T rueP ositiveRate (equivalent to Hit Rate, Sensitivity and Recall), F alseP ositiveRate (also known as False Alarm Rate, and Fall-out) and the area under ROC curve [9], to estimate the performance of each Table 3. Methods Comparison on Balanced Data Sets Index 1 Data Sets arrhythmia* 452 280 Minor 45.8% (1) Major 54.2% (rest) 2 credit-rating 286 9 45% (+) 55% (-) 3 heart-disease 303 13 45% (>50 1) 55% (<50) 4 kr-vs-kp 3196 36 47.2% (nowin) 52.2% (win) 5 mushroom 8124 22 48.2% (p) 51.8% (e) 6 tae* 100 5 50% (1) 50% (2) Average Classifier CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.5 SPARCC CBA CCRDT C4.8 SPARCC CBA Figure 2. Average Contribution of Leaves or Rules to ROC Area in Each Method on Imbalanced Data Sets classifier 2 . The best value of these measures on each data set are marked in bold text. For example, in “breast” data sets, CCRDT has the smallest model size (13 leaves), low2 Note that the ROC Area can be calculated by Weka itself (version ≥ 3.5.1) Nodes/Rules 32 35 632 174 22 51 96 210 13 30 36 52 31 63 50 94 26 35 340 714 20 31 175 124 24 40.83 221.5 228 MinorAcc 74.4 72.5 81.2 32.9 88.6 84.7 91.9 93.5 84.8 71 85.5 87.5 99.6 97.3 97.1 100 100 100 0.997 100 65.3 65.3 85.7 93.5 85.45 81.80 73.73 84.57 MajorAcc 79.6 79.6 59.2 86.1 82.2 12.8 29.5 60.3 26.1 17 46.1 76.3 99.3 96.8 74.1 6 100 100 100 85.5 60 58 36 52.9 74.53 60.70 57.48 61.18 Overall Acc 77.21 76.32 69.247 61.72 85.07 86.08 80 75.7 78.88 77.58 68.32 81.34 99.4 97.03 86.08 55.1 100 100 99.85 93.03 62.63 61.62 60.6 79.17 83.87 83.11 77.35 74.34 Figure 3. Average Contribution of Leaves or Rules to Accuracy in Each Method on Balanced Data Sets est FP Rate (0.5%) and largest ROC Area (0.974), so they are in bold font, while SPARCCC has the highest TP rate (93.8%) which is also marked to be bold text. As we can see from Table 2, CCRDT has the largest ROC Area for most of the data sets. On average, the lowest False Algorithm 2 Discovering the attribute with highest information gain Input: Training Data: T D Output: The index of the attribute to be split: Attri —————————————————————— MaxCCRDTGain(TD): 1: Let M axCCR = 0, 2: Let Attri = 0, 3: Inf oGain = 0 4: for Each attribute Aj in T D do 5: Get the Entropy of this unpartitioned attribute: Aj .oldEnt, 6: Partition T D into bags according to the distinct values in Aj 7: for Each bag Bji partitioned from value Aij do 8: Bag Class BagCl = most common class in Bji 9: // Set up Contingency Table: 10: Bji .a = number of instances which have value Aij and the class BagCl, 11: Bji .b = number of instances which have value Aij but not the class BagCl, 12: Bji .c = number of instances which don’t have value Aij but have the class BagCl, 13: Bji .d = number of instances which don’t value Aij and not the class BagCl, Bji .a∗(Bji .c+Bji .d) Bji .b∗(Bji .a+Bji .b) M axCCR < Bji .CCR then M axCCR = Bji .CCR 14: Bji .CCR = 15: if 16: 17: 18: end if end for 19: Aj .newEnt = − X Bji .CCR Bji .CCR log M axCCR M axCCR j if Inf oGain < Aj .oldEnt − Aj .newEnt then Attri = j, Inf oGain = Aj .oldEnt − Aj .newEnt end if end for 25: Return Attri 20: 21: 22: 23: 24: Positive Rate and largest ROC Area are both generated by CCRDT. Only the TP Rate higher for SPARCCC, but this relies on a large amount of rules: SPARCCC also has the largest number of rules on average. To measure of effect of each rule (leaf) on the accuarcy we divided the value of ROC Area by number of leaves (or rules). This captures the average significance of each rule in each method, as shown in Figure 2. The significance of rules (pathes in the tree) of CCRDT is higher than other method, which indicates that CCRDT is capable of achieving a highly accurate prediction on imbalanced data sets based on a very small tree size. Algorithm 3 Pruning based on p-Value Input: Unpruned Decision Tree DT , p-Value threshold (pV T ) Output: Pruned Decision Tree —————————————————————— Prune(DT, pVT): 1: Get the Root of DT 2: if Root itself is not a leaf node then 3: for Each child(i) of Root do 4: P rune(child(i), pV T ) 5: end for 6: Root.pV alue = GetpV alue(Root) 7: if pV T > Root.pV alue then 8: for Each child(i) of Root do 9: child(i) ← null 10: end for 11: Set Root to be a leaf node which predicts the class that has the most instances in DT 12: end if 13: end if Algorithm 4 Subroutine for calculating pValue Input: A branch node: BN ode Output: The p-Value of this node —————————————————————— GetpValue(BNode): 1: a ← BN ode.a 2: b ← BN ode.b 3: c ← BN ode.c 4: d ← BN ode.d 5: n ← a + b + c + d min(b,c) X (a + b)!(c + d)!(a + c)!(b + d)! 6: Return n!(a + i)!(b − i)!(c − i)!(d + i)! i=0 5.2 Balanced Data Sets In the comparisons on balanced data sets, we compare not only the overall accuracy but also the accuracy on each class, since the two classes in balanced data sets are considered equally important. Following the same way as in imbalanced data sets analysis, we mark the best value of each measure in bold font. As we can see from Figure 3, CCRDT can perform the best on all of these measurement. Although the difference of CCRDT and C4.5 on Average Overall Accuracy is not very significant (83.87% and 83.11% respectively), this performance of CCRDT is based on a much smaller size of tree (24 leaves on average) compared to C4.5 (40.83 leaves on average). Again if we divide the value of overall accuracy by the number of leaves (rules) (Figure 3) it is clear that the accuracy of rules found by CCRDT is always the highest among other methods on most of the data sets. 6. Conclusion This paper presents a new method for imbalanced data set learning based on decision trees, called Class Confidence Ratio based Decision Trees (CCRDT), which uses Class Confidence Ratio (CCR) as the splitting criteria and uses Fisher’s Exact Test (FET) for pruning. CCR overcomes the drawback of previous support-confidence based classifiers, and FET overcomes the determinental effect of pruning based on error estimation. The CCRDT has been shown in this paper to be a better method compared to C4.5 [8], SPARCCC [10] and CBA [7] on both imbalanced and balanced classification. References [1] B. Arunasalam and S. Chawla. CCCS: a top-down associative classifier for imbalanced class distribution. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 517–522, 2006. [2] A. Asuncion and D. Newman. UCI machine learning repository, 2007. [3] N. Chawla. C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. Workshop on Learning from Imbalanced Data Sets II, 2003. [4] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence and Research, 16:321. [5] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. Bowyer. Smoteboost: Improving prediction of the minority class in boosting. [6] W. Han and J. Pei. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. Proc. of IEEE-ICDM, pages 369–376, 2001. [7] B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In Knowledge Discovery and Data Mining, pages 80–86, 1998. [8] J. Quinlan. C4.5: Programs For Machine Learning. Morgan Kaufmann Publishers, San Mateo, California, 1993. [9] J. Swets. Signal detection theory and ROC analysis in psychology and diagnostics. L. Erlbaum Associates Mahwah, NJ, 1996. [10] F. Verhein and S. Chawla. Using significant, positively associated and relatively class correlated rules for association classification of imbalanced datasets. In Seventh IEEE International Conference on Data Mining, pages 679–684, 2007. [11] I. Witten and E. Frank. Data mining: practical machine learning tools and techniques with Java implementations. ACM SIGMOD Record, 31(1):76–77, 2002. School of Information Technologies, J12 The University of Sydney NSW 2006 AUSTRALIA T +61 2 9351 3423 F +61 2 9351 3838 www.it.usyd.edu.au ISBN 978-1-74210-087-6
© Copyright 2026 Paperzz