A Robust Decision Tree Algorithm for Imbalanced Data Sets Wei Liu1 , Sanjay Chawla1 , David A. Cieslak2 and Nitesh V. Chawla2 2 1 School of Information Technologies, the University of Sydney Computer Science and Engineering Department, University of Notre Dame W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 1 / 16 C4.5 on Balanced Data When data is balanced, C4.5 gives a good boundary between the two classes. W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 2 / 16 Imbalanced Data How is the performance of C4.5 on imbalanced data? W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 3 / 16 C4.5 on Imbalanced Data When data is imbalanced(e.g. +:- = 1:10), information gain is biased towards the majority class. W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 4 / 16 C4.5 on Data Imbalance (Cont’) We plot the sum of subnodes’ entropy as a function of true positive rate and false positive rate: The left figure is the sum of subnotes’ entropy learned from balanced data; and the right one is from imbalanced data (class ratio +:- = 1:10). We can observe that the pattern of entropy becomes distorted when data are imbalanced. W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 5 / 16 Why are C4.5’s performances so bad? C4.5 uses information gain InfoGainsplit = Entropy (t) − = Entropy (t) − − X ni Entropy (i) n i=1,2 n1 [−p(y |X ) log p(y |X ) − p(¬y |X ) log p(¬y |X )] n n2 [−p(y |¬X ) log p(y |¬X ) − p(¬y |¬X ) log p(¬y |¬X )] n The undesirable performances of C4.5 on imbalanced data result from the effects of above probabilities. Taking p(y |X ) as an example, the information gain is maximized when p(y |X ) is close to 1 or 0 (i.e. p(¬y |X )). W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 6 / 16 The relationship with association rules The probability of y given X is equivalent to the confidence of rule X → y: p(y |X ) = p(X , y ) Support(X , y ) = = Confidence(X → y ) p(X ) Support(X ) For this reason, C4.5 performs as badly as association rule mining which is used to discover “frequent” items. W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 7 / 16 Other decision trees Other decision trees such as CART has the same problem in using the Gini Index: X Gini(t) = 1 − p(j|t)2 j where p(j|t) is the same as p(y |X ) in information gain. W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 8 / 16 Our solution to this problem Instead of looking at features X , we focus on classes y and look for the most significant antecedents associated with that class. p(X , y ) p(y ) It is necessary to ensure that the class implied by paths of a tree are more interesting than their alternative class: Class Confidence: Class Confidence Proportion: CC (X → y ) = CCP(X → y ) = CC (X → y ) CC (X → y ) + CC (X → ¬y ) CCP is used to replace p(y |X ) in C4.5/CART, and we call the new entropy “CCP-embedded entropy”. W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 9 / 16 Robustness of CCP With respect to tpr and fpr, the information gain from CCP-embedded entropy is unaffected by data imbalance. Figure: Model on the left side is trained on balanced data; and the one on right is trained on imbalanced data (Class ratio +:- = 1:10). W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 10 / 16 Tree pruning The pruning of conventional trees is based on “error estimation”, which has all the drawbacks of “accuracy-based measure” in dealing with imbalanced data. Our solution: We use the Fisher’s exact test to evaluate the significance of each branch of the tree. For all paths of a tree, we only keep the ones whose p-values from the Fisher’s exact test is lower than a threshold (such as 1e-4). We design a two-stage pruning algorithm: Stage 1 is a bottom–up checking process from each leaf to the root. A node is marked “pruneable” if it and all of its descendants are not significant. Stage 2 is a top–down pruning process performed according to the “pruneable” status of each node from root to leaves. Total time complexity of pruning is O(n2 ). W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 11 / 16 Data sets in Experiments 5 × 2 CV; AUC as the evaluation metric; Friedman test for significance checking; Table: Data sets are derived from real world and UCI Data Sets Boundary Breast Cam Covtype Estate Fourclass German Ism Letter Oil Page Pendigits Phoneme PhosS Pima Satimage Segment Splice SVMguide W. Liu et al. (SydneyUni) Instances 3505 569 28374 38500 5322 862 1000 11180 20000 937 5473 10992 2700 11411 768 6430 2310 1000 3089 Attributes 175 30 132 10 12 2 24 6 16 50 10 16 5 481 8 37 20 60 4 MinClass % 3.5% 37.3% 5.0% 7.1% 12.0% 35.6% 30.0% 2.3% 3.9% 4.4% 10.2% 10.4% 29.3% 5.4% 34.9% 9.7% 14.3% 4.8% 35.3% Decision Trees for Imbalanced Data SDM 2010 12 / 16 Comparisons on splitting criteria Table: Splitting criteria comparisons on imbalanced data sets where all trees are unpruned. Area Under ROC C4.5 CCP-C4.5 CART CCP-CARTHDDTSPARCCC Avg. Rank 3.95 1.6 4.15 2.09 2.4 5.65 Friedman Test (C4.5) X9.6E-5 Base Friedman Test (CART) X9.6E-5 Base Friedman Test (Other) Base 0.0896 X1.3E-5 Data Sets W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 13 / 16 Pruning comparisons Table: Pruning strategy comparisons on AUC Data set Avg.Rank Friedman Test (C4.5) Friedman Test (CCP) C4.5 Err.Est FET 3.0 2.15 X 0.0184 Base CCP-C4.5 Err.Est FET 2.65 1.55 X 0.0076 Base Table: Pruning strategy comparisons on number of leaves. Data set Avg.Rank Friedman Test (C4.5) Friedman Test (CCP) W. Liu et al. (SydneyUni) C4.5 Err.Est FET 2.3 2.45 Base 0.4913 Decision Trees for Imbalanced Data CCP-C4.5 Err.Est FET 2.25 2.7 Base 0.2513 SDM 2010 14 / 16 Comparison with sampling techniques Table: Performances comparisons on AUC with original, sampling and CCP-based techniques. Original Sampling based CCP based CART C4.5 W+CART W+C4.5 CCP-CARTCCP-C4.5 Avg.Rank 5.05 3.85 4.15 2.7 1.75 2.3 Friedman TestX 9.6E-5X 0.0076 X 5.8E-4 X 0.0076 Base 0.3173 Data set W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 15 / 16 Conclusions Conclusions Class confidence proportion (CCP) is an insensitive measure on evaluating imbalanced data sets; The two stage pruning algorithm with Fisher’s exact test keeps both the correctness and the completeness of pruned trees; Sampling methods using traditional decision trees are significantly overperformed by CCP. W. Liu et al. (SydneyUni) Decision Trees for Imbalanced Data SDM 2010 16 / 16
© Copyright 2026 Paperzz