A Robust Decision Tree Algorithm for Imbalanced Data Sets

A Robust Decision Tree Algorithm
for Imbalanced Data Sets
Wei Liu1 , Sanjay Chawla1 , David A. Cieslak2 and Nitesh V. Chawla2
2
1 School of Information Technologies, the University of Sydney
Computer Science and Engineering Department, University of Notre Dame
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
1 / 16
C4.5 on Balanced Data
When data is balanced, C4.5 gives a good boundary between the two
classes.
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
2 / 16
Imbalanced Data
How is the performance of C4.5 on imbalanced data?
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
3 / 16
C4.5 on Imbalanced Data
When data is imbalanced(e.g. +:- = 1:10), information gain is biased
towards the majority class.
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
4 / 16
C4.5 on Data Imbalance (Cont’)
We plot the sum of subnodes’ entropy as a function of true positive rate
and false positive rate:
The left figure is the sum of subnotes’ entropy learned from balanced data;
and the right one is from imbalanced data (class ratio +:- = 1:10). We
can observe that the pattern of entropy becomes distorted when data are
imbalanced.
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
5 / 16
Why are C4.5’s performances so bad?
C4.5 uses information gain
InfoGainsplit = Entropy (t) −
= Entropy (t) −
−
X ni
Entropy (i)
n
i=1,2
n1
[−p(y |X ) log p(y |X ) − p(¬y |X ) log p(¬y |X )]
n
n2
[−p(y |¬X ) log p(y |¬X ) − p(¬y |¬X ) log p(¬y |¬X )]
n
The undesirable performances of C4.5 on imbalanced data result from the
effects of above probabilities.
Taking p(y |X ) as an example, the information gain is maximized when
p(y |X ) is close to 1 or 0 (i.e. p(¬y |X )).
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
6 / 16
The relationship with association rules
The probability of y given X is equivalent to the confidence of rule
X → y:
p(y |X ) =
p(X , y )
Support(X , y )
=
= Confidence(X → y )
p(X )
Support(X )
For this reason, C4.5 performs as badly as association rule mining which is
used to discover “frequent” items.
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
7 / 16
Other decision trees
Other decision trees such as CART has the same problem in using the Gini
Index:
X
Gini(t) = 1 −
p(j|t)2
j
where p(j|t) is the same as p(y |X ) in information gain.
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
8 / 16
Our solution to this problem
Instead of looking at features X , we focus on classes y and look for the
most significant antecedents associated with that class.
p(X , y )
p(y )
It is necessary to ensure that the class implied by paths of a tree are more
interesting than their alternative class:
Class Confidence:
Class Confidence Proportion:
CC (X → y ) =
CCP(X → y ) =
CC (X → y )
CC (X → y ) + CC (X → ¬y )
CCP is used to replace p(y |X ) in C4.5/CART, and we call the new
entropy “CCP-embedded entropy”.
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
9 / 16
Robustness of CCP
With respect to tpr and fpr, the information gain from CCP-embedded
entropy is unaffected by data imbalance.
Figure: Model on the left side is trained on balanced data; and the one on right is
trained on imbalanced data (Class ratio +:- = 1:10).
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
10 / 16
Tree pruning
The pruning of conventional trees is based on “error estimation”, which
has all the drawbacks of “accuracy-based measure” in dealing with
imbalanced data.
Our solution:
We use the Fisher’s exact test to evaluate the significance of each branch
of the tree. For all paths of a tree, we only keep the ones whose p-values
from the Fisher’s exact test is lower than a threshold (such as 1e-4).
We design a two-stage pruning algorithm:
Stage 1 is a bottom–up checking process from each leaf to the root. A
node is marked “pruneable” if it and all of its descendants are not
significant.
Stage 2 is a top–down pruning process performed according to the
“pruneable” status of each node from root to leaves.
Total time complexity of pruning is O(n2 ).
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
11 / 16
Data sets in Experiments
5 × 2 CV;
AUC as the evaluation metric;
Friedman test for significance checking;
Table: Data sets are derived from real world and UCI
Data Sets
Boundary
Breast
Cam
Covtype
Estate
Fourclass
German
Ism
Letter
Oil
Page
Pendigits
Phoneme
PhosS
Pima
Satimage
Segment
Splice
SVMguide
W. Liu et al. (SydneyUni)
Instances
3505
569
28374
38500
5322
862
1000
11180
20000
937
5473
10992
2700
11411
768
6430
2310
1000
3089
Attributes
175
30
132
10
12
2
24
6
16
50
10
16
5
481
8
37
20
60
4
MinClass %
3.5%
37.3%
5.0%
7.1%
12.0%
35.6%
30.0%
2.3%
3.9%
4.4%
10.2%
10.4%
29.3%
5.4%
34.9%
9.7%
14.3%
4.8%
35.3%
Decision Trees for Imbalanced Data
SDM 2010
12 / 16
Comparisons on splitting criteria
Table: Splitting criteria comparisons on imbalanced data sets where all trees are
unpruned.
Area Under ROC
C4.5 CCP-C4.5 CART CCP-CARTHDDTSPARCCC
Avg. Rank
3.95
1.6
4.15
2.09
2.4
5.65
Friedman Test (C4.5) X9.6E-5 Base
Friedman Test (CART)
X9.6E-5
Base
Friedman Test (Other)
Base
0.0896 X1.3E-5
Data Sets
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
13 / 16
Pruning comparisons
Table: Pruning strategy comparisons on AUC
Data set
Avg.Rank
Friedman Test (C4.5)
Friedman Test (CCP)
C4.5
Err.Est
FET
3.0
2.15
X 0.0184 Base
CCP-C4.5
Err.Est
FET
2.65
1.55
X 0.0076
Base
Table: Pruning strategy comparisons on number of leaves.
Data set
Avg.Rank
Friedman Test (C4.5)
Friedman Test (CCP)
W. Liu et al. (SydneyUni)
C4.5
Err.Est
FET
2.3
2.45
Base
0.4913
Decision Trees for Imbalanced Data
CCP-C4.5
Err.Est
FET
2.25
2.7
Base
0.2513
SDM 2010
14 / 16
Comparison with sampling techniques
Table: Performances comparisons on AUC with original, sampling and CCP-based
techniques.
Original
Sampling based
CCP based
CART
C4.5 W+CART W+C4.5 CCP-CARTCCP-C4.5
Avg.Rank
5.05
3.85
4.15
2.7
1.75
2.3
Friedman TestX 9.6E-5X 0.0076 X 5.8E-4 X 0.0076
Base
0.3173
Data set
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
15 / 16
Conclusions
Conclusions
Class confidence proportion (CCP) is an insensitive measure on
evaluating imbalanced data sets;
The two stage pruning algorithm with Fisher’s exact test keeps both
the correctness and the completeness of pruned trees;
Sampling methods using traditional decision trees are significantly
overperformed by CCP.
W. Liu et al. (SydneyUni)
Decision Trees for Imbalanced Data
SDM 2010
16 / 16