School of IT Technical Report DECISION TREES FOR

School of IT
Technical Report
DECISION TREES
FOR IMBALANCED DATA SETS
TECHNICAL REPORT 630
WEI LIU AND SANJAY CHAWLA
SCHOOL OF INFORMATION TECHNOLOGIES
THE UNIVERSITY OF SYDNEY
SEPTEMBER 2008
Decision Trees for Imbalanced Data Sets
Wei Liu
School of Information Technology
University of Sydney
Australia
[email protected]
Abstract
We propose a new variant of decision tree for imbalanced classification. Decision trees use a greedy approach
based on information gain to select the attribute to split. We
express information again in terms of confidence and show
that like confidence, information gain is biased towards the
majority class. We overcome the bias of information gain
by embedding a new measure, the ratio of confidence of the
minority and majority class, into the entropy formula. Furthermore we prune the decision tree using p-value instead of
the conventional error-based approach. Together these two
changes yield a decision tree which is robust to imbalanced
data. Extensive experiments and comparision against C4.5,
CBA and variants on imbalanced and balance data confirm
our claims.
1. Introduction
While there are several types of classifiers, rule-based
classifiers have the distinct advantage of being easily interpretable. This is especially true in a ‘data mining’ setting where the high-dimensionality of data often means that
apriori very little is known about the underlying mechanism
which generated the data.
Decision Trees are perhaps the most popular form of
rule-based classifiers[8]. Lately however, classifiers based
on association rules have also become popular [7, 10].
These are often called associative classifiers.
Associative classifiers use association rule mining to discover interesting/significant rules from the training data and
the set of rules discovered constitute the classifier. classifiers. The canonical example of an associative classifier
is CBA (Classification Based on Associations) [7], which
uses the minimum support and confidence framework to
find ules. The accuracy of associative classifiers depends
on the quality of their discovered rules.
Sanjay Chawla
School of Information Technology
University of Sydney
Australia
[email protected]
However, the success of both C4.5 and CBA is dependent upon the assumption that there is an equal amount of
information for each class contained in the training data.
For example, in binary classification, if there are similar
number of instances for both positive and negative class,
then both C4.5 and CBA generally perform well. On the
other hand if the training data set tends to have an imbalanced class distribution then both these types of classifier
have a bias towards the majority class. As it happens it is
often the accurate prediction related to the minority class
that is usually of more interest.
One way of solving the imbalance class problem is to
modify the class distributions in the training data by oversampling the minority class or undersampling the majority
class. SMOTE [4] for example, uses oversampling to increase the the number of the minorty class instances by creating synthetic examples. Further variations on SMOTE [5]
have integrated boosting with sampling strategies to better
model the minority class by focusing on difficult examples
that belong not only to the minor class but also the majority
class.
Nonetheless, data sampling is not the only way to deal
with the class imbalanced problems, and some specifically
designed “imbalanced data oriented” algorithms can perform well on the original unmodified imbalanced data sets.
For example, a variation on associative classifier called
SPARCCC [10] has been shown to outperform CBA [7],
CMAR [6] and CCCS [1] on imbalanced data sets. The
downside of SPRARCC is that it generates a large number
of rules. This seems to be a feature of all associative classifiers and negates many of the advantages of rule based
classification.
Our main objective is to have a rule based classifier
which generates much fewer rules than associative classifiers but can successfully handle both imbalanced and balanced data sets. Towards that end we first theoretically examine how imbalanced distribution effects the metrics that
are used by both decision trees and associative classifiers.
In particular we study the effect of an imbalanced distribu-
tion on confidence as well as information gain.
One issue specific to decision trees is the use of a pruning
strategy to prevent the tree from overfitting the data. Traditional pruning algorithms are based on error estimations.
For example, a branch is replaced by a leaf if the predicted
error rate for the leaf is lower than that for the branch. But
this pruning technique will not always perform well on imbalanced data sets. Nitesh et al. [3] has shown that the pruning in C4.5 can have a detrimental effect on learning from
imbalanced data sets, since lower error rate can be achieved
by removing the branches that lead to minor class leaves. In
our pruning process, we use the Fisher’s Exact Test (FET)
to check if or not a path in a decision tree is statistically
significant and if not then the leaf is pruned. As an added
advantage every rule that we get is also statistically significant.
The paper is structured as follows. In Section 2 we analyze the reasons why CBA and C4.5 perform poorly on
imbalanced data sets. In Section 3 we present the Class
Confidence Ratio (CCR) as the metric of choice for selecting attributes to split for imabalanced classificaiton. Section 4 present the new decision tree algorithm which integrate CCR and Fisher Exact Test (FET). Experiments are
presented in Section 5 and we conclude in Section 6 with
directions for future work.
2. Analysis of rule based classifiers
We analyze the metrics used by rule based classifiers in
the context of imbalanced data. We show that both associative classifiers, like CBA and variants and decision trees
use metrics that are not favourable towards the classification of imbalanced data sets. In particular we first show that
ranking of rules, based on confidence is biased towards the
majority class. We also express information gain as a function of confidence and show that it too suffers from similar
problems.
2.1. CBA
The performance of associative Classifiers is dependendent on the quality of the rules it discovers during the training process. We now show that in an imbalanced setting,
confidence is biased towards the majority class.
Suppose we have a training data set which consists of n
records, and the antecedents (denoted by X and ¬X) and
class (denote by y and ¬y) distributions are in the form of
Table 1:
The rule selection strategy in CBA is to find all rule items
that have support and confidence above some predefined
thresholds. For a rule X → y, the confidence of this rule is
defined as:
Table 1. An Example for CBA Analysis
X
¬X
ΣInstances
y
a
b
a+b
¬y
c
d
c+d
ΣAttributes a + c b + d
n
S
Supp(X y)
a
Conf (X → y) =
=
Supp(X)
a+c
(1)
Similarly, we have:
Conf (X → ¬y) =
S
c
Supp(X ¬y)
=
Supp(X)
a+c
(2)
From the definition in Equation 1, it is clear that the highest confidence rules are actually selecting the most frequent
class among all the instances that contains that antecedent
(i.e. X in this example). However, for imbalanced data set,
since the size of the positive class is always much smaller
than the negative one, we always have: a + b c + d (suppose y is positive class). Moreover, assuming that imbalanced data does not affect the distribution of antecedents,
we can assume that Xs and ¬Xs are nearly equally distributed.
Then in the case when the data set is imabalanced, a and
b are both small, while c and d are both large. Even if y
is supposed to occur with X more frequently than ¬y, c is
unlikely to be less than a because the positive class size is
much smaller than the negative class size. Thus it is not surprising the right side term in Equation 2 always tends to be
lower bounded by the right side term in Equation 1. Apparently, even though the rule X → ¬y may not be significant,
it is easy for it to have a high confidence.
In these circumstances, for a “good” rule X → y, it will
be much harder for its confidence to be significantly larger
than a “bad” rule X → ¬y. What’s more, because of its
low confidence, during the classifier building up process,
it will be ranked behind some rules which have a higher
confidence just because they predict the majority class. This
is a fatal error as in an imbalanced class problem it is often
the minority class that is of more interest.
2.2. Decision Trees
Decision trees like C4.5 use information gain to decide
which variable to split [8]. The information gain for splitting a node t is defined as:
Inf oGainsplit = Entropy(t) −
X ni
Entropy(i) (3)
n
i=1,2
Where Entropy(t) is defined as:
Entropy(t) = −
X
p(j|t)logp(j|t)
(4)
j
For a fixed training set (or its subsets), the number of instances for each class (i.e. the j in equation 4) is the same
for all attributes. So the first term in equation 3 is fixed,
which means the Inf ormationGain for each attributesplitting depends only on the second term. Maximizing
Inf ormationGain reduces
to be the problem of maximizP
ing the second term − i=1,2 nni Entropy(i).
If we expand the second term in equation 3, we can get:
Inf oGainsplit = Entropy(t) −
−
Figure 1. Proportion of Information Gain to
conf (X → y) and conf (¬X → y).
n1
Entropy(subnode1)
n
n2
Entropy(subnode2)
n
(5)
Suppose the node t is split into two sub nodes with two
corresponding paths: X and ¬X, and the instances in each
node have two classes denoted by y and ¬y, then equation
5 can be rewritten as:
Inf oGainsplit = Entropy(t)
n1
[−p(y|X)log(y|X) − p(¬y|X)logp(¬y|X)]
−
n
n2
−
[−p(y|¬X)log(y|¬X) − p(¬y|¬X)logp(¬y|¬X)]
n
(6)
Note that the probability of y given X is actually the
confidence of X implies y:
supp(X ∪ y)
p(X ∪ y)
=
= conf (X → y)
p(y|X) =
p(X)
supp(X)
Then if we ignore the “fixed” terms (Entropy(t)) in
equation 6, we can obtain the following relationship (Equation 7):
Inf oGainsplit = Entropy(t)
n1
+
[conf (X → y)logconf (X → y)
n
+ conf (X → ¬y)logconf (X → ¬y)]
n2
+
[conf (¬X → y)logconf (¬X → y)
n
+ conf (¬X → ¬y)logconf (¬X → ¬y)]
n1
∝
[conf (X → y)logconf (X → y)
n
+ conf (X → ¬y)logconf (X → ¬y)]
n2
[conf (¬X → y)logconf (¬X → y)
+
n
+ conf (¬X → ¬y)logconf (¬X → ¬y)]
n1
=
logconf (X → y)conf (X→y)
n
∗ conf (X → ¬y)conf (X→¬y)
n2
+
logconf (¬X → y)conf (¬X→y)
n
∗ conf (¬X → ¬y)conf (¬X→¬y)
(7)
If we represent conf (X → y) by p, and represent
conf (¬X → y) by q, then equation 7 can be rewritten as:
Inf oGainsplit
n2
n1
logpp (1 − p)1−p +
logq q (1 − q)1−q
∝
n
n
∝ logpp (1 − p)1−p + logq q (1 − q)1−q
(8)
∝ pp (1 − p)1−p + q q (1 − q)1−q
The distribution of equation 8 as a 3-dimensional surface is shown in Figure 1. Apparently, Information Gain
is maximized when conf (X → y) and conf (¬X → y)
are both close to either 0 or 1, and is minimized when
conf (X → y) and conf (¬X → y) are both close to
0.5. Note that when conf (X → y) is close to 0, we have
conf (X → ¬y) close to 1; and when conf (¬X → y) is
close to 0, conf (¬X → ¬y) is actually close to 1. We can
therefore conclude : Information Gain achieves the highest value when either X → y or X → ¬y has a high
confidence, and either ¬X → y or ¬X → ¬y also has a
high confidence.
Thus decision trees like C4.5 split an an attribute such
that the resulting partition provides the highest confidence.
This is very similar to the rule ranking mechanism of association classifiers like CBA.
However, as we have analysed in Section 2.1, for imbalanced data set, high confidence rules do not always mean
significant rules, and there might be some significant rules
that do not get a high confidence. So the splitting criteria in
C4.5 is probably suitable for balanced data sets, but not for
imbalanced ones.
3. Class Confidence Ratio and Fisher’s Exact
Test
Having identified the weakness of the supportconfidence framework and also the close relationship between information gain and confidence, we are now in a
position to propose new measures to remedy the problem.
Recall that more frequently a particular class c appears
together with X does not necessarily mean X “explains”
the class c because c could be the overwhelming majority
class. Then it is reasonable that instead of focusing on the
antecedents (Xs), we focus only on each class and find the
most significant antecedents for them. In this way, all instances are partitioned according to the class they contain,
and consequently instances that belong to difference classes
will not have an impact on each other.
So we define a new concept, Class Confidence (CC),
to find the most interesting antecedents (X) from all the
classes (y):
S
Supp(X y)
(9)
CC(X → y) =
Supp(y)
If we use the notation in Table 1, we get the following:
CC(X → y) =
a
a+b
ones, regardless of whether they are discovered from balanced imbalanced data sets.
The ratio of significance of one rule compared against
with its alternative class has been shown in SPARCCC to
be an efficient way of measuring the rule’s interestingness
[10]. We use the ratio of the Class Confidence between the
class one rule contains and the alternative one. We call it
the Class Confidence Ratio (CCR):
CCR(X → y) =
a(c + d)
CC(X → y)
=
CC(X → ¬y)
b(a + b)
(12)
A rule with high CCR means that, compared with its alternative class, the class this rule implies has higher Class
Confidence, and consequently is more likely to occur together with this rule’s antecedents regardless of the proportion of classes in the data set.
While CCR can help us select which rules are “good”
to discrimanate between classes, we also want to evaluate
the statistical significance of each rule. For this we use
the Fisher’s Exact Test (FET) which allows us to do that.
For a rule X → y, the FET will find the probability (pValue) of obtaining the contingency table where X and y
are more positively associated, under the null hypothesis
that {X, ¬X} and {y, ¬y} are independent [10]. The pValue of this rule is given by:
min(b,c)
(a + b)!(c + d)!(a + c)!(b + d)!
n!(a
+ i)!(b − i)!(c − i)!(d + i)!
i=0
(13)
Then given a threshold for p-Value, we can find and keep
those rules that are statistically significant in the positively
associated direction, and discard those rules that are not. In
the next section we introduce the CCR-based decision tree
which combine CCR and FET.
p([a, b; c, d]) =
X
4. CCR-based Decision Trees
The original definition of entropy in decision trees is
presented in Equation 4. As analyzed in section 2.2, the
frequency term “p” in Equation 4 is not a good criteria
for learning with imbalanced data sets In order to integrate
CCR, we give modify the definition of entropy that replaces
the frequency term with CCR:
(10)
(11)
X CCR(X → yj )
CCR(X → yj )
log
M
axCCR
M axCCR
j
Then even if a + b c + d, Equation 10 and 11 will
not be affected by imbalanced data. Consequently, rules
with high Class Confidence will indeed be the significant
In Equation 14, M axCCRj is the new metric which replaces the former p(j|t) in Equation 4. In order to make
sure that the new metric is still between 0 and 1, we divide
c
CC(X → ¬y) =
c+d
Entropy CCR (t) = −
(14)
CCR(X→y )
each CCR by M axCCR which is the largest CCR value
among all the partitioned children nodes. After replacing
the new term in the Entropy definition, the conclusion made
in Section 2.2 can be restated as: In CCR-based Decision
Trees, Information Gain achieves the highest value when
either X → y or X → ¬y has a high Class Confidence
(CC), and either ¬X → y or ¬X → ¬y also has a high a
high Class Confidence (CC).
The process of creating Decision Trees based on CCR
is described in Algorithm 1. The major difference between
CCRDT and C4.5 is the the way of selecting the candidate
splitting attribute, as shown in line 9. The process of discovering the attribute that has the highest Information Gain
is presented in the subroutine Algorithm 2. In Algorithm
2, line 5 gets the entropy of a attribute before its splitting,
line 6 to line 19 get the new CCR-based entropy after the
splitting of that attribute, and line 20 to line 23 record the
attribute that has the highest Information Gain.
In our decision tree model, we treat each branch node
as the last antecedent of a rule. For example, if there are
three branch nodes (BranchA, BranchB, and BranchC)
from the root to a leaf Leaf Y , then we assume a rule exists
like this: BranchA ∧ BranchB ∧ BranchC → Leaf Y .
In Algorithm 2, we calculate the CCR for each branch
node (shown in line 7 to line 18), then under the same
example, the CCR of BranchC is the CCR of the previous rule, and the CCR of BrachB is the CCR of rule
BranchA ∧ BranchB → Leaf Y , etc. In this way, it is
guaranteed that the attribute we select is the one whose split
can generate rules (pathes in tree) that has the highest Class
Confidence and highest CCR.
After the creation of decision tree, Fisher’s Exact Test
is applied on each branch node of the tree. A branch node
will be replaced by a leaf node if the branch node holds a
p-Value lower than the desired threshold. This process of
pruning in CCRDT is described in line 6 to line 12 of Algorithm 3, among which line 6 gets the p-Value of a branch
node. Note that the contingency table used for each branch
node in subroutine (Algorithm 4) has already been built up
in Algorithm 2.
5. Experiments
In our experiments, we compare CCRDT with CBA [7],
C4.5 [8] and SPARCCC [10] on both balanced and imbalanced data sets. We use Weka’s C4.5 implementation
[11] in our experiments (note that it’s named j48 in Weka).
CCRDT, CBA and SPARCCC were also implemented in
Weka 1 .
We choose 16 data sets from UCI repository [2], among
which 10 data sets are imbalanced and 6 are balanced. De1 Implementation source code and data sets used in the experiments can
be obtained from http://www.cs.usyd.edu.au/ weiliu/CCRDT/
Algorithm 1 Creation of CCR-based Decision Tree
Input: Training Data: T D
Output: Decision Tree
——————————————————————
CCRDT(TD):
1: Create a node as the Root of the tree
2: if All instances are positive then
3:
Return decision tree with one node (Root), labeled
“positive”
4: else
5:
if All instances are negative then
6:
Return decision tree with one node (Root), labeled
“negative”
7:
else
8:
// Find the best splitting attribute (Attri)
9:
Attri = MaxCCRDTGain(T D),
10:
Assign Attri to the tree root (Root = Attri),
11:
for each value vi of Attri do
12:
Add a branch for vi ,
13:
Find T D(vi ) that contains the instances which
are vi at attribute Attri
14:
if T D(vi ) is empty then
15:
Add a leaf to this branch, with label = most
common class in T D
16:
else
17:
Add a subtree CCRDT (T D(vi )) to this
branch
18:
end if
19:
end for
20:
end if
21: end if
22: Return Root
tailed information for each data set is shown in Table 2 and
Table 3.
Class distributions are displayed in column 3 and 4 in
Table 2 and Table 3. For the imbalanced ones, three of the
data sets (with no asterisk at the end of data set names) are
originally imbalanced and have binary classes. The other
seven data sets have multiple classes, and we select one of
them and convert all the rest of the classes to be “rest”. For
example, the “vehicle” data originally has 4 classes, so we
select one of the class (“bus”), and treat all the other three
classes to be “rest”. Then “bus” becomes the minority class
(25.8%), and “rest” is the majority class (74.2%). For the
balanced data sets, we used similar ways to modify three
data sets (“arrhythmia”, “optdigits” and “tae”) to use them
for binary classification.
All the experiments were carried out using 10-fold crossvalidation. In CCRDT, we set the p-Value threshold to be
0.001. In C4.5 we use its default confidence level of 25%
in pruning. In SPARCCC, we set CCR threshold to be 1.0
Table 2. Methods Comparison on Imbalanced Data Sets
Index
1
Data Sets
breast
286
9
Minor Class
34.5% (malignant)
Major Class
65.5% (benign)
2
car*
1728
6
30% (unacc)
70% (rest)
3
cmc*
1473
9
34.7% (3)
65.3 (rest)
4
ecoli*
336
7
15.5% (pp)
84.5% (rest)
5
hepatitis
155
79
20.6% (DIE)
79.4% (LIVE)
6
sick
3772
29
6.1% (sick)
93.9% (negative)
7
splice*
3190
60
24% (EI)
76% (rest)
8
synthetic*
600
60
16.7 (cyclic)
83.3 (rest)
9
vehicle*
846
18
25.8% (bus)
74.2% (rest)
10
wine*
178
13
27% (3)
73% (rest)
Average
and p-Value threshold to be 0.001. In CBA, we set support
threshold to be 1% and confidence threshold to be 10%. The
performances of each classifier are listed in the right half
part of Table 2 and Table 3.
Classifier
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
5.1
Leaves/Rules
13
14
189
51
53
79
66
69
19
65
35
34
6
7
13
72
16
17
42
34
24
20
66
78
56
57
173
122
4
5
82
41
14
14
277
119
3
6
45
49
20.8
28.4
98.8
66.90
TP Rate
92.5
92.5
93.8
90.4
95.4
93.4
79.9
97.4
38.2
39.1
63.6
65.4
75
75
75
66.7
56.3
34.4
65.6
15.6
84
88.3
96.1
51.1
94.1
93.6
99.1
54.5
84
74
100
90.9
93.1
93.6
84.9
37.3
89.6
87.5
100
85.4
80.22
77.14
85.80
65.47
FP Rate
0.5
4.4
2.8
7.1
3.7
4.7
2.8
2.6
18.9
25.2
42.4
38.8
4.6
3.2
12.7
9.2
12.2
10.6
16.3
1.6
1.3
5
14.8
0
2.8
3.5
6.6
19.8
0
4.2
0
0
1.6
1.9
3.8
0
6.2
3.1
4.6
2.3
5.18
6.58
10.68
8.14
ROC Area
0.974
0.955
0.955
0.901
0.974
0.914
0.886
0.987
0.643
0.609
0.606
0.633
0.846
0.838
0.812
0.788
0.79
0.73
0.747
0.57
0.96
0.951
0.967
0.755
0.971
0.957
0.951
0.674
0.975
0.883
1
0.955
0.97
0.958
0.905
0.686
0.944
0.94
0.977
0.916
0.91
0.87
0.88
0.79
Imbalanced Data Sets
As accuracy is considered as a poor performance measure for classifying imbalanced data sets, we use three
other measurements: T rueP ositiveRate (equivalent to
Hit Rate, Sensitivity and Recall), F alseP ositiveRate
(also known as False Alarm Rate, and Fall-out) and the area
under ROC curve [9], to estimate the performance of each
Table 3. Methods Comparison on Balanced Data Sets
Index
1
Data Sets
arrhythmia*
452
280
Minor
45.8% (1)
Major
54.2% (rest)
2
credit-rating
286
9
45% (+)
55% (-)
3
heart-disease
303
13
45% (>50 1)
55% (<50)
4
kr-vs-kp
3196
36
47.2% (nowin)
52.2% (win)
5
mushroom
8124
22
48.2% (p)
51.8% (e)
6
tae*
100
5
50% (1)
50% (2)
Average
Classifier
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.5
SPARCC
CBA
CCRDT
C4.8
SPARCC
CBA
Figure 2. Average Contribution of Leaves or
Rules to ROC Area in Each Method on Imbalanced Data Sets
classifier 2 . The best value of these measures on each data
set are marked in bold text. For example, in “breast” data
sets, CCRDT has the smallest model size (13 leaves), low2 Note that the ROC Area can be calculated by Weka itself (version ≥
3.5.1)
Nodes/Rules
32
35
632
174
22
51
96
210
13
30
36
52
31
63
50
94
26
35
340
714
20
31
175
124
24
40.83
221.5
228
MinorAcc
74.4
72.5
81.2
32.9
88.6
84.7
91.9
93.5
84.8
71
85.5
87.5
99.6
97.3
97.1
100
100
100
0.997
100
65.3
65.3
85.7
93.5
85.45
81.80
73.73
84.57
MajorAcc
79.6
79.6
59.2
86.1
82.2
12.8
29.5
60.3
26.1
17
46.1
76.3
99.3
96.8
74.1
6
100
100
100
85.5
60
58
36
52.9
74.53
60.70
57.48
61.18
Overall Acc
77.21
76.32
69.247
61.72
85.07
86.08
80
75.7
78.88
77.58
68.32
81.34
99.4
97.03
86.08
55.1
100
100
99.85
93.03
62.63
61.62
60.6
79.17
83.87
83.11
77.35
74.34
Figure 3. Average Contribution of Leaves or
Rules to Accuracy in Each Method on Balanced Data Sets
est FP Rate (0.5%) and largest ROC Area (0.974), so they
are in bold font, while SPARCCC has the highest TP rate
(93.8%) which is also marked to be bold text.
As we can see from Table 2, CCRDT has the largest ROC
Area for most of the data sets. On average, the lowest False
Algorithm 2 Discovering the attribute with highest information gain
Input: Training Data: T D
Output: The index of the attribute to be split: Attri
——————————————————————
MaxCCRDTGain(TD):
1: Let M axCCR = 0,
2: Let Attri = 0,
3: Inf oGain = 0
4: for Each attribute Aj in T D do
5:
Get the Entropy of this unpartitioned attribute:
Aj .oldEnt,
6:
Partition T D into bags according to the distinct values in Aj
7:
for Each bag Bji partitioned from value Aij do
8:
Bag Class BagCl = most common class in Bji
9:
// Set up Contingency Table:
10:
Bji .a = number of instances which have value Aij
and the class BagCl,
11:
Bji .b = number of instances which have value Aij
but not the class BagCl,
12:
Bji .c = number of instances which don’t have
value Aij but have the class BagCl,
13:
Bji .d = number of instances which don’t value Aij
and not the class BagCl,
Bji .a∗(Bji .c+Bji .d)
Bji .b∗(Bji .a+Bji .b)
M axCCR < Bji .CCR then
M axCCR = Bji .CCR
14:
Bji .CCR =
15:
if
16:
17:
18:
end if
end for
19:
Aj .newEnt = −
X Bji .CCR
Bji .CCR
log
M axCCR
M axCCR
j
if Inf oGain < Aj .oldEnt − Aj .newEnt then
Attri = j,
Inf oGain = Aj .oldEnt − Aj .newEnt
end if
end for
25: Return Attri
20:
21:
22:
23:
24:
Positive Rate and largest ROC Area are both generated by
CCRDT. Only the TP Rate higher for SPARCCC, but this
relies on a large amount of rules: SPARCCC also has the
largest number of rules on average. To measure of effect
of each rule (leaf) on the accuarcy we divided the value of
ROC Area by number of leaves (or rules). This captures the
average significance of each rule in each method, as shown
in Figure 2. The significance of rules (pathes in the tree) of
CCRDT is higher than other method, which indicates that
CCRDT is capable of achieving a highly accurate prediction
on imbalanced data sets based on a very small tree size.
Algorithm 3 Pruning based on p-Value
Input: Unpruned Decision Tree DT , p-Value threshold
(pV T )
Output: Pruned Decision Tree
——————————————————————
Prune(DT, pVT):
1: Get the Root of DT
2: if Root itself is not a leaf node then
3:
for Each child(i) of Root do
4:
P rune(child(i), pV T )
5:
end for
6:
Root.pV alue = GetpV alue(Root)
7:
if pV T > Root.pV alue then
8:
for Each child(i) of Root do
9:
child(i) ← null
10:
end for
11:
Set Root to be a leaf node which predicts the class
that has the most instances in DT
12:
end if
13: end if
Algorithm 4 Subroutine for calculating pValue
Input: A branch node: BN ode
Output: The p-Value of this node
——————————————————————
GetpValue(BNode):
1: a ← BN ode.a
2: b ← BN ode.b
3: c ← BN ode.c
4: d ← BN ode.d
5: n ← a + b + c + d
min(b,c)
X (a + b)!(c + d)!(a + c)!(b + d)!
6: Return
n!(a + i)!(b − i)!(c − i)!(d + i)!
i=0
5.2
Balanced Data Sets
In the comparisons on balanced data sets, we compare
not only the overall accuracy but also the accuracy on each
class, since the two classes in balanced data sets are considered equally important. Following the same way as in imbalanced data sets analysis, we mark the best value of each
measure in bold font. As we can see from Figure 3, CCRDT
can perform the best on all of these measurement. Although
the difference of CCRDT and C4.5 on Average Overall Accuracy is not very significant (83.87% and 83.11% respectively), this performance of CCRDT is based on a much
smaller size of tree (24 leaves on average) compared to C4.5
(40.83 leaves on average).
Again if we divide the value of overall accuracy by the
number of leaves (rules) (Figure 3) it is clear that the accuracy of rules found by CCRDT is always the highest among
other methods on most of the data sets.
6. Conclusion
This paper presents a new method for imbalanced data
set learning based on decision trees, called Class Confidence Ratio based Decision Trees (CCRDT), which uses
Class Confidence Ratio (CCR) as the splitting criteria and
uses Fisher’s Exact Test (FET) for pruning. CCR overcomes
the drawback of previous support-confidence based classifiers, and FET overcomes the determinental effect of pruning based on error estimation. The CCRDT has been shown
in this paper to be a better method compared to C4.5 [8],
SPARCCC [10] and CBA [7] on both imbalanced and balanced classification.
References
[1] B. Arunasalam and S. Chawla. CCCS: a top-down associative classifier for imbalanced class distribution. Proceedings of the 12th ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 517–522,
2006.
[2] A. Asuncion and D. Newman. UCI machine learning repository, 2007.
[3] N. Chawla. C4. 5 and imbalanced data sets: investigating the
effect of sampling method, probabilistic estimate, and decision tree structure. Workshop on Learning from Imbalanced
Data Sets II, 2003.
[4] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P.
Kegelmeyer. Smote: Synthetic minority over-sampling
technique. Journal of Artificial Intelligence and Research,
16:321.
[5] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. Bowyer.
Smoteboost: Improving prediction of the minority class in
boosting.
[6] W. Han and J. Pei. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. Proc. of
IEEE-ICDM, pages 369–376, 2001.
[7] B. Liu, W. Hsu, and Y. Ma. Integrating classification and
association rule mining. In Knowledge Discovery and Data
Mining, pages 80–86, 1998.
[8] J. Quinlan. C4.5: Programs For Machine Learning. Morgan
Kaufmann Publishers, San Mateo, California, 1993.
[9] J. Swets. Signal detection theory and ROC analysis in psychology and diagnostics. L. Erlbaum Associates Mahwah,
NJ, 1996.
[10] F. Verhein and S. Chawla. Using significant, positively associated and relatively class correlated rules for association
classification of imbalanced datasets. In Seventh IEEE International Conference on Data Mining, pages 679–684, 2007.
[11] I. Witten and E. Frank. Data mining: practical machine
learning tools and techniques with Java implementations.
ACM SIGMOD Record, 31(1):76–77, 2002.
School of Information Technologies, J12
The University of Sydney
NSW 2006 AUSTRALIA
T +61 2 9351 3423
F +61 2 9351 3838
www.it.usyd.edu.au
ISBN 978-1-74210-087-6