A Backward Adjusting Strategy for the C4.5

1
2007 AMALTHEA REU SITE; Beck, et al.
A Backward Adjusting Strategy for the C4.5
Decision Tree Classifier
Jason R. Beck, Maria E. Garcia, Mingyu Zhong, Michael Georgiopoulos, Georgios Anagnostopoulos
Abstract—In machine learning, decision trees are employed
extensively in solving classification problems. In order to produce
a decision tree classifier two main steps need to be followed. The
first step is to grow the tree using a set of data, referred to as the
training set. The second step is to prune the tree; this step
produces a smaller tree with better generalization (smaller error
on unseen data). The goal of this project is to incorporate an
additional adjustment phase interjected between the growing and
pruning phases of a well known decision tree classifier, called the
C4.5 decision tree. This additional step reduces the error rate
(generalization of the tree) by making adjustments to the
non-optimal splits created in the growing phase of the C4.5
classifier. As a byproduct of our work we are also discussing of
how the decision tree produced by C4.5 is affected by the change of
the C4.5 default parameters, such as CF (confidence factor) and
MS (number of minimum split-off) cases, and emphasizing the fact
that CF and MS parameter values, different than the default
values, lead us to C4.5 trees of much smaller size and smaller
error.
Index Terms—C4.5, Decision Trees
I. INTRODUCTION
D
ECISION tree is a methodology used for classification and
regression. Its advantage lies in the fact that it is easy to
understand; also, it can be used to predict patterns with missing
values and categorical attributes. This research will focus only
on classification problems. There are many approaches that can
be used to induce a decision tree. This focuses on the C4.5
decision tree classifier [1], which is a method that evolved from
the simpler ID3 classifier [2]. Typically, in order to obtain an
optimal tree, there are two main steps to follow. First, the tree
has to be grown; to do this a collection of training data is utilized
to grow the tree by discovering (using some criterion of merit) a
series of splits that divide the training data into progressively
smaller subsets of data that have purer class labels than their
predecessor data. The growing of the tree stops when we end up
with datasets that are pure (i.e., contain data from the same
Manuscript received July 13, 2007. This work was supported in part by the
National Science Foundation IIS-REU-0647018, and IIS-REU-0647120
Jason Robert Beck is with the School of EECS at the University of Central
Florida, Orlando, FL 32816 (e-mail: [email protected])
Maria E. Garcia is with the University of Puerto Rico at Mayaguez,
Mayaguez,Puerto Rico, 00981-9018 (e-mail: [email protected])
Mingyu Zhong is with the School of EECS at the University of Central
Florida, Orlando, FL 32816 (e-mail: [email protected])
Michael Georgiopoulos is with the School of EECS at the University of
Central Florida, Orlando, FL 32816 (corresponding author, 407-823-5338, fax:
407-823-5835, e-mail: [email protected])
Georgios C. Anagnostopoulos is with the Florida Institute of Technology
(FIT), Melbourne, FL 32792 (e-mail: [email protected])
class) or the split yields no improvement to the criterion of
merit. At the end of the growing phase, usually an over-trained
tree has been produced. The second step in the design of an
optimal tree classifier is to prune the tree; the pruning results in
less complex and more accurate (on unseen data) classifiers.
Pruning removes parts of the tree that do not contribute to the
tree’s classification accuracy.
During the growing phase if minimum error rate splitting is
employed to grow the tree there will be many “optimal” splits
and it is unlikely that the true boundaries will be selected; even
worse, in many cases none of these “optimal” splits finds the
real boundary between the classes. Therefore, it is widely
accepted that another measure of merit, usually referred to as
gain (decrease) of impurity, should be used to evaluate the splits
during the growing phase. During the pruning phase the
estimated error rate is used in order to decide if a branch must be
pruned or not. The problem encountered is that the estimated
error rate may not be optimized because the splits were decided
using the gain ratio and not the minimal error rate. For this
reason we propose adding a third step, the backward adjustment
phase, to the decision tree design process. In the backward
adjustment phase the grown tree is used; a re-evaluation of the
possible splits is done bottom-up, minimizing the estimated
error rate.
The paper is organized in the following manner: In section II
we present a literature review related to our work. In section III
we describe the basic algorithms for growing and pruning a tree.
In section IV we explain the backward adjustment phase. In
section V we explain the experiments conducted, and describe
the databases that were used during the research. At the end of
this section we show our experimental results and make
appropriate observations. In section VI, we summarize our
work, we reiterate our conclusions.
II. LITERATURE REVIEW
Hyafil and Rivest proved that getting the optimal tree is
NP-complete [3], and thus most algorithms employ the greedy
search and the divide-and-conquer approach to grow a tree (see
Section 3 for more details). In particular, the training data set
continues to be split in smaller datasets until each dataset is pure
or contains too few examples; each such dataset belongs to a
tree node, referred to as leaf, usually with a uniform class label.
The splitting rules apply to data that reside in tree nodes,
referred to as decision tree nodes.
Apparently, if all the leaves in a tree have the same class label
among the training examples they cover, the tree has 100%
accuracy on the training set. Like other approaches that design
their models from data, a decision tree classifier that is
2
2007 AMALTHEA REU SITE; Beck, et al.
over-adapted to the training set tends to generalize poorly, when
it is confronted with unseen instances. Unfortunately, it is not
easy to set a stopping criterion to terminate the growing process
before leaves with pure labels are discovered, because it is
difficult to foresee the usefulness of future splits until these
splits are employed in the growing process. Therefore, it has
been widely accepted in the decision tree literature that the tree
should be grown to the maximum size and then pruned back.
A number of pruning algorithms have been proposed for
decision tree classifiers, such as the Minimal Cost-Complexity
Pruning (CCP) in CART (Classification and Regression Trees;
see [4], page 66), Error Based Pruning (EBP) in C4.5 (see [1],
page 37), Minimum Error Pruning (MEP) [5, 6], Reduced Error
Pruning (REP), Pessimistic Error Pruning (PEP) (see [7] for
both REP and PEP), the MDL-Based Pruning [8],
Classifiability Based Pruning [9], the pruning using
Backpropagation Neural Networks [10], etc. Some of the
pruning algorithms are briefly analyzed and empirically
compared in [11].
Other recent research on decision trees can be roughly
divided into three categories: the extensions of the conventional
model, including the extension from axis-parallel splits to
oblique splits (see [12-14]), and multi-splits per node (see [15]);
the combination of trees and other machine learning
algorithms, such as the evolution of trees in [16], hybrid trees in
[17], and tree ensembles in [18, 19]; and the optimization of tree
implementation for large databases, usually by parallel
processing, as in [20-23].
In this project we focus on the improvement of an induced
tree that is produced by an existing, popular decision tree
algorithm, the C4.5 algorithm. It is important to note that in the
existing approaches of pruning a tree, the only option for
pruning a branch is to remove the decision node that is the root
of the branch. EBP of C4.5 allows a flexible option to “graft” a
branch, by replacing it with the most frequently selected
sub-branch under the current decision node. Nevertheless, even
in this approach no split is re-evaluated or changed without
removal. In Section IV, we argue that this option by itself may
not yield the optimal tree.
III. DECISION TREES INDUCTION
This research is based on C4.5 algorithm developed by J.
Ross Quinlan (see [1]). As most other approaches to induce a
tree, C4.5 employs two steps: the growing phase and the pruning
phase. During the first phase (growing), a tree is grown to a
sufficiently large size. In most cases this tree has a large number
of redundant nodes. During the second phase (pruning) the tree
is pruned and the redundant nodes are removed.
A. C4.5 Growing Phase
The pseudo-code of the algorithm for growing a C4.5 tree is
presented below:
algorithm Grow(root, S)
input: A root node, set S of training cases
output: A trained decision tree, appended to root
if S is pure then
Assign class of pure cases to root;
return;
end
if no split can yield minimum number of cases split off then
Assign class of the node as the majority class of S
return;
end
Find an optimal split that splits S into subsets Sk (k=1:K);
foreach k = 1 to K do
Create a node tk under root;
Grow (tk,Sk);
end
Fig. 1. The pseudo code for the C4.5 growing algorithm.
In order to obtain the optimal split while growing the tree (see
part of the pseudo-code above) the gain ratio should be
calculated. The following steps will describe the meaning of the
gain ratio, why it is used, and how it is derived.
The first step is to find the information known before splitting.
The following equation measures the average amount of
information needed to identify the class of a case in S, where S
represents the set of training cases and Cj represents the class
with class index j. |S| is simply the count of the number of cases
in the set S. When there are missing values in the set of training
cases, |S| is the sum of the weights of each class in the set S. The
way missing values are handled will be explained shortly.
k
Info ( S ) = − ∑
j =1
freq ( Cj , S )
⋅ log
S
freq ( C j , S )
2
(1)
|S |
The information for the set of training cases is simply the sum
of the information for each class multiplied by the proportion
according to which this class is represented in S.
Now that the information before splitting is known, the
information after making a split needs to be determined. This
corresponds to the average information of the resulting subsets,
after S has been divided in accordance with the n outcomes of a
test X. Note that X represents an explored split which divides the
cases up in set S into n subsets each indexed as Si. The expected
(average) information is found by taking the weighted sum of
the information of each subset, as shown below
n
Si
i =1
S
Info X ( S ) = ∑
⋅ Info( S i ) .
(2)
The gain in information made by the split is simply the
difference between the amount of information needed to classify
a case before and after making the split.
3
2007 AMALTHEA REU SITE; Beck, et al.
Gain ( X ) = F ⋅ ( Info ( S ) − Info X ( S ))
(3)
creates a pure tree node. The remaining cases go to another tree node, which is
not pure and as a result it will be split even further.
C4.5 handles missing values by adding a multiplier F in front of
the gain calculation. The multiplier F is the ratio of the sum of
the weights of the cases with known outcomes on the split
attribute divided by the sum of the weights of all of the cases.
The gain measure alone is not enough to build a tree. The
gain measure favors splits with very many outcomes. To
overcome this problem the gain ratio is used, defined as follows:
GainRatio ( X ) =
Gain( X )
SplitInfo ( X )
(4)
The gain ratio divides the gain by the evaluated split
information. This penalizes splits with many outcomes.
n +1
Si
i =1
S
SplitInfo( X ) = −∑
⋅ log 2
Si
S
(5)
The split information is the weighted average calculation of the
information using the proportion of cases which are passed to
each child. When there are cases with unknown outcomes on the
split attribute, the split information treats this as an additional
split direction. This is done to penalize splits which are made
using cases with missing values.
After finding the best split, the tree continues to be grown
recursively using the same process. The following example
shows how a tree is grown using the Iris database. The Iris
database comes from the UCI repository [24]. This database
contains three classes of Iris flowers: iris-setosa, iris-virginica
and iris-versicolor. There are 150 training examples, 50 from
each class. Each case contains four attributes: petal length, petal
width, sepal length, and sepal width. Two of the attributes, petal
length versus petal width, are plotted below for all of the cases.
Fig. 2 shows where the first best split will be made. Fig. 3 shows
why that split was chosen by plotting the gain ratio for each of
the explored split possibilities.
Fig. 3. Diagram Showing the gain ratio for each considered first split in the iris
dataset. The position with the highest gain ratio is chosen as the first split.
This split separates the iris-setosa cases from the rest of the
iris cases. There will be no more splitting of the tree node that
contains the iris-setosa cases (since this node contains data of a
pure label). The remaining iris-plant cases (residing in the other
tree node resulting from the split) will continue to be split until
there is little or no more information being gained from
additional splits, or when pure nodes result.
The graphical version of the Iris tree growing phase is shown
in Fig. 4.
Fig. 4. Graphical display of a C4.5 tree grown using the iris database. The iris
database contains 150 cases which can be used to predict between three classes:
iris-setosa, iris-versicolor, and iris-virginica. Each case contains 4 continuous
attributes measuring the iris’ petal width/length, and sepal width/length.
Fig. 2. Diagram showing the first best tree split for the Iris dataset. The first
best split separates all of the iris-setosa classes from the rest of the classes, and
4
2007 AMALTHEA REU SITE; Beck, et al.
B. Pruning Phase of C4.5
The following algorithm (pseudo-code) outlines the pruning
process:
algorithm Prune(node)
input: node with an attached subtree
output: pruned tree
leafError = estimated leaf error;
if node is a leaf then
return leaf error
else
subtreeError =
∑ Prune(t i ) ;
t i ∈children ( node )
branchError = error if replaced with most frequented branch
if leafError is less than branchError and subtreeError
make this node a leaf;
error = leafError;
elseif branchError is less than leafError and subtreeError
replace this node with the most frequented branch;
error = branchError;
else
error = subtreeError;
return error;
end
Fig. 5. The pseudo code for the C4.5 pruning algorithm. The algorithm is
recursively called so that it works from the bottom of the tree upward, removing
or replacing branches to minimize the predicted error.
Because trees that have only been grown are usually very
cumbersome, the resultant tree after pruning is easier to
understand and usually provides smaller misclassification error
on unseen data. The problem is that the estimated error rate may
not be optimized. As a remedy to this problem the idea of the
backward adjustment phase is introduced in the next section.
The following example shows how a tree is pruned. This
example uses the congressional voting database from the UCI
repository [24]. This database contains 300 cases; each case is
the voting record of a U.S. congressman taken during the second
session of congress in 1984. For this example, each case has 16
categorical attributes which represent 16 key votes made during
that year. The possible outcomes for each of these attributes
are: yea, nay, and unknown disposition. It has two classes,
democrat and republican. The following is the textual output of
the fully grown tree:
physician fee freeze = n:
| adoption of the budget resolution = y: democrat (151.0)
| adoption of the budget resolution = u: democrat (1.0)
| adoption of the budget resolution = n:
| | education spending = n: democrat (6.0)
| | education spending = y: democrat (9.0)
| | education spending = u: republican (1.0)
physician fee freeze = y:
| synfuels corporation cutback = n: republican (97.0/3.0)
| synfuels corporation cutback = u: republican (4.0)
| synfuels corporation cutback = y:
| | duty free exports = y: democrat (2.0)
| | duty free exports = u: republican (1.0)
| | duty free exports = n:
| | | education spending = n: democrat (5.0/2.0)
| | | education spending = y: republican (13.0/2.0)
| | | education spending = u: democrat (1.0)
physician fee freeze = u:
| water project cost sharing = n: democrat (0.0)
| water project cost sharing = y: democrat (4.0)
| water project cost sharing = u:
| | mx missile = n: republican (0.0)
| | mx missile = y: democrat (3.0/1.0)
| | mx missile = u: republican (2.0)
Fig. 6. The C4.5 fully grown congressional vote database tree. The numbers in
the parentheses after the display of the node represent: the sum of the weights in
that node / error at the leaf (if any). The congressional voting database contains
300 cases. Every case has 16 categorical attributes, each with possible
outcomes y, n, and u.
The tree now needs to be pruned. To explain how this is done,
only a part of the tree will be considered. The node examined is
the one containing 16 cases, and depicted below.
adoption of the budget resolution = n:
| education spending = n: democrat (6.0)
| education spending = y: democrat (9.0)
| education spending = u: republican (1.0)
Fig. 7. Subtree of the fully grown congressional voting database tree.
The cases which made it here voted nay on the physician fee
freeze and nay on the adoption of the budget resolution. The
children of this node were created using the education spending
attribute, and the number of cases in each child node is shown in
the parenthesis after the node. There are no errors in the training
set at the children nodes. Observing this part of the tree one may
notice that if the education spending answer is yea or nay the
congressman would be a democrat, while if the answer is
unknown the congressman will be a republican. Hence, only one
case (republican) is in the unknown disposition, while the
remaining 15 of the total 16 cases will be a democrat. This
means that since 15 of the 16 cases are democrat, there is a
strong indication that this node should become a leaf, predicting
that if a voting record makes it to this leaf, the designated
congressman is a democrat. To decide if the node should
become a leaf or not we need to estimate the error rates using a
confidence level (see the appendix for more details about the
confidence level used in C4.5); this will penalize small leaves.
5
2007 AMALTHEA REU SITE; Beck, et al.
Check the predicted error of the tree shown in Figure 7.
algorithm AdjustTree (t, S)
input: a tree rooted at node t, training set S
output: the adjusted tree (modified from the input)
Predicted Error = 3.273 = 6 x 0.206 + 9 x 0.143 + 1 x 0.750
Check the predicted error if we prune this node to a leaf:
Predicted Error = 2.512 = 16 x 0.157
The predicted error is smaller if we prune this node to a leaf.
The result after pruning this part of the tree will be the
following:
| adoption of the budget resolution = n: democrat
IV. BACKWARD ADJUST DESCRIPTION
A. Motivation
At this point the two traditional decision tree phases (growing
and pruning) have been explained. The main problem is that
during the growing process the splits that were chosen may not
have been the error optimal splits. This happens because during
the growing process the data is separated using the maximized
gain ratio to reduce entropy. This split finds a good boundary
and future splits may further separate these classes. For
example, consider Fig. 8 which shows two class distributions
that overlap. When C4.5 makes the first split, the maximized
gain ratio split is chosen to separate part of class 1 from class 2.
Nevertheless, if there is no future split under this node or all
future splits are pruned, it is clear that this split is not the Bayes
optimal split (the one minimizing the probability of error; see
Fig. 8 for an illustration of this fact).
Max Gain Point
Class 1
Min Error Rate Point (Bayes Optimal)
Class 2
Attribute
Fig. 8. Finding the minimum error rate point on two Gaussians. The point
which may maximize the gain may not minimize the error.
By observing Fig. 8 one may notice that the optimal split is
the one that passes through the intersection of both distributions.
In this paper, in order to reduce the error of the split, we propose
an adjustment process to be added to decision tree induction,
described in the next subsection.
B. Adjustment of a Tree
For the decision node pertaining to Fig. 8, it is clear that it is
beneficial to adjust the split. In general, after a decision node is
adjusted, its ancestor nodes may also be improved by the same
adjusting method. Therefore, this adjusting process attempts to
improve the accuracy of the decision nodes in a bottom-up
order. This adjustment process currently examines and adjusts
only binary splits for simplicity. The algorithm for this
backwards adjusting phase is shown below.
if t is a decision node then
Divide S into subsets {Si} using the current split;
foreach child node ti of t do
Recursively call AdjustTree (ti,Si);
end
foreach possible split sk in t do
Set the split of t to sk;
Ek = EstimateError (t,S);
end
Set the split of t to sk, k = arg min Ek;
removeEmptyNodes(t,S).
end
Fig. 9. The pseudo code for the backwards adjusting algorithm.
There are two options to adjust the tree: before pruning or
after pruning. In both cases, we expect that the resulting tree will
have better accuracy and be smaller than the pruned tree without
adjustment. We prefer to adjust the tree before pruning it,
because in the fully grown tree, we have more nodes to adjust
and thus the chance to improve the tree, through backwards
adjusting, is higher. Since the tree will be pruned later, the
function EstimateError in our adjusting algorithm must return
the error of the pruned result of the sub-tree rather than the
current sub-tree. This can be implemented in the same way as
C4.5’s pruning, as shown in Fig. 10.
algorithm EstimateError (t, S)
input: a tree rooted at node t, training set S
output: estimated error
leafError = error if node was made a leaf using binomial estimate;
if t is a leaf node then
return leafError;
else
Divide S into subsets {Si} using the current split;
foreach child node ti of t do
sumError = EstimateError (ti,Si);
end
if sumError < leafError then
return sumError;
else
return leafError;
end
Fig. 10. The pseudo code for the estimate error function.
When estimating the error of a sub-tree, we return the smaller
value between the total error of its branches (involving recursive
calls) and the error of the root treated as a leaf. The difference is
that we do not actually prune this sub-tree in EstimateError
routine, but simply evaluate the potential best pruned sub-tree.
After adjusting, the pruning algorithm can serve as a realization
process, converting the sub-trees to its best pruned version (for
6
2007 AMALTHEA REU SITE; Beck, et al.
C4.5, the result may not be exactly the same as the potential best
pruned sub-tree as we evaluated before, due to the grafting
option in the final pruning process).
Strictly speaking, after one pass of the adjusting algorithm,
the leaves may cover different training examples because the
splits in their ancestor nodes may have changed. We could
repeat the same adjusting process again from the bottom
decision nodes, based on the updated leaf coverage.
Nevertheless, our preliminary experiments show that a second
run of the adjusting process does not have significant
improvements compared to the first pass, and therefore in our
experiments we apply only one pass of the adjusting algorithm
to the tree.
The following figure (Fig. 13) shows a typical tree that could
result after the growing phase alone. Unless grafting (replacing
a sub-tree with one of its branches) is allowed in the pruning
process, the root split will not be changed and the optimal tree
cannot be obtained.
x<0.5
y<0?
x<0?
Blue
C. Examples
Now we present two examples to show how the backward
adjustment phase will optimize the estimated error rate. In the
first example we design a 2-class artificial database, as shown in
Fig. 11. The following square is given with the observed class
distribution.
y
(0,0)
x
Fig. 11. Distribution of the cross database. Each pattern has two attributes
which are plotted on the x – y axis shown above. The class of each pattern is
colored accordingly.
The optimal split is the one that assigns 50% of the red
distribution and 50% of the blue distribution to both children as
shown in Fig. 12.
Blue
Red
x<0?
Red
Red
Red
Blue
Blue
Fig. 13. Non-optimal grown tree for the cross database. The higher level splits
in this tree do no find the ideal split positions
Another example showing where the backwards adjusting phase
would be beneficial is shown below (Fig. 14). In this figure the
gain ratio and error rate are plotted for each respective split.
The two classes are represented as circles and crosses. The
circles have a uniform distribution and the crosses increase in
frequency across the axis. C4.5 would choose Split A, because
it maximizes the gain ratio. The minimizing error split is
actually Split C. Assuming no other divisions occur below this
split in the final tree this split would benefit from the backwards
adjusting phase.
GAIN RATIO FOR EACH SPLIT
0.25
0.2
0.15
0.1
0.05
0
Split A
x<0?
y<0?
y<0?
Split B
Split C
Split D
Split E
y<0?
Red
Blue
Fig. 12. Optimal tree for the cross datasets. This tree is optimal because there
are equal number of leaves for the tree and class regions in the dataset under
consideration (in both cases this number is equal to 4).
In practice, the optimal tree is not easy to find. Most tree
induction algorithms including C4.5, consider only axis-parallel
splits (vertical or horizontal for this 2-D case). For this database,
after the first split, regardless of its position, each class still
occupies 50% of the area in both branches. Therefore, ideally
each split position has the same gain ratio or impurity. In reality,
only a finite number of examples are given, so there is a position
that has slightly better gain than others due to the random
sampling, but it is unlikely that this “best” splits turns out to be
the optimal one.
Maximized Gain
Ratio Split
Error Optimal Split
ERROR RATE FOR EACH SPLIT
0.45
0.4
0.35
0.3
Split A
Split B
Split C
Split D
Split E
Fig. 14. Gain Ratio and Minimum Error Rate Split. The top graph shows the
gain ratio plotted for each split. The bottom graph shows the minimum error
rate position. This shows that the division C4.5 makes is not in the minimum
error position but the backwards adjusting phase can correct this issue.
7
2007 AMALTHEA REU SITE; Beck, et al.
V. EXPERIMENTS
In this section we will describe the software design, the
experimental design, as well as the databases used and the
results of the experiments.
A. Software Design
Quinlan’s original software was not ideal to use for our
experimentation and addition of the backwards adjusting phase.
His version was implemented in C and lacked any object
oriented design. In particular, it is very difficult to port his code
into our platform for empirically comparison. Because of this
limitation, we decided to implement C4.5 using C++ and
standard object oriented principles. This would allow for easy
experimentation with the algorithm, along with the ability to add
additional modules. We also wanted to compile the produced
code into a MEX function for use in MATLAB so that we could
use all of the tools available within MATLAB.
Our design methodology was to first develop sets of abstract
classes for interfacing. We could then provide various
implementations from these classes and specify the traditional
C4.5 phases (growing and pruning) and split types (continuous,
categorical, and grouped categorical) as well as leave the
possibility to create new phases and split types. To accomplish
this, any phase of tree development which modifies the structure
of a tree was considered to be a tree manipulator. The tree
manipulator became the abstract class from which growing,
pruning, and backwards adjusting were defined. Split types
were also defined abstractly, and continuous, categorical, and
grouped categorical splits were implemented. Split factories
were also created to produce splits for the growing and
backwards adjusting phases. A specific factory was developed
for each type of split, therefore keeping the possibility open of
developing a new type of split or a new method for choosing the
best split. A class that could evaluate a subtree and return its
error using a set of cases was also developed. This class could
be extended to measure the error using any type of error
measurement. We implemented versions to measure the
resubstitution error (the number of errors made with the training
examples), as well as the binomial estimated error (the
estimated error in EBP; see the appendix for more details).
Throughout the development, the important considerations
were reusability and expandability. More details about the
specific software design components, mentioned above, can be
found at [25].
B. Experimental Design
In order to perform our experiments each database is
separated in two groups; a training set and a test set. To reduce
the randomness that results when partitioning the data into a
training set and a test set the experiments are performed for
twenty iterations with a different partitioning each time.
One problem that can be encountered if random partitioning
is merely repeated is that some of the examples might appear
more often in the training set than in the test set or visa-versa.
We know that V-fold cross-validation [26] can address this, but
the problem is that the training ratio when V is fixed cannot be
controlled, and that the training ratio is large when V is high.
For this reason we slightly generalize cross-validation in the
following way: we divide each database into 20 subsets of equal
size. In the ith (i = 0,1,2,…,19) test, we use subsets i mod 20, (i +
1) mod 20,…, ( i+ m -1) mod 20 for training and the others for
testing, where m is predefined (for example, if the training ratio
is 5%, m=1; if the training ratio is 50%, m=10). This approach
guarantees that each example appears exactly m times in the
training set and exactly (20 - m) times in the test set.
Furthermore, it allows the training ratio to vary. The algorithm
that accomplishes this goal is shown in Fig. 15, and pictorially
illustrated in Fig. 16:
foreach Database D do
Shuffle the examples in D;
Partition D into 20 subsets D0, D1,….,D19;
for i = 0 to 19 do
Set Training Set =
U
m −1
i =0
D( i + j ) mod 20 ;
Set Test Set = D – Training Set;
Run experiment using Training Set;
Evaluate results using the Test Set;
end
end
Fig. 15. The pseudo code for separating the training and test sets of cases
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Fig. 16. Representation showing the training set (red) and the test set (green)
windowing through the subsets of cases, for iterations 0 through 3 (each row
represents one iteration).
The experimentation aims at comparing the accuracy and size
of the trees generated by the backwards adjusting phase
compared to the same tree which could be produced using the
traditional growing and pruning phases alone. The experiment
will involve growing and pruning a tree using all of the default
options and comparing that to growing, then backwards
adjusting, and finally pruning a tree using the default options.
The error rate and tree sizes will be compared between the two
approaches of designing trees. To make a fair comparison, both
approaches are based on the same grown tree in each iteration.
The comparison will also be conducted when the C4.5 default
8
2007 AMALTHEA REU SITE; Beck, et al.
C. Databases
The experimentation will use a combination of artificial and
real datasets. The real datasets are available from the University
of California at Irvine’s (UCI) machine learning repository[24].
The table of databases and their individual properties are shown
in the Table I.
TABLE I
DATABASES USED FOR EXPERIMENTATION
Attributes
(Numerical
/
Databas
Example
Categorical
Majority
e
s
)
Classes
Class %
g6c15
5004
2/0
6
16.7
g6c25
5004
2/0
6
16.7
abalone
4177
7/1
3
34.6
satellite
6435
36 / 0
6
23.8
segment
2310
19 / 0
7
14 3
The artificial Gaussian datasets g6c15 (see Fig. 17) and
g6c25 are both two dimensional six class datasets with 15% and
25% overlap respectively. The overlap percentage means that
the optimal Bayesian Classifier would have 15% and 25%
misclassification rate on each of the datasets.
g6c15 Artificial Gaussian Dataset
1
0.9
0.8
0.7
0.6
Y axis
options are not used. Specifically, two C4.5 option parameters
will be independently adjusted: the minimum number of cases
split off during growing, and the confidence factor while
pruning. The first is the minimum number of cases split off by a
split when growing the C4.5 tree. This number is defaulted to
two and in Quinlan’s C4.5 the minimum number of cases split
off is set to a maximum of 25. In our implementation with C4.5
we did not put an upper limit on the minimum number of cases
split off. Instead, in our experiments this option ranges from the
default number of 2 and increases, at steps of 10, up to a number
equal to 15% of the number of training cases. The tree is then
grown using the set value of the minimum number of cases split
off and then pruned. This will be compared to a tree grown with
the set minimum number of split off cases, backward adjusted,
and then pruned.
The other option parameter which will be adjusted is the
confidence factor. This value affects how pessimistically the
tree is pruned back. The confidence factor will be given values
of 75%, 50%, 25%, 10%, 5%, 1%, .1%, .01%, and .001%
(smaller confidence factors result in further tree pruning; a
detailed explanation of the role of the confidence factor is
provided in the appendix). The tree will be grown then pruned
with the set confidence factor. This tree will be compared to a
tree adjusted and pruned based on the same fully grown tree and
using the same set value of confidence factor. The trees
designed with these two approaches will be compared in terms
of accuracy and size.
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
X axis
0.6
0.7
0.8
0.9
1
Fig. 17. g6c15 database (6 class and 2 continuous attributes) with 15% overlap.
Each colored dot corresponds to a different class.
In the abalone dataset, each case contains 7 continuous
attributes and 1 categorical attribute. The class is the age of the
abalone. Although the original dataset used a class for each age
number, this dataset has clustered groups of ages so that the
result is a three class database. The classes are 8 and lower
(class 1), 9-10 (class 2), 11 and greater (class 3).
The satellite dataset gives the multi-spectral values of pixels
within 3x3 neighborhoods in a satellite image, and the
classification associated with the central pixel in each
neighborhood. The class is the type of land present in the
image: red soil, cotton crop, etc.
The segment dataset contains 19 numerical attributes and is
used to predict 7 classes. The classes represent seven types of
outdoor images: brick face, sky, foliage, cement, window, path,
and grass.
The waveform dataset is a three-class dataset. Each case
contains 21 numerical attributes which are used to predict three
classes of waveforms. The method of creating this dataset is
described in [4].
D. Results
As described in the experimental design section, there are
three configurations tested. The first compares the resulting tree
after growing adjusting and pruning using all of the default
options to a tree just grown and pruned using the default
options. The results of this experiment are shown below in
Table II. These configurations will be referred to as C4.5_DP
(default parameters) and C4.5A_DP (backwards adjusting and
default parameters).
9
2007 AMALTHEA REU SITE; Beck, et al.
TABLE II
EXPERIMENTAL RESULTS USING DEFAULT OPTIONS
Prune
Grow Adjust Prune
Grow
C4.5_DP
C4.5A_DP
Error Rate
Tree Size
Error Rate
Tree Size
Databas
% (test
(# nodes)
% (test
(# nodes)
e Names
set)
set)
g6c15
17.56
67.6
17.34
63.3
→
→
g6c25
27.67
99
26.95
94.8
abalone
39.39
397.2
39.15
376.75
satellite
15.18
333.2
15.06
328.5
segment
5
61.6
4.89
59.1
wavefor
m
24.63
328.2
24.45
323.6
The results show a consistent improvement in both tree
accuracy and node size across all of the examined databases.
Although the difference in error is at most .72% (for the g6c25
database), there is a slight improvement for all the sets.
The other experimental configurations that are examined
correspond to the trees produced when modifying the minimum
number of cases split off (C4.5_MS and C4.5A_MS), or the
confidence factor during pruning (C4.5_CF and C4.5A_CF).
For each dataset in each configuration the tree size versus error
rate plot was created. The plots for g6c15 are shown below
when the confidence factor (CF) is modified (Fig. 18) and when
the minimum number of cases split off (MS) is modified (Fig.
2019 and 20) .
g6c15 Modifying Confidence Factor
90
C4.5 CF
C4.5A CF
CF = 25% (default)
CF paired points
80
Number of Nodes
70
60
CF = 10%
50
CF = 5%
40
30
20
0.172
0.174
0.176
0.178
0.18
0.182
0.184
0.186
Error Rate
Fig. 18. Number of tree nodes versus tree error rate when modifying the
confidence factor (CF) during C4.5 pruning for the g6c15 Gaussian dataset.
The results corresponding to two configurations are illustrated:
growing-pruning with confidence factor modified (C4.5 CF) and
growing-adjusting-pruning with confidence factor modified (C4.5A CF). The
pair of points for C4.5_CF and C4.5A_CF when CF=0.25 (default value) are
shown (green highlighted points), as well as pairs of points for C4.5_CF and
C4.5A_CF for other values of the confidence factor are shown.
C4.5 MS
C4.5A MS
MS = 2 (default)
MS paired points
60
50
Number of Nodes
→
g6c15 Modifying Minimum Split Off
70
40
30
20
10
0
0.2
0.25
0.3
0.35
0.4
Error Rate
Fig. 19. Number of tree nodes versus tree error rate when modifying the
minimum number of cases split-off (MS) during C4.5 growing for the g6c15
Gaussian dataset. The results corresponding to two configurations are
illustrated: growing-pruning with minimum number of split-off cases modified
(C4.5 MS) and growing-adjusting-pruning with minimum number of cases
split-off modified (C4.5A MS).
The pair of points for C4.5_MS and
C4.5A_MS when MS=2 (default value) are shown (green highlighted points),
as well as pairs of points for C4.5_MS and C4.5A_MS for other MS values are
shown
Each graph (Fig. 18 and Fig. 2019) shows the pairs of
outcomes which correspond to the same parameter values. For
each of these pairs, it is evident that the backwards adjusting
phase improves or maintains the accuracy. Although the
difference is minimal for the confidence factor adjustment (Fig.
18), there is an interesting side effect which has appeared as a
result of these experiments. Besides the error rate decrease
provided by the backwards adjusting phase, there is a slight
error rate decrease and a significant drop in tree size at the knee
of the curve of Fig. 18. This reduction is seen when comparing
the results using the default CF value of 25% (green point on the
graph for the blue curve) to the results at the knee of the curve
(the point of the blue curve where a tree size of 50 nodes (CF =
10%) achieves a better misclassification rate than a tree size of
68 nodes (for the default C4.5 CF value of 0.25). This behavior
was seen in the other datasets that we experimented with and
will be pointed out in the rest of the paper. Similar type of a knee
of the curve behavior occurs for the red curve of Fig. 18
(corresponding to C4.5A_CF), where a reduction of tree size
from 62 leaves to 45 leaves without any reduction of the tree
error rate is observed (CF = 5%).
Although the improvement when using the backwards
adjusting phase appears to be minimal in Fig. 2019, there is a
section of interest along the knee of the plot in which there are a
high density of outcome pairs. At this location, the error rate
difference between the paired points of C4.5 MS and C4.5A MS
is the greatest. Fig. 20 shows a closer look at this knee and plots
more pairs of outcomes between these two configurations. A set
of pairs at the knee of the curve is also shown in Table III. The
10
2007 AMALTHEA REU SITE; Beck, et al.
greatest difference between the error rates is for the minimum
number of cases split off value of 322. For this parameter value,
the error rate difference between the corresponding C4.5_MS
and C4.5A_MS is 5.2%. This point pair also shares
approximately the same tree size of 11 nodes. The knee of this
curve is also of interest because of the ‘fanning effect’ that can
be seen by the lines plotted between the pairs of corresponding
points (see Fig. 20). On the plot for C4.5A MS, the points at the
knee of the curve are highly concentrated at approximately the
same error rate and number of nodes position. On the contrary,
the corresponding points on the C4.5 MS plot quickly start to
exhibit an increased error rate within this window. The result is
an appreciable difference in error rate between the C4.5_MS
and the associated C4.5A_MS points (in favor of the
C4.5A_MS points; see Fig. 20 and Table III for more details).
g6c15 Modifying Minimum Split Off
20
C4.5 MS
C4.5A MS
MS paired points
18
Number of Nodes
16
14
12
10
8
6
4
0.19
0.2
0.21
0.22
0.23
0.24
0.25
Error Rate
Fig. 20. Number of tree nodes versus tree error rate for a zoomed portion of the
knee present in Fig. 19. At the zoomed location the C4.5A_MS points are
highly concentrated at the knee of the C4.5A MS plot. On the contrary, their
corresponding C4.5_MS points on the C4.5 MS plot show a consistent increase
in error rate. It is along this section of the plot where the greatest differences in
error rate between C4.5_MS and C4.5A_MS are seen.
TABLE III
g6c15 MODIFYING MINIMUM NUMBER OF CASES SPLIT OFF
Grow -> Prune
Grow -> Adjust -> Prune
C4.5_MS
C4.5A_MS
Minimum
Error Rate
Tree Size
Error Rate
Tree Size
Number
% (test
(#
% (test
(# nodes)
of Cases
set)
nodes)
set)
Split Off
282
22.95
12.5
19.16
11
292
23.52
11.8
19.15
11
302
23.82
11.4
19.15
11
312
24.10
11.3
19.15
11
322
24.85
11.0
19.65
10.9
332
27.03
10.2
23.30
10.2
To show that the results of Table III are statistically
significant, a t-test was performed for each of the values of
minimum split off shown in the table using the error rates
calculated for each of the twenty iterations. The t-test used a
significance level of 5%. The null hypothesis for the test is that
the means are equal.
TABLE IV
g6c25 T-TEST RESULTS (5% SIGNIFICANCE LEVEL)
NULL HYPOTHESIS Æ “THE MEANS ARE EQUAL”
p-value
Reject Null
Minimum
(probability that
Hypothesis (Yes /
Number of Cases
the Null Hypothesis
No)
Split Off
is true)
282
0
Yes
292
0
Yes
302
0
Yes
312
0
Yes
322
0
Yes
332
.0191
Yes
For the values of the minimum number split off shown in
Table III, the results are statistically significant. The null
hypothesis (corresponding to the statement that the two mean
error rates for C4.5_MS and C4.5A_MS are equal) was
rejected.
Hence, although reduction in error rate on the test set between
C4.5 and C4.5A was found to be minimal when all of the default
C4.5 settings were used (see results of Table II), a significant
reduction can be gained when C4.5 parameters are varied (see
results of Table III). We argue that varying these parameters is
usually beneficial as it can yield a tree with better accuracy
and/or size, and thus our algorithm’s improvement is practical.
The magnitude of the reduction is apparently database
dependent. In the case of the Gaussian datasets, the benefit is
seen when increasing the minimum number of cases split off.
This in turn inhibits the size of the grown tree and pruning does
not have the ability to recover from the loss in tree accuracy.
The backwards adjusting is able to reduce the error because an
ample statistical sample of cases resides at each node for higher
(than the default) MS values. The backward adjustment uses
these cases to minimize the predicted error and therefore
reduces the overall error of the entire tree while keeping the
same (or smaller) number of nodes than the original grown tree.
It could be argued that the tree could be pruned back
significantly to the same number of nodes but with greater
accuracy by simply modifying the C4.5’s confidence factor
value. However, this is not the case because even when the
confidence factor is set to a very small value (=.001%) the
number of nodes in the tree is still greater than 20. On the
contrary, when the minimum number of split-off cases is
modified the tree size is reduced to less than 10 nodes. In
review, the backwards adjusting phase helps reduce the error
rate which has increased as a result of initially creating such a
small tree. The end result is a smaller tree than could have been
11
2007 AMALTHEA REU SITE; Beck, et al.
produced by pruning alone but without the loss in accuracy that
would occur if backwards adjusting was not used.
There is a final observation that can be obtained by looking at
the results depicted by the curves (blue and red) of Fig. 19.
Although the knees of the curves are not as pronounced, we also
see that without reducing the error rate we can attain a smaller
C4.5 and C4.5A tree by simply changing the value of the
minimum number of split-off (MS) cases parameter.
Similar trends are found, and similar observations can be
made regarding the other datasets that we experimented with. In
particular, in all cases, the backwards adjusting phase produces
trees with less of an error rate than the traditional phases using
the same parameters. Furthermore, there is merit in using
parameter values other than the default parameter values for MS
and CF that C4.5 typically uses (smaller trees of the same error
rate can be produced). The following figures show the number
of nodes in the tree versus error rate plots for each of the
remaining datasets when the minimum number of cases split-off
and the confidence factor are independently adjusted.
Fig. 21 and
abalone Modifying Minimum Split Off
400
C4.5 MS
C4.5A MS
MS = 2 (default)
MS paired points
350
Number of Nodes
300
250
200
150
100
50
0
0.365
0.37
0.375
0.38
0.385
0.39
0.395
0.4
Error Rate
Fig. 21. Number of tree nodes versus tree error rate for the abalone dataset,
when modifying the minimum number of cases split-off (MS) during C4.5
growing. When the value of MS increases, the number of nodes in the tree drops
significantly along with the error rate. The knee of the curve shows the ideal
parameter value for the dataset (MS = 52), and the number of tree nodes
produced at this MS value is significantly reduced (from around 400, when the
default MS parameter is used) to less than 50, when MS equal 52).
abalone Modifying Confidence Factor
500
C4.5 CF
C4.5A CF
CF = 25% (default)
CF paired points
450
Number of Nodes
400
Fig. 22 show interesting results when modifying the
parameters MS and CF for the abalone dataset. In Fig. 21,
increasing MS decreases both the tree size and the error rate.
The knee of the curve is the location where the error rate is the
least and the tree size is significantly reduced. This position
corresponds to the MS value of 52. In addition to the reductions
in error and in size provided by modifying MS, the backwards
adjusting phase provides approximately an additional reduction
of .5% (in error rate) and keeps the tree size about the same.
350
300
250
200
150
CF = 1%
100
50
0
0.36
0.37
0.38
0.39
0.4
0.41
Error Rate
Fig. 22. Number of tree nodes versus tree error rate, when modifying the
confidence factor (CF) during C4.5 pruning for the abalone dataset. Although
the backwards adjusting has minimal (less than 1%) improvement on the error
rate, modifying CF from the default of 25% to 1% results in fewer nodes and
less error rate. The knees of the curves correspond to the CF parameter value of
1%. Here we see a significant reduction of tree sizes (from around 400, when
the default CF parameter is used to around 50, when CF=1% is used), while the
error rate is also decreased.
The g6c25 database is different from the already examined
g6c15 database in the amount of overlap of the associated
Gaussian distributions. The plots when adjusting MS and CF for
12
2007 AMALTHEA REU SITE; Beck, et al.
this database are very similar with the plots shown for the g6c15
Gaussian database, and show the same overall trends for both
configurations.
g6c25 Modifying Minimum Split Off
produce some additional error reduction (albeit very minimal),
decreasing the CF and increasing MS allows one, once more, to
significantly decrease the tree size without affecting the error
rate.
100
C4.5 MS
C4.5A MS
MS = 2 (default)
MS paired points
90
80
satellite Modifying Minimum Split Off
350
300
70
60
Number of Nodes
Number of Nodes
C4.5 MS
C4.5A MS
MS = 2 (default)
MS paired points
50
40
30
20
250
200
150
100
10
50
0
0.25
0.3
0.35
0.4
0.45
Error Rate
Fig. 23. Number of tree nodes versus tree error rate for the g6c25 dataset, when
modifying the minimum number of cases split-off (MS) during C4.5 growing.
Similar to Fig. 19, this plot reveals the same ideal knee where the error rate
differences are at the maximum. Also, when MS is set to 42, there is a tree size
reduction of about 70 nodes. This corresponds to the difference between the
pair of points using the default MS value of 2, and the pair of points at around
30 nodes which corresponds to an MS value of 42.
g6c25 Modifying Confidence Factor
140
0
0.1
0.15
0.2
0.25
0.3
0.35
Error Rate
Fig. 25. Number of tree nodes versus tree error rate for the satellite dataset,
when modifying the minimum number of cases split-off (MS) during C4.5
growing. The difference in error is not easily seen because of the complete
range of MS plotted for. A zoomed in view of the knee is shown in Fig. 26.
What is evident in this graph is the knee in the curve at approximately 100
nodes. The significant reduction in tree size from the default value is caused by
increasing MS from 2 to 12. A size reduction of approximately 225 nodes is
achieved with no apparent penalties in error rate.
satellite Modifying Minimum Split Off
120
40
80
Number of Nodes
Number of Nodes
45
100
60
CF = 5%
C4.5 CF
C4.5A CF
CF = 25% (default)
CF paired points
40
35
30
25
20
15
20
0.265
0.27
0.275
0.28
0.285
0.29
Error Rate
Fig. 24. Number of tree nodes versus tree error rate, when modifying the
confidence factor (CF) during C4.5 pruning for the g6c25 dataset. Although
minimal error reductions are observed by the backwards adjusting phase, a 5%
value for CF provides a lower error rate and tree size (tree sizes of
approximately 50 nodes for both the red and blue curves are seen). Backwards
adjusting reduces the error at this point even further. Furthermore, although the
CF value of 5% (corresponding to the most prominent knee in the curve) has the
greatest reduction in error rate compared to the default parameters, the CF
values of 10% and 1% each also show a reduction in tree size and error rate
compared to the tree sizes and error rates at the default setting of CF=25%.
The satellite database results shown below provide similar
observations, as well. Although the backwards adjusting does
10
5
0.165
0.17
0.175
0.18
0.185
0.19
Error Rate
Fig. 26. Number of tree nodes versus tree error rate for a zoomed portion of Fig.
25. The pairs of points are for MS values of 42 and 72 accordingly. Each point
in a pair has approximately the same number of nodes; however backwards
adjusting improves the error rate by approximately 1% (MS = 42 has .8%
reduction and MS = 62 has 1.21% reduction)
13
2007 AMALTHEA REU SITE; Beck, et al.
growing. The results are similar to those found for the satellite database when
using the same configuration; however, the magnitude in the decrease of nodes
is less, and there is a slight gain in error when increasing MS. A zoomed in
version of the knee is shown in Fig. 29.
satellite Modifying Confidence Factor
400
350
segment Modifying Minimum Split Off
19
250
CF = 5%
200
18
C4.5 CF
C4.5A CF
CF = 25% (default)
CF paired points
CF = 1%
150
Number of Nodes
Number of Nodes
300
100
50
0.145
0.15
0.155
16
15
14
0.16
13
Error Rate
The segment database is shown in Fig. 28 when modifying
MS and in Fig. 30 when modifying CF. At the knee of Fig. 28,
the pairs show a 2% difference in error rate (Fig. 29) as a result
of the backwards adjusting. Modifying MS decreases the
overall tree size but slightly increases the error rate. No special
trend can be concluded from Fig. 30 because the difference in
error rate is insignificantly small. All that can be said is that the
backwards adjusting does not cause worse performance, and
that decreasing the CF values increases the error rate minutely.
segment Modifying Minimum Split Off
70
C4.5 MS
C4.5A MS
MS = 2 (default)
MS paired points
60
12
0.09
0.1
0.11
0.12
0.13
0.14
Error Rate
Fig. 29. Number of tree nodes versus tree error rate for a zoomed portion of Fig.
28. The pairs of values correspond to MS of 42, 72, 102 and 132. The largest
difference in error rate shown between C4.5_MS and C4.5A_MS points is
observed for MS value of 72 (2%). Also, for each pair of values shown in the
graph the C4.5_MS and C4.5A_MS create the same size trees.
segment Modifying Confidence Factor
65
C4.5 CF
C4.5A CF
CF = 25% (default)
CF paired points
60
55
Number of Nodes
Fig. 27. Number of tree nodes versus tree error rate, when modifying the
confidence factor (CF) during C4.5 pruning for the satellite dataset. Modifying
the confidence factor from the default of 25% to 5% produces a tree on the blue
curve (shown at the knee of the curve) with approximately 225 nodes. On the
red curve the knee is for a CF value of 1% and has approximately 150 nodes at
the knee. Interestingly, CF values of 10%, 5%, 1%, .1% and .01% all provide
error rates and tree sizes less that the ones observed for the default parameter
value.
Number of Nodes
17
CF = 1%
50
45
40
50
35
40
30
0.048
0.05
0.052
0.054
0.056
0.058
0.06
0.062
Error Rate
Fig. 30. Number of tree nodes versus tree error rate, when modifying the
confidence factor (CF) during C4.5 pruning for the segment dataset. The knees
of both of these curves occur at the pair at approximately 47 nodes (CF value of
1%). At this location the increase in error rate compared to the default value
position is on the order of tenths of a percent whereas the number of node
decrease is approximately 10 nodes. The difference in performance provided by
the backwards adjusting phase is inconsequential.
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5
Error Rate
Fig. 28. Number of tree nodes versus tree error rate for the segment dataset,
when modifying the minimum number of cases split-off (MS) during C4.5
14
2007 AMALTHEA REU SITE; Beck, et al.
The final dataset to be examined is the waveform dataset. It is shown in Fig. 31
and
waveform Modifying Confidence Factor
350
Number of Nodes
300
C4.5 CF
C4.5A CF
CF = 25% (default)
CF paired points
250
200
CF = 1%
150
100
50
0.23
0.235
0.24
0.245
0.25
Error Rate
Fig. 32 when modifying MS and CF respectively. Modifying
the values of MS and CF from the default settings shows a
reduction in the size of the tree and an improvement in error rate
in both cases.
waveform Modifying Minimum Split Off
VI. SUMMARY AND CONCLUSIONS
350
C4.5 MS
C4.5A MS
MS = 2 (default)
MS paired points
300
Number of Nodes
250
200
150
100
50
0
0.22
0.24
0.26
0.28
0.3
0.32
Fig. 32. Number of tree nodes versus tree error rate, when modifying the
confidence factor (CF) during C4.5 pruning for the waveform dataset.
Modifying the value of CF from 25% to 1% produces a knee in the curve which
yields a 1% reduction in error rate as well as a tree that is approximately half of
the size (160 nodes compared to 325 nodes). Interestingly, each of the other
values of CF plotted (CF of 10%, 5%, .1% and .01%) also produce knees for the
curves which exhibit less error rates and even fewer number of tree nodes,
compared to the error rate and tree node size observed for the default CF value.
0.34
Error Rate
Fig. 31. Number of tree nodes versus tree error rate for the waveform dataset,
when modifying the minimum number of cases split off (MS) during C4.5
growing. Increasing MS decreases the number of nodes as well as the error rate
slightly. The knees occur at about 50 nodes for both curves and correspond to
an MS of 32. At this point the error rate is also .5% less than the default
parameter pair for both the blue and red curves. Backwards adjusting provides
a minimal amount of error rate reduction.
In this work we have focused on C4.5, one of the classical,
and most often used, decision tree classifiers. We developed a
software methodology to implement the C4.5 classifier that is
applicable for other tree classifiers and a variety of other
machine learning algorithms. We have also introduced an
innovative pruning strategy referred to as the “backwards
adjusting algorithm”, explained the motivation for its
introduction, and compared its effect on the growing and
pruning phases of the C4.5 algorithm. More specifically, the
backwards adjusting algorithm was interjected in between the
growing and the pruning phases of the C4.5 algorithm.
We conducted a thorough experimentation comparing C4.5
and C4.5A (C4.5 with the backwards adjusting algorithm). In
this experimentation we used a number of simulated and real
datasets and we compared the two algorithms by using two
figures of merit: the tree size and the tree error rate. In the
process of this experimentation we changed some of the default
parameter values that the C4.5 classifier uses CF (confidence
factor) value of 0.25 and MS (minimum number of split off)
value of 2). Our experiments illustrated that C4.5A almost
always improves the tree error rate, compared to C4.5, and in a
few cases it significantly improved the tree error rate (e.g.,
Gaussian, segment, and satellite datasets).
An important byproduct of our work was the realization that
by changing the default parameter values in C4.5 (CF and MS)
we are able to design trees that are of much smaller size than the
ones produced by the default C4.5 algorithm. The reduction in
15
2007 AMALTHEA REU SITE; Beck, et al.
tree size was attained without affecting the default C4.5 error
rate. Other researchers have observed that reductions in the
C4.5 tree sizes could be attained by changing the CF value, but
to the best of our knowledge, nobody has experimented with the
MS value, or produced so thorough and illustrative examples
and associated explanations for the significant effect that the CF
and MS parameter values have on the produced C4.5 tree size.
[7]
[8]
[9]
[10]
APPENDIX
A. Binomial Estimate and Confidence Factor
In this appendix we explain the meaning of the CF value. The
definition of this parameter is not explicitly given in [1], but the
expression can be found in [27] as shown below.
N
CF = ∑ x =0   p x (1 − p) N − x
x
[6]
[11]
[12]
E
(6)
Equation (6) indicates that the estimated probability of error p is
a Bernoulli one-sided confidence bound. For example, if
CF=25%, the 75% confidence interval of the error rate on
unseen cases is [0, p] where p is the solution to (6) and is used as
a pessimistic error rate estimate. CF can be interpreted as the
probability that there will be E or fewer errors in N unseen cases,
after we observe E errors in N training examples from a node.
C4.5 sets the confidence level to a default of 25% although
this can be modified by the user via an option. The pruning
phase calculates the number of errors in the node as well as the
total number of cases from the training set passed to the node.
The cases in the node from the training set are considered to be
a statistical sample from which an error probability can be
calculated. The only unknown remaining in the above equation
is then the probability p. Hence, with known CF, N and E values
at a node, equation (6) is solved for the parameter p. This
misclassification rate p can then be multiplied by the number of
cases in a node to predict how many errors will occur. The
lower the value of CF, the more pessimistic the pruning will be
because the error rate p will increase. For example, when CF=0,
the 100% confidence interval of the error rate is clearly [0,1],
and thus the estimated error rate is 1.
[13]
REFERENCES
[20]
[1]
[2]
[3]
[4]
[5]
J. R. Quinlan, C4.5: Programs for Machine Learning:
Morgan Kaufmann, 1993.
J. R. Quinlan, "Induction of decision trees," Machine
Learning, vol. 1, pp. 81-106, 1986.
L. Hyafil and R. L. Rivest, "Constructing Optimal
Binary Decision Trees is {NP}-Complete," Inf.
Process. Lett, vol. 5, pp. 15-17, 1976.
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J.
Stone, Classification and Regression Trees:
Wadsworth, 1984.
B. Cestnik, "Estimating Probabilities: A Crucial Task
in Machine Learning," 1990.
[14]
[15]
[16]
[17]
[18]
[19]
[21]
[22]
T. Niblett and I. Bratko, "Learning decision rules in
noisy domains," Brighton, United Kingdom, 1986.
J. R. Quinlan, "Simplifying decision trees," Int. J.
Hum.-Comput. Stud, vol. 51, pp. 497-510, 1999.
M. Mehta, J. Rissanen, and R. Agrawal,
"{MDL}-Based Decision Tree Pruning," 1995.
M. Dong and R. Kothari, "Classifiability Based
Pruning of Decision Trees," 2001.
B. Kijsirikul and K. Chongkasemwongse, "Decision
tree pruning using backpropagation neural networks,"
2001.
F. Esposito, D. Malerba, and G. Semeraro, "A
Comparative Analysis of Methods for Pruning
Decision Trees," IEEE Trans. Pattern Anal. Mach.
Intell, vol. 19, pp. 476-491, 1997.
X. Li, J. R. Sweigart, J. T. C. Teng, J. M. Donohue, L.
A. Thombs, and S. M. Wang, "Multivariate decision
trees using linear discriminants and tabu search," IEEE
Trans. Systems, Man, and Cybernetics, Part A, vol. 33,
pp. 194-205, 2003.
S. Murthy, S. Kasif, and S. Salzberg, "A system for
induction of oblique decision trees " Journal of
Artificial Intelligence Research, vol. 2, pp. 1-32, 1994.
C. T. Yildiz and E. Alpaydin, "Omnivariate Decision
Trees," IEEE Trans. Neural Networks, vol. 12, pp.
1539-1547, 2001.
T. Eloma and J. Rousu, "General and Efficient
Multisplitting of Numerical Attributes," Machine
Learning, vol. 36, pp. 201-244, 1999.
E. Cantú-Paz and C. Kamath, "Inducing oblique
decision trees with evolutionary algorithms," IEEE
Trans. Evol. Comput., vol. 7, pp. 54–68, 2003.
M. Zorman and P.Kokol, "Hybrid NN-DT cascade
method for generating decision trees from
backpropagation neural networks," presented at Proc.
9th International Conference on Neural Information
Processing (ICONIP '02), 2002.
T. G. Dietterich, "An experimental comparison of
three methods for constructing ensembles of decision
trees: Bagging, boosting, and randomization,"
Machine Learning, vol. 40, pp. 139-158, 2000.
L. Todorovski and S. Džeroski, "Combining classifiers
with meta decision trees," Machine Learning, vol. 50,
pp. 223-249, 2003.
R. Jin and G. Agrawal, "Communication and Memory
Efficient Parallel Decision Tree Construction,"
presented at 3rd SIAM International Conference on
Data Mining, San Francisco, CA, 2003.
M. V. Joshi, G. Ksarypis, and V. Kumar, "ScalParC: A
New Scalable and Efficient Parallel Classification
Algorithm for Mining Large Datasets," presented at
11th International Parallel Processing Symposium,
1998.
M. Mehta, R. Agrawal, and J. Rissanen, "SLIQ: A fast
scalable classifier for data mining," presented at 5th
International Conference on Extending Database
Technology, Avignon, France, 1996.
2007 AMALTHEA REU SITE; Beck, et al.
[23]
[24]
[25]
[26]
[27]
J. Shafer, R. Agarwal, and M. Mehta, "Sprint: A
scalable parallel classifier for data mining," presented
at 22nd International Conference on Very Large
Databases, Mumbai, India, 1996.
D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz,
"UCI Repository of machine learning databases,"
University of California, Irvine, CA, 1998.
J. R. Beck, "Implementation and Experimentation with
C4.5 Decision Trees," in Electrical and Computer
Engineering. Orlando: University of Central Florida,
2007.
M. Stone, "Cross-validation: A review.,"
Mathematics, Operations and Statistics, vol. 9, pp.
127-140, 1978.
T. Windeatt and G. Ardeshir, "An Empirical
Comparison of Pruning Methods for Ensemble
Classifiers," Proc. of Advances in Intelligent Data
Analysis (IDA2001), vol. LNCS2189, pp. 208-217,
2001.
16