1 2007 AMALTHEA REU SITE; Beck, et al. A Backward Adjusting Strategy for the C4.5 Decision Tree Classifier Jason R. Beck, Maria E. Garcia, Mingyu Zhong, Michael Georgiopoulos, Georgios Anagnostopoulos Abstract—In machine learning, decision trees are employed extensively in solving classification problems. In order to produce a decision tree classifier two main steps need to be followed. The first step is to grow the tree using a set of data, referred to as the training set. The second step is to prune the tree; this step produces a smaller tree with better generalization (smaller error on unseen data). The goal of this project is to incorporate an additional adjustment phase interjected between the growing and pruning phases of a well known decision tree classifier, called the C4.5 decision tree. This additional step reduces the error rate (generalization of the tree) by making adjustments to the non-optimal splits created in the growing phase of the C4.5 classifier. As a byproduct of our work we are also discussing of how the decision tree produced by C4.5 is affected by the change of the C4.5 default parameters, such as CF (confidence factor) and MS (number of minimum split-off) cases, and emphasizing the fact that CF and MS parameter values, different than the default values, lead us to C4.5 trees of much smaller size and smaller error. Index Terms—C4.5, Decision Trees I. INTRODUCTION D ECISION tree is a methodology used for classification and regression. Its advantage lies in the fact that it is easy to understand; also, it can be used to predict patterns with missing values and categorical attributes. This research will focus only on classification problems. There are many approaches that can be used to induce a decision tree. This focuses on the C4.5 decision tree classifier [1], which is a method that evolved from the simpler ID3 classifier [2]. Typically, in order to obtain an optimal tree, there are two main steps to follow. First, the tree has to be grown; to do this a collection of training data is utilized to grow the tree by discovering (using some criterion of merit) a series of splits that divide the training data into progressively smaller subsets of data that have purer class labels than their predecessor data. The growing of the tree stops when we end up with datasets that are pure (i.e., contain data from the same Manuscript received July 13, 2007. This work was supported in part by the National Science Foundation IIS-REU-0647018, and IIS-REU-0647120 Jason Robert Beck is with the School of EECS at the University of Central Florida, Orlando, FL 32816 (e-mail: [email protected]) Maria E. Garcia is with the University of Puerto Rico at Mayaguez, Mayaguez,Puerto Rico, 00981-9018 (e-mail: [email protected]) Mingyu Zhong is with the School of EECS at the University of Central Florida, Orlando, FL 32816 (e-mail: [email protected]) Michael Georgiopoulos is with the School of EECS at the University of Central Florida, Orlando, FL 32816 (corresponding author, 407-823-5338, fax: 407-823-5835, e-mail: [email protected]) Georgios C. Anagnostopoulos is with the Florida Institute of Technology (FIT), Melbourne, FL 32792 (e-mail: [email protected]) class) or the split yields no improvement to the criterion of merit. At the end of the growing phase, usually an over-trained tree has been produced. The second step in the design of an optimal tree classifier is to prune the tree; the pruning results in less complex and more accurate (on unseen data) classifiers. Pruning removes parts of the tree that do not contribute to the tree’s classification accuracy. During the growing phase if minimum error rate splitting is employed to grow the tree there will be many “optimal” splits and it is unlikely that the true boundaries will be selected; even worse, in many cases none of these “optimal” splits finds the real boundary between the classes. Therefore, it is widely accepted that another measure of merit, usually referred to as gain (decrease) of impurity, should be used to evaluate the splits during the growing phase. During the pruning phase the estimated error rate is used in order to decide if a branch must be pruned or not. The problem encountered is that the estimated error rate may not be optimized because the splits were decided using the gain ratio and not the minimal error rate. For this reason we propose adding a third step, the backward adjustment phase, to the decision tree design process. In the backward adjustment phase the grown tree is used; a re-evaluation of the possible splits is done bottom-up, minimizing the estimated error rate. The paper is organized in the following manner: In section II we present a literature review related to our work. In section III we describe the basic algorithms for growing and pruning a tree. In section IV we explain the backward adjustment phase. In section V we explain the experiments conducted, and describe the databases that were used during the research. At the end of this section we show our experimental results and make appropriate observations. In section VI, we summarize our work, we reiterate our conclusions. II. LITERATURE REVIEW Hyafil and Rivest proved that getting the optimal tree is NP-complete [3], and thus most algorithms employ the greedy search and the divide-and-conquer approach to grow a tree (see Section 3 for more details). In particular, the training data set continues to be split in smaller datasets until each dataset is pure or contains too few examples; each such dataset belongs to a tree node, referred to as leaf, usually with a uniform class label. The splitting rules apply to data that reside in tree nodes, referred to as decision tree nodes. Apparently, if all the leaves in a tree have the same class label among the training examples they cover, the tree has 100% accuracy on the training set. Like other approaches that design their models from data, a decision tree classifier that is 2 2007 AMALTHEA REU SITE; Beck, et al. over-adapted to the training set tends to generalize poorly, when it is confronted with unseen instances. Unfortunately, it is not easy to set a stopping criterion to terminate the growing process before leaves with pure labels are discovered, because it is difficult to foresee the usefulness of future splits until these splits are employed in the growing process. Therefore, it has been widely accepted in the decision tree literature that the tree should be grown to the maximum size and then pruned back. A number of pruning algorithms have been proposed for decision tree classifiers, such as the Minimal Cost-Complexity Pruning (CCP) in CART (Classification and Regression Trees; see [4], page 66), Error Based Pruning (EBP) in C4.5 (see [1], page 37), Minimum Error Pruning (MEP) [5, 6], Reduced Error Pruning (REP), Pessimistic Error Pruning (PEP) (see [7] for both REP and PEP), the MDL-Based Pruning [8], Classifiability Based Pruning [9], the pruning using Backpropagation Neural Networks [10], etc. Some of the pruning algorithms are briefly analyzed and empirically compared in [11]. Other recent research on decision trees can be roughly divided into three categories: the extensions of the conventional model, including the extension from axis-parallel splits to oblique splits (see [12-14]), and multi-splits per node (see [15]); the combination of trees and other machine learning algorithms, such as the evolution of trees in [16], hybrid trees in [17], and tree ensembles in [18, 19]; and the optimization of tree implementation for large databases, usually by parallel processing, as in [20-23]. In this project we focus on the improvement of an induced tree that is produced by an existing, popular decision tree algorithm, the C4.5 algorithm. It is important to note that in the existing approaches of pruning a tree, the only option for pruning a branch is to remove the decision node that is the root of the branch. EBP of C4.5 allows a flexible option to “graft” a branch, by replacing it with the most frequently selected sub-branch under the current decision node. Nevertheless, even in this approach no split is re-evaluated or changed without removal. In Section IV, we argue that this option by itself may not yield the optimal tree. III. DECISION TREES INDUCTION This research is based on C4.5 algorithm developed by J. Ross Quinlan (see [1]). As most other approaches to induce a tree, C4.5 employs two steps: the growing phase and the pruning phase. During the first phase (growing), a tree is grown to a sufficiently large size. In most cases this tree has a large number of redundant nodes. During the second phase (pruning) the tree is pruned and the redundant nodes are removed. A. C4.5 Growing Phase The pseudo-code of the algorithm for growing a C4.5 tree is presented below: algorithm Grow(root, S) input: A root node, set S of training cases output: A trained decision tree, appended to root if S is pure then Assign class of pure cases to root; return; end if no split can yield minimum number of cases split off then Assign class of the node as the majority class of S return; end Find an optimal split that splits S into subsets Sk (k=1:K); foreach k = 1 to K do Create a node tk under root; Grow (tk,Sk); end Fig. 1. The pseudo code for the C4.5 growing algorithm. In order to obtain the optimal split while growing the tree (see part of the pseudo-code above) the gain ratio should be calculated. The following steps will describe the meaning of the gain ratio, why it is used, and how it is derived. The first step is to find the information known before splitting. The following equation measures the average amount of information needed to identify the class of a case in S, where S represents the set of training cases and Cj represents the class with class index j. |S| is simply the count of the number of cases in the set S. When there are missing values in the set of training cases, |S| is the sum of the weights of each class in the set S. The way missing values are handled will be explained shortly. k Info ( S ) = − ∑ j =1 freq ( Cj , S ) ⋅ log S freq ( C j , S ) 2 (1) |S | The information for the set of training cases is simply the sum of the information for each class multiplied by the proportion according to which this class is represented in S. Now that the information before splitting is known, the information after making a split needs to be determined. This corresponds to the average information of the resulting subsets, after S has been divided in accordance with the n outcomes of a test X. Note that X represents an explored split which divides the cases up in set S into n subsets each indexed as Si. The expected (average) information is found by taking the weighted sum of the information of each subset, as shown below n Si i =1 S Info X ( S ) = ∑ ⋅ Info( S i ) . (2) The gain in information made by the split is simply the difference between the amount of information needed to classify a case before and after making the split. 3 2007 AMALTHEA REU SITE; Beck, et al. Gain ( X ) = F ⋅ ( Info ( S ) − Info X ( S )) (3) creates a pure tree node. The remaining cases go to another tree node, which is not pure and as a result it will be split even further. C4.5 handles missing values by adding a multiplier F in front of the gain calculation. The multiplier F is the ratio of the sum of the weights of the cases with known outcomes on the split attribute divided by the sum of the weights of all of the cases. The gain measure alone is not enough to build a tree. The gain measure favors splits with very many outcomes. To overcome this problem the gain ratio is used, defined as follows: GainRatio ( X ) = Gain( X ) SplitInfo ( X ) (4) The gain ratio divides the gain by the evaluated split information. This penalizes splits with many outcomes. n +1 Si i =1 S SplitInfo( X ) = −∑ ⋅ log 2 Si S (5) The split information is the weighted average calculation of the information using the proportion of cases which are passed to each child. When there are cases with unknown outcomes on the split attribute, the split information treats this as an additional split direction. This is done to penalize splits which are made using cases with missing values. After finding the best split, the tree continues to be grown recursively using the same process. The following example shows how a tree is grown using the Iris database. The Iris database comes from the UCI repository [24]. This database contains three classes of Iris flowers: iris-setosa, iris-virginica and iris-versicolor. There are 150 training examples, 50 from each class. Each case contains four attributes: petal length, petal width, sepal length, and sepal width. Two of the attributes, petal length versus petal width, are plotted below for all of the cases. Fig. 2 shows where the first best split will be made. Fig. 3 shows why that split was chosen by plotting the gain ratio for each of the explored split possibilities. Fig. 3. Diagram Showing the gain ratio for each considered first split in the iris dataset. The position with the highest gain ratio is chosen as the first split. This split separates the iris-setosa cases from the rest of the iris cases. There will be no more splitting of the tree node that contains the iris-setosa cases (since this node contains data of a pure label). The remaining iris-plant cases (residing in the other tree node resulting from the split) will continue to be split until there is little or no more information being gained from additional splits, or when pure nodes result. The graphical version of the Iris tree growing phase is shown in Fig. 4. Fig. 4. Graphical display of a C4.5 tree grown using the iris database. The iris database contains 150 cases which can be used to predict between three classes: iris-setosa, iris-versicolor, and iris-virginica. Each case contains 4 continuous attributes measuring the iris’ petal width/length, and sepal width/length. Fig. 2. Diagram showing the first best tree split for the Iris dataset. The first best split separates all of the iris-setosa classes from the rest of the classes, and 4 2007 AMALTHEA REU SITE; Beck, et al. B. Pruning Phase of C4.5 The following algorithm (pseudo-code) outlines the pruning process: algorithm Prune(node) input: node with an attached subtree output: pruned tree leafError = estimated leaf error; if node is a leaf then return leaf error else subtreeError = ∑ Prune(t i ) ; t i ∈children ( node ) branchError = error if replaced with most frequented branch if leafError is less than branchError and subtreeError make this node a leaf; error = leafError; elseif branchError is less than leafError and subtreeError replace this node with the most frequented branch; error = branchError; else error = subtreeError; return error; end Fig. 5. The pseudo code for the C4.5 pruning algorithm. The algorithm is recursively called so that it works from the bottom of the tree upward, removing or replacing branches to minimize the predicted error. Because trees that have only been grown are usually very cumbersome, the resultant tree after pruning is easier to understand and usually provides smaller misclassification error on unseen data. The problem is that the estimated error rate may not be optimized. As a remedy to this problem the idea of the backward adjustment phase is introduced in the next section. The following example shows how a tree is pruned. This example uses the congressional voting database from the UCI repository [24]. This database contains 300 cases; each case is the voting record of a U.S. congressman taken during the second session of congress in 1984. For this example, each case has 16 categorical attributes which represent 16 key votes made during that year. The possible outcomes for each of these attributes are: yea, nay, and unknown disposition. It has two classes, democrat and republican. The following is the textual output of the fully grown tree: physician fee freeze = n: | adoption of the budget resolution = y: democrat (151.0) | adoption of the budget resolution = u: democrat (1.0) | adoption of the budget resolution = n: | | education spending = n: democrat (6.0) | | education spending = y: democrat (9.0) | | education spending = u: republican (1.0) physician fee freeze = y: | synfuels corporation cutback = n: republican (97.0/3.0) | synfuels corporation cutback = u: republican (4.0) | synfuels corporation cutback = y: | | duty free exports = y: democrat (2.0) | | duty free exports = u: republican (1.0) | | duty free exports = n: | | | education spending = n: democrat (5.0/2.0) | | | education spending = y: republican (13.0/2.0) | | | education spending = u: democrat (1.0) physician fee freeze = u: | water project cost sharing = n: democrat (0.0) | water project cost sharing = y: democrat (4.0) | water project cost sharing = u: | | mx missile = n: republican (0.0) | | mx missile = y: democrat (3.0/1.0) | | mx missile = u: republican (2.0) Fig. 6. The C4.5 fully grown congressional vote database tree. The numbers in the parentheses after the display of the node represent: the sum of the weights in that node / error at the leaf (if any). The congressional voting database contains 300 cases. Every case has 16 categorical attributes, each with possible outcomes y, n, and u. The tree now needs to be pruned. To explain how this is done, only a part of the tree will be considered. The node examined is the one containing 16 cases, and depicted below. adoption of the budget resolution = n: | education spending = n: democrat (6.0) | education spending = y: democrat (9.0) | education spending = u: republican (1.0) Fig. 7. Subtree of the fully grown congressional voting database tree. The cases which made it here voted nay on the physician fee freeze and nay on the adoption of the budget resolution. The children of this node were created using the education spending attribute, and the number of cases in each child node is shown in the parenthesis after the node. There are no errors in the training set at the children nodes. Observing this part of the tree one may notice that if the education spending answer is yea or nay the congressman would be a democrat, while if the answer is unknown the congressman will be a republican. Hence, only one case (republican) is in the unknown disposition, while the remaining 15 of the total 16 cases will be a democrat. This means that since 15 of the 16 cases are democrat, there is a strong indication that this node should become a leaf, predicting that if a voting record makes it to this leaf, the designated congressman is a democrat. To decide if the node should become a leaf or not we need to estimate the error rates using a confidence level (see the appendix for more details about the confidence level used in C4.5); this will penalize small leaves. 5 2007 AMALTHEA REU SITE; Beck, et al. Check the predicted error of the tree shown in Figure 7. algorithm AdjustTree (t, S) input: a tree rooted at node t, training set S output: the adjusted tree (modified from the input) Predicted Error = 3.273 = 6 x 0.206 + 9 x 0.143 + 1 x 0.750 Check the predicted error if we prune this node to a leaf: Predicted Error = 2.512 = 16 x 0.157 The predicted error is smaller if we prune this node to a leaf. The result after pruning this part of the tree will be the following: | adoption of the budget resolution = n: democrat IV. BACKWARD ADJUST DESCRIPTION A. Motivation At this point the two traditional decision tree phases (growing and pruning) have been explained. The main problem is that during the growing process the splits that were chosen may not have been the error optimal splits. This happens because during the growing process the data is separated using the maximized gain ratio to reduce entropy. This split finds a good boundary and future splits may further separate these classes. For example, consider Fig. 8 which shows two class distributions that overlap. When C4.5 makes the first split, the maximized gain ratio split is chosen to separate part of class 1 from class 2. Nevertheless, if there is no future split under this node or all future splits are pruned, it is clear that this split is not the Bayes optimal split (the one minimizing the probability of error; see Fig. 8 for an illustration of this fact). Max Gain Point Class 1 Min Error Rate Point (Bayes Optimal) Class 2 Attribute Fig. 8. Finding the minimum error rate point on two Gaussians. The point which may maximize the gain may not minimize the error. By observing Fig. 8 one may notice that the optimal split is the one that passes through the intersection of both distributions. In this paper, in order to reduce the error of the split, we propose an adjustment process to be added to decision tree induction, described in the next subsection. B. Adjustment of a Tree For the decision node pertaining to Fig. 8, it is clear that it is beneficial to adjust the split. In general, after a decision node is adjusted, its ancestor nodes may also be improved by the same adjusting method. Therefore, this adjusting process attempts to improve the accuracy of the decision nodes in a bottom-up order. This adjustment process currently examines and adjusts only binary splits for simplicity. The algorithm for this backwards adjusting phase is shown below. if t is a decision node then Divide S into subsets {Si} using the current split; foreach child node ti of t do Recursively call AdjustTree (ti,Si); end foreach possible split sk in t do Set the split of t to sk; Ek = EstimateError (t,S); end Set the split of t to sk, k = arg min Ek; removeEmptyNodes(t,S). end Fig. 9. The pseudo code for the backwards adjusting algorithm. There are two options to adjust the tree: before pruning or after pruning. In both cases, we expect that the resulting tree will have better accuracy and be smaller than the pruned tree without adjustment. We prefer to adjust the tree before pruning it, because in the fully grown tree, we have more nodes to adjust and thus the chance to improve the tree, through backwards adjusting, is higher. Since the tree will be pruned later, the function EstimateError in our adjusting algorithm must return the error of the pruned result of the sub-tree rather than the current sub-tree. This can be implemented in the same way as C4.5’s pruning, as shown in Fig. 10. algorithm EstimateError (t, S) input: a tree rooted at node t, training set S output: estimated error leafError = error if node was made a leaf using binomial estimate; if t is a leaf node then return leafError; else Divide S into subsets {Si} using the current split; foreach child node ti of t do sumError = EstimateError (ti,Si); end if sumError < leafError then return sumError; else return leafError; end Fig. 10. The pseudo code for the estimate error function. When estimating the error of a sub-tree, we return the smaller value between the total error of its branches (involving recursive calls) and the error of the root treated as a leaf. The difference is that we do not actually prune this sub-tree in EstimateError routine, but simply evaluate the potential best pruned sub-tree. After adjusting, the pruning algorithm can serve as a realization process, converting the sub-trees to its best pruned version (for 6 2007 AMALTHEA REU SITE; Beck, et al. C4.5, the result may not be exactly the same as the potential best pruned sub-tree as we evaluated before, due to the grafting option in the final pruning process). Strictly speaking, after one pass of the adjusting algorithm, the leaves may cover different training examples because the splits in their ancestor nodes may have changed. We could repeat the same adjusting process again from the bottom decision nodes, based on the updated leaf coverage. Nevertheless, our preliminary experiments show that a second run of the adjusting process does not have significant improvements compared to the first pass, and therefore in our experiments we apply only one pass of the adjusting algorithm to the tree. The following figure (Fig. 13) shows a typical tree that could result after the growing phase alone. Unless grafting (replacing a sub-tree with one of its branches) is allowed in the pruning process, the root split will not be changed and the optimal tree cannot be obtained. x<0.5 y<0? x<0? Blue C. Examples Now we present two examples to show how the backward adjustment phase will optimize the estimated error rate. In the first example we design a 2-class artificial database, as shown in Fig. 11. The following square is given with the observed class distribution. y (0,0) x Fig. 11. Distribution of the cross database. Each pattern has two attributes which are plotted on the x – y axis shown above. The class of each pattern is colored accordingly. The optimal split is the one that assigns 50% of the red distribution and 50% of the blue distribution to both children as shown in Fig. 12. Blue Red x<0? Red Red Red Blue Blue Fig. 13. Non-optimal grown tree for the cross database. The higher level splits in this tree do no find the ideal split positions Another example showing where the backwards adjusting phase would be beneficial is shown below (Fig. 14). In this figure the gain ratio and error rate are plotted for each respective split. The two classes are represented as circles and crosses. The circles have a uniform distribution and the crosses increase in frequency across the axis. C4.5 would choose Split A, because it maximizes the gain ratio. The minimizing error split is actually Split C. Assuming no other divisions occur below this split in the final tree this split would benefit from the backwards adjusting phase. GAIN RATIO FOR EACH SPLIT 0.25 0.2 0.15 0.1 0.05 0 Split A x<0? y<0? y<0? Split B Split C Split D Split E y<0? Red Blue Fig. 12. Optimal tree for the cross datasets. This tree is optimal because there are equal number of leaves for the tree and class regions in the dataset under consideration (in both cases this number is equal to 4). In practice, the optimal tree is not easy to find. Most tree induction algorithms including C4.5, consider only axis-parallel splits (vertical or horizontal for this 2-D case). For this database, after the first split, regardless of its position, each class still occupies 50% of the area in both branches. Therefore, ideally each split position has the same gain ratio or impurity. In reality, only a finite number of examples are given, so there is a position that has slightly better gain than others due to the random sampling, but it is unlikely that this “best” splits turns out to be the optimal one. Maximized Gain Ratio Split Error Optimal Split ERROR RATE FOR EACH SPLIT 0.45 0.4 0.35 0.3 Split A Split B Split C Split D Split E Fig. 14. Gain Ratio and Minimum Error Rate Split. The top graph shows the gain ratio plotted for each split. The bottom graph shows the minimum error rate position. This shows that the division C4.5 makes is not in the minimum error position but the backwards adjusting phase can correct this issue. 7 2007 AMALTHEA REU SITE; Beck, et al. V. EXPERIMENTS In this section we will describe the software design, the experimental design, as well as the databases used and the results of the experiments. A. Software Design Quinlan’s original software was not ideal to use for our experimentation and addition of the backwards adjusting phase. His version was implemented in C and lacked any object oriented design. In particular, it is very difficult to port his code into our platform for empirically comparison. Because of this limitation, we decided to implement C4.5 using C++ and standard object oriented principles. This would allow for easy experimentation with the algorithm, along with the ability to add additional modules. We also wanted to compile the produced code into a MEX function for use in MATLAB so that we could use all of the tools available within MATLAB. Our design methodology was to first develop sets of abstract classes for interfacing. We could then provide various implementations from these classes and specify the traditional C4.5 phases (growing and pruning) and split types (continuous, categorical, and grouped categorical) as well as leave the possibility to create new phases and split types. To accomplish this, any phase of tree development which modifies the structure of a tree was considered to be a tree manipulator. The tree manipulator became the abstract class from which growing, pruning, and backwards adjusting were defined. Split types were also defined abstractly, and continuous, categorical, and grouped categorical splits were implemented. Split factories were also created to produce splits for the growing and backwards adjusting phases. A specific factory was developed for each type of split, therefore keeping the possibility open of developing a new type of split or a new method for choosing the best split. A class that could evaluate a subtree and return its error using a set of cases was also developed. This class could be extended to measure the error using any type of error measurement. We implemented versions to measure the resubstitution error (the number of errors made with the training examples), as well as the binomial estimated error (the estimated error in EBP; see the appendix for more details). Throughout the development, the important considerations were reusability and expandability. More details about the specific software design components, mentioned above, can be found at [25]. B. Experimental Design In order to perform our experiments each database is separated in two groups; a training set and a test set. To reduce the randomness that results when partitioning the data into a training set and a test set the experiments are performed for twenty iterations with a different partitioning each time. One problem that can be encountered if random partitioning is merely repeated is that some of the examples might appear more often in the training set than in the test set or visa-versa. We know that V-fold cross-validation [26] can address this, but the problem is that the training ratio when V is fixed cannot be controlled, and that the training ratio is large when V is high. For this reason we slightly generalize cross-validation in the following way: we divide each database into 20 subsets of equal size. In the ith (i = 0,1,2,…,19) test, we use subsets i mod 20, (i + 1) mod 20,…, ( i+ m -1) mod 20 for training and the others for testing, where m is predefined (for example, if the training ratio is 5%, m=1; if the training ratio is 50%, m=10). This approach guarantees that each example appears exactly m times in the training set and exactly (20 - m) times in the test set. Furthermore, it allows the training ratio to vary. The algorithm that accomplishes this goal is shown in Fig. 15, and pictorially illustrated in Fig. 16: foreach Database D do Shuffle the examples in D; Partition D into 20 subsets D0, D1,….,D19; for i = 0 to 19 do Set Training Set = U m −1 i =0 D( i + j ) mod 20 ; Set Test Set = D – Training Set; Run experiment using Training Set; Evaluate results using the Test Set; end end Fig. 15. The pseudo code for separating the training and test sets of cases 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Fig. 16. Representation showing the training set (red) and the test set (green) windowing through the subsets of cases, for iterations 0 through 3 (each row represents one iteration). The experimentation aims at comparing the accuracy and size of the trees generated by the backwards adjusting phase compared to the same tree which could be produced using the traditional growing and pruning phases alone. The experiment will involve growing and pruning a tree using all of the default options and comparing that to growing, then backwards adjusting, and finally pruning a tree using the default options. The error rate and tree sizes will be compared between the two approaches of designing trees. To make a fair comparison, both approaches are based on the same grown tree in each iteration. The comparison will also be conducted when the C4.5 default 8 2007 AMALTHEA REU SITE; Beck, et al. C. Databases The experimentation will use a combination of artificial and real datasets. The real datasets are available from the University of California at Irvine’s (UCI) machine learning repository[24]. The table of databases and their individual properties are shown in the Table I. TABLE I DATABASES USED FOR EXPERIMENTATION Attributes (Numerical / Databas Example Categorical Majority e s ) Classes Class % g6c15 5004 2/0 6 16.7 g6c25 5004 2/0 6 16.7 abalone 4177 7/1 3 34.6 satellite 6435 36 / 0 6 23.8 segment 2310 19 / 0 7 14 3 The artificial Gaussian datasets g6c15 (see Fig. 17) and g6c25 are both two dimensional six class datasets with 15% and 25% overlap respectively. The overlap percentage means that the optimal Bayesian Classifier would have 15% and 25% misclassification rate on each of the datasets. g6c15 Artificial Gaussian Dataset 1 0.9 0.8 0.7 0.6 Y axis options are not used. Specifically, two C4.5 option parameters will be independently adjusted: the minimum number of cases split off during growing, and the confidence factor while pruning. The first is the minimum number of cases split off by a split when growing the C4.5 tree. This number is defaulted to two and in Quinlan’s C4.5 the minimum number of cases split off is set to a maximum of 25. In our implementation with C4.5 we did not put an upper limit on the minimum number of cases split off. Instead, in our experiments this option ranges from the default number of 2 and increases, at steps of 10, up to a number equal to 15% of the number of training cases. The tree is then grown using the set value of the minimum number of cases split off and then pruned. This will be compared to a tree grown with the set minimum number of split off cases, backward adjusted, and then pruned. The other option parameter which will be adjusted is the confidence factor. This value affects how pessimistically the tree is pruned back. The confidence factor will be given values of 75%, 50%, 25%, 10%, 5%, 1%, .1%, .01%, and .001% (smaller confidence factors result in further tree pruning; a detailed explanation of the role of the confidence factor is provided in the appendix). The tree will be grown then pruned with the set confidence factor. This tree will be compared to a tree adjusted and pruned based on the same fully grown tree and using the same set value of confidence factor. The trees designed with these two approaches will be compared in terms of accuracy and size. 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 X axis 0.6 0.7 0.8 0.9 1 Fig. 17. g6c15 database (6 class and 2 continuous attributes) with 15% overlap. Each colored dot corresponds to a different class. In the abalone dataset, each case contains 7 continuous attributes and 1 categorical attribute. The class is the age of the abalone. Although the original dataset used a class for each age number, this dataset has clustered groups of ages so that the result is a three class database. The classes are 8 and lower (class 1), 9-10 (class 2), 11 and greater (class 3). The satellite dataset gives the multi-spectral values of pixels within 3x3 neighborhoods in a satellite image, and the classification associated with the central pixel in each neighborhood. The class is the type of land present in the image: red soil, cotton crop, etc. The segment dataset contains 19 numerical attributes and is used to predict 7 classes. The classes represent seven types of outdoor images: brick face, sky, foliage, cement, window, path, and grass. The waveform dataset is a three-class dataset. Each case contains 21 numerical attributes which are used to predict three classes of waveforms. The method of creating this dataset is described in [4]. D. Results As described in the experimental design section, there are three configurations tested. The first compares the resulting tree after growing adjusting and pruning using all of the default options to a tree just grown and pruned using the default options. The results of this experiment are shown below in Table II. These configurations will be referred to as C4.5_DP (default parameters) and C4.5A_DP (backwards adjusting and default parameters). 9 2007 AMALTHEA REU SITE; Beck, et al. TABLE II EXPERIMENTAL RESULTS USING DEFAULT OPTIONS Prune Grow Adjust Prune Grow C4.5_DP C4.5A_DP Error Rate Tree Size Error Rate Tree Size Databas % (test (# nodes) % (test (# nodes) e Names set) set) g6c15 17.56 67.6 17.34 63.3 → → g6c25 27.67 99 26.95 94.8 abalone 39.39 397.2 39.15 376.75 satellite 15.18 333.2 15.06 328.5 segment 5 61.6 4.89 59.1 wavefor m 24.63 328.2 24.45 323.6 The results show a consistent improvement in both tree accuracy and node size across all of the examined databases. Although the difference in error is at most .72% (for the g6c25 database), there is a slight improvement for all the sets. The other experimental configurations that are examined correspond to the trees produced when modifying the minimum number of cases split off (C4.5_MS and C4.5A_MS), or the confidence factor during pruning (C4.5_CF and C4.5A_CF). For each dataset in each configuration the tree size versus error rate plot was created. The plots for g6c15 are shown below when the confidence factor (CF) is modified (Fig. 18) and when the minimum number of cases split off (MS) is modified (Fig. 2019 and 20) . g6c15 Modifying Confidence Factor 90 C4.5 CF C4.5A CF CF = 25% (default) CF paired points 80 Number of Nodes 70 60 CF = 10% 50 CF = 5% 40 30 20 0.172 0.174 0.176 0.178 0.18 0.182 0.184 0.186 Error Rate Fig. 18. Number of tree nodes versus tree error rate when modifying the confidence factor (CF) during C4.5 pruning for the g6c15 Gaussian dataset. The results corresponding to two configurations are illustrated: growing-pruning with confidence factor modified (C4.5 CF) and growing-adjusting-pruning with confidence factor modified (C4.5A CF). The pair of points for C4.5_CF and C4.5A_CF when CF=0.25 (default value) are shown (green highlighted points), as well as pairs of points for C4.5_CF and C4.5A_CF for other values of the confidence factor are shown. C4.5 MS C4.5A MS MS = 2 (default) MS paired points 60 50 Number of Nodes → g6c15 Modifying Minimum Split Off 70 40 30 20 10 0 0.2 0.25 0.3 0.35 0.4 Error Rate Fig. 19. Number of tree nodes versus tree error rate when modifying the minimum number of cases split-off (MS) during C4.5 growing for the g6c15 Gaussian dataset. The results corresponding to two configurations are illustrated: growing-pruning with minimum number of split-off cases modified (C4.5 MS) and growing-adjusting-pruning with minimum number of cases split-off modified (C4.5A MS). The pair of points for C4.5_MS and C4.5A_MS when MS=2 (default value) are shown (green highlighted points), as well as pairs of points for C4.5_MS and C4.5A_MS for other MS values are shown Each graph (Fig. 18 and Fig. 2019) shows the pairs of outcomes which correspond to the same parameter values. For each of these pairs, it is evident that the backwards adjusting phase improves or maintains the accuracy. Although the difference is minimal for the confidence factor adjustment (Fig. 18), there is an interesting side effect which has appeared as a result of these experiments. Besides the error rate decrease provided by the backwards adjusting phase, there is a slight error rate decrease and a significant drop in tree size at the knee of the curve of Fig. 18. This reduction is seen when comparing the results using the default CF value of 25% (green point on the graph for the blue curve) to the results at the knee of the curve (the point of the blue curve where a tree size of 50 nodes (CF = 10%) achieves a better misclassification rate than a tree size of 68 nodes (for the default C4.5 CF value of 0.25). This behavior was seen in the other datasets that we experimented with and will be pointed out in the rest of the paper. Similar type of a knee of the curve behavior occurs for the red curve of Fig. 18 (corresponding to C4.5A_CF), where a reduction of tree size from 62 leaves to 45 leaves without any reduction of the tree error rate is observed (CF = 5%). Although the improvement when using the backwards adjusting phase appears to be minimal in Fig. 2019, there is a section of interest along the knee of the plot in which there are a high density of outcome pairs. At this location, the error rate difference between the paired points of C4.5 MS and C4.5A MS is the greatest. Fig. 20 shows a closer look at this knee and plots more pairs of outcomes between these two configurations. A set of pairs at the knee of the curve is also shown in Table III. The 10 2007 AMALTHEA REU SITE; Beck, et al. greatest difference between the error rates is for the minimum number of cases split off value of 322. For this parameter value, the error rate difference between the corresponding C4.5_MS and C4.5A_MS is 5.2%. This point pair also shares approximately the same tree size of 11 nodes. The knee of this curve is also of interest because of the ‘fanning effect’ that can be seen by the lines plotted between the pairs of corresponding points (see Fig. 20). On the plot for C4.5A MS, the points at the knee of the curve are highly concentrated at approximately the same error rate and number of nodes position. On the contrary, the corresponding points on the C4.5 MS plot quickly start to exhibit an increased error rate within this window. The result is an appreciable difference in error rate between the C4.5_MS and the associated C4.5A_MS points (in favor of the C4.5A_MS points; see Fig. 20 and Table III for more details). g6c15 Modifying Minimum Split Off 20 C4.5 MS C4.5A MS MS paired points 18 Number of Nodes 16 14 12 10 8 6 4 0.19 0.2 0.21 0.22 0.23 0.24 0.25 Error Rate Fig. 20. Number of tree nodes versus tree error rate for a zoomed portion of the knee present in Fig. 19. At the zoomed location the C4.5A_MS points are highly concentrated at the knee of the C4.5A MS plot. On the contrary, their corresponding C4.5_MS points on the C4.5 MS plot show a consistent increase in error rate. It is along this section of the plot where the greatest differences in error rate between C4.5_MS and C4.5A_MS are seen. TABLE III g6c15 MODIFYING MINIMUM NUMBER OF CASES SPLIT OFF Grow -> Prune Grow -> Adjust -> Prune C4.5_MS C4.5A_MS Minimum Error Rate Tree Size Error Rate Tree Size Number % (test (# % (test (# nodes) of Cases set) nodes) set) Split Off 282 22.95 12.5 19.16 11 292 23.52 11.8 19.15 11 302 23.82 11.4 19.15 11 312 24.10 11.3 19.15 11 322 24.85 11.0 19.65 10.9 332 27.03 10.2 23.30 10.2 To show that the results of Table III are statistically significant, a t-test was performed for each of the values of minimum split off shown in the table using the error rates calculated for each of the twenty iterations. The t-test used a significance level of 5%. The null hypothesis for the test is that the means are equal. TABLE IV g6c25 T-TEST RESULTS (5% SIGNIFICANCE LEVEL) NULL HYPOTHESIS Æ “THE MEANS ARE EQUAL” p-value Reject Null Minimum (probability that Hypothesis (Yes / Number of Cases the Null Hypothesis No) Split Off is true) 282 0 Yes 292 0 Yes 302 0 Yes 312 0 Yes 322 0 Yes 332 .0191 Yes For the values of the minimum number split off shown in Table III, the results are statistically significant. The null hypothesis (corresponding to the statement that the two mean error rates for C4.5_MS and C4.5A_MS are equal) was rejected. Hence, although reduction in error rate on the test set between C4.5 and C4.5A was found to be minimal when all of the default C4.5 settings were used (see results of Table II), a significant reduction can be gained when C4.5 parameters are varied (see results of Table III). We argue that varying these parameters is usually beneficial as it can yield a tree with better accuracy and/or size, and thus our algorithm’s improvement is practical. The magnitude of the reduction is apparently database dependent. In the case of the Gaussian datasets, the benefit is seen when increasing the minimum number of cases split off. This in turn inhibits the size of the grown tree and pruning does not have the ability to recover from the loss in tree accuracy. The backwards adjusting is able to reduce the error because an ample statistical sample of cases resides at each node for higher (than the default) MS values. The backward adjustment uses these cases to minimize the predicted error and therefore reduces the overall error of the entire tree while keeping the same (or smaller) number of nodes than the original grown tree. It could be argued that the tree could be pruned back significantly to the same number of nodes but with greater accuracy by simply modifying the C4.5’s confidence factor value. However, this is not the case because even when the confidence factor is set to a very small value (=.001%) the number of nodes in the tree is still greater than 20. On the contrary, when the minimum number of split-off cases is modified the tree size is reduced to less than 10 nodes. In review, the backwards adjusting phase helps reduce the error rate which has increased as a result of initially creating such a small tree. The end result is a smaller tree than could have been 11 2007 AMALTHEA REU SITE; Beck, et al. produced by pruning alone but without the loss in accuracy that would occur if backwards adjusting was not used. There is a final observation that can be obtained by looking at the results depicted by the curves (blue and red) of Fig. 19. Although the knees of the curves are not as pronounced, we also see that without reducing the error rate we can attain a smaller C4.5 and C4.5A tree by simply changing the value of the minimum number of split-off (MS) cases parameter. Similar trends are found, and similar observations can be made regarding the other datasets that we experimented with. In particular, in all cases, the backwards adjusting phase produces trees with less of an error rate than the traditional phases using the same parameters. Furthermore, there is merit in using parameter values other than the default parameter values for MS and CF that C4.5 typically uses (smaller trees of the same error rate can be produced). The following figures show the number of nodes in the tree versus error rate plots for each of the remaining datasets when the minimum number of cases split-off and the confidence factor are independently adjusted. Fig. 21 and abalone Modifying Minimum Split Off 400 C4.5 MS C4.5A MS MS = 2 (default) MS paired points 350 Number of Nodes 300 250 200 150 100 50 0 0.365 0.37 0.375 0.38 0.385 0.39 0.395 0.4 Error Rate Fig. 21. Number of tree nodes versus tree error rate for the abalone dataset, when modifying the minimum number of cases split-off (MS) during C4.5 growing. When the value of MS increases, the number of nodes in the tree drops significantly along with the error rate. The knee of the curve shows the ideal parameter value for the dataset (MS = 52), and the number of tree nodes produced at this MS value is significantly reduced (from around 400, when the default MS parameter is used) to less than 50, when MS equal 52). abalone Modifying Confidence Factor 500 C4.5 CF C4.5A CF CF = 25% (default) CF paired points 450 Number of Nodes 400 Fig. 22 show interesting results when modifying the parameters MS and CF for the abalone dataset. In Fig. 21, increasing MS decreases both the tree size and the error rate. The knee of the curve is the location where the error rate is the least and the tree size is significantly reduced. This position corresponds to the MS value of 52. In addition to the reductions in error and in size provided by modifying MS, the backwards adjusting phase provides approximately an additional reduction of .5% (in error rate) and keeps the tree size about the same. 350 300 250 200 150 CF = 1% 100 50 0 0.36 0.37 0.38 0.39 0.4 0.41 Error Rate Fig. 22. Number of tree nodes versus tree error rate, when modifying the confidence factor (CF) during C4.5 pruning for the abalone dataset. Although the backwards adjusting has minimal (less than 1%) improvement on the error rate, modifying CF from the default of 25% to 1% results in fewer nodes and less error rate. The knees of the curves correspond to the CF parameter value of 1%. Here we see a significant reduction of tree sizes (from around 400, when the default CF parameter is used to around 50, when CF=1% is used), while the error rate is also decreased. The g6c25 database is different from the already examined g6c15 database in the amount of overlap of the associated Gaussian distributions. The plots when adjusting MS and CF for 12 2007 AMALTHEA REU SITE; Beck, et al. this database are very similar with the plots shown for the g6c15 Gaussian database, and show the same overall trends for both configurations. g6c25 Modifying Minimum Split Off produce some additional error reduction (albeit very minimal), decreasing the CF and increasing MS allows one, once more, to significantly decrease the tree size without affecting the error rate. 100 C4.5 MS C4.5A MS MS = 2 (default) MS paired points 90 80 satellite Modifying Minimum Split Off 350 300 70 60 Number of Nodes Number of Nodes C4.5 MS C4.5A MS MS = 2 (default) MS paired points 50 40 30 20 250 200 150 100 10 50 0 0.25 0.3 0.35 0.4 0.45 Error Rate Fig. 23. Number of tree nodes versus tree error rate for the g6c25 dataset, when modifying the minimum number of cases split-off (MS) during C4.5 growing. Similar to Fig. 19, this plot reveals the same ideal knee where the error rate differences are at the maximum. Also, when MS is set to 42, there is a tree size reduction of about 70 nodes. This corresponds to the difference between the pair of points using the default MS value of 2, and the pair of points at around 30 nodes which corresponds to an MS value of 42. g6c25 Modifying Confidence Factor 140 0 0.1 0.15 0.2 0.25 0.3 0.35 Error Rate Fig. 25. Number of tree nodes versus tree error rate for the satellite dataset, when modifying the minimum number of cases split-off (MS) during C4.5 growing. The difference in error is not easily seen because of the complete range of MS plotted for. A zoomed in view of the knee is shown in Fig. 26. What is evident in this graph is the knee in the curve at approximately 100 nodes. The significant reduction in tree size from the default value is caused by increasing MS from 2 to 12. A size reduction of approximately 225 nodes is achieved with no apparent penalties in error rate. satellite Modifying Minimum Split Off 120 40 80 Number of Nodes Number of Nodes 45 100 60 CF = 5% C4.5 CF C4.5A CF CF = 25% (default) CF paired points 40 35 30 25 20 15 20 0.265 0.27 0.275 0.28 0.285 0.29 Error Rate Fig. 24. Number of tree nodes versus tree error rate, when modifying the confidence factor (CF) during C4.5 pruning for the g6c25 dataset. Although minimal error reductions are observed by the backwards adjusting phase, a 5% value for CF provides a lower error rate and tree size (tree sizes of approximately 50 nodes for both the red and blue curves are seen). Backwards adjusting reduces the error at this point even further. Furthermore, although the CF value of 5% (corresponding to the most prominent knee in the curve) has the greatest reduction in error rate compared to the default parameters, the CF values of 10% and 1% each also show a reduction in tree size and error rate compared to the tree sizes and error rates at the default setting of CF=25%. The satellite database results shown below provide similar observations, as well. Although the backwards adjusting does 10 5 0.165 0.17 0.175 0.18 0.185 0.19 Error Rate Fig. 26. Number of tree nodes versus tree error rate for a zoomed portion of Fig. 25. The pairs of points are for MS values of 42 and 72 accordingly. Each point in a pair has approximately the same number of nodes; however backwards adjusting improves the error rate by approximately 1% (MS = 42 has .8% reduction and MS = 62 has 1.21% reduction) 13 2007 AMALTHEA REU SITE; Beck, et al. growing. The results are similar to those found for the satellite database when using the same configuration; however, the magnitude in the decrease of nodes is less, and there is a slight gain in error when increasing MS. A zoomed in version of the knee is shown in Fig. 29. satellite Modifying Confidence Factor 400 350 segment Modifying Minimum Split Off 19 250 CF = 5% 200 18 C4.5 CF C4.5A CF CF = 25% (default) CF paired points CF = 1% 150 Number of Nodes Number of Nodes 300 100 50 0.145 0.15 0.155 16 15 14 0.16 13 Error Rate The segment database is shown in Fig. 28 when modifying MS and in Fig. 30 when modifying CF. At the knee of Fig. 28, the pairs show a 2% difference in error rate (Fig. 29) as a result of the backwards adjusting. Modifying MS decreases the overall tree size but slightly increases the error rate. No special trend can be concluded from Fig. 30 because the difference in error rate is insignificantly small. All that can be said is that the backwards adjusting does not cause worse performance, and that decreasing the CF values increases the error rate minutely. segment Modifying Minimum Split Off 70 C4.5 MS C4.5A MS MS = 2 (default) MS paired points 60 12 0.09 0.1 0.11 0.12 0.13 0.14 Error Rate Fig. 29. Number of tree nodes versus tree error rate for a zoomed portion of Fig. 28. The pairs of values correspond to MS of 42, 72, 102 and 132. The largest difference in error rate shown between C4.5_MS and C4.5A_MS points is observed for MS value of 72 (2%). Also, for each pair of values shown in the graph the C4.5_MS and C4.5A_MS create the same size trees. segment Modifying Confidence Factor 65 C4.5 CF C4.5A CF CF = 25% (default) CF paired points 60 55 Number of Nodes Fig. 27. Number of tree nodes versus tree error rate, when modifying the confidence factor (CF) during C4.5 pruning for the satellite dataset. Modifying the confidence factor from the default of 25% to 5% produces a tree on the blue curve (shown at the knee of the curve) with approximately 225 nodes. On the red curve the knee is for a CF value of 1% and has approximately 150 nodes at the knee. Interestingly, CF values of 10%, 5%, 1%, .1% and .01% all provide error rates and tree sizes less that the ones observed for the default parameter value. Number of Nodes 17 CF = 1% 50 45 40 50 35 40 30 0.048 0.05 0.052 0.054 0.056 0.058 0.06 0.062 Error Rate Fig. 30. Number of tree nodes versus tree error rate, when modifying the confidence factor (CF) during C4.5 pruning for the segment dataset. The knees of both of these curves occur at the pair at approximately 47 nodes (CF value of 1%). At this location the increase in error rate compared to the default value position is on the order of tenths of a percent whereas the number of node decrease is approximately 10 nodes. The difference in performance provided by the backwards adjusting phase is inconsequential. 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 Error Rate Fig. 28. Number of tree nodes versus tree error rate for the segment dataset, when modifying the minimum number of cases split-off (MS) during C4.5 14 2007 AMALTHEA REU SITE; Beck, et al. The final dataset to be examined is the waveform dataset. It is shown in Fig. 31 and waveform Modifying Confidence Factor 350 Number of Nodes 300 C4.5 CF C4.5A CF CF = 25% (default) CF paired points 250 200 CF = 1% 150 100 50 0.23 0.235 0.24 0.245 0.25 Error Rate Fig. 32 when modifying MS and CF respectively. Modifying the values of MS and CF from the default settings shows a reduction in the size of the tree and an improvement in error rate in both cases. waveform Modifying Minimum Split Off VI. SUMMARY AND CONCLUSIONS 350 C4.5 MS C4.5A MS MS = 2 (default) MS paired points 300 Number of Nodes 250 200 150 100 50 0 0.22 0.24 0.26 0.28 0.3 0.32 Fig. 32. Number of tree nodes versus tree error rate, when modifying the confidence factor (CF) during C4.5 pruning for the waveform dataset. Modifying the value of CF from 25% to 1% produces a knee in the curve which yields a 1% reduction in error rate as well as a tree that is approximately half of the size (160 nodes compared to 325 nodes). Interestingly, each of the other values of CF plotted (CF of 10%, 5%, .1% and .01%) also produce knees for the curves which exhibit less error rates and even fewer number of tree nodes, compared to the error rate and tree node size observed for the default CF value. 0.34 Error Rate Fig. 31. Number of tree nodes versus tree error rate for the waveform dataset, when modifying the minimum number of cases split off (MS) during C4.5 growing. Increasing MS decreases the number of nodes as well as the error rate slightly. The knees occur at about 50 nodes for both curves and correspond to an MS of 32. At this point the error rate is also .5% less than the default parameter pair for both the blue and red curves. Backwards adjusting provides a minimal amount of error rate reduction. In this work we have focused on C4.5, one of the classical, and most often used, decision tree classifiers. We developed a software methodology to implement the C4.5 classifier that is applicable for other tree classifiers and a variety of other machine learning algorithms. We have also introduced an innovative pruning strategy referred to as the “backwards adjusting algorithm”, explained the motivation for its introduction, and compared its effect on the growing and pruning phases of the C4.5 algorithm. More specifically, the backwards adjusting algorithm was interjected in between the growing and the pruning phases of the C4.5 algorithm. We conducted a thorough experimentation comparing C4.5 and C4.5A (C4.5 with the backwards adjusting algorithm). In this experimentation we used a number of simulated and real datasets and we compared the two algorithms by using two figures of merit: the tree size and the tree error rate. In the process of this experimentation we changed some of the default parameter values that the C4.5 classifier uses CF (confidence factor) value of 0.25 and MS (minimum number of split off) value of 2). Our experiments illustrated that C4.5A almost always improves the tree error rate, compared to C4.5, and in a few cases it significantly improved the tree error rate (e.g., Gaussian, segment, and satellite datasets). An important byproduct of our work was the realization that by changing the default parameter values in C4.5 (CF and MS) we are able to design trees that are of much smaller size than the ones produced by the default C4.5 algorithm. The reduction in 15 2007 AMALTHEA REU SITE; Beck, et al. tree size was attained without affecting the default C4.5 error rate. Other researchers have observed that reductions in the C4.5 tree sizes could be attained by changing the CF value, but to the best of our knowledge, nobody has experimented with the MS value, or produced so thorough and illustrative examples and associated explanations for the significant effect that the CF and MS parameter values have on the produced C4.5 tree size. [7] [8] [9] [10] APPENDIX A. Binomial Estimate and Confidence Factor In this appendix we explain the meaning of the CF value. The definition of this parameter is not explicitly given in [1], but the expression can be found in [27] as shown below. N CF = ∑ x =0 p x (1 − p) N − x x [6] [11] [12] E (6) Equation (6) indicates that the estimated probability of error p is a Bernoulli one-sided confidence bound. For example, if CF=25%, the 75% confidence interval of the error rate on unseen cases is [0, p] where p is the solution to (6) and is used as a pessimistic error rate estimate. CF can be interpreted as the probability that there will be E or fewer errors in N unseen cases, after we observe E errors in N training examples from a node. C4.5 sets the confidence level to a default of 25% although this can be modified by the user via an option. The pruning phase calculates the number of errors in the node as well as the total number of cases from the training set passed to the node. The cases in the node from the training set are considered to be a statistical sample from which an error probability can be calculated. The only unknown remaining in the above equation is then the probability p. Hence, with known CF, N and E values at a node, equation (6) is solved for the parameter p. This misclassification rate p can then be multiplied by the number of cases in a node to predict how many errors will occur. The lower the value of CF, the more pessimistic the pruning will be because the error rate p will increase. For example, when CF=0, the 100% confidence interval of the error rate is clearly [0,1], and thus the estimated error rate is 1. [13] REFERENCES [20] [1] [2] [3] [4] [5] J. R. Quinlan, C4.5: Programs for Machine Learning: Morgan Kaufmann, 1993. J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1, pp. 81-106, 1986. L. Hyafil and R. L. Rivest, "Constructing Optimal Binary Decision Trees is {NP}-Complete," Inf. Process. Lett, vol. 5, pp. 15-17, 1976. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees: Wadsworth, 1984. B. Cestnik, "Estimating Probabilities: A Crucial Task in Machine Learning," 1990. [14] [15] [16] [17] [18] [19] [21] [22] T. Niblett and I. Bratko, "Learning decision rules in noisy domains," Brighton, United Kingdom, 1986. J. R. Quinlan, "Simplifying decision trees," Int. J. Hum.-Comput. Stud, vol. 51, pp. 497-510, 1999. M. Mehta, J. Rissanen, and R. Agrawal, "{MDL}-Based Decision Tree Pruning," 1995. M. Dong and R. Kothari, "Classifiability Based Pruning of Decision Trees," 2001. B. Kijsirikul and K. Chongkasemwongse, "Decision tree pruning using backpropagation neural networks," 2001. F. Esposito, D. Malerba, and G. Semeraro, "A Comparative Analysis of Methods for Pruning Decision Trees," IEEE Trans. Pattern Anal. Mach. Intell, vol. 19, pp. 476-491, 1997. X. Li, J. R. Sweigart, J. T. C. Teng, J. M. Donohue, L. A. Thombs, and S. M. Wang, "Multivariate decision trees using linear discriminants and tabu search," IEEE Trans. Systems, Man, and Cybernetics, Part A, vol. 33, pp. 194-205, 2003. S. Murthy, S. Kasif, and S. Salzberg, "A system for induction of oblique decision trees " Journal of Artificial Intelligence Research, vol. 2, pp. 1-32, 1994. C. T. Yildiz and E. Alpaydin, "Omnivariate Decision Trees," IEEE Trans. Neural Networks, vol. 12, pp. 1539-1547, 2001. T. Eloma and J. Rousu, "General and Efficient Multisplitting of Numerical Attributes," Machine Learning, vol. 36, pp. 201-244, 1999. E. Cantú-Paz and C. Kamath, "Inducing oblique decision trees with evolutionary algorithms," IEEE Trans. Evol. Comput., vol. 7, pp. 54–68, 2003. M. Zorman and P.Kokol, "Hybrid NN-DT cascade method for generating decision trees from backpropagation neural networks," presented at Proc. 9th International Conference on Neural Information Processing (ICONIP '02), 2002. T. G. Dietterich, "An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization," Machine Learning, vol. 40, pp. 139-158, 2000. L. Todorovski and S. Džeroski, "Combining classifiers with meta decision trees," Machine Learning, vol. 50, pp. 223-249, 2003. R. Jin and G. Agrawal, "Communication and Memory Efficient Parallel Decision Tree Construction," presented at 3rd SIAM International Conference on Data Mining, San Francisco, CA, 2003. M. V. Joshi, G. Ksarypis, and V. Kumar, "ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets," presented at 11th International Parallel Processing Symposium, 1998. M. Mehta, R. Agrawal, and J. Rissanen, "SLIQ: A fast scalable classifier for data mining," presented at 5th International Conference on Extending Database Technology, Avignon, France, 1996. 2007 AMALTHEA REU SITE; Beck, et al. [23] [24] [25] [26] [27] J. Shafer, R. Agarwal, and M. Mehta, "Sprint: A scalable parallel classifier for data mining," presented at 22nd International Conference on Very Large Databases, Mumbai, India, 1996. D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz, "UCI Repository of machine learning databases," University of California, Irvine, CA, 1998. J. R. Beck, "Implementation and Experimentation with C4.5 Decision Trees," in Electrical and Computer Engineering. Orlando: University of Central Florida, 2007. M. Stone, "Cross-validation: A review.," Mathematics, Operations and Statistics, vol. 9, pp. 127-140, 1978. T. Windeatt and G. Ardeshir, "An Empirical Comparison of Pruning Methods for Ensemble Classifiers," Proc. of Advances in Intelligent Data Analysis (IDA2001), vol. LNCS2189, pp. 208-217, 2001. 16
© Copyright 2026 Paperzz