On a unified framework for sampling with and without replacement in

On a Unified Framework for Sampling With and
Without Replacement in Decision Tree
Ensembles
J.M. Martı́nez-Otzeta, B. Sierra, E. Lazkano, and E. Jauregi
Department of Computer Science and Artificial Intelligence
University of the Basque Country
P. Manuel Lardizabal 1, 20018 Donostia-San Sebastián
Basque Country, Spain
[email protected]
http://www.sc.ehu.es/ccwrobot
Abstract. Classifier ensembles is an active area of research within the
machine learning community. One of the most successful techniques is
bagging, where an algorithm (typically a decision tree inducer) is applied
over several different training sets, obtained applying sampling with replacement to the original database. In this paper we define a framework
where sampling with and without replacement can be viewed as the extreme cases of a more general process, and analyze the performance of
the extension of bagging to such framework.
Keywords: Ensemble Methods, Decision Trees.
1
Introduction
One of the most active areas of research in the machine learning community is
the study of classifier ensembles.
Combining the predictions of a set of component classifiers has been shown
to yield accuracy higher than the most accurate component on a long variety of
supervised classification problems. To achieve this goal, various strategies of combining these classifiers in different ways are possible [Xu et al., 1992] [Lu, 1996]
[Dietterich, 1997] [Bauer and Kohavi, 1999] [Sierra et al., 2001]. Good introductions to the area can be found in [Gama, 2000] and [Gunes et al., 2003]. For a
comprehensive work on the issue see [Kuncheva, 2004].
The combination, mixture, or ensemble of classification models can be performed mainly by means of two approaches:
– Concurrent execution of some paradigms with a posterior combination of the
individual decision each model has given to the case to classify [Wolpert, 1992];
the combination can be done by voting or by means of more complex approaches
[Ho et al., 1994].
– Hybrid approaches, in which two or more different classification systems are
implemented together in one classifier [Kohavi, 1996].
J. Euzenat and J. Domingue (Eds.): AIMSA 2006, LNAI 4183, pp. 118–127, 2006.
c Springer-Verlag Berlin Heidelberg 2006
On a Unified Framework for Sampling With and Without Replacement
119
When implementing a model belonging to the first approach, a necessary
condition is that the ensemble classifiers are diverse. One of the ways to achieve
this consists of using several base classifiers, apply them to the database, and
then combine their predictions in a single one. But even with a unique base
classifier, is still possible to build an ensemble, applying it to different training
sets in order to generate several different models.
One of the ways to get several training sets from a given dataset is bootstrap
sampling, where a sampling with replacement is made, obtaining samples with
the same cardinality than the original dataset, but with different composition
(some instances from the original set may be missing, while others may appear
more than once).
This is the method that bagging [Breiman, 1996] uses to obtain several training databases from a unique dataset. In this work we present a sampling method
that make appear sampling with and without generalization as the two extreme cases of a more general continuous process. Then, it is possible to analyze the performance of bagging or any other algorithm that makes use of sampling with or without replacement in the continuum that spans between the two
extremes.
Typically, the base classifier in a given implementation of bagging uses to be
a decision tree, due to the fact that small changes in the training data use to
lead to proportionally big changes in the built tree.
A decision tree consists of nodes and branches to partition a set of samples
into a set of covering decision rules. In each node, a single test or decision is
made to obtain a partition. In each node, the main task is to select an attribute
that makes the best partition between the classes of the samples in the training
set. In our experiments, we use the well-known decision tree induction algorithm,
C4.5 [Quinlan, 1993].
The rest of the paper is organized as follows. Section 2 presents the proposed
framework, with a brief description of the bagging algorithm. In section 3 the
experimental setup in which the experiments were carried out is described. The
obtained results are shown in section 4 and section 5 is devoted to conclusions
and further work.
2
Unified Framework
In this section we will define a sampling process, of which sampling with replacement, and without replacement are the two extreme cases, existing a continuous
range of intermediate possibilities.
To define the general case, first of all let us take a glance to sampling with
and without replacement:
– Sampling with replacement: an instance is sampled according to some probability function, it is recorded, and then returned to the original database
– Sampling without replacement: an instance is sampled according to some
probability function, it is recorded, and then discarded
120
J.M. Martı́nez-Otzeta et al.
Let us define now the following sampling method:
– Sampling with probability p of replacement: an instance is sampled according
to some probability function, it is recorded, and then, with a probability p,
returned to the original database
It is clear than the last definition is more general that the other two, and includes
them.
The differences among the three processes are depicted in Figure 1.
extract
Dataset
return
Sample
a) Sampling with replacement
extract
Dataset
return
Sample
b) Sampling without replacement
extract
Dataset
return with prob. p
Sample
c) Generalized sampling
Fig. 1. Graphical representation of the three different sampling methods
As we have already noted, sampling with replacement is one of the extreme
cases of the above definition: when p is 1, so every instance sampled is returned
to the database. Sampling without replacement is the opposite case, where p is
0, so a sampled instance is discarded and never returns to the database.
Some questions arise: is it possible to apply this method to some problem
where sampling with replacement is used to overcome the limitations of sampling
without replacement? Will be the best results obtained when p is 0 or 1, or maybe
in some intermediate point?
We have tested this generalization in one well-known algorithm: bagging,
where a sampling with replacement is made.
On a Unified Framework for Sampling With and Without Replacement
2.1
121
Bagging
Bagging (bootstrap aggregating) is a method that adds up the predictions of several classifiers by means of voting. These classifiers are built from several training
sets obtained from a unique database through sampling with replacement.
Leo Breiman described this technique in the early 90’s and it has been widely
used to improve the results of single classifiers, specially decision trees.
Each individual model is created from a instance set with the same number of
elements than the original one, but obtained through sampling with replacement.
Therefore, if in the original database there are m elements, every element has
a probability 1 − (1 − 1/m)m of being selected at least once in the m times
sampling is performed. The limit of this expression for big values of m is 1 −
1/e ,which yields the value of 63.2. Therefore, in average, only a 63.2% of the
original cases will be present in the new set, appearing some of them several
times.
Bagging Algorithm
– Initialize parameters
• Initialize the set of classifiers D = ∅
• Initialize N , the number of classifiers
– For n = 1, ..., N :
• Extract a sample Bn through sampling with replacement from the original database
• Built a classifier Dn taking Bn as training set
• Add the classifier obtained in the previous step to the set of classifiers:
D = D ∪ Dn
– Return D
It is straightforward to apply the previously introduced approach to bagging.
The only modification in the algorithm consists in replacing the standard sampling procedure by the generalization above described.
3
Experimental Setup
In order to evaluate the performance of the proposed sampling procedure, we
have carried out an experiment over a high number of the well-known UCI
repository databases [Newman et al., 1998]. To do so, we have selected all the
databases of medium size (between 100 and 1000 instances) among those converted to the MLC ++ [Kohavi et al., 1997] format, and located in this public
repository: [http://www.sgi.com/tech/mlc/db/] This amounts to 59 databases,
122
J.M. Martı́nez-Otzeta et al.
Table 1. Characteristics of the 41 databases used in this experiment
Database
#Instances #Attributes #Classes
Anneal
798
38
6
Audiology
226
69
24
Australian
690
14
2
Auto
205
26
7
Balance-Scale
625
4
3
Banding
238
29
2
Breast
699
10
2
Breast-cancer
286
9
2
Cars
392
8
3
Cars1
392
7
3
Cleve
303
14
2
Corral
129
6
2
Crx
690
15
2
Diabetes
768
8
2
Echocardiogram
132
7
2
German
1000
20
2
Glass
214
9
7
Glass2
163
9
2
Hayes-Roth
162
4
3
Heart
270
13
2
Hepatitis
155
19
2
Horse-colic
368
28
2
Hungarian
294
13
2
Ionosphere
351
34
2
Iris
150
4
3
Liver-disorder
345
6
2
Lymphography
148
19
4
Monk1
432
6
2
Monk2
432
6
2
Monk3
432
6
2
Pima
768
8
2
Primary-org
339
17
22
Solar
323
11
6
Sonar
208
59
2
ThreeOf9
512
9
2
Tic-tac-toe
958
9
2
Tokyo1
963
44
2
Vehicle
846
18
4
Vote
435
16
2
Wine
178
13
3
Zoo
101
16
7
from which we have selected one of each family of problems. For example, we
have chosen monk1 and not monk1-cross, monk1-full or monk1-org. After this
final selection, we were left with the 41 databases shown in Table 1.
On a Unified Framework for Sampling With and Without Replacement
123
begin Generalized sampling testing
Input: 41 databases from UCI repository
For every database in the input
For every p in the range 0..1 in steps 0.00625
For every fold in a 10-fold cross validation
Construct 10 training sets sampling
according to parameter p
Induce models from those sets
Present the test set to every model
Make a voting
Return the ensemble prediction and accuracy
end For
end For
end For
end Generalized sampling testing
Fig. 2. Description of the testing process of the generalized sampling algorithm
The sampling generalization described in the previous section makes use of a
parameter p that is continuous. In the experiments carried out we have tested
the performance of every value of p between 0 and 1 in steps of 0.00625 width.
this amounts to a total of 161 discrete values.
For every value of p a 10-fold crossvalidation has been carried out.
In Figure 2 is depicted the algorithm used for the evaluation.
4
Experimental Results
In this section we present the experimental results obtained from a experiment
following the methodology described in the previous section.
We were interested in analyze the performance of the modification of bagging
when sampling with replacement was changed by our approach, and for what
values of p better results are obtained. Therefore, to analyze the data of all the
databases, we have normalized the performances, taking as unit the case where
p is zero. This case is equivalent to sampling without replacement, so it is clear
that every set obtained in this way from a given dataset will be equivalent to
the original one. This case corresponds to apply the base classifier (in our case
C4.5) without any modification at all. This will be our reference when comparing
performances with other p values.
For example if, over the dataset A, with p = 0 we obtain an accuracy of 50%,
and with p = 0.6 the performance is 53%, the normalized values would be 1
and 1.06 respectively. In other words, the accuracy in the second case is a six
per cent better than in the first one. This normalization will permit us analyze
which values of p yield better accuracy with respect to the base classifier.
Standard bagging is the case when p takes the value 1. The obtained databases
are diverse and this is one of the causes of the expected better performance. But,
124
J.M. Martı́nez-Otzeta et al.
is this gain in performance uniform over all the interval? Our results show that
it is not the case, and that beyond p = 0.5 there are no noticeable gains, being
the most important shifts around p = 0.15 and p = 0.25. This means that small
diversity between samplings could lead to similar results than the big diversity
that bagging produces.
After normalization, each database would have associated a performance (typically between 0.9 and 1.1) to every value of p; this performance is the result of
the 10-fold crossvalidation as explained in the previous section. After applying
linear regression, we obtained the results shown below.
Fig. 3. Regression line for values of p between 0 and 0.25: y = 0.0763x + 1.0068
In every figure, the X axe is p, while Y is the normalized performance. In
Figures 3, 4, 5 and 6 are shown the results in the intervals (0, 0.25), (0.25, 0.5),
(0.5, 0.75) and (0.75, 1), respectively. In every interval, the normalization has
been carried out with respect to the lower limit of the interval. This has been
made to make clear the gains in that interval.
Observing the slope of the regression line, we note that the bigger gains are
in the first interval. In the second one, there are some gains, but not at all in the
same amount than in the first one. In the two last intervals the gains are very
small, if any.
Let us note too that this means that the cloud of points in Figure 3 is skewed
towards bigger performance values than the clouds depicted in Figures 4, 5 and
6. The extreme lower values in Figure 3 are around 0.95, while in the other
intervals appear some values below that limit. This means the chances of an
important drop in performance are much smaller than the opposite.
With respect to Figure 4, where some perceptible gains are still achieved, it
is observed that performances below 0.92 are extremely rare, while in Figure 5
appear a higher amount of them. In Figure 6, apart from a unique extreme case
below 0.90, the frequency of appearance of performances below 0.92 is very rare
too. More detailed analysis is needed to distinguish true patterns behind this
data from statistical fluctuations.
From these data it looks as if with little diversity it is possible to achieve the
same results than with bagging. In Figure 7 it is drawn the result of a polynomial
regression of sixth degree. It shows that values close to the maximum are around
p = 0.15.
On a Unified Framework for Sampling With and Without Replacement
125
Fig. 4. Regression line for values of p between 0.25 and 0.50: y = 0.0141x + 0.9951
Fig. 5. Regression line for values of p between 0.50 and 0.75: y = 0.0008x + 0.9969
Fig. 6. Regression line for values of p between 0.75 and 1: y = −0.0004x + 1.0017
Fig. 7. Sixth grade polynomial regression for values of p between 0 and 0.25
126
5
J.M. Martı́nez-Otzeta et al.
Conclusions and Further Work
In this paper we have defined a generalization of sampling that includes sampling
with and without replacement as extreme cases. This sampling has been applied
to the bagging algorithm, in order to analyze its behavior. The results suggests
that ensembles with less diversity than those obtained applying bagging could
achieve similar performances.
The analysis carried out in previous sections has been made over the accumulated data of all the 41 databases, so another line of research could consist of
detailed analysis of performance over any given database. In this way, it could
be possible a characterization of databases for which improvements in the interval (0, 0.25) are more noticeable and, in the other hand, databases for which
improvements are achieved in intervals different than (0, 0.25). Let us note that
the above results have been obtained putting together the 41 databases, so it is
expected that some databases will behave different than the main trend; in some
cases they will be against the main behavior, and in others their results will be
in the same line, but much more marked.
As further work, a better analysis of the interval (0, 0.25), where the most
dramatic changes occur, would be of interest.
A study of the value of similarity measures when applied over the ensembles
obtained with different p values would be desirable too, along with theoretical
work.
Acknowledgments
This work has been supported by the Ministerio de Ciencia y Tecnologı́a under
grant TSI2005-00390 and by the Gipuzkoako Foru Aldundia OF-838/2004.
References
[Bauer and Kohavi, 1999] Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine
Learning, 36(1-2):105–142.
[Breiman, 1996] Breiman, L. (1996).
Bagging predictors.
Machine Learning,
24(2):123–140.
[Dietterich, 1997] Dietterich, T. G. (1997). Machine learning research: four current
directions. AI Magazine, 18(4):97–136.
[Gama, 2000] Gama, J. (2000). Combining Classification Algorithms. Phd Thesis.
University of Porto.
[Gunes et al., 2003] Gunes, V., Ménard, M., and Loonis, P. (2003). Combination, cooperation and selection of classifiers: A state of the art. International Journal of
Pattern Recognition, 17(8):1303–1324.
[Ho et al., 1994] Ho, T. K., Hull, J. J., and Srihari, S. N. (1994). Decision combination
in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 16:66–75.
[Kohavi, 1996] Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classifiers:
a decision-tree hybrid. In Proceedings of the Second International Conference on
Knowledge Discovery and Data Mining.
On a Unified Framework for Sampling With and Without Replacement
127
[Kohavi et al., 1997] Kohavi, R., Sommerfield, D., and Dougherty, J. (1997). Data
mining using MLC ++ , a machine learning library in C ++ . International Journal
of Artificial Intelligence Tools, 6(4):537–566.
[Kuncheva, 2004] Kuncheva, L. I. (2004). Combining pattern classifiers: methods and
algorithms. Wiley-Interscience, Hoboken, New Jersey.
[Lu, 1996] Lu, Y. (1996). Knowledge integration in a multiple classifier system. Applied
Intelligence, 6(2):75–86.
[Newman et al., 1998] Newman, D., Hettich, S., Blake, C., and Merz, C. (1998). UCI
repository of machine learning databases.
[Quinlan, 1993] Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan
Kaufmann.
[Sierra et al., 2001] Sierra, B., Serrano, N., Larrañaga, P., Plasencia, E. J., Inza, I.,
Jiménez, J. J., Revuelta, P., and Mora, M. L. (2001). Using bayesian networks
in the construction of a bi-level multi-classifier. Artificial Intelligence in Medicine,
22:233–248.
[Wolpert, 1992] Wolpert, D. (1992). Stacked generalization. Neural Networks, 5:241–
259.
[Xu et al., 1992] Xu, L., Kryzak, A., and Suen, C. Y. (1992). Methods for combining
multiple classifiers and their applications to handwriting recognition. IEEE Transactions on SMC, 22:418–435.