Ensemble Learning with Local Diversity*

Ensemble Learning with Local Diversity?
Ricardo Ñanculef1 , Carlos Valle1 , Héctor Allende1 , and Claudio Moraga2,3
1
2
Universidad Técnica Federico Santa Marı́a,
Departamento de Informática, CP 110-V Valparaı́so, Chile
{jnancu, cvalle, hallende}@inf.utfsm.cl
European Centre for Soft Computing 33600 Mieres, Asturias, Spain
3
Dortmund University, 44221 Dortmund, Germany
[email protected]
Abstract. The concept of Diversity is now recognized as a key characteristic of successful ensembles of predictors. In this paper we investigate an
algorithm to generate diversity locally in regression ensembles of neural
networks, which is based on the idea of imposing a neighborhood relation over the set of learners. In this algorithm each predictor iteratively
improves its state considering only information about the performance
of the neighbors to generate a sort of local negative correlation. We will
assess our technique on two real data sets and compare this with Negative
Correlation Learning, an effective technique to get diverse ensembles. We
will demonstrate that the local approach exhibits better or comparable
results than this global one.
1
Introduction
Ensemble methods offer a simple and flexible way to build powerful learning machines for a great variety of problems including classification, regression and clustering [7] [3]. An ensemble algorithm to learn a set of examples D = {(xi , yi ); i =
1, . . . , m} selects a set of predictors S = {h0 , h2 , . . . , hn−1 } from some baseLhypothesis
S,
Lspace H and builds a decision function f as a composition f =
where
is an aggregation operator such as voting for categorical outputs or a
linear combination for continuous outputs.
To be useful, the set S has to have some degree of heterogeneity or diversity
that allows the group compensate individual errors and reach a better expected
performance. The characterization of methods to generate diversity has matured
in the last years [5] [6] and the concept is now recognised as a central element to
get significant performance improvements with the ensemble. Negative Correlation Learning [9] [11] for example has been proved to be an effective method to
get diversity in regression ensembles.
?
This work was supported in part by Research Grant Fondecyt (Chile) 1040365 and
7050205, and in part by Research Grant DGIP-UTFSM (Chile). Partial support was
also received from Research Grant BMBF (Germany) CHL 03-Z13.
2
In this paper we propose to introduce a neighborhood relation over the ensemble
to locally generate diversity. Each predictor iteratively improves its state considering only information about the performance of the neighbors to generate a
sort of local negative correlation. Different neighborhood sizes control the visibility each learner has about the ensemble, and allow to cover the gap between a
single learning machine and completely independent learners. We will show that
a desired global behavior can be generated considering only local learning rules,
with small neighborhood sizes. It is clear that the more local the learning rules
the more efficient the algorithm.
The remainder of this paper is organized as follows. In the next section we introduce the concept of diversity in regression ensembles and the Negative Correlation Learning algorithm. Our proposal is introduced in section 3. In section
4 we provide a set of experimental results on two real data sets to assess the
introduced algorithm. We also compare this with Negative Correlation Learning.
Conclusions and future work close the article.
2
Diversity and Negative Correlation Learning
The problem in designing successful ensemble learning algorithms is how to select
an appropriate set of predictors from the base hypothesis space H and how to
aggregate them. Studies tend to show that diversity in errors of the predictors is a
key characteristic to get better generalization performance, although the way to
approach the concept is highly heterogeneous. Please refer to [3] for an exhaustive
taxonomy of diversity creation methods in classification and regression scenarios.
In case of regression estimation it is well-known the so called Ambiguity Decomposition [1] for the quadratic loss of an ensemble F obtained as a convex
combination of a set of n predictors f0 , f2 , . . . , fn−1 ,
2
ē = (F − y) =
n−1
X
wi (y − fi )2 −
i=0
n−1
X
wi (fi − F )2
(1)
i=0
where (and for the remainder of this paper)
F (x) =
n−1
X
wi fi (x)
(2)
i=0
This decomposition states that the error of the ensemble can be decomposed into
two terms where the first is the aggregation of the individual errors (y − fi ) and
the second (called ambiguity) measures deviations of these individual predictions
around the ensemble prediction. It is clear that the higher the second term,
the lower the ensemble error and so this seems an appropriate approach to
quantify the concept of diversity. If the ambiguity is positive the ensemble loss
is guaranteed to be less than the averaged individual errors.
Ambiguity decomposition suggests that individual learners should be trained
considering information about their deviations around the ensemble. For example, the i-th learner could be trained considering the objective function
3
2
ei = (y − fi ) − λ (fi − F )
2
(3)
where the parameter λ weights the importance of the ambiguity component. As
noted in [3] if the ensemble is uniformly weighted, we obtain that
2
ei = (y − fi ) − λ
X
(fi − F ) (fj − F )
(4)
j6=i
which is the learning rule considered in the Negative Correlation Learning algorithm described in [11] [9] for neural networks ensembles and strongly related
with the constructive algorithm proposed in [4]. In this approach the set of learners is trained in parallel and synchronously such that, at time t, the learner i
is trained to minimize (4) with the ensemble F computed considering the states
of the learners at the previous time t − 1.
If λ = 1 it should be noted that the ensemble is trained as a great learning
machine with components fi joined using non trainable connections. It can be
verified taking the derivatives of ei and ē with respect to fi and noting that
they are proportional. On the other hand, if the parameter λ = 0, each learner
is trained independently. Parameter λ ∈ [0, 1] covers the range between the
two extremes. In [3] it is said that increasing parameter λ can force a wider
basin of attraction for the objective function to be minimized, increasing in this
way the possible configurations of the learners necessary to get the minimum.
In practice and depending on the optimization landscape induced by the data,
better solutions can be achieved using λ between 0 and 1.
3
Generating Diversity Locally
In Negative Correlation Learning one trains each member of the ensemble introducing information about the group behavior and controlling the weight each
predictor gives to this information. With λ < 1 each predictor sees ensemble
performance only partially. From this observation it seems natural to think in
restricting the visibility that each predictor has on the group, allowing that they
learn only considering the performance of a subset of the set of learners. To
accomplish this we will introduce a neighborhood relation on the ensemble.
Let S = {f0 , f2 , . . . , fl−1 } be a set of learners. A one-dimensional linear neighborhood relation of order ν on S consists of a function ψ : S × S → {0, 1} such
that
ψ(fi , fj ) = 1 ⇔ (i − j)|l ≤ ν or (j − i)|l ≤ ν
(5)
where i|l denotes the modulus after division of i by l. The neighborhood Vi of
a learner i is the set of learners hj j 6= i for which ψ(hi , hj ) = 1. Geometrically
we obtain a ring, where two learners are neighbors if they are contiguous up to
ν steps.
4
In [8] a bi-dimensional neighborhood relation is considered to get a cellular automata of learning machines and then to select a subset of machines to build an
ensemble. However learners are trained independently before the construction
of the bi-dimensional arrangement and the neighborhood relation is only used
to determinate which predictors will “survive” from the original pool of trained
learners. We want to use the neighborhood relation to define the learning process itself for each individual learner in a way such that each predictor adapts
its state based on the behavior of its neighbors. Since diversity is desirable to
get successful ensembles we want to define local learning rules encouraging the
diversity of the set of learners.
A way to accomplish this could be to restrict the negative correlation learning
rule for the i-th learner in equation (4), such that this takes into account only
deviations y − fj concerning the neighbors of fi . The learning function for this
learner would be
X
2
evic
= (y − fi ) − λ
(fi − F ) (fj − F )
(6)
i
j∈Vi
It should be noted that this learning rule is not completely local because it
depends on the overall ensemble function F . However, we can obtain a decomposition similar to the ambiguity decomposition where the diversity component
does not depend directly on F . In fact, we have that the ambiguity component
of (1) can be written as
n
X
2
wi (fi − F ) =
i=1
n−1
X
2
wi ((fi − y) − (F − y))
i=0
=
n−1
X
wi (fi − y)2 − 2
i=0
n−1
X
wi (fi − y)(F − y) + (F − y)2 (7)
i=0
Replacing in (1) we obtain
2
(F − y) = 2
n−1
X
wi (fi − y)(F − y) − (F − y)2
i=0
2
(F − y) =
n−1
X
wi (fi − y)(F − y)
(8)
i=0
Expanding F as in equation (2) one finally obtains
2
(F − y) =
n−1
X
i=0
wi2 (y − fi )2 +
n−1
XX
wi wj (fi − y)(fj − y)
(9)
i=0 j6=i
As in the ambiguity decomposition, the first term measures the individual performance of the estimator while the second measures a sort of error correlation
5
between the different predictors. Clearly negative error correlation is desirable
to get advantages of combining the predictors.
From this decomposition and neglecting the effect of the coupling coefficients
wj , it seems natural to train each learner with the training function
ẽi = (y − fi )2 + λ
X
(fi − y)(fj − y)
(10)
j6=i
where λ > 0 controls the importance of the group information versus the information about the individual performance. Now, we can restrict the last training
function to get a rule which only depends on the neighborhood of fi ,
li = (y − fi )2 + λ
X
(fi − y)(fj − y)
(11)
j∈Vi
As in the negative correlation learning algorithm we will proceed iteratively
correcting the state of fi at time t based on the state of its neighbors at time
t−1. Using fit to denote the state of the learner fi at time t, the learning function
at time t can be explicitly expressed as
lit = (y − fit )2 + λ
X
(fit − y)(fjt−1 − y)
(12)
j∈Vi
This procedure can be viewed as an synchronous cellular automaton with cells
corresponding to each leaner, states corresponding to a function in the hypothesis
space and (stochastic) transition rules ∀i corresponding to
fit = y − λ
X
(fjt−1 − y) + ²ti
(13)
j∈Vi
where ²ti is the estimation error in the minimization of (12). After calculating
equation (13) for i = 0, . . . , n − 1 the global ensemble transition rule is obtained
F t = y − κ(F t−1 − y) + ²̄t
(14)
P
where κ = 2λν, ²̄t = i wi ²ti is the aggregated estimation error at time t and F t
is the state of the ensemble (2) at this time. If we think equation (14) as applied
to a fixed point (x, y) the gradient of C = (y − F t−1 )2 at F t−1 := F t−1 (x) is
simply ∇C = (F t−1 − y). Then, equation (14) can be written as
F t = F t−1 + −(1 + κ)∇C + ²̄t
(15)
which shows that the global transition rule is in essence a gradient descent rule
with step size 1 + κ. Now, let us suppose at time t = 0 each learner approximates
6
y with an additive error e0i , such that F 0 = y + ²̄0 . Then, the expected values
St = E[F t − y] are governed by the recurrence
St = E[²¯t ] − κSt−1
S0 = E[²¯0 ]
(16)
whose solution for |κ| < 1 is
St =
t
X
(−κB)i E[²¯t ] =
i=0
1
E[²¯t ]
1 + κB
(17)
where B is the backward operator B s xt = xt−s . If the sequence of expected
estimation errors E[²̄t ] does not depend on t we have
F t − y −→
t→∞
1 + (−κ)∞
E[²̄0 ]
1+κ
(18)
This property suggest to choose κ ∈ [0, 1) since κ > 1 yields a divergent sequence
of biases around the target. Moreover, to minimize the convergence point of the
biases St we should choose κ close to 1.
Algorithm 1 The Proposed Algorithm
1: Let n be the number of learners and ν the order of the neighborhood.
2: Let fit i = 0, . . . , n − 1 be the function implemented by the learner fi at time
t = 0, . . . , T and Vi = {fi−ν , fi−1 , fi+1 , . . . , fi+ν } the neighborhood of this learner.
3: for t = 1 to T
4:
Train the learner fi a number of epochs p to achieve the state
fit = y − λ
X
(fjt−1 − y)
j∈Vi
which corresponds to use the learning function
eti = (y − fi )2 + κ
X
(fi − y) fjt−1 − y
j∈Vi
on the set of examples D = {(xk , yk ); k = 1, . . . , m}.
5: end for
P
6: Set the ensemble at time t to be F (x) = 1/n n−1
i=0 fi (x)
The algorithm implementing the proposed approach is presented as algorithm
(1). It should be noted that there are two types of learning iterations; the loop in
step 4 corresponds to group iterations where the learners in the ensemble share
information about its behavior, while the implicit loop at step 5 corresponds to
individual iterations where each learner modifies its state based on the group
behavior information. In practice it can be observed that only one individual
epoch can be enough to achieve good results.
7
4
Experimental Results
In this section we present results of empirical studies for analyzing different aspects of the proposed approach and for comparing this with Negative Correlation
Learning. In the whole set of experiments, two real data sets were used, namely
Boston and N O2 . A detailed description of these data sets can be obtained from
[2] and [10] respectively. In addition, neural networks with five sigmoidal hidden
units and trained with standard backpropagation were employed as base learners. For each experiment figures will show error-bars corresponding to t-student
confidence intervals with a significance of 0.02. Mean values are plotted with a
symbol ’o’.
4.1
Effect of the Neighborhood Order
In this experiment we will analyze the effect of the neighborhood order on the
results obtained with the proposed algorithm. We will keep constant the number
of machines in the ensemble at M = 20 and parameter κ at 0.9, while the neighborhood order will be varied according to ν = 1, 2, . . . , 9. Figure (1) summarizes
the results in the testing set. Results in the training were omitted due to space
limitations, but the behavior is analogous.
15
0.3
14.5
0.29
14
13.5
0.28
13
12.5
0.27
12
0.26
11.5
11
0.25
10.5
10
1
2
3
4
5
6
7
8
9
0.24
1
2
3
4
5
6
7
8
9
Fig. 1. MSE (vertical axis) versus neighborhood order (horizontal axis) in the Boston
data set (at left) and the N O2 data set (at right).
It can be observed that there is not a strong difference between the results obtained with different neighborhood sizes, both in terms of precision and dispersion.
This difference is smaller in the testing set. Moreover, the minimum of errors
occurs for both data sets at small neighborhood orders ν = 4 or ν = 2. This
result is very attractive because it proves that a desired global behavior can be
generated considering only local learning rules. In addition it should be clear
that the smaller the neighborhood the more efficient the training process.
4.2
Effect of the Parameter κ
In this experiment we will analyze the effect of the κ parameter in the proposed
algorithm. We will keep constant the number of machines in the ensemble at
8
M = 20 and the neighborhood order at ν = 1, while κ will be varied uniformly
in [0, 1.1] with gaps of size 0.1 and additionally in [0.9, 1.02] with gaps of size
0.01. Figures (2) and (3) summarize the results for the Boston and NO2 data
sets respectively.
20
26
Divergence
Divergence
24
22
20
15
18
16
14
10
12
10
5
8
0
.1
.2
.3
.4
.5
.6
.7
.8
.9 .91 .92 .93 .94 .95 .96 .97 .98 .99
1 1.011.02 1.1
0
.1 .2 .3 .4 .5 .6 .7 .8 .9 .91 .92 .93 .94 .95 .96 .97 .98 .99 1 1.011.02 1.1
Fig. 2. MSE (vertical axis) versus value of parameter κ (horizontal axis) in the training
set (at left) and the testing set (at right), corresponding to the Boston data.
0.4
0.45
Divergence
Divergence
0.35
0.4
0.3
0.35
0.25
0.3
0.2
0.25
0.15
0
.1 .2 .3 .4 .5 .6 .7 .8 .9 .91 .92 .93 .94 .95 .96 .97 .98 .99 1 1.011.02 1.1
0.2
0
.1 .2 .3 .4 .5 .6 .7 .8 .9 .91 .92 .93 .94 .95 .96 .97 .98 .99 1 1.011.02 1.1
Fig. 3. MSE (vertical axis) versus value of parameter κ (horizontal axis) in the training
set (at left) and the testing set (at right), corresponding to the N O2 data.
This experiment supports the theoretical analysis of previous section for the effect of parameter κ. Both precision and variability tend to improve as κ increases
from 0.0 to 1.0. Near the critical value of κ = 1 numerical instabilities appear
and above this threshold the learning process becomes divergent.
4.3
Comparing with NC and Different Ensemble Sizes
In this experiment we compare our algorithm with Negative Correlation Learning
(NC). Since the key characteristic in the proposed approach is the introduction
9
of local learning rules instead of global ones we will fix the neighborhood order
at the the smallest value ν = 1. Based the previous experiment we will select the
best parameter κ for our algorithm. Note that this “best” parameter is really
the best for an ensemble of size M = 20, however the number of machines in the
ensemble will be varied according to M = 3, 4, 5, 6, 7, 8, 9, 20.
Previously to this comparison we conducted experiments with NC using an ensemble of size M = 20 to determine the best parameter λ testing values between
0 and 1 with gaps of size 0.1. In the Boston data set the best testing result is
achieved at λ = 0.5 as reported in [3]. In the N O2 data set, the minimum testing error corresponds to λ = 0.4. Values of λ greater than 0.5 cause divergence
of the algorithm in both data sets. Figures (4) and (5) summarize the results.
The circle-solid curves corresponds to our algorithm and the dotted-diamond
curves to Negative Correlation. To get a better visualization the last curve was
horizontally (but not vertically!) shifted a bit.
10.5
16
10
15.5
15
9.5
14.5
9
14
8.5
13.5
8
13
7.5
12.5
7
12
6.5
11.5
3
4
5
6
7
8
9
10
20
3
4
5
6
7
8
9
10
20
Fig. 4. MSE (vertical axis) versus number of machines (horizontal axis) in the training
set (at left) and the testing set (at right), corresponding to the Boston data.
0.31
0.29
0.3
0.28
0.29
0.27
0.26
0.28
0.25
0.27
0.24
0.26
0.23
0.22
3
4
5
6
7
8
9
10
20
0.25
3
4
5
6
7
8
9
10
20
Fig. 5. MSE (vertical axis) versus number of machines (horizontal axis) in the training
set (at left) and the testing set (at right), corresponding to the N O2 data.
10
From this experiment we can conclude that a local approach to generate diversity
exhibit better or comparable results than a global approach such as Negative
Correlation (NC) both in terms of accuracy and precision. Generalization looks
not so good for our algorithm but its comparable to the typical results with
these data sets. It is important to remark that our algorithm has complexity
proportional to 2νm in the ensemble size because each learner adapts its state
based only in the behavior of 2ν fixed neighbors (we used ν = 1). On the other
hand, NC has complexity proportional to m2 because at each training iteration
each machine needs to know the whole ensemble performance. This advantage
was verified in practice but results were omitted due to space limitations.
5
Future Work
In the future work we plan to work with more complex neighborhoods such as
bi-dimensional or asymmetric arrangements. We also want to experiment with a
distributed version of the algorithm taking advantage of its parallel nature. Since
recent results have shown that ensembles capable to balance the influence of data
points in the training can get better generalization, we are also interested in designing an algorithm capable to control the influence of an example in a machine
based on information about the influence of this point in the neighborhood.
References
1. J. Vedelsby A. Krogh, Neural network ensembles, cross-validation and active learning, Neural Information Processing Systems 7 (1995), 231–238.
2. C.L. Blake and C.J. Merz, UCI repository of machine learning databases, 1998.
3. G. Brown, Diversity in neural network ensembles, Ph.D. thesis, School of Computer
Science, University of Birmingham, 2003.
4. S. Fahlman and C. Lebiere, The cascade-correlation learning architecture, Advances
in Neural Information Processing Systems, vol. 2, Morgan Kaufmann, San Mateo,
1990, pp. 524–532.
5. R. Harris G. Brown, J. Wyatt and X. Yao, Diversity creation methods: A survey and
categorisation, Information Fusion Journal (Special issue on Diversity in Multiple
Classifier Systems) 6 (2004), no. 1, 5–20.
6. C. Whitaker L. Kuncheva, Measures of diversity in classifier ensembles, Machine
Learning 51 (2003), 181–207.
7. J. Kittler F. Roli N.C. Oza, R. Polikar (ed.), Multiple classifier systems, 6th international workshop, mcs 2005, seaside, ca, usa, june 13-15, 2005, proceedings,
Lecture Notes in Computer Science, vol. 3541, Springer, 2005.
8. T.W. Druovec P. Povalej, P. Kokol and B. Stiglic, Machine-learning with cellular
automata, Advances in Intelligent Data Analysis VI, vol. 1, Springer-Verlag, 2005,
pp. 305–315.
9. B. Rosen, Ensemble learning using decorrelated neural networks, Connection
Science (Special Issue on Combining Artificial Neural Networks: Ensemble Approaches) 8 (1999), no. 3-4, 373–384.
10. P. Vlachos, StatLib datasets archive, 2005.
11. X. Yao Y. Lui, Ensemble learning via negative correlation, Neural Networks 12
(1999), no. 10, 1399–1404.