A Boosting Algorithm for Regression

A Boosting Algorithm for Regression?
A. Bertoni, P. Campadelli, M. Parodi
Dipartimento di Scienze dell'Informazione
Universita degli Studi di Milano
via Comelico 39/40, I-20135 Milano (Italy)
Abstract. A new boosting algorithm ADABOOST-R for regression
problems is presented and upper bound on the error is obtained. Experimental results to compare ADABOOST-R and other learning algo-
rithms are given.
1 Introduction
Boosting refers to the general problem of producing a very accurate prediction
algorithm by appropriately combining rough and moderately inaccurate ones.
It works by calling repeatedly a given \weak" learning algorithm on various
distributions on the training set, and combining the hypotheses obtained with a
linearly separable boolean function.
The boosting algorithm ADABOOST proposed by Freund and Schapire 1]
and presented in Section 2 has been successfully applied to improve the performance of di
erent learning algorithms used for classication problems both binary and multiclass 2]. The rst extension for regression problems f : X ! 0 1]
is the algorithm ADABOOST-R 1]. In this paper we present ADABOOST-R,
a di
erent and more general extension for problems f : X ! 0 1]m. The main
theoretical result is an upper bound on the error.
To analyse the performance of ADABOOST-R we have done experiments
using backpropagation as \weak" learning algorithm. Preliminary results show
good convergence properties of ADABOOST-R notably it is able to lower
both the mean and the maximum error on either the training and the test set.
For regression problems f : X ! 0 1] it works better than ADABOOST-R.
2 Preliminary denitions and results
Given a set X and Y = f0 1g, let P be a probability distribution on X Y . An
N ;sample is a sequence h(x1 y1) : : : (xN yN )i with (xk yk ) 2 X Y we call
SampN the set of all the N ;samples and Samp = N SampN .
Given a class of functions H ff j f : X ! f0 1gg, a learning algorithm A
on H is a function A : Samp ! H.
?
Supported by grants CT 9305230.ST 74, CT 90.021.56.74
Let S = h(x1 y1 ) : : : (xN yN )i be a N ;sample and let h = AS be the
hypothesis output by A on input S, the empirical error b of A on S is
#f(xk yk ) j yk 6= h(xk )g
b =
N
while the generalization error is
g = Pfy 6= h(x)g:
Under weak conditions on H (that is the Vapnik-Chervonenkis dimension 3] of
H is nite), choosing elements of X Y randomly and independently according
to P , for suciently large samples, the empirical error is close to the generalization error with high probability 4]. For this reason a good learning algorithm
should minimize the empirical error. Often this is a dicult task because of the
large amount of computational resources required and computationally ecient
algorithms are usually moderately accurate.
Boosting is a general method for improving the accuracy of a learning algorithm. In particular we refer to the algorithm ADABOOST presented in 1]
and described below. It has in input a learning algorithm A, a N ;sample
S = h(x1 y1 ) : : : (xN yN )i, a distribution D on the elements
of S, an intePT
ger T and gives in output a nal hypothesis hf = HS( k=1 wk hk ; ) with
hk 2 H (k = 1 T ) HS denotes the function HS(x) = if x 0 then 1 else
0. Even whether A is moderately inaccurate, for a suciently large T the error
made by the nal hypothesis hf on the sample S can be made close to 0. Besides,
the generalization ability of hf is good since the Vapnik-Chervonenkis dimension
of the family of the nal hypothesis does not grow too much 1].
Algorithm ADABOOST
Input: a N ;sample S = h(x1 y1) : : : (xN yN )i, a distribution D on S, a learning algorithm A, an integer T.
Initialize the weight vector wi1 = D(i) for i = 1 : : : N
Do for t = 1 2 : : :t T
1. Set p(t) = PNw w t
i
i
2. Choose randomly with distribution p(t) the sample S(t) from S call the
learning algorithm A and get the hypothesis ht = AS t
3. Calculate the error t = PNi=1 p(it) j ht(xi) ; yi j
4. Calculate t = (1;tt)
t
(t+1)
5. Set the new weights vector to be:
= wi(t)t1;jh (xi );yi )j
Pwi
Output the hypothesis hf = HS Tk=1(log 1 t )ht(x) ; 1=2 PTk=1(log 1 t)
( )
( )
=1
( )
( )
An upper bound to the error =
following
PN
k=1 D(k) j hf (xk ) ; yk ) j is given by the
Theorem1 (Freund-Schapire). Suppose the learning algorithm A, when called
by ADABOOST,
generates hypotheses with errors 1 : : : T . Then the error
P
= Nk=1 D(k) j hf (xk ) ; yk j of the nal hypothesis hf output by ADABOOST
is bounded by
2T
T p
Y
t=1
t(1 ; t):
ADABOOST can be applied to classication problems with 2 classes. It has
been generalized to multiclass problems (Y = f1 : : : K g) and to regression
problems (Y = 0 1]) 1]. Roughly speaking, the main idea of the algorithm
ADABOOST.R designed by Freund and Shapire for regression problems, is that
of transforming the regression problem into a classication one using the total
order relation on the real numbers. For example, every hypothesis h : X !
0 1] is transformed into the boolean function bh : X 0 1] ! f0 1g with
bh(x y) = 1 y h(x)
0 otherwise
In the next paragraph we present a di
erent way of transforming a regression
problem for functions with values in 0 1]m, into a classication problem. It
develops an idea presented in 5] and it is based on the notion of norm in Rm .
3 The algorithm ADABOOST-R
In this section we show a boosting algorithm ADABOOST-R for regression
problems. In this setting, Y is 0 1]m instead of f0 1g as before, a sample S,
chosen at random according to a probability distribution P on X m Y , is given
to a learning algorithm A, that outputs a hypothesis h : X ! 0 1]m.
Given a norm k k on Rm , xed > 0, we say that x and x~ \- agree" if kx ;
x~k . We consider as generalization error g of a hypothesis
h the probability
R
that y and h(x) does not \- agree", that is g = HS(kh( x) ; yk ; )dP.
Analogously,
the empirical error on a sample S = h(x1 y1) : : : (xN yN )i is
P
= (HS(kh(xk ; yk k ; )):
Algorithm ADABOOST-R
Input: a N ;sample S = h(x1 y1 ) : : : (xN yN )i, a distribution D on S, a learning algorithm A, an integer T, a real number.
Initialize the weight vector wi1 = D(i) for i = 1 : : : N
Do for t = 1 2 : : :t T
1. Set p(t) = PNw w t
i
i
2. Choose randomly with distribution p(t) the sample S (t) from S call the
learning algorithm A and get the hypothesis ht = AS t
3. Calculate the error t = PNi=1 p(it)HS(kht (xi) ; yi k ; )
if t > 1=2 then T = t ; 1 and abort loop
( )
( )
=1
( )
4. Calculate t = (1;tt)
5. Set the new weights vector to be wi(t+1) = wPi(t)t1;HS (kht (xi );yi k;)
Output the hypothesis hf = argmaxy201]m tHS( ;kht (xi) ; yi k) where
t = log 1 t
P
An upper bound to the error 2 = (HS(khf (xk ; yk k ; 2)) is given by
following
Theorem 2. Suppose the learning algorithm A, when called by ADABOOSTR, generates hypotheses ht with errors 1 : : : T < 1=2 and let hf be the nal
hypothesis output by ADABOOST-R. Then
X
.
khf (xk );ykk>2
D(k) 2T
T p
Y
t=1
t(1 ; t)
Proof. (Outline) We transform the regression problem X ! 0 1]m into a classication problem X 0 1]m ! f0 1g
{ the sample S = h(xi yi) j i = 1 N i is transformed into the sample Sb =
h(xi yi) 0) j i = 1 N i
b = D(i) on the sample Sb
{ the distribution D(i) is transformed into D(i)
{ the algorithm A with input S is transformed into the algorithm Ab with input
Sb with the rule: AbSb(xi yi ) = HS(kAS (xi ) ; yi k ; )
Let w(t) and (t) be respectively the weights vector and the error at step t
of the algorithm ADABOOST-R and let wb(t) and b(t) (t) be respectively the
weight vector and the error at step t of the algorithm ADABOOST applied to
the associated classication problem. By induction it can be proved that
wb(t) = w(t) b(t) = (t) (1 t T)
Let bhf (x y) be the nal hypothesis given by ADABOOST and hf (x) be the nal
hypothesis given by ADABOOST-R. If khf (xi) ; yi k 2 then bhf (xi yi ) = 1.
Let us suppose, on the contrary, that bhf (xi yi ) = 0, then
X
t HS(kht (xi) ; yi k ; ) < 1=2
t
kht(xi ) ; yi k
X
t
t
Let
< g, the inequality (1) becomes
P I = ft j
1=2 t t and the following relation hold
X
Since hf = argmaxy201]m
t2I
P
t >
X
t62I
t
tHS( ; kht (xi) ; yi k) then
(1)
P
t62I t <
(2)
X
t
tHS( ; kht(xi ) ; hf (xi )k) From (2) and (3) follows
X
t2I
tHS(;kht(xi );hf (xi )k) >
X
t62I
X
t
tHS( ; kht(xi ) ; yi k)
(3)
t(1;HS(;kht(xi );hf (xi )k) 0 (4)
From the inequality (4) one can assert that there exists ~t 2 I such that
HS( ;kht~(xi) ; hf (xi )k) = 1 for such ~t it holds that khf (xi ) ; ht~(xi )k < and
kht~(xi ) ; yi k < . Hence by the triangular inequality we obtain: khf (xi ) ; yi k <
2 but this is against the hypothesis.
Since kbhf (xi ) ; yi )k > 2 implies bhf (xi yi ) = 1, using Theorem 1 we conclude
X
kbhf (xi );yi)k>2
D(i) X
bhf (xi yi )=1
D(i) 2T
T p
Y
t=1
t (1 ; t )
ut
4 Experimental results
In this section we present some preliminary results to evaluate the performance
of ADABOOST-R in terms of learning accuracy and computational eciency.
The experiments have been done as follows:
{ the functions to be learned are functions f : R ! R or g : R2 ! R2
{ the \weak algorithm" A, given in input to ADABOOST-R, is backpropagation on neural networks of xed architecture
{ the parameter has been set at 1:5, where is the error on the training
set made by backpropagation and preliminarly computed.
The experiments show that ADABOOST-R exhibits better convergence
properties than backpropagation: after the same number of epochs the accuracy
on either the training and the test sets is higher. A qualitative example of this
behaviour is shown in Figure 1. Figure 1a and 1b show respectively the interpolation of the function f(x) = (sin(10 x)+2)=4+sin(50 (x+0:5)2)=15+0:1N(0 0:1)
made by backpropagation and by ADABOOST-R.
Quantitative results for a function g : R2 ! R2 , described by the the
6
6
expression g(x y) = sin( 5(x+0
:5) ) sin( 5(y+0:5) )=3 + 0:5 with gaussian noise
0:2 N(0 0:2) added, are given in Table 1 (left). Two di
erent network architectures (a denotes the architecture 2-10-10-2, b denotes the architecture 2-15-15-2)
have been trained for di
erent numbers of epochs.
For functions f : R ! R we have also compared the performance of ADABOOSTR with the algorithm ADABOOST-R. Table 1 (right) shows the results relative
to the function f(x) = sin( (0:031 x) )=5 + 0:2 N(0 0:2)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Data
Back-prop
0 20 40 60 80 100 120
(a)
Data
Boosting
0 20 40 60 80 100 120
(b)
Fig. 1. Comparison between back-propagation and ADABOOST-R.
str data epochs T mean err. max err.
a tr 1000 5 0.0428 0.1607
a test 1000 5 0.0433 0.1545
a tr 5000 1 0.0471 0.1739
a test 5000 1 0.0475 0.1664
b tr 1500 4 0.0417 0.1687
b test 1500 4 0.0425 0.2101
b tr 6000 1 0.0434 0.2054
b test 6000 1 0.044 0 0.2404
alg. data epochs T mean err. max err.
F.S. tr 1000 6 0.021
0.111
F.S. test 1000 6 0.020
0.128
tr 1000 5 0.017 0.058
test 1000 5 0.018 0.073
b.pr. tr 6000 1 0.031
0.204
b.pr. test 6000 1 0.036
0.228
Table 1.
References
1] Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning
and an application to boosting. Internal Report of AT & T, September (1995)
2] Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. Machine
Learning: Proc. of Thirteenth Int. Conf. (1996) 148{156
4] V.N. Vapnik (1982) Estimation of Dependences Based on Empirical Data. SpringerVerlag.
3] V.N. Vapnik, A.Y. Chervonenkis (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications,
16(2): 264-280.
5] Freund, Y.: Boosting a weak learning algorithm by majority. Information and Computation 121 (2) (1995) 256{285
This article was processed using the LATEX macro package with LLNCS style