THE ADABOOST ALGORITHM - Personal Web Pages

ENSEMBLE LEARNING: ADABOOST
Jianping Fan
Dept of Computer Science
UNC-Charlotte
ENSEMBLE LEARNING
A machine learning paradigm where multiple learners
are used to solve the problem
Previously:
Proble
m
Learner
Ensemble:
Proble
m
… ...
Learner Learner … ...
Learner
The generalization ability of the ensemble is usually significantly
better than that of an individual learner
Boosting is one of the most important families of ensemble methods
A BRIEF HISTORY
Resampling for
estimating statistic
 Bootstrapping
 Bagging
 Boosting
 Adaboost
(Schapire 1989)
Resampling for
classifier design
(Schapire 1995)
3
BOOTSTRAP ESTIMATION
 Repeatedly
draw n samples from D
 For each set of samples, estimate a
statistic
 The bootstrap estimate is the mean
of the individual estimates
 Used to estimate a statistic
(parameter) and its variance
BAGGING - AGGREGATE BOOTSTRAPPING
 For
i = 1 .. M
 Draw n*<n samples from D with
replacement
 Learn classifier Ci
 Final classifier is a vote of C1 .. CM
 Increases classifier stability/reduces
variance
BAGGING
ML
f1
ML
f2
ML
fT
f
BOOSTING
ML
f1
Training Sample
ML
Weighted Sample
f2
f
…
ML
Weighted
Sample
fT
REVISIT BAGGING
BOOSTING CLASSIFIER
BAGGING VS BOOSTING
 Bagging:
the construction of
complementary base-learners is left
to chance and to the unstability of
the learning methods.
 Boosting: actively seek to generate
complementary base-learner--training the next base-learner based
on the mistakes of the previous
learners.
BOOSTING (SCHAPIRE 1989)
 Randomly
select n1 < n samples from D without
replacement to obtain D1
 Train weak learner C1
 Select
n2 < n samples from D with half of the
samples misclassified by C1 to obtain D2
 Train weak learner C2
 Select
all samples from D that C1 and C2
disagree on
 Train weak learner C3
 Final
classifier is vote of weak learners
ADABOOST (SCHAPIRE 1995)
 Instead
of sampling, re-weight
 Previous weak learner has only
50% accuracy over new
distribution
 Can
be used to learn weak classifiers
 Final
classification based on
weighted vote of weak classifiers
ADABOOST TERMS
 Learner
= Hypothesis = Classifier
 Weak
Learner: < 50% error over any
distribution
 Strong
Classifier: thresholded linear
combination of weak learner outputs
AdaBoost
Adaptive Boosting
A learning algorithm
Building a strong classifier a lot of weaker ones
ADABOOST CONCEPT
h1 ( x) {1, 1}
h2 ( x) {1, 1}
.
.
.
h ( x) {1, 1}
 T

HT ( x)  sign    t ht ( x) 
 t 1

T
weak classifiers
slightly better than random
strong classifier
WEAKER CLASSIFIERS
h1 ( x) {1, 1}
h2 ( x) {1, 1}
.
.
.
h ( x) {1, 1}
T
weak classifiers
slightly better than random


Each weak classifier learns by
considering one simple feature
T most beneficial features for
classification should
T be selected


HT ( x)  sign    t ht ( x) 
 How to
 t 1

–
–
–
–
–
define features?
select beneficial features?
train weak classifiers?
manage (weight) training samples?
associate weight to each weak
classifier?
strong classifier
THE STRONG CLASSIFIERS
h1 ( x) {1, 1}
h2 ( x) {1, 1}
.
.
.
h ( x) {1, 1}
 T

HT ( x)  sign    t ht ( x) 
 t 1

T
weak classifiers
slightly better than random
strong classifier
THE ADABOOST ALGORITHM
Given:( x1 , y1 ), ,( xm , ym ) where xi  X , yi {1, 1}
1
Dt (i):probability distribution of xi 's at time t
D
(
i
)

, i  1, , m
1
m
Initialization:
For t  1, , T :
ht : X  {1, 1}
• Find classifier
which minimizes error wrt Dt ,i.e.,
m
ht  arg min  j where  j   Dt (i )[ yi  h j ( xi )]
minimize weighted error
i 1
hj
1
2
1 t
• Weight
classifier: t  ln
• Update
Dt 1 (i ) 
distribution:
t
for minimize exponential loss
Dt (i ) exp[ t yi ht ( xi )]
, Z t is for normalization
Zt
Give error classified patterns more chance for learning.
THE ADABOOST ALGORITHM
Given:( x1 , y1 ), ,( xm , ym ) where xi  X , yi {1, 1}
1
Initialization:D1 (i)  m , i  1, , m
For t  1, , T :
ht : X  {1, 1}
• Find classifier
which minimizes error wrt Dt ,i.e.,
m
ht  arg min  j where  j   Dt (i )[ yi  h j ( xi )]
i 1
hj
1
2
1 t
• Weight
classifier: t  ln
• Update
Dt 1 (i ) 
distribution:
Output final
t
Dt (i ) exp[ t yi ht ( xi )]
, Z t is for normalization
Zt
T


sign
H
(
x
)


h
(
x
)

t t

classifier: 
t 1

BOOSTING
Weak
Classifier 1
ILLUSTRATION
BOOSTING
Weights
Increased
ILLUSTRATION
THE ADABOOST ALGORITHM
typicallywhere
where
the weights of incorrectly classified examples are
increased so that the base learner is forced to focus
on the hard examples in the training set
BOOSTING
Weak
Classifier 2
ILLUSTRATION
BOOSTING
ILLUSTRATION
Weights
Increased
BOOSTING
Weak
Classifier 3
ILLUSTRATION
BOOSTING
Final classifier is
a combination of
weak classifiers
ILLUSTRATION
What goal the AdaBoost wants to reach?
THE ADABOOST ALGORITHM
Given:( x1 , y1 ), ,( xm , ym ) where xi  X , yi {1, 1}
1
D
(
i
)

, i  1, , m
1
m
Initialization:
For t  1, , T :
ht : X  {1, 1}
• Find classifier
which minimizes error wrt Dt ,i.e.,
m
ht  arg min  j where  j   Dt (i )[ yi  h j ( xi )]
i 1
hj
1
2
1 t
• Weight
classifier: t  ln
• Update
Dt 1 (i ) 
distribution:
Output final
t
Dt (i ) exp[ t yi ht ( xi )]
, Z t is for normalization
Zt
T


sign
H
(
x
)


h
(
x
)

t t

classifier: 
t 1

What goal the AdaBoost wants to reach?
THE ADABOOST ALGORITHM
Given:( x1 , y1 ), ,( xm , ym ) where xi  X , yi {1, 1}
1
,m
Initialization:D1 (i)  m , i  1, They
are goal dependent.
For t  1, , T :
ht : X  {1, 1}
• Find classifier
which minimizes error wrt Dt ,i.e.,
m
ht  arg min  j where  j   Dt (i )[ yi  h j ( xi )]
i 1
hj
1
2
1 t
• Weight
classifier: t  ln
• Update
Dt 1 (i ) 
distribution:
Output final
t
Dt (i ) exp[ t yi ht ( xi )]
, Z t is for normalization
Zt
T


sign
H
(
x
)


h
(
x
)

t t

classifier: 
t 1

GOAL
Final classifier:
T


sign  H ( x)    t ht ( x) 
t 1


Minimize exponential loss
lossexp  H ( x)  Ex , y e
 yH ( x )

GOAL
Final classifier:
T


sign  H ( x)    t ht ( x) 
t 1


Minimize exponential loss
lossexp  H ( x)  Ex , y e
 yH ( x )

Maximize the margin yH(x)
GOAL
Minimize
lossexp  H ( x)  Ex , y e  yH ( x ) 
Final classifier:
Define
T


sign  H ( x)    t ht ( x) 
t 1


H t ( x)  H t 1 ( x)  t ht ( x) with H 0 ( x)  0
Then, H ( x)  HT ( x)
Ex , y e yHt ( x )   Ex  E y e yHt ( x ) | x  
 Ex  E y e y[ Ht1 ( x )t ht ( x )] | x  
 Ex  E y e yHt1 ( x )e yt ht ( x ) | x  
 Ex e yHt1 ( x ) et P( y  ht ( x))  et P( y  ht ( x))  
Minimize
t  ?
Define
lossexp  H ( x)  Ex , y e  yH ( x ) 
Final classifier:
T


sign  H ( x)    t ht ( x) 
t 1


H t ( x)  H t 1 ( x)  t ht ( x) with H 0 ( x)  0
Then, H ( x)  HT ( x)
Ex , y e yHt ( x )   Ex eEyyHte1( xyH) t (ex )| txP(y  ht ( x))  et P( y  ht ( x))  
Set

Ex , y e yHt ( x )   0
 t
 Ex e yHt1 ( x )  et P( y  ht ( x))  et P( y  ht ( x))    0
0
Minimize
t  ?
Define
lossexp  H ( x)  Ex , y e  yH ( x ) 
Final classifier:
T


sign  H ( x)    t ht ( x) 
t 1


H t ( x)  H t 1 ( x)  t ht ( x) with H 0 ( x)  0
Then, H ( x)  HT ( x)
 t 
1 1 t
1 P( y  ht ( x))
  t  ln
ln
2
t
2 P( y  ht ( x))
P( xi , yi )  Dt (i)
m
 t  P(error)   Dt (i )[ yi  h j ( xi )]
i 1
 Ex e yHt1 ( x )  et P( y  ht ( x))  et P( y  ht ( x))    0
0
Given: ( x1 , y1 ),
t  ?
Define
,( xm , ym ) where xi  X , yi {1, 1}
Minimize
Initialization: D (i) 
1
For t  1,
,T :
1
m
 yH ( x )


loss
H
(
x
)

E
e


x, y 
, i  1, , mexp


Final
classifier:
h  arg min
 where    Dsign
(i )[ y 
 hH( x()]x)    t ht ( x) 
t 1


• Find classifier ht : X  {1, 1} which minimizes errorTwrt Dt ,i.e.,
m
t
j
• Weight classifier:  t 
t
i 1
i
j
i
1 1 t
ln
2
t
D (i ) exp[ y h ( x )]
H t ( x)  H t•1Update
( x) distribution:
t ht ( x) with
D (i )  H 0 ( x)  0
,Z
t 1
Then, H ( x)  HTOutput
( x) final classifier:
 t 
j
hj
t
t
i t
Zt
i
t
is for normalization
T


sign  H ( x)    t ht ( x) 
t 1


1 1 t
1 P( y  ht ( x))
  t  ln
ln
2
t
2 P( y  ht ( x))
P( xi , yi )  Dt (i)
m
 t  P(error)   Dt (i )[ yi  h j ( xi )]
i 1
 Ex e yHt1 ( x )  et P( y  ht ( x))  et P( y  ht ( x))    0
0
Given: ( x1 , y1 ),
,( xm , ym ) where xi  X , yi {1, 1}
Minimize
Initialization: D (i) 
1
Dt 1  ?
For t  1,
,T :
1
m
 yH ( x )


loss
H
(
x
)

E
e


exp
x~ D, y 
, i  1, , m


Final
classifier:
h  arg min
 where    Dsign
(i )[ y 
 hH( x()]x)    t ht ( x) 
t 1


• Find classifier ht : X  {1, 1} which minimizes errorTwrt Dt ,i.e.,
m
t
j
j
hj
• Weight classifier:  t 
Define
i
j
i
1 1 t
ln
2
t
D (i ) exp[ y h ( x )]
H t ( x)  H t•1Update
( x) distribution:
t ht ( x) with
D (i )  H 0 ( x)  0
,Z
t 1
Then, H ( x)  HTOutput
( x) final classifier:
 t 
t
i 1
t
t
i t
Zt
i
t
is for normalization
T


sign  H ( x)    t ht ( x) 
t 1


1 1 t
1 P( y  ht ( x))
  t  ln
ln
2
t
2 P( y  ht ( x))
P( xi , yi )  Dt (i)
m
 t  P(error)   Dt (i )[ yi  h j ( xi )]
i 1
 Ex e yHt1 ( x )  et P( y  ht ( x))  et P( y  ht ( x))    0
0
Minimize
Dt 1  ?
Define
lossexp  H ( x)   Ex ~ D , y e  yH ( x ) 
Final classifier:
T


sign  H ( x)    t ht ( x) 
t 1


H t ( x)  H t 1 ( x)  t ht ( x) with H 0 ( x)  0
Then, H ( x)  HT ( x)

1


Ex , y e  yHt   Ex , y e  yHt 1 e  yt ht   Ex , y e yHt1 1  y t ht   t2 y 2 ht2  
2




1


 ht  arg min Ex , y e yHt 1 1  y t h   t2 y 2 h 2  
h
2




1 

 ht  arg min Ex , y e yHt1 1  y t h   t2  
h
2 


   yHt1 
1 2  
 ht  arg min Ex  E y e
1  yt h  t   | x 
h
2  

 
y 2 h2  1
Minimize
Dt 1  ?
Define
lossexp  H ( x)   Ex ~ D , y e  yH ( x ) 
Final classifier:
T


sign  H ( x)    t ht ( x) 
t 1


H t ( x)  H t 1 ( x)  t ht ( x) with H 0 ( x)  0
Then, H ( x)  HT ( x)
 ht  arg max Ex 1 h( x)e Ht1 ( x )  P( y  1| x)  (1)  h( x)e Ht1 ( x )  P( y  1| x) 
h
 ht  arg max Ex  E y e  yHt 1  yh   | x 
h
 ht  arg min Ex  E y e  yHt 1   y t h   | x 
h
   yHt1 
1 2  
 ht  arg min Ex  E y e
1  yt h  t   | x 
h
2  

 
Minimize
Dt 1  ?
Define
lossexp  H ( x)   Ex ~ D , y e  yH ( x ) 
Final classifier:
T


sign  H ( x)    t ht ( x) 
t 1


H t ( x)  H t 1 ( x)  t ht ( x) with H 0 ( x)  0
Then, H ( x)  HT ( x)
 ht  arg max Ex 1 h( x)e Ht1 ( x )  P( y  1| x)  (1)  h( x)e Ht1 ( x )  P( y  1| x) 
h
 ht  arg max Ex, y ~e yHt1 ( x ) P( y| x)  yh( x)
h

 h ( x)  sign  P
 ht ( x)  sign Ex, y ~e yHt1 ( x ) P ( y|x )  y | x 
t
x , y ~ e yHt 1 ( x ) P ( y| x )
y  h( x) x
maximized when

( y  1| x)  Px, y ~e yHt1 ( x ) P ( y|x ) ( y  1| x)

Minimize
Dt 1  ?
Define
lossexp  H ( x)   Ex ~ D , y e  yH ( x ) 
Final classifier:
T


sign  H ( x)    t ht ( x) 
t 1


H t ( x)  H t 1 ( x)  t ht ( x) with H 0 ( x)  0
Then, H ( x)  HT ( x)
x, y ~ e yHt1 ( x ) P( y | x)
At time t

 ht ( x)  sign Px, y ~e yHt1 ( x ) P ( y|x ) ( y  1| x)  Px, y ~e yHt1 ( x ) P ( y|x ) ( y  1| x)

Given: ( x1 , y1 ),
,( xm , ym ) where xi  X , yi {1, 1}
Minimize
Initialization: D (i) 
1
Dt 1  ?
For t  1,
,T :
1
m
 yH ( x )


loss
H
(
x
)

E
e


exp
x~ D, y 
, i  1, , m


Final
classifier:
h  arg min
 where    Dsign
(i )[ y 
 hH( x()]x)    t ht ( x) 
t 1


• Find classifier ht : X  {1, 1} which minimizes errorTwrt Dt ,i.e.,
m
t
j
j
hj
• Weight classifier:  t 
Define
t
i 1
i
j
i
1 1 t
ln
2
t
D (i ) exp[ y h ( x )]
H t ( x)  H t•1Update
( x) distribution:
t ht ( x) with
D (i )  H 0 ( x)  0
,Z
t 1
Then, H ( x)  HTOutput
( x) final classifier:
At time t
At time 1
At time t+1
t
t
i t
i
Zt
t
is for normalization
T


sign  H ( x)    t ht ( x) 
t 1


x, y ~ e yHt1 ( x ) P( y | x)
x, y ~ P ( y | x )
x, y ~ e
 Dt 1 (i ) 
 yHt ( x )
P( yi | xi )  1  D1 (1) 
1 1

Z1 m
 t yht ( x )

D
e
P( y | x)
t
Dt (i ) exp[ t yi ht ( xi )]
, Z t is for normalization
Zt
PROS AND CONS OF ADABOOST
Advantages
 Very simple to implement
 Does feature selection resulting in
relatively simple classifier
 Fairly good generalization
Disadvantages
 Suboptimal solution
 Sensitive to noisy data and outliers
41
INTUITION



Train a set of weak hypotheses: h1, …., hT.
The combined hypothesis H is a weighted majority
vote of the T weak hypotheses.
 Each hypothesis ht has a weight αt.
During the training, focus on the examples that are
misclassified.
 At round t, example xi has the weight Dt(i).
BASIC SETTING


Binary classification problem
Training data:
( x1 , y1 ),...., ( xm , y m ), where xi  X , yi  Y  {1,1}



Dt(i): the weight of xi at round t. D1(i)=1/m.
A learner L that finds a weak hypothesis ht: X  Y given
the training set and Dt
The error of a weak hypothesis ht:
THE BASIC ADABOOST ALGORITHM
For t=1, …, T
• Train weak learner using training data and Dt
• Get ht: X  {-1,1} with error
t 
 D (i)
i:ht ( xi )  yi
• Choose
• Update
1 1 t
 t  ln
2
t
Dt (i ) e  t if ht ( xi )  y i
Dt 1 (i ) 
* 
t
Zt
e
if ht ( xi )  y i

Dt (i )e  t yi ht ( xi )

Zt
t
THE GENERAL ADABOOST ALGORITHM
PROS AND CONS OF ADABOOST
Advantages
 Very simple to implement
 Does feature selection resulting in
relatively simple classifier
 Fairly good generalization
Disadvantages
 Suboptimal solution
 Sensitive to noisy data and outliers
46