Machine learning: Boosting 1. Basic definition: Assume again we

Machine learning: Boosting
1. Basic definition:
Assume again we have a classification task (e.g. cancer
classification) with data
H œ Ö x3 ß C 3 × 3 .
and C3 œ „ ".
Boosting
Assume we have a classifier :ÐxÑ which takes feature
vector x, and classify
Cœ"
œ C œ "
if :ÐxÑ  !
Þ
if :ÐxÑ  !
Assume :ÐxÑ is 'weak' - i.e., its predictions are not always
correct.
Now assume a family Ö:4 ÐxÑ×7
4œ" of different weak
classifiers.
Boosting
A boosting classifier takes a linear combination of these
classifiers to form a better one, for example,
0 ÐxÑ œ "+3 :3 ÐxÑ,
7
3=1
where, e.g., 0 ÐxÑ ! selects the  class, and
otherwise the  class.
2. AdaBoost (adaptive boosting) algorithm (Freund &
Schaphire, 1997)
Consider data H œ Öx3 ß C3 ×83œ" . Weigh each data point
Ðx3 ß C3 Ñ equally with weight [" Ð3Ñ œ 8" Þ
Assume x3 − J œ feature space.
Goal: build a classifier 0 ÐxÑ which generalizes data set
H to predict class C of x.
Let [ œ space of allowed classifier functions 0
AdaBoost: introduction
Idea: We will take random samples from data set H
(with repetition) according the weight distribution [" ,
and later some distributions [# , [$ ß á which
emphasize the examples x3 we have misclassified
previously.
Specifically: train initial classifier 2" ÐxÑ À x Ä Ö „ "×
which minimizes error with respect to weight
distribution [" .
That is:
AdaBoost: introduction
2" œ function 2 which minimizes weighted number of
errors
œ function 2 which minimizes error
8
! [" Ð3ÑMÐC3 Á 2Ðx3 ÑÑ
œ argmin ! [" Ð3ÑMÐC3 Á 2Ðx3 ÑÑ
3œ"
8
2−[
3œ"
where
MÐC3 Á 2Ðx3 ÑÑ œ œ
"
!
if C3 Á 2Ðx3 Ñ
.
otherwise
(1)
3. The algorithm:
1. Define
%" œ minimal value of error in (1).
(i.e. smallest value of the sum (1))
Require: %" Ÿ Þ&; otherwise stop (then have a bad family
of classifiers 2; need at least 50% accuracy).
AdaBoost algorithm
2. Choose !" à typically
!" œ
" "  %"
ln
#
%"
(2)
3. Update weigths [ :
[# Ð3Ñ œ
[" Ð3Ñ/!" C3 2" Ð3Ñ
ß
^" Ð!" Ñ
where ^" œ normalizing constant to make Ö[# Ð3Ñ×3 a
probability distribution (over 3) which adds up to ".
AdaBoost algorithm
Now Ö[# Ð3Ñ×3 form a family of weights which we can use
to 're-sample' the data set H, and form a new
classifier 2# .
Specifically
2# œ argmin "[# Ð3ÑMÐC3 Á 2Ðx3 ÑÑ
8
2−[
3œ"
(note that if weight [# Ð3Ñ  8" this is equivalent to
'oversampling' data point x3 ; otherwise
'undersampling' x3 )
AdaBoost algorithm
4. Generally, define
2> œ argmin "[> Ð3ÑMÐC3 Á 2Ðx3 ÑÑ
8
2−[
(2a)
3œ"
Again let
%> œ smallest value of the sum in (2a)
and
!> œ
" "  %>
ln
Þ
#
%>
(3)
AdaBoost algorithm
Then define
[> Ð3Ñ/!> C3 2> Ð3Ñ
[>" Ð3Ñ œ
ß
^>
and
^> œ normalizing constant as earlier.
(3a)
AdaBoost algorithm
Then after X steps form final classifier
0 ÐxÑ œ sgn"!> 2> ÐxÑÞ
X
>œ"
4. Observations about AdaBoost
Note that the updating equation (3a) has property
/!>" C3 2>" Ðx3 Ñ œ
"
"
if C3 œ 2>" Ðx3 Ñ
Þ
if C3 Á 2>" Ðx3 Ñ
Thus after we find best classifier 2>" for sample
distribution [>" , examples x3 which 2>" identified
incorrectly are weighted more for the selection of 2> .
Thus the classifier based on distribution [> will better
identify examples that the previous weak classifier
missed.
AdaBoost: additional observations
Recall normalization constant
^> Ð!> Ñ œ "[> Ð3Ñ/!> C3 2> Ð3Ñ .
3
This is weighted measure of the total error at the
previous step, since
C3 2> Ð3Ñ œ œ
"
"
if 2> Ð3Ñ œ C3 (correct prediction)
if 2> Ð3Ñ Á C3 (incorrect prediction)
which means /!> C3 2> Ð3Ñ large if 2> Ð3Ñ Á C3 Þ
(4)
AdaBoost: additional observations
Can show our definition (3) for !> minimizes ^> Ð!> Ñ at
each step (just use calculus to minimize (4) above w/
respect to the variable !> ).
Note: can also show
^> œ #È%> Ð"  %> Ñ
for above choice of !> .
Recall reweighting formula:
AdaBoost: additional observations
C3 ! !; 2; Ðx3 Ñ
>
[>" Ð3Ñ œ
[> Ð3Ñ/!> C3 2> Ðx3 Ñ
^>
recursion
œ
/
;œ"
8 # ^;
>
;œ"
(recall initial value H" Ð3Ñ œ 8" ).
Þ
AdaBoost: additional observations
Can also show: fact that !> are chosen to minimize ^>
implies that
" [>" Ð3Ñ œ " [>" Ð3Ñ;
3À2> Ðx3 ÑœC3
3À2> Ðx3 ÑÁC3
this shows that half the new weight is focused on the
misclassified examples for previous classifier 2>
(where 2> Ðx3 Ñ Á C3 ).
AdaBoost: examples
Example 1:
Training set:
AdaBoost: examples
Blue: R Ð!ß "Ñ
Red:
J. Matas, J. Sochman
"
"Î#Ð<%Ñ#
/
<È )<#
AdaBoost: examples
(distribution is radial with uniform angular
distribution)
Select weak (always linear) classifier with smallest
weighted error (all points equal at this time)Þ
[" Ð3Ñ œ 8" Þ
AdaBoost: examples
Re-weighting formula: find [# Ð3Ñ. Give more weight to
misclassified examples;
AdaBoost: examples
AdaBoost: examples
new classifer (focuses more on higher weights):
Reweigh again based on previously misclassified
examples, etc.
AdaBoost: examples
Summary: first classifier at > œ " and error rate:
Fig. 1: Here and below shaded part is currently classified red (figures
due to J. Matas, J. Sochman)
AdaBoost: examples
Second classifier: > œ # and error rate:
AdaBoost: examples
>œ$À
AdaBoost: examples
>œ%À
AdaBoost: examples
>œ&À
AdaBoost: examples
>œ'À
AdaBoost: examples
>œ(À
AdaBoost: examples
> œ %! À
AdaBoost: examples
Note misclassified examples closest to the boundary are
the ones which at end get highest weight:
AdaBoost: examples
Baseline: AdaBoost with RBF network weak classifiers
from (Ratsch-ML: 2000).
AdaBoost: examples
Blue: Adaboost (discrete)
Green: Continuous Adaboost
Red: Adaboost with totally corrective step (TCA)
Cyan: Continuous Adaboost with totally corrective step
Dashed lines: training (leave one out) error; solid - test
error
AdaBoost: examples
Above dual distribution example
AdaBoost: examples
(Serum antibodies as predictors of thyroiditis)
AdaBoost: examples
Breast cancer classification
AdaBoost: examples
Diabetes from blood markers
AdaBoost: examples
Heart disease from blood markers
All figures: J. Matas and J. Sochman (2005)
AdaBoost: examples
Long and Vega (2003)
Adaboost vs. other classifiers in microarray cancer
diagnosis (cross-validation error rates)
AdaBoost: examples
HCC = Hepatocellular carcinoma
ER = estrogen response in breast cancer
LN = lymph node spread of cancer