Introduction to Boosting

Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Introduction to Boosting
February 2, 2012
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Outline
1
Boosting - Intuitively
2
Adaboost - Overview
3
Adaboost - Demo
4
Adaboost - Details
Adaboost - Demo
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Boosting - Introduction
Train one base learner at a time.
Focus it on the mistakes of its predecessors.
Weight it based on how ‘useful’ it is in the ensemble (not on
its training error).
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Boosting - Introduction
Under very mild assumptions on the base learners (“Weak
learning” - they must perform better than random guessing):
1
Training error is eventually reduced to zero (We can fit the
training data).
2
Generalisation error continues to be reduced even after
training error is zero (There is no overfitting).
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Boosting - Introduction
The most popular ensemble algorithm is a boosting algorithm
called “Adaboost”.
Adaboost is short for “Adaptive Boosting”, because the
algorithm adapts weights on the base learners and training
examples.
There are many explanation of precisely what Adaboost does
and why it is so successful - but the basic idea is simple!
Boosting - Intuitively
Adaboost - Overview
Outline
1
Boosting - Intuitively
2
Adaboost - Overview
3
Adaboost - Demo
4
Adaboost - Details
Adaboost - Demo
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Pseudo-code
Adaboost Pseudo-code
Set uniform example weights.
for Each base learner do
Train base learner with weighted sample.
Test base learner on all data.
Set learner weight with weighted error.
Set example weights based on ensemble predictions.
end for
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Loss Function
To see precisely how Adaboost is implemented, we need to
understand its Loss Function
L(H) =
mi = yi
N
X
i=1
K
X
e −mi
(1)
αk hk (xi )
(2)
k=1
mi is the voting margin.
There are N examples and K base learners.
αK is the weight of the k th learner, hk (xi ) is its prediction on xi .
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Loss Function
Why Exponential Loss?
Loss is higher when a prediction is wrong.
Loss is steeper when a prediction is wrong.
Precise reasons later. . .
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Loss Function
Adaboost - Demo
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Example Weights
After k learners, the example weights used to train the k + 1th
learner are:
Dk (i) ∝ e −mi
(3)
= e yi
Pk
= e yi
Pk−1
j=1
j=1
αj hj (xi )
αj hj (xi ) yi αk hk (xi )
e
∝ Dk−1 (i)e
Dk (i) =
Dk−1
(4)
yi αk hk (xi )
(i)e yi αk hk (xi )
Zk
P
Zk normalises Dk (i) such that N
i=1 Dk (i) = 1.
Larger when mi is negative.
Directly proportional to the loss funcion.
(5)
(6)
(7)
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Learner Weights
Once the k th learner is trained, we evaluate its predictions on
training data hk (xi ) and assign weight:
1
1 − k
log
,
2
k
N
X
k =
Dk−1 (i)δ[hk (xi ) 6= yi ]
αk =
(8)
(9)
i=1
k is the weighted error of the k th learner.
The log in the learner weight is related to the exponential in
the loss function.
Boosting - Intuitively
Adaboost - Overview
Adaboost - Learner Weights
Adaboost - Demo
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Pseudo-code
Adaboost Pseudo-code
Dk (i): Example i weight after learner k
αk : Learner k weight
Set uniform example weights.
for Each base learner do
Train base learner with weighted sample.
Test base learner on all data.
Set learner weight with weighted error.
Set example weights based on ensemble predictions.
end for
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Pseudo-code
Adaboost Pseudo-code
Dk (i): Example i weight after learner k
αk : Learner k weight
∀i : D0 (i) ← N1
for Each base learner do
Train base learner with weighted sample.
Test base learner on all data.
Set learner weight with weighted error.
Set example weights based on ensemble predictions.
end for
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Pseudo-code
Adaboost Pseudo-code
Dk (i): Example i weight after learner k
αk : Learner k weight
∀i : D0 (i) ← N1
for k=1 to K do
D ← data sampled with Dk−1 .
hk ← base learner trained on D
Test base learner on all data.
Set learner weight with weighted error.
Set example weights based on ensemble predictions.
end for
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Pseudo-code
Adaboost Pseudo-code
Dk (i): Example i weight after learner k
αk : Learner k weight
∀i : D0 (i) ← N1
for k=1 to K do
D ← data sampled with Dk−1 .
hk ← P
base learner trained on D
k ← N
i=1 Dk−1 (i)δ[hk (xi ) 6= yi ]
Set learner weight with weighted error.
Set example weights based on ensemble predictions.
end for
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Pseudo-code
Adaboost Pseudo-code
Dk (i): Example i weight after learner k
αk : Learner k weight
∀i : D0 (i) ← N1
for k=1 to K do
D ← data sampled with Dk−1 .
hk ← P
base learner trained on D
k ← N
i=1 Dk−1 (i)δ[hk (xi ) 6= yi ]
k
αk = 12 log 1−
k
Set example weights based on ensemble predictions.
end for
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Pseudo-code
Adaboost Pseudo-code
Dk (i): Example i weight after learner k
αk : Learner k weight
∀i : D0 (i) ← N1
for k=1 to K do
D ← data sampled with Dk−1 .
hk ← P
base learner trained on D
k ← N
i=1 Dk−1 (i)δ[hk (xi ) 6= yi ]
1
k
αk ← 2 log 1−
k
Dk (i) ←
end for
Dk−1 (i)e −αk yi hk (xi )
Zk
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Pseudo-code
Decision rule:
K
X
αk hk (x))
H(x) = sign(
k=1
(10)
Boosting - Intuitively
Adaboost - Overview
Outline
1
Boosting - Intuitively
2
Adaboost - Overview
3
Adaboost - Demo
4
Adaboost - Details
Adaboost - Demo
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
[Demo]
Adaboost - Demo
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Outline
1
Boosting - Intuitively
2
Adaboost - Overview
3
Adaboost - Demo
4
Adaboost - Details
Adaboost - Demo
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Training Error
So the training error rate of Adaboost reaches 0 for large
enough K
Combination of exponential loss + weak learning assumption.
If base learner weighted error is always ≤
p
etr ≤ ( 1 − 4γ 2 )K
1
2
− γ then:
(11)
Boosting - Intuitively
Adaboost - Overview
Adaboost - Training Error
Adaboost - Demo
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Training Error
Let’s derive the bound!
1
Show that exponential loss is an upper bound on classification
error.
2
Show that the loss can be written as a product of the
normalisation constant.
3
Minimise the normalisation constant at each step (derive the
learner weight rule).
4
Write this constant using the weighted error at each step.
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Training Error
Show that exponential loss is an upper bound on classification
error.
etr =
=
≤
N
1 X
δ[H(xi ) 6= yi ]
N
(12)
1
N
δ[mi ≤ 0]
(13)
e −mi
(14)
1
N
i=1
N
X
i=1
N
X
i=1
Because x ≤ 0 → e −x ≥ 1 and x ≤ 1 → e −x ≥ 0
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Training Error
Show that the loss can be written as a product of the
normalisation constant.
Dk−1 (i)e −yi αk hk (xi )
Zk
k
Y e −yi αj hj (xi )
]
= D0 (i)[
Zj
Dk (i) =
=
e −mi
j=1
Pk
1 −yi
e
N
= NDk (i)
j=1
(15)
(16)
1
αj hj (xi )
Qk
j=1 Zj
k
Y
j=1
Zj
(17)
(18)
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Training Error
Show that the loss can be written as a product of the
normalisation constant.
k
Y
Zj
e −mi = NDk (i)
(19)
j=1
etr ≤
=
1
N
N
X
e −mi
(20)
i=1
N
k
Y
1 X
NDk (i)
Zj
N
i=1
= [
N
X
Dk (i)]
i=1
=
k
Y
j=1
Zj
(21)
j=1
k
Y
Zj
(22)
j=1
(23)
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Training Error
Minimise normalisation constant at each step.
Zk
=
N
X
Dk−1 (i)e −yi αk hk (xi )
(24)
i=1
=
=
X
hk (xi )6=yi
k e αk +
Dk−1 (i)e αk +
X
Dk−1 (i)e −αk
(25)
hk (xi )=yi
(1 − k )e −αk
(26)
Boosting - Intuitively
Adaboost - Overview
Adaboost - Training Error
Adaboost - Demo
Adaboost - Details
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Training Error
Minimise normalisation constant at each step.
Zk = k e αk + (1 − k )e −αk
∂Zk
= k e αk − (1 − k )e −αk
∂αk
0 = k e 2αk − (1 − k )e 0
1 − k
e 2αk =
k
1
1 − k
αk =
log
2
k
Minimising Zk gives us the learner weight rule.
(27)
(28)
(29)
(30)
(31)
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Training Error
Write the normalisation constant using weighted error.
Zk
αk
= k e αk + (1 − k )e −αk
1 − k
1
log
=
2
k
1
Zk
log
1−k
k
= k e 2
+ (1 − k )e
p
= 2 k (1 − k )
(32)
(33)
− 12 log
1−k
k
(34)
(35)
So Zk depends only on k , and the weak learning assumption says
that k < 0.5.
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Training Error
We bound training error using: etr ≤
Qk
j=1 Zj
We’ve p
found that Zj is a function of the weighted error:
Zj = 2 j (1 − j )
p
Q
So, we can bound the training error: etr ≤ kj=1 2 j (1 − j )
p
Let j ≤ 21 − γ, and we can write etr ≤ ( 1 − 4γ 2 )k
If γ > 0, then this bound will shrink to 0 when k is large
enough.
Boosting - Intuitively
Adaboost - Overview
Adaboost - Demo
Adaboost - Details
Adaboost - Training Error Summary
We have seen an upper bound on the training error which
decreases as K increase.
We can prove the bound by considering the loss function as a
bound on the error rate.
Examining the role of the normalisation constant in this
bound reveals the learner weight rule.
Examining the role of γ in the bound shows why we can
achieve 0 training error even with weak learners.