pptx

Ensemble Learning
Reading:
R. Schapire, A brief introduction to boosting
Ensemble learning
Training sets
Hypotheses
S1
h1
S2
.
.
.
SN
h2
hN
Ensemble hypothesis
H
Advantages of ensemble learning
• Can be very effective at reducing generalization error!
(E.g., by voting.)
• Ideal case: the hi have independent errors
Example
Given three hypotheses, h1, h2, h3, with hi (x)  {−1,1}
Suppose each hi has 60% generalization accuracy, and
assume errors are independent.
Now suppose H(x) is the majority vote of h1, h2, and h3 .
What is probability that H is correct?
h1
h2
h3
H
C
C
C
C
C
C
I
C
C
I
I
I
C
I
C
C
I
C
C
C
I
I
C
I
I
C
I
I
I
I
I
I
probability
Total
probability
correct:
h1
h2
h3
H
probability
C
C
C
C
.216
C
C
I
C
.144
C
I
I
I
.096
C
I
C
C
.144
I
C
C
C
.144
I
I
C
I
.096
I
C
I
I
.096
I
I
I
I
.064
Total
probability
correct: .648
Another Example
Again, given three hypotheses, h1, h2, h3.
Suppose each hi has 40% generalization accuracy, and
assume errors are independent.
Now suppose we classify x as the majority vote of h1, h2,
and h3 . What is probability that the classification is
correct?
h1
h2
h3
H
C
C
C
C
C
C
I
C
C
I
I
I
C
I
C
C
I
C
C
C
I
I
C
I
I
C
I
I
I
I
I
I
probability
h1
h2
h3
H
probability
C
C
C
C
.064
C
C
I
C
.096
C
I
I
I
.144
C
I
C
C
.096
I
C
C
C
.096
I
I
C
I
.144
I
C
I
I
.144
I
I
I
I
.261
Total
probability
correct: .352
General case
In general, if hypotheses h1, ..., hM all have generalization
accuracy A, what is probability that a majority vote will be
correct?
Possible problems with ensemble learning
• Errors are typically not independent
• Training time and classification time are increased by a
factor of M.
• Hard to explain how ensemble hypothesis does
classification.
• How to get enough data to create M separate data sets,
S1, ..., SM?
• Three popular methods:
– Voting:
• Train classifier on M different training sets Si to
obtain M different classifiers hi.
• For a new instance x, define H(x) as:
M
H(x) = å a i hi (x)
i=1
where αi is a confidence measure for classifier hi .
– Bagging (Breiman, 1990s):
• To create Si, create “bootstrap replicates” of original
training set S
– Boosting (Schapire & Freund, 1990s)
• To create Si, reweight examples in original training
set S as a function of whether or not they were
misclassified on the previous round.
Adaptive Boosting (Adaboost)
A method for combining different weak hypotheses
(training error close to but less than 50%) to produce a
strong hypothesis (training error close to 0%)
Sketch of algorithm
Given examples S and learning algorithm L, with | S | = N
• Initialize probability distribution over examples w1(i) = 1/N .
• Repeatedly run L on training sets St  S to produce h1, h2, ... , hK.
– At each step, derive St from S by choosing examples
probabilistically according to probability distribution wt . Use St
to learn ht.
• At each step, derive wt + 1 by giving more probability to examples that
were misclassified at step t.
• The final ensemble classifier H is a weighted sum of the ht’s, with each
weight being a function of the corresponding ht’s error on its training
set.
Adaboost algorithm
• Given S = {(x1, y1), ..., (xN, yN)} where x  X, yi  {+1, −1}
• Initialize w1(i) = 1/N. (Uniform distribution over data)
• For t = 1, ..., K:
– Select new training set St from S with replacement, according to wt
– Train L on St to obtain hypothesis ht
– Compute the training error t of ht on S :
N
et = å w t ( j) d (y j ¹ ht (x j )) , where
j=1
ìï 1 if y ¹ h (x )
j
t
j
d (y j ¹ ht (x j )) = í
ïî 0 otherwise
– Compute coefficient
1  1 t 

 t  ln 
2  t 
– Compute new weights on data:
For i = 1 to N
wt+1 (i) =
wt (i) exp(-at yi ht (xi ))
Zt
where Zt is a normalization factor chosen so that wt+1 will be a
probability distribution:
N
Zt = å wt (i) exp(-at yi ht (xi ))
i=1
• At the end of K iterations of this algorithm, we have
h1, h2, . . . , hK
We also have
1, 2, . . . ,K, where
1  1 t 

 t  ln 
2  t 
• Ensemble classifier:
K
H (x) = sgn åa t ht (x)
t=1
• Note that hypotheses with higher accuracy on their training sets are
weighted more strongly.
A Hypothetical Example
S = {x1, x2 , x3, x 4, x5, x 6, x 7, x8,}
where { x1, x2, x3, x4 } are class +1
{x5, x6, x7, x8 } are class −1
t=1:
w1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8}
S1 = {x1, x2, x2, x5, x5, x6, x7, x8} (notice some repeats)
Train classifier on S1 to get h1
Run h1 on S. Suppose classifications are: {1, −1, −1, −1, −1, −1, −1, −1}
• Calculate error:
N
e1 = å wt ( j)d (y j ¹ ht (x j )) = ?
j=1
A Hypothetical Example
S = {x1, x2 , x3, x 4, x5, x 6, x 7, x8,}
where { x1, x2, x3, x4 } are class +1
{x5, x6, x7, x8 } are class −1
t=1:
w1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8}
S1 = {x1, x2, x2, x5, x5, x6, x7, x8} (notice some repeats)
Train classifier on S1 to get h1
Run h1 on S. Suppose classifications are: {1, −1, −1, −1, −1, −1, −1, −1}
• Calculate error:
N
e1 = å wt ( j)d (y j ¹ ht (x j )) =
j=1
1
(3) = .375
8
Calculate ’s:
1 æ 1- et ö
a1 = ln ç
÷=
2 è et ø
Calculate new w’s:
w t+1 (i) =
w t (i) exp(-a t yi ht (x i ))
Zt
ŵ 2 (1) =
ŵ 2 (2) =
ŵ 2 (3) =
ŵ 2 (4) =
ŵ 2 (5) =
ŵ 2 (6) =
ŵ 2 (7) =
ŵ 2 (8) =
Z1 = å ŵ 2 (i) =
i
w 2 (1) =
w 2 (2) =
w 2 (3) =
w 2 (4) =
w 2 (5) =
w 2 (6) =
w 2 (7) =
w 2 (8) =
Calculate ’s:
1 æ1 - e t ö
a1 = lnç
÷ = .255
2 è et ø
Calculate new w’s:
w t+1 (i) =
w t (i) exp(-a t yi ht (x i ))
Zt
ŵ 2 (1) = (.125)exp(-.255(1)(1)) = 0.1
ŵ 2 (2) = (.125)exp(-.255(1)(-1)) = 0.16
ŵ 2 (3) = (.125)exp(-.255(1)(-1)) = 0.16
ŵ 2 (4) = (.125)exp(-.255(1)(-1)) = 0.16
ŵ 2 (5) = (.125)exp(-.255(-1)(-1)) = 0.1
ŵ 2 (6) = (.125)exp(-.255(-1)(-1)) = 0.1
ŵ 2 (7) = (.125)exp(-.255(-1)(-1)) = 0.1
ŵ 2 (8) = (.125)exp(-.255(-1)(-1)) = 0.1
Z1 = å ŵ 2 (i) = .98
i
w 2 (1) = 0.1 /.98 = 0.102
w 2 (2) = 0.163
w 2 (3) = 0.163
w 2 (4) = 0.163
w 2 (5) = 0.102
w 2 (6) = 0.102
w 2 (7) = 0.102
w 2 (8) = 0.102
t=2
w2 = {0.102, 0.163, 0.163, 0.163, 0.102, 0.102, 0.102, 0.102}
S2 = {x1, x2, x2, x3, x4, x4, x7, x8}
Learn classifier on S2 to get h2
Run h2 on S. Suppose classifications are: {1, 1, 1, 1, 1, 1, 1, 1}
Calculate error:
N
e2 = å w t ( j)d (y j ¹ ht (x j ))
j=1
= (.102) ´ 4 = 0.408
Calculate ’s:
1 æ1 - e t ö
a 2 = lnç
÷ = .186
2 è et ø
Calculate w’s:
w t+1 (i) =
w t (i) exp(-a t yi ht (xi ))
Zt
ŵ3 (1) = (.102)exp(-.186(1)(1)) = 0.08
ŵ3 (2) = (.163)exp(-.186(1)(1)) = 0.135
ŵ3 (3) = (.163)exp(-.186(1)(1)) = 0.135
ŵ3 (4) = (.163)exp(-.186(1)(1)) = 0.135
ŵ3 (5) = (.102)exp(-.186(-1)(1)) = 0.122
w3 (1) = 0.08 /.973 = 0.082
w3 (2) = 0.139
w3 (3) = 0.139
w3 (4) = 0.139
w3 (5) = 0.125
ŵ3 (7) = (.102)exp(-.186(-1)(1)) = 0.122
w3 (6) = 0.125
w3 (7) = 0.125
ŵ3 (8) = (.102)exp(-.186(-1)(1)) = 0.122
w3 (8) = 0.125
ŵ3 (6) = (.102)exp(-.186(-1)(1)) = 0.122
Z 2 = å ŵ 2 (i) = .973
i
t =3
w3 = {0.082, 0.139, 0.139, 0.139, 0.125, 0.125, 0.125, 0.125}
S3 = {x2, x3, x3, x3, x5, x6, x7, x8}
Run classifier on S3 to get h3
Run h3 on S. Suppose classifications are: {1, 1, −1, 1, −1, −1, 1, −1}
Calculate error:
N
e3 = å w t (i)d (y j ¹ ht (x j ))
j=1
= (.139) + (.125) = 0.264
• Calculate ’s:
1 æ1 - e t ö
a 3 = lnç
÷ = .512
2 è et ø
• Ensemble classifier:
K
H (x) = sgn åa t ht (x)
t=1
= sgn (.255´ h1 (x) +.186 ´ h2 (x) +.512 ´ h3 (x))
Example
Actual
class
h1
h2
h3
x1
1
1
1
1
x2
1
−1
1
1
x3
1
−1
1
−1
x4
1
1
1
1
x5
−1
−1
1
−1
x6
−1
−1
1
−1
x7
−1
1
1
1
x8
−1
−1
1
−1
T
H (x) = sgn åa t ht (x)
Recall the training set:
S = {x1, x2 , x3, x 4, x5, x 6, x 7, x8,}
where { x1, x2, x3, x4 } are class +1
{x5, x6, x7, x8 } are class −1
What is the accuracy of H on the training data?
t=1
= sgn (.255´ h1 (x) +.186 ´ h2 (x) +.512 ´ h3 (x))
Adaboost seems to reduce both bias and variance.
Adaboost does not seem to overfit for increasing K.
Optional: Read about “Margin-theory” explanation of
success of Boosting
Recap of Adaboost algorithm
•
Given S = {(x1, y1), ..., (xN, yN)} where x  X, yi  {+1, −1}
•
Initialize w1(i) = 1/N. (Uniform distribution over data)
•
For t = 1, ..., K:
1. Select new training set St from S with replacement, according to wt
1. Train L on St to obtain hypothesis ht
1. Compute the training error t of ht on S :
N
et = å w t ( j) d (y j ¹ ht (x j )) ,
j=1
ìï 1 if y ¹ h (x )
j
t
j
where d (y j ¹ ht (x j )) = í
ïî 0 otherwise
If εt > 0.5, abandon ht and go to step 1
4. Compute coefficient:
1 æ 1- et ö
at = ln ç
÷
2 è et ø
5. Compute new weights on data:
For i = 1 to N
wt+1 (i) =
wt (i) exp(-at yi ht (xi ))
Zt
where Zt is a normalization factor chosen so that wt+1 will be a probability distribution:
N
Zt = å wt (i) exp(-at yi ht (xi ))
i=1
•
At the end of K iterations of this algorithm, we have h1, h2, . . . , hK , and 1, 2, . . . ,K
•
Ensemble classifier:
K
H (x) = sgn åa t ht (x)
t=1
In-class exercises
Problems with Boosting
• Sensitive to noise, outliers: Boosting focuses on those
examples that are hard to classify
• But this can be a way to identify outliers or noise in a
dataset
Boosting Decision Stumps
Decision Stumps
Let x = (x1, x2, …, xn)
Decision Stump hi,t
If xi ³ t
then class = 1
else class = -1
Example of Training a Decision Stump
Training data:
x1
x2
Class
.9
.5
1
.7
.2
−1
.3
.4
−1
.1
.5
1
.1
.6
−1
Training decision stump for xi
1. Sort xi values; remove duplicates
2. Construct candidate thresholds t below min value,
above max value, and midway between successive features
3. For each hi,t , compute error on training set.
4. Return hi,t that maximizes
1
- error(hi,t ) .
2
Example of Training a Decision Stump
Training data:
x1
x2
Class
.9
.5
1
.7
.2
1
.3
.1
−1
.3
.4
−1
.1
.5
1
.1
.6
−1
Training decision stump for x1:
Sort, remove duplicates: .9 .7 .3 .1
Candidate thresholds: 1 .8 .5 .2 0
Err (h1,1) = 3/6
h1,.5
Err (h1,.8) = 2/6
1
maximizes 2 - error(hi,t )
Err (h1,.5) = 1/6 Err (h1,.2) = 3/6 Err(h1,0) = 3/6
Adaboost on Decision Stumps
Run Adaboost for M iterations, with L being the decision-stump learning
algorithm just described. At each iteration, choose feature i for learning
decision stumps. (Choose i at random without replacement, or use feature
selection method.)
Decision stump ht is learned using training i data selected from current
distribution.
Coefficient αt is calculated by running ht on all training data.
After T iterations, run ensemble classifier H on test data.
Case Study of Adaboost:
Viola-Jones Face Detection Algorithm
P. Viola and M. J. Jones, Robust real-time face detection.
International Journal of Computer Vision, 2004.
First face-detection algorithm to work well in real-time (e.g.,
on digital cameras)
Training Data
•
Positive: 500 faces scaled
and aligned to a base
resolution of 24 by 24 pixels.
• Negative: 300 million nonfaces.
Features
Use rectangle features at multiple sizes and
location in an image subwindow (candidate face).
From http://makematics.com/research/viola-jones/
For each feature fj :
fj =
å
bÎblack pixels
intensity(pixel b) -
å
intensity(pixel w)
wÎwhite pixels
Possible number of features per 24 x 24 pixel subwindow > 180,000.
Detecting faces
Given a new image:
• Scan image using subwindows at all locations and at
different scales
• For each subwindow, compute features and send them to
an ensemble classifier (learned via boosting). If classifier
is positive (“face”), then detect a face at this location and
scale.
Preprocessing
Preprocessing: Viola & Jones use a clever pre-processing
step that allows the rectangular features to be computed very
quickly. (See their paper for description. )
They use a variant of AdaBoost to both select a small set of
features and train the classifier.
Base (“weak”) classifiers
For each feature fj ,
ìï 1 if p f (x) < p q
j j
j j
hj = í
ïî -1 otherwise
üï
ý
ïþ
where x is a 24 x 24-pixel subwindow of an image, θj is the
threshold that best separates the data using feature fj , and pj is
either -1 or 1.
Such features are called “decision stumps”.
Running Adaboost
For t = 1 to T:
Select training examples for this iteration from wt
Learning:
Evaluate each rectangle feature on each training example
For each feature, use selected training examples to find best threshold
to create decision stump
Run each decision stump on all training examples to find ε
Select decision stump ht with lowest value of ε.
Remove ft (the feature used in ht) from pool of features to be used
(i.e., it can’t be selected twice)
Boosting:
Compute αt for ht
Update the weights to get wt+1
Final ensemble classifier:
On a test image x:
Run ensemble classifier C(x) in many windows of different
scales across the image. If classification is 1 in a window,
detect face in this window.
https://www.youtube.com/watch?v=k3bJUP0ct08
https://www.youtube.com/watch?v=pZi9o-3ddq4
https://www.youtube.com/watch?v=pZi9o-3ddq4
Viola-Jones “attentional cascade” algorithm
• Increases performance while reducing computation time
• Main idea:
– Vast majority of sub-windows are negative.
– Simple classifiers used to reject majority of sub-windows before
more complex classifiers (more features) are applied
FP rate: 50%
FN rate: 0%
FP rate: 40%
FN rate: 0%
FP rate: 10%
FN rate: 0%
From Wikipedia
Optional HW Assignment
Run Adaboost with decision stumps on Spam dataset
For t = 1 to T:
Select training examples for this iteration from wt
Learning:
Evaluate each rectangle feature on each training example
For each feature, use selected training examples to find best
threshold
to create decision stump
Run each decision stump on all training examples to find ε
Select decision stump ht with lowest value of ε.
Remove ft (the feature used in ht) from pool of features to be used
(i.e., it can’t be selected twice)
Boosting:
Compute αt for ht
Update the weights to get wt+1
Final ensemble classifier:
Compare performance of Adaboost for same number of
features as SVM feature selection.