Boosting and Bagging

Decision Trees
ID
Hair
Height
Weight
Lotion
Result
Sarah
Blonde
Average
Light
No
Sunburn
Dana
Blonde
Tall
Average
Yes
none
Alex
Brown
Tall
Average
Yes
None
Annie
Blonde
Short
Average
No
Sunburn
Emily
Red
Average
Heavy
No
Sunburn
Pete
Brown
Tall
Heavy
No
None
John
Brown
Average
Heavy
No
None
Katie
Blonde
Short
Light
Yes
None
Ensemble methods




Use multiple models to obtain better predictive
performance
Ensembles combine multiple hypotheses to form a
(hopefully) better hypothesis
Combine multiple weak learners to produce a strong
learner
Typically much more computation, since you are
training multiple learners
Ensemble learners



Typically combine multiple fast learners (like
decision trees)
Tend to overfit
Tend to get better results since there is deliberately
introduced significant diversity among models


Diversity does not mean reduced performance
Note that empirical studies have shown that random
forests do better than an ensemble of decision trees

Random forest is an ensemble of decisions trees
that do not minimize entropy to choose tree
nodes.
Bayes optimal classifier is an ensemble learner
Bagging: Bootstrap aggregating

Each model in the ensemble votes with equal weight

Train each model with a random training set

Random forests do better than bagged entropy
reducing DTs
Bootstrap estimation

Repeatedly draw n samples from D

For each set of samples, estimate a statistic


The bootstrap estimate is the mean of the individual
estimates
Used to estimate a statistic (parameter) and its variance
Bagging

For i = 1 .. M


Draw n*<n samples from D with replacement
Learn classifier Ci

Final classifier is a vote of C1 .. CM

Increases classifier stability/reduces variance
Boosting



Incremental
Build new models that try to do better on previous
model's mis-classifications

Can get better accuracy

Tends to overfit
Adaboost is canonical boosting algorithm
Boosting (Schapire 1989)

Randomly select n1 < n samples from D without replacement to obtain D1


Select n2 < n samples from D with half of the samples misclassified by C1
to obtain D2


Train weak learner C2
Select all samples from D that C1 and C2 disagree on


Train weak learner C1
Train weak learner C3
Final classifier is vote of weak learners
Adaboost

Learner = Hypothesis = Classifier

Weak Learner: < 50% error over any distribution

Strong Classifier: thresholded linear combination of
weak learner outputs
Discrete Adaboost
Real Adaboost
Comparison