Boosting - Intuitively Adaboost - Overview Adaboost - Demo Introduction to Boosting February 2, 2012 Adaboost - Details Boosting - Intuitively Adaboost - Overview Outline 1 Boosting - Intuitively 2 Adaboost - Overview 3 Adaboost - Demo 4 Adaboost - Details Adaboost - Demo Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Boosting - Introduction Train one base learner at a time. Focus it on the mistakes of its predecessors. Weight it based on how ‘useful’ it is in the ensemble (not on its training error). Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Boosting - Introduction Under very mild assumptions on the base learners (“Weak learning” - they must perform better than random guessing): 1 Training error is eventually reduced to zero (We can fit the training data). 2 Generalisation error continues to be reduced even after training error is zero (There is no overfitting). Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Boosting - Introduction The most popular ensemble algorithm is a boosting algorithm called “Adaboost”. Adaboost is short for “Adaptive Boosting”, because the algorithm adapts weights on the base learners and training examples. There are many explanation of precisely what Adaboost does and why it is so successful - but the basic idea is simple! Boosting - Intuitively Adaboost - Overview Outline 1 Boosting - Intuitively 2 Adaboost - Overview 3 Adaboost - Demo 4 Adaboost - Details Adaboost - Demo Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Pseudo-code Adaboost Pseudo-code Set uniform example weights. for Each base learner do Train base learner with weighted sample. Test base learner on all data. Set learner weight with weighted error. Set example weights based on ensemble predictions. end for Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Loss Function To see precisely how Adaboost is implemented, we need to understand its Loss Function L(H) = mi = yi N X i=1 K X e −mi (1) αk hk (xi ) (2) k=1 mi is the voting margin. There are N examples and K base learners. αK is the weight of the k th learner, hk (xi ) is its prediction on xi . Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Loss Function Why Exponential Loss? Loss is higher when a prediction is wrong. Loss is steeper when a prediction is wrong. Precise reasons later. . . Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Loss Function Adaboost - Demo Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Example Weights After k learners, the example weights used to train the k + 1th learner are: Dk (i) ∝ e −mi (3) = e yi Pk = e yi Pk−1 j=1 j=1 αj hj (xi ) αj hj (xi ) yi αk hk (xi ) e ∝ Dk−1 (i)e Dk (i) = Dk−1 (4) yi αk hk (xi ) (i)e yi αk hk (xi ) Zk P Zk normalises Dk (i) such that N i=1 Dk (i) = 1. Larger when mi is negative. Directly proportional to the loss funcion. (5) (6) (7) Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Learner Weights Once the k th learner is trained, we evaluate its predictions on training data hk (xi ) and assign weight: 1 1 − k log , 2 k N X k = Dk−1 (i)δ[hk (xi ) 6= yi ] αk = (8) (9) i=1 k is the weighted error of the k th learner. The log in the learner weight is related to the exponential in the loss function. Boosting - Intuitively Adaboost - Overview Adaboost - Learner Weights Adaboost - Demo Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Pseudo-code Adaboost Pseudo-code Dk (i): Example i weight after learner k αk : Learner k weight Set uniform example weights. for Each base learner do Train base learner with weighted sample. Test base learner on all data. Set learner weight with weighted error. Set example weights based on ensemble predictions. end for Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Pseudo-code Adaboost Pseudo-code Dk (i): Example i weight after learner k αk : Learner k weight ∀i : D0 (i) ← N1 for Each base learner do Train base learner with weighted sample. Test base learner on all data. Set learner weight with weighted error. Set example weights based on ensemble predictions. end for Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Pseudo-code Adaboost Pseudo-code Dk (i): Example i weight after learner k αk : Learner k weight ∀i : D0 (i) ← N1 for k=1 to K do D ← data sampled with Dk−1 . hk ← base learner trained on D Test base learner on all data. Set learner weight with weighted error. Set example weights based on ensemble predictions. end for Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Pseudo-code Adaboost Pseudo-code Dk (i): Example i weight after learner k αk : Learner k weight ∀i : D0 (i) ← N1 for k=1 to K do D ← data sampled with Dk−1 . hk ← P base learner trained on D k ← N i=1 Dk−1 (i)δ[hk (xi ) 6= yi ] Set learner weight with weighted error. Set example weights based on ensemble predictions. end for Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Pseudo-code Adaboost Pseudo-code Dk (i): Example i weight after learner k αk : Learner k weight ∀i : D0 (i) ← N1 for k=1 to K do D ← data sampled with Dk−1 . hk ← P base learner trained on D k ← N i=1 Dk−1 (i)δ[hk (xi ) 6= yi ] k αk = 12 log 1− k Set example weights based on ensemble predictions. end for Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Pseudo-code Adaboost Pseudo-code Dk (i): Example i weight after learner k αk : Learner k weight ∀i : D0 (i) ← N1 for k=1 to K do D ← data sampled with Dk−1 . hk ← P base learner trained on D k ← N i=1 Dk−1 (i)δ[hk (xi ) 6= yi ] 1 k αk ← 2 log 1− k Dk (i) ← end for Dk−1 (i)e −αk yi hk (xi ) Zk Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Pseudo-code Decision rule: K X αk hk (x)) H(x) = sign( k=1 (10) Boosting - Intuitively Adaboost - Overview Outline 1 Boosting - Intuitively 2 Adaboost - Overview 3 Adaboost - Demo 4 Adaboost - Details Adaboost - Demo Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo [Demo] Adaboost - Demo Adaboost - Details Boosting - Intuitively Adaboost - Overview Outline 1 Boosting - Intuitively 2 Adaboost - Overview 3 Adaboost - Demo 4 Adaboost - Details Adaboost - Demo Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Training Error So the training error rate of Adaboost reaches 0 for large enough K Combination of exponential loss + weak learning assumption. If base learner weighted error is always ≤ p etr ≤ ( 1 − 4γ 2 )K 1 2 − γ then: (11) Boosting - Intuitively Adaboost - Overview Adaboost - Training Error Adaboost - Demo Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Training Error Let’s derive the bound! 1 Show that exponential loss is an upper bound on classification error. 2 Show that the loss can be written as a product of the normalisation constant. 3 Minimise the normalisation constant at each step (derive the learner weight rule). 4 Write this constant using the weighted error at each step. Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Training Error Show that exponential loss is an upper bound on classification error. etr = = ≤ N 1 X δ[H(xi ) 6= yi ] N (12) 1 N δ[mi ≤ 0] (13) e −mi (14) 1 N i=1 N X i=1 N X i=1 Because x ≤ 0 → e −x ≥ 1 and x ≤ 1 → e −x ≥ 0 Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Training Error Show that the loss can be written as a product of the normalisation constant. Dk−1 (i)e −yi αk hk (xi ) Zk k Y e −yi αj hj (xi ) ] = D0 (i)[ Zj Dk (i) = = e −mi j=1 Pk 1 −yi e N = NDk (i) j=1 (15) (16) 1 αj hj (xi ) Qk j=1 Zj k Y j=1 Zj (17) (18) Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Training Error Show that the loss can be written as a product of the normalisation constant. k Y Zj e −mi = NDk (i) (19) j=1 etr ≤ = 1 N N X e −mi (20) i=1 N k Y 1 X NDk (i) Zj N i=1 = [ N X Dk (i)] i=1 = k Y j=1 Zj (21) j=1 k Y Zj (22) j=1 (23) Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Training Error Minimise normalisation constant at each step. Zk = N X Dk−1 (i)e −yi αk hk (xi ) (24) i=1 = = X hk (xi )6=yi k e αk + Dk−1 (i)e αk + X Dk−1 (i)e −αk (25) hk (xi )=yi (1 − k )e −αk (26) Boosting - Intuitively Adaboost - Overview Adaboost - Training Error Adaboost - Demo Adaboost - Details Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Training Error Minimise normalisation constant at each step. Zk = k e αk + (1 − k )e −αk ∂Zk = k e αk − (1 − k )e −αk ∂αk 0 = k e 2αk − (1 − k )e 0 1 − k e 2αk = k 1 1 − k αk = log 2 k Minimising Zk gives us the learner weight rule. (27) (28) (29) (30) (31) Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Training Error Write the normalisation constant using weighted error. Zk αk = k e αk + (1 − k )e −αk 1 − k 1 log = 2 k 1 Zk log 1−k k = k e 2 + (1 − k )e p = 2 k (1 − k ) (32) (33) − 12 log 1−k k (34) (35) So Zk depends only on k , and the weak learning assumption says that k < 0.5. Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Training Error We bound training error using: etr ≤ Qk j=1 Zj We’ve p found that Zj is a function of the weighted error: Zj = 2 j (1 − j ) p Q So, we can bound the training error: etr ≤ kj=1 2 j (1 − j ) p Let j ≤ 21 − γ, and we can write etr ≤ ( 1 − 4γ 2 )k If γ > 0, then this bound will shrink to 0 when k is large enough. Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Adaboost - Training Error Summary We have seen an upper bound on the training error which decreases as K increase. We can prove the bound by considering the loss function as a bound on the error rate. Examining the role of the normalisation constant in this bound reveals the learner weight rule. Examining the role of γ in the bound shows why we can achieve 0 training error even with weak learners.
© Copyright 2026 Paperzz