CIS 520, Machine Learning, Fall 2015: Assignment 4

CIS 520, Machine Learning, Fall 2015: Assignment 4
Due: Thursday, October 15th, 11:59pm, PDF to Canvas
Instructions. Please write up your responses to the following problems clearly and concisely. We encourage
you to write up your responses using LATEX; we have provided a LATEX template, available on Canvas, to
make this easier. Submit your answers in PDF form to Canvas. We will not accept paper copies
of the homework.
Collaboration. You are allowed and encouraged to work together. You may discuss the homework to
understand the problem and reach a solution in groups up to size two students. However, each student
must write down the solution independently, and without referring to written notes from the joint session.
In addition, each student must write on the problem set the names of the people with whom
you collaborated. You must understand the solution well enough in order to reconstruct it by yourself.
(This is for your own benefit: you have to take the exams alone.)
Boosting and MDL
[100 points]
1. Analyzing the training error of boosting
[60 points]
Consider the AdaBoost algorithm you saw in class. In this question we will try to analyze the training error
of boosting.
1. [8 points] Given a set of m examples, (xi , yi ) (yi is the class label of xi ), i = 1, . . . , m, let ht (x) be
the weak classifier obtained at step t, and let αt be its weight. Recall that the final classifier is
H(x) = sign(f (x)), where f (x) =
T
X
αt ht (x).
t=1
Show that the training error of the final classifier can be bounded from above by an exponential loss
function:
m
m
1 X
1 X
I(H(xi ) 6= yi ) ≤
exp(−f (xi )yi ),
m i=1
m i=1
where I(a = b) is the indicator function. It is equal to 1 if a = b, and 0 otherwise
Hint: e−x ≥ 1 ⇔ x ≤ 0.
F SOLUTION: The table for different combinations of f (xi ) and yi is given below:
f (xi ) yi H(xi ) I(H(xi ) 6= yi ) exp (−f (xi )yi )
>0
1
1
0
≥0
>0
-1
1
1
≥1
<0
1
-1
1
≥1
<0
-1
-1
0
≥0
The last column is obtained using the fact that exp(−f (xi )yi ) is always ≥ 0 and exp(−f (xi )yi ) ≥ 1
when f (xi )yi ≤ 0. From the table, we can see no matter what the prediction function f (xi ) and true class
yi are, we have I(H(xi ) 6= yi ) ≤ exp (−f (xi )yi ).
1
Summing them up, we have
1
m
Pm
i=1
I(H(xi ) 6= yi ) ≤
1
m
Pm
i=1
exp (−f (xi )yi ).
2. [16 points] Remember that
Dt (i) exp(−αt yi ht (xi ))
Zt
Dt+1 (i) =
Use this recursive definition to prove the following.
m
T
Y
1 X
exp(−f (xi )yi ) =
Zt ,
m i=1
t=1
(1)
where Zt is the normalization factor for distribution Dt+1 :
Zt =
m
X
Dt (i) exp(−αt yi ht (xi )).
(2)
i=1
Hint: Remember that e
P
i
gi
=
Q
i
egi ,D1 (i) =
1
m,
and that
P
i
Dt+1 (i) = 1.
F SOLUTION:
Dt+1 (i)
Dt (i) exp(−αt ht (xi )yi )
Zt
=
Now we will use the recursive definition of Dt to obtain,
Dt+1 (i)
=
... =
... =
... =
Dt−1 (i) exp(−αt ht (xi )yi ) exp(−αt−1 ht−1 (xi )yi )
Zt Zt−1
PT
D1 (i) exp( t=1 −αt ht (xi ))
Q
t Zt
D1 (i) exp(−f (xi )yi )
Q
t Zt
exp(−f (xi )yi )
Q
m t Zt
The last step uses the fact that D1 (i) =
summing both sides from 1 to m we get
Y
1
m.
Remember now that Dt+1 (i) is a distribution over i. Hence
1
=
Zt
=
Pm
exp(−f (xi )yi )
Q
m t Zt
Pm
exp(−f (xi )yi )
m
i=1
i=1
t
3. [10 points] Equation 1 suggests that the training error can be reduced rapidly by greedily optimizing
Zt at each step. You have shown that the error is bounded from above:
training ≤
2
T
Y
t=1
Zt .
Observe that Z1 , . . . , Zt−1 are determined by the first (t − 1) rounds of boosting, and we cannot change
them on round t. A greedy step we can take to minimize the training error bound on round t is to
minimize Zt .
In this question, you will prove that for binary weak classifiers, Zt from Equation 2 is minimized by
picking αt as:
1 − t
1
,
(3)
αt∗ = log
2
t
where t is the training error of weak classifier ht for the weighted dataset:
t =
m
X
i=1
Dt (i)I(ht (xi ) 6= yi ).
where I is the indicator function. For this proof, only consider the simplest case of binary classifiers,
i.e. the output of ht (x) is binary, {−1, +1}.
For this special class of classifiers, first show that the normalizer Zt can be written as:
Zt = (1 − t ) exp(−αt ) + t exp(αt ).
Hint: Consider the sums over correctly and incorrectly classified examples separately.
F SOLUTION:
Zt
=
=
=
=
m
X
i=1
m
X
i=1
m
X
Dt (i) exp(−αt ht (xi )yi )
Dt (i) exp(−αt )I(ht (xi ) = yi ) +
i=1
Dt (i) exp(αt )I(ht (xi ) 6= yi )
Dt (i) exp(−αt )(1 − I(ht (xi ) 6= yi )) + (
i=1
m
X
(
i=1
=
m
X
Dt (i) −
m
X
i=1
m
X
i=1
Dt (i)I(ht (xi ) 6= yi )) exp(αt )
Dt (i)I(ht (xi ) 6= yi )) exp(−αt ) + (
m
X
i=1
Dt (i)I(ht (xi ) 6= yi )) exp(αt )
(1 − t ) exp(−αt ) + t exp(αt )
4. [8 points] Now, prove that the value of αt that minimizes this definition of Zt is given by Equation 3.
F SOLUTION: To minimize, set the derivative to 0;
∂Zt
= 0
∂αt
−(1 − t ) exp(−αt ) + t exp(αt ) = 0
t exp(αt )
=
(exp αt )2
=
αt∗
=
3
1 − t
exp(αt )
1 − t
t
1
1 − t
log
2
t
p
5. [4 points] Prove that for the above value of αt we have Zt = 2 t (1 − t ).
F SOLUTION: Just plug the optimal αt into the expression for Zt .
1
2
− γt and prove that Zt ≤ exp(−2γt2 ). Therefore, we will know
Y
X
training ≤
Zt ≤ exp(−2
γt2 ).
6. [8 points] Furthermore, let t =
t
t
Hint: log(1 − x) ≤ −x for 0 < x ≤ 1.
p
p
F SOLUTION: Letting t = 21 − γt , then Zt = 2 t (1 − t ) = 1 − 4γt2 . Take the log on both sides
of this equation,
Zt = 21 log(1 − 4γt2 ) ≤ 21 · (−4γt2 ) = −2γt2 . Therefore, Zt ≤ exp(−2γt2 ), and
Q we have logP
training ≤ t Zt ≤ exp(−2 t γt2 ).
7. [4 points] If each weak classifier is slightly better than random, so that γt ≥ γ, for some γ > 0, then
the training error drops exponentially fast in T , i.e.
training ≤ exp(−2T γ 2 ).
Argue that in each round of boosting, there always exists a weak classifier ht such that its training
error on the weighted dataset t ≤ 0.5.
F SOLUTION: Let’s assume there exists a weak classifier ht with error ≥ .5. Clearly there also exists
another weak classifier h0t which just flips the prediction of ht . h0t has error < 0.5. Since at iteration t we
are picking the weak classifier with the minimum error, we will never pick ht over h0t . Hence t ≤ 0.5.
8. [2 points] Show that for t = 0.5 the training error can get “stuck” above zero. Hint: Dt (i)s may get
stuck.
F SOLUTION: If t = 0.5, then αt = 0, Dt+1 (i) = Dt (i), and Zt = 1, for all i. All the parameters
and the predictions from the weak classifier will remain the same after this iteration. We will not learn
anything new anymore. Thus, the training error can be stuck above 0.
COMMON MISTAKE 1: There were many solutions which used the fact that γt ≥ 0, or αt ≥ 0
to argue that t ≤ 0.5. But note that both of these facts are results of t ≤ 0.5, and not the other way
around.
2. Adaboost on a toy dataset
[20 points]
Now we will apply Adaboost to classify a toy dataset. Consider the dataset shown in Figure 1a. The dataset
consists of 4 points: (X1 : 0, −1, −), (X2 : 1, 0, +), (X3 : −1, 0, +) and (X4 : 0, 1, −). For this part, you may
find it helpful to use MATLAB as a calculator rather than doing the computations by hand. (You do not
need to submit any code though.)
1. [8 points] For T = 4, show how Adaboost works for this dataset, using simple decision stumps (depth-1
decision trees that simply split on a single variable once) as weak classifiers. For each timestep compute
the following:
t , αt , Zt , Dt (i) ∀i,
Also for each timestep draw your weak classifier. For example h1 can be as shown in 1b).
4
1.5 1.5
1
1.5 1.5
1
1
0.5 0.5
0
1
0.5 0.5
0
0
0
!0.5!0.5
!0.5!0.5
!1 !1
!1 !1
!1.5!1.5
!1.5!1.5
!1 !1
!0.5!0.5
0
0
0.5 0.5
1
1
!1.5!1.5
1.5 1.5 !1.5!1.5
!1 !1
!0.5!0.5
0
0
0.5 0.5
1
1
1.5 1.5
Figure
Figure1:1:
1:a)a)
Toy
data
Question
4.4.. b)
b)hhh
inQuestion
Question4 4
1 11in
Figure
a)Toy
Toydata
datainin
inQuestion
Question
b)
in
Question
4.2
4.2FAdaboost
Adabooston
onaThe
atoy
toy
dataset
dataset
[5[5
points]
SOLUTION:
table
for values
ofpoints]
the
parameters at each timestep is given below:
t will
applyAdaboost
α
D
(1)
Dt (2)Consider
Dt (3)the
Ddataset
ht (x1in
),h
(x2 ),h1a.
),h
t
t (4) shown
t Figure
t (x
t (x
4)
Now
Nowwe
we
willapply
Adaboost
totoZ
classify
classify
a attoy
toydataset.
dataset.
Consider
thedataset
shown
inFigure
1a.3The
The
dataset
dataset
1 ofof40.25
0.8660
0.25
0.25
0.25
0.25
-1,
-1,
1
,
-1
consists
consists
4points:
points:0.5493
(X
(X
:
:
0,
0,
1,
1,
),
),
(X
(X
:
:
1,
1,
0,
0,
+),
+),
(X
(X
:
:
1,
1,
0,
0,
+)
+)
and
and
(X
(X
:
:
0,
0,
1,
1,
).
).
For
For
this
this
part,
part,
you
you
may
may
11
22
33
44
0.1667
0.8047
0.7454
0.1667 rather
0.5000
0.1667
0.1667
1by
,1,
1, -1(You
find
finditit2helpful
helpful
totouse
use
MATLAB
MATLAB
asasa acalculator
calculator
ratherthan
than
doing
doingthe
thecomputations
computationsby
hand.
hand.
(Youdo
donot
not
0.1 any
0.600 0.5000 0.3000 0.1000 0.1000
-1 ,1, -1,-1
need
needto3tosubmit
submit
any1.0986
code
codethough.)
though.)
-1 ,1,1,-1
4 0.0556 1.4166 0.4581 0.2778 0.1667 0.5000 0.0556
1.1.The
[2[2points]
points]
For
For
T
T
=
=
4,
4,
show
show
how
how
Adaboost
Adaboost
works
works
for
for
this
this
dataset,
dataset,
using
using
simple
simple
decision
decision
stumps
stumps(depth-1
(depth-1
weak classifier for each timestep is drawn on the figure below.
decision
decisiontrees
treesthat
thatsimply
simplysplit
spliton
ona asingle
singlevariable
variableonce)
once)asasweak
weakclassifiers.
classifiers.For
Foreach
eachtimestep
timestepcompute
compute
the
thefollowing:
following:
✏t✏,t↵
,↵
,Z
,D
8i,
t ,tZ
t ,tD
t (i)
t (i)8i,
Also
Alsofor
foreach
eachtimestep
timestepdraw
drawyour
yourweak
weakclassifier.
classifier.For
Forexample
examplehh
canbebeasasshown
showninin 1b).
1b).
1 1can
2. [4 points] What is the training error of Adaboost for this toy dataset?
2.2.[1[1points]
points]What
Whatisisthe
thetraining
trainingerror
errorofofAdaboost
Adaboostfor
forthis
thistoy
toydataset?
dataset?
P
H(xdataset
) = linearly
α
)separable?
are -2.260,
2.771,why
1.672,
and -3.869
after
iterations.
The
3.3.F
[2[2SOLUTION:
points]
points]IsIsthe
theabove
above
linearly
Explain
Explain
why
Adaboost
Adaboost
does
doesbetter
better4than
than
a adecision
decision
i dataset
t ht (xiseparable?
predictions
for
are dataset.
-1,1,1,-1.
stump
stumpon
onthe
thex
above
dataset. Thus, it is a perfect fit and the training error is 0.
iabove
3. [8 points] Is the above dataset linearly separable? Explain why Adaboost does better than a decision
stump on the above dataset.
F SOLUTION: This dataset is not linearly separable. There is no simple decision stump which classifies
the data perfectly. However, Adaboost combines the information from several weak learners (decision stumps
in this case), and gives a nonlinear classification boundary.
3. MDL on a toy dataset
[20 points]
Consider the following problem
We provide a data set generated from a particular model with N = 64.
where we want to estimate
ŷ = w1 x1 + w2 x2 + w3 x3
We want to use MDL to find the ’optimal’ L0 -penalized model.
1. [10 points] Estimate the three linear regressions
(We could actually try all possible subsets here, but instead we’ll just try three.)
66
5
y1 = w1 x1
y2 = w1 x1 + w2 x2
y3 = w1 x1 + w2 x2 + w3 x3
For each of the three cases, what is
(a) the sum of square error
i) Err1 =
ii) Err2 =
iii) Err3 =
F SOLUTION: a) i. 1.2779e+03 ii. 835.0558 iii. 834.7418
(b) 2 times the estimated bits to code the residual (n log
i) ERR bits1 =
ii) ERR bits2 =
iii) ERR bits3 =
Error
n )
F SOLUTION: b) i. 276.4546 ii. 237.1666 iii. 237.1319
(c) 2 times the estimated bits to code each residual plus model under AIC (2 ∗ 1 bit to code each
feature)
i) AIC bits1 =
ii) AIC bits2 =
iii) AIC bits3 =
F SOLUTION: c) i. 278.4546 ii. 241.1666 iii. 243.1319
(d) 2 times the estimated bits to code each residual plus model under BIC (2 ∗ (1/2)log(n) bits to
code each feature)
i) BIC bits1 =
ii) BIC bits2 =
iii) BIC bits3 =
F SOLUTION: d) i. 282.4546 ii. 249.1666 iii. 255.1319
2. [5 points] Which model has the smallest minimum description length?
a) for AIC
b) for BIC
F SOLUTION: Model 2 for both (2 features)
3. [5 points] Included in the kit is a test data set; does the error on the test set for the three models
correspond to what is expected from MDLs?
6
F SOLUTION:
Model 1 test error:
Model 2 test error:
Model 3 test error:
Yes:
1,778
1,167
1,173
7