CS6501: Advanced Machine Learning
2017
Problem Set 1
Handed Out: Feb 11, 2017
Due: Feb 18, 2017
• Feel free to talk to other members of the class in doing the homework. You should,
however, write down your solution yourself. Please try to keep the solution brief and
clear.
• Please use Piazza first if you have questions about the homework. Also feel free to
come to office hours.
• Please, no handwritten solutions. You will submit your solution manuscript as a single
pdf file. Please use LATEXto typeset your solutions.
• The homework is due at 11:59 PM on the due date. We will be using Collab for
collecting the homework assignments. Please submit your solution manuscript as a
pdf file and your code as a zip file. Please do NOT hand in a hard copy of your
write-up. Contact the TAs if you are having technical difficulties in submitting the
assignment.
1
Binary Classification
A linear program can be stated as follows:
Definition 1 Let A be an m × n real-valued matrix, ~b ∈ Rm , and ~c ∈ Rn . Find a ~t ∈ Rn ,
that minimizes the linear function
z(~t) = ~c T ~t
A~t ≥ ~b
subject to
In the linear programming terminology, ~c is often referred to as the cost vector and z(~t)
is referred to as the objective function.1 We can use this framework to define the problem of
learning a linear discriminant function.2
1
Note that you don’t need to really know how to solve a linear program since you can use Matlab or
Scipy to obtain the solution (although you may wish to brush up on Linear Algebra).
2
This discussion closely parallels the linear programming representation found in Pattern Classification,
by Duda, Hart, and Stork.
1
The Learning Problem:3 Let x~1 , x~2 , . . . , x~m represent m samples, where each sample
x~i ∈ Rn is an n-dimensional vector, and ~y ∈ {−1, 1}m is an m × 1 vector representing the
respective labels of each of the m samples. Let w
~ ∈ Rn be an n × 1 vector representing the
weights of the linear discriminant function, and θ be the threshold value.
We predict x~i to be a positive example if w
~ T x~i + θ ≥ 0. On the other hand, we predict x~i
T
to be a negative example if w
~ x~i + θ < 0.
We hope that the learned linear function can separate the data set. That is, for the true
labels we hope
(
1
if w
~ T x~i + θ ≥ 0
yi =
(1)
−1 if w
~ T x~i + θ < 0.
In order to find a good linear separator, we propose the following linear program:
min
δ
w,θ,δ
~
(2)
yi (w
~ T x~i + θ) ≥ 1 − δ,
δ≥0
subject to
∀(~
xi , yi ) ∈ D
(3)
(4)
where D is the data set of all training examples.
a. [10 points] A data set D = {(~
xi , yi )}m
i=1 that satisfies condition (1) above is called
linearly separable. Show that the data set D is linearly separable if and only if
~ 0 , θ0 ) that satisfies condition (3) with δ = 0 (Need
there exists a hyperplane (w
4
a hint? ). Conclude that D is linearly separable iff the optimal solution to the linear
program [(2) to (4)] attains δ = 0. What can we say about the linear separability
of the data set if there exists a hyperplane that satisfies condition (3) with
δ > 0?
b. [5 points] Show that there is a trivial optimal solution for the following linear
program:
min
w,θ,δ
~
subject to
δ
yi (w
~ T x~i + θ) ≥ −δ,
δ≥0
∀(xi , yi ) ∈ D
Show the optimal solution and use it to (very briefly) explain why we chose
to formulate the linear program as [(2) to (4)] instead.
3
Note that the notation used in the Learning Problem is unrelated to the one used in the Linear Program
definition. You may want to figure out the correspondence.
4
Hint: Assume that D is linearly separable. D is a finite set, so it has a positive example that is closest
to the hyperplane among all positive examples. Similarly, there is a negative example that is closest to the
hyperplane among negative examples. Consider their distances and use them to show that condition (3)
holds. Then show the other direction.
2
c. [10 points] Let x~1 ∈ Rn , x~1 T = 1 1 . . . 1 and y1 = 1. Let x~2 ∈ Rn , x~2 T =
−1 −1 . . . −1 and y2 = −1. The data set D is defined as
D = {(x~1 , y1 ), (x~2 , y2 )}.
Consider the formulation in [(2) to (4)] applied to D. Show the set of all possible
optimal solutions (solve this problem by hand).
2
Multiclass Classification
In this problem set, we will derive, implement, and test an SGD method for a multi-class
SVM classification model, and then we will compare it with the one-vs-all approach.
Given a data set D = {xi , yi }N
i=1 with K classes, the multi-class SVM can be written as
the following unconstrained optimization problem:
K
min
w={wk }
N
X
1X T
T
T
max ∆(yi , k) + wk xi − wyi xi ,
w wk + C
k=1...K
2 k=1 k
i=1
(5)
where wk is the weight vector for class k, (xi , yi ) is the ith instance and
(
1 if y 6= k,
∆(y, k) =
0 otherwise.
To derive theP
SGD algorithm, we first need to rewrite the objective function in Eq. (5)
into the form of i gi (w):
!
N
K
X
1 X T
,
wk wk + C max ∆(yi , k) + wkT xi − wyTi xi
min
k=1...K
w={wk }
2N
i=1
k=1
so that
gi (w) =
!
K
1 X T
w wk + C max ∆(yi , k) + wkT xi − wyTi xi
.
k=1...K
2N k=1 k
Now let’s derive the sub-gradient of gi (w). Consider ∂gi (w)/∂wk , the partial sub-gradient
of the first term of gi (w) is wk /N . Regarding the second term, we have to consider several
cases. Let
ỹ = arg max ∆(yi , k) + wkT xi
k
. If ỹ = k and yi = k, the partial sub-gradient of the second term is C(x − x) = 0. Therefore,
∂gi (w)/∂wk = wk /N when ỹ = k and yi = k.
How about other cases?
3
a. [15 points] Please complete the following formulation
wk /N
∂gi (w)
=
∂wk
if ỹ = k and yi = k
if ỹ 6= k and yi = k
(6)
if ỹ = k and yi 6= k
if ỹ 6= k and yi 6= k
b. [15 points] The SGD algorithm runs as follows:
for epoch = 0 . . . T do
for (~
xi , yi ) in D do
for k= 1 . . . K do
wk ← wk − η ∂gi (wk )/∂wk
end for
end for
end for
η is the step size. The step size is a user specified parameter. Here, we can set it with
a default value 0.01.
Plug Eq. (6) in, we have:
for epoch = 0 . . . T do
for (~
xi , yi ) in D do
ỹ = arg maxk ∆(yi , k) + wkT xi
for k= 1 . . . K do
wk ← wk − η
end for
if ỹ 6= yi then
wyi ← wyi − η
wỹ ← wỹ − η
end if
end for
end for
Please, complete the above algorithm.
c. [20points] Let’s implement a baseline one-vs-all multiclass algorithm and conduct
experiments on MNIST dataset 5 . We provide a basic code framework at https:
5
http://yann.lecun.com/exdb/mnist/
4
//github.com/uvanlp/HW1 to facilitate the implementation. However, you’re welcomed to write your own code. If you want to use the code we provided, implement
the one-vs-all algorithm in the main function of one vs all with data reader.py. Then,
conduct cross-validation to find the best C for all underlying binary classifier (assume
the same C is used for all the binary classifiers). Note: You can use the LinearSVC class
at Scikit-Learn, http://scikit-learn.org/stable/modules/generated/sklearn.
svm.LinearSVC.html, to implement the underlying classifiers.
Report the best C and the final accuracy.
d. [25points] Implement the SGD algorithm for multi-class SVM. We provide ‘multiclass svm.py’ as the code framework. However, you have to implement the update
rules. Conduct cross-validation to find the best C and report the result on the test
set.6
e. [0-10 points] We will give bonus points for additional experiments.
6
To test if your algorithm implementation is correct, you can check if the objective function value of
the SGD algorithm matches with the score from liblinear https://www.csie.ntu.edu.tw/ cjlin/liblinear. Use
the following command line to train a multi-class SVM model “./train -s 4 -B -1 -e 0.01 -C 1 train.data”
with C=1. There is a function in the sample code allows you to print data in liblinear format. Check if the
objective function values of your implementation is the same as that of liblinear on a small subset (e.g., run
on only the first 100 samples).
5
© Copyright 2026 Paperzz