Support Vector Machines 1 Optimization and convexity

Machine Learning 1 — WS2014 — Module IN2064
Sheet 9 · Page 1
Machine Learning Worksheet 9
Support Vector Machines
1
Optimization and convexity
In the lecture you learned the concept of a convex function. By tighening the condition we obtain the
concept of a strictly convex function.
A function f : X → R defined on a convex set X is called strictly convex if, for any two points x, y ∈ X
with x 6= y and any t ∈]0, 1[
f (tx + (1 − t)y) < tf (x) + (1 − t)f (y) .
It can be shown that for differentiable functions this is equivalent to the condition that for every x, y ∈ X
with x 6= y we have
f (y) > f (x) + (y − x)T ∇f (x) .
Problem 1: Consider the graph of a strictly convex function. How can you define strict convexity
geometrically? How does it compare to (non-strict) convexity?
Take any two points on the graph of the function and connect them by a straight line. The function
is strictly convex if all the lines you can construct in this way are strictly above the graph of the
function. If the function is non-strict convex then the lines may touch the graph of the function (but
not cross it).
Problem 2: If a function f (x) is convex how many points x∗ at most can satisfy the equation f (x∗ ) =
minx f (x) ? Proof your answer.
The function f (x) = 0 is convex and all x∗ ∈ R satisfy f (x∗ ) = 0 = minx f (x) = 0. Thus there are
infinitely many points that satisfy this equation.
Problem 3: If a function f (x) is strictly convex how many points x∗ at most can satisfy the equation
f (x∗ ) = minx f (x) ? Proof your answer.
Let us assume that f (x) has a minimum and it is obtained at x = x̃. We know from the lecture that
we have ∇f (x̃) = 0 at the minimizer x̃, that means the tangent of the function is horizontal at this
point. Thus by the above equation we have for every x∗ 6= x
f (x∗ ) > f (x̃) + (x∗ − x̃)T 0 = f (x̃) .
Hence at most only one point can satisfy this equation.
Submit to [email protected] with subject line homework sheet 9 by 2014/12/08, 23:59 CET
Sheet 9 · Page 2
Machine Learning 1 — WS2014 — Module IN2064
Problem 4: Give examples of a non-strictly convex function and a strictly convex function. Provide
definitions and graphs of your functions.
f (x) = x2 is a strictly convex function, whereas
0 if |x| < 5
f (x) =
x2 otherwise
is non-strictly convex. Note that it is also not continuous.
4
3
2
1
-2
1
-1
2
A more interesting case is found in two dimensions: f (x, y) = x2 + y 2 (on the left) is strictly convex
whereas f (x, y) = x2 (on the right) is not — the former looks like a well, the latter looks like a valley,
having infinite local minima at the bottom.
8
15
6
10
2
5
4
2
2
0
0
0
0
-2
-2
0
0
-2
-2
2
2
2
The Gaussian kernel for classification
Problem 5: Let us define a feature transformation φn : R → Rn as follows:
(
i
2
2
2
2
−x2 /2σ 2 x 2
e−x /2σ σx
e−x /2σ
−x2 /2σ 2 −x2 /2σ 2 x e
σ
√
√
√
,e
,...,
φn (x) = e
,
,...,
σ
2
i!
n!
x n
σ
)
Submit to [email protected] with subject line homework sheet 9 by 2014/12/08, 23:59 CET
Sheet 9 · Page 3
Machine Learning 1 — WS2014 — Module IN2064
Suppose we let n → ∞ and define a new feature transformation:
(
2
2
−x2 /2σ 2 x 2
e−x /2σ
−x2 /2σ 2 −x2 /2σ 2 x e
σ
√
√
,
,...,
φ∞ (x) = e
,e
σ
2
i!
x i
σ
)
,...
Can we directly apply this feature transformation to data? Explain why or why not!
No, because of both finite time and space on computing machines, one will have problems of applying
this (or any infinite) feature transformation (though one might think of apply in different terms and
consider lazy evaluation, then your answer might as well differ here).
Problem 6: From the lecture, we know that we can express a linear classifier using only inner products
of input vectors in the transformed feature space. It would be great if we could somehow use the feature
space obtained by the feature transformation φ∞ . However, to do this, we must be able to compute the
inner product of samples in this infinite feature space. We define the inner product between two infinite
vectors φ∞ (x) and φ∞ (y) as the infinite sum given in the following equation:
K(x, y) =
∞
X
φ∞,i (x) φ∞,i (y)
i=1
What is the explicit form of K(x, y)? (Hint: Think of the Taylor series of ex .) With such a high
dimensional feature space, should we be concerned about overfitting?
Using the definition of K(x, y) we get
K(x, y) = e
−x2 −y 2
2σ 2
∞
X
i=1
xy k
σ2
k!
=e
−x2 −y 2
2σ 2
xy
e σ = e−
(x−y)2
2σ 2
When using a linear classifier (without the maximum margin property) overfitting is a problem in
an infinite-dimensional feature space. This can be seen froms results from learning theory that show
that the risk (expected loss) increases with the dimensionality of the feature space.
For linear maximum margin classifiers (SVMs) learning theory has proven that the generalization
performance does not depend on the dimensionality of the feature space. However, this result does
not take into account noise in the input and thus overfitting might still be a problem. Note that the
maximum margin property can be considered as a regularizier since it limits the acceptable hypotheses
for a training set.
Problem 7: Can any finite set of points be linearly separated in the feature space defined by φ∞ if σ
can be chosen freely?
Consider the limit σ → 0. The kernel function becomes
(
1 if x = y
K(x, y) =
.
0 if x 6= y
Submit to [email protected] with subject line homework sheet 9 by 2014/12/08, 23:59 CET
Sheet 9 · Page 4
Machine Learning 1 — WS2014 — Module IN2064
Thus the kernel matrix for any finite set of points is the identity matrix, K = I.
All training samples are correctly classified if
yi (wT φ(xi ) + b) > 0 for all i .
P
By substituting the dual representation w = j αj φ(xj ) into this expression and replacing the scalar
product φ(xi )T φ(xj ) by the kernel function K(xi , xj ) this condition translates to


X
yi 
αj K(xi , xj ) + b > 0 .
j
With the above kernel function in particular the condition becomes
yi (αi + b) > 0 .
By choosing αi = yi and b = 0 we see that the resulting condition
yi2 > 0
is fulfilled for all training samples.
Hence all finite sets of points can be linearly separated using the Gaussian kernel if the variance σ is
chosen small enough.
3
The SVM optimisation problem
Problem 8: Consider the SVM optimisation problem with slack variables (1-norm soft-margin) from
the lecture. Show that the duality gap is zero for this problem. Why is this important? What would
happen if this was not the case?
From the optimisation lecture we know that Slater’s constraint qualification guarantees that the
duality gap is zero if the objective function is convex and the constraints are affine. (All affine
functions are also convex.)
First we show that the objective function
m
X
1
ξi
f0 (w, b, ξ) = wT w + C
2
i=1
with C ≥ 0 is convex. We prove that the parabola f (x) = x2 is convex by showing that for x, y ∈ R
the inequality
f (y) ≥ f (x) + (y − x)T ∇f (x) .
Submit to [email protected] with subject line homework sheet 9 by 2014/12/08, 23:59 CET
Sheet 9 · Page 5
Machine Learning 1 — WS2014 — Module IN2064
holds. We have
f (x) + (y − x)
df
= x2 + (y − x) · 2x = −x2 + 2xy
dx
= −x2 + 2xy − y 2 + y 2 = −(x − y)2 + y 2 ≤ y 2 = f (y) .
The linear function f (x) = x is convex because
f ((1 − t)x + ty) = (1 − t)x + ty = (1 − t)f (x) + tf (y) .
Now, the objective function f0 (w, b, ξ) is convex because it is a sum of (scaled) convex functions.
The constraints
fi (x) = yi (wT xi + b) − 1 + ξi ≥ 0
i = 1, . . . , m
are affine because we can write them in the form
fi (x) = aT x + c
with a = yi w and c = yi b − 1 + ξi .
The constraints
gi (ξi ) = ξi
i = 1, . . . , m ≥ 0
are affine because we can also write them in the form
gi (ξi ) = aξi + c
with a = 1 and c = 0.
Thus Slater’s constraint qualification applies and the duality gap is zero for the SVM optimisation
problem.
If the duality gap was not zero the solution we would obtain by solving the dual problem would be
larger than the minimum. For the SVM this means that we would obtain a separating hyperplane
with a smaller than optimal margin.
Submit to [email protected] with subject line homework sheet 9 by 2014/12/08, 23:59 CET