Machine Learning 1 — WS2014 — Module IN2064 Sheet 9 · Page 1 Machine Learning Worksheet 9 Support Vector Machines 1 Optimization and convexity In the lecture you learned the concept of a convex function. By tighening the condition we obtain the concept of a strictly convex function. A function f : X → R defined on a convex set X is called strictly convex if, for any two points x, y ∈ X with x 6= y and any t ∈]0, 1[ f (tx + (1 − t)y) < tf (x) + (1 − t)f (y) . It can be shown that for differentiable functions this is equivalent to the condition that for every x, y ∈ X with x 6= y we have f (y) > f (x) + (y − x)T ∇f (x) . Problem 1: Consider the graph of a strictly convex function. How can you define strict convexity geometrically? How does it compare to (non-strict) convexity? Take any two points on the graph of the function and connect them by a straight line. The function is strictly convex if all the lines you can construct in this way are strictly above the graph of the function. If the function is non-strict convex then the lines may touch the graph of the function (but not cross it). Problem 2: If a function f (x) is convex how many points x∗ at most can satisfy the equation f (x∗ ) = minx f (x) ? Proof your answer. The function f (x) = 0 is convex and all x∗ ∈ R satisfy f (x∗ ) = 0 = minx f (x) = 0. Thus there are infinitely many points that satisfy this equation. Problem 3: If a function f (x) is strictly convex how many points x∗ at most can satisfy the equation f (x∗ ) = minx f (x) ? Proof your answer. Let us assume that f (x) has a minimum and it is obtained at x = x̃. We know from the lecture that we have ∇f (x̃) = 0 at the minimizer x̃, that means the tangent of the function is horizontal at this point. Thus by the above equation we have for every x∗ 6= x f (x∗ ) > f (x̃) + (x∗ − x̃)T 0 = f (x̃) . Hence at most only one point can satisfy this equation. Submit to [email protected] with subject line homework sheet 9 by 2014/12/08, 23:59 CET Sheet 9 · Page 2 Machine Learning 1 — WS2014 — Module IN2064 Problem 4: Give examples of a non-strictly convex function and a strictly convex function. Provide definitions and graphs of your functions. f (x) = x2 is a strictly convex function, whereas 0 if |x| < 5 f (x) = x2 otherwise is non-strictly convex. Note that it is also not continuous. 4 3 2 1 -2 1 -1 2 A more interesting case is found in two dimensions: f (x, y) = x2 + y 2 (on the left) is strictly convex whereas f (x, y) = x2 (on the right) is not — the former looks like a well, the latter looks like a valley, having infinite local minima at the bottom. 8 15 6 10 2 5 4 2 2 0 0 0 0 -2 -2 0 0 -2 -2 2 2 2 The Gaussian kernel for classification Problem 5: Let us define a feature transformation φn : R → Rn as follows: ( i 2 2 2 2 −x2 /2σ 2 x 2 e−x /2σ σx e−x /2σ −x2 /2σ 2 −x2 /2σ 2 x e σ √ √ √ ,e ,..., φn (x) = e , ,..., σ 2 i! n! x n σ ) Submit to [email protected] with subject line homework sheet 9 by 2014/12/08, 23:59 CET Sheet 9 · Page 3 Machine Learning 1 — WS2014 — Module IN2064 Suppose we let n → ∞ and define a new feature transformation: ( 2 2 −x2 /2σ 2 x 2 e−x /2σ −x2 /2σ 2 −x2 /2σ 2 x e σ √ √ , ,..., φ∞ (x) = e ,e σ 2 i! x i σ ) ,... Can we directly apply this feature transformation to data? Explain why or why not! No, because of both finite time and space on computing machines, one will have problems of applying this (or any infinite) feature transformation (though one might think of apply in different terms and consider lazy evaluation, then your answer might as well differ here). Problem 6: From the lecture, we know that we can express a linear classifier using only inner products of input vectors in the transformed feature space. It would be great if we could somehow use the feature space obtained by the feature transformation φ∞ . However, to do this, we must be able to compute the inner product of samples in this infinite feature space. We define the inner product between two infinite vectors φ∞ (x) and φ∞ (y) as the infinite sum given in the following equation: K(x, y) = ∞ X φ∞,i (x) φ∞,i (y) i=1 What is the explicit form of K(x, y)? (Hint: Think of the Taylor series of ex .) With such a high dimensional feature space, should we be concerned about overfitting? Using the definition of K(x, y) we get K(x, y) = e −x2 −y 2 2σ 2 ∞ X i=1 xy k σ2 k! =e −x2 −y 2 2σ 2 xy e σ = e− (x−y)2 2σ 2 When using a linear classifier (without the maximum margin property) overfitting is a problem in an infinite-dimensional feature space. This can be seen froms results from learning theory that show that the risk (expected loss) increases with the dimensionality of the feature space. For linear maximum margin classifiers (SVMs) learning theory has proven that the generalization performance does not depend on the dimensionality of the feature space. However, this result does not take into account noise in the input and thus overfitting might still be a problem. Note that the maximum margin property can be considered as a regularizier since it limits the acceptable hypotheses for a training set. Problem 7: Can any finite set of points be linearly separated in the feature space defined by φ∞ if σ can be chosen freely? Consider the limit σ → 0. The kernel function becomes ( 1 if x = y K(x, y) = . 0 if x 6= y Submit to [email protected] with subject line homework sheet 9 by 2014/12/08, 23:59 CET Sheet 9 · Page 4 Machine Learning 1 — WS2014 — Module IN2064 Thus the kernel matrix for any finite set of points is the identity matrix, K = I. All training samples are correctly classified if yi (wT φ(xi ) + b) > 0 for all i . P By substituting the dual representation w = j αj φ(xj ) into this expression and replacing the scalar product φ(xi )T φ(xj ) by the kernel function K(xi , xj ) this condition translates to X yi αj K(xi , xj ) + b > 0 . j With the above kernel function in particular the condition becomes yi (αi + b) > 0 . By choosing αi = yi and b = 0 we see that the resulting condition yi2 > 0 is fulfilled for all training samples. Hence all finite sets of points can be linearly separated using the Gaussian kernel if the variance σ is chosen small enough. 3 The SVM optimisation problem Problem 8: Consider the SVM optimisation problem with slack variables (1-norm soft-margin) from the lecture. Show that the duality gap is zero for this problem. Why is this important? What would happen if this was not the case? From the optimisation lecture we know that Slater’s constraint qualification guarantees that the duality gap is zero if the objective function is convex and the constraints are affine. (All affine functions are also convex.) First we show that the objective function m X 1 ξi f0 (w, b, ξ) = wT w + C 2 i=1 with C ≥ 0 is convex. We prove that the parabola f (x) = x2 is convex by showing that for x, y ∈ R the inequality f (y) ≥ f (x) + (y − x)T ∇f (x) . Submit to [email protected] with subject line homework sheet 9 by 2014/12/08, 23:59 CET Sheet 9 · Page 5 Machine Learning 1 — WS2014 — Module IN2064 holds. We have f (x) + (y − x) df = x2 + (y − x) · 2x = −x2 + 2xy dx = −x2 + 2xy − y 2 + y 2 = −(x − y)2 + y 2 ≤ y 2 = f (y) . The linear function f (x) = x is convex because f ((1 − t)x + ty) = (1 − t)x + ty = (1 − t)f (x) + tf (y) . Now, the objective function f0 (w, b, ξ) is convex because it is a sum of (scaled) convex functions. The constraints fi (x) = yi (wT xi + b) − 1 + ξi ≥ 0 i = 1, . . . , m are affine because we can write them in the form fi (x) = aT x + c with a = yi w and c = yi b − 1 + ξi . The constraints gi (ξi ) = ξi i = 1, . . . , m ≥ 0 are affine because we can also write them in the form gi (ξi ) = aξi + c with a = 1 and c = 0. Thus Slater’s constraint qualification applies and the duality gap is zero for the SVM optimisation problem. If the duality gap was not zero the solution we would obtain by solving the dual problem would be larger than the minimum. For the SVM this means that we would obtain a separating hyperplane with a smaller than optimal margin. Submit to [email protected] with subject line homework sheet 9 by 2014/12/08, 23:59 CET
© Copyright 2026 Paperzz