Support vector machines Usman Roshan Separating hyperplanes • For two sets of points there are many hyperplane separators y • Which one should we choose for classification? • In other words which one is most likely to produce least error? x Theoretical foundation • Margin error theorem (7.3 from Learning with kernels) Separating hyperplanes • Best hyperplane is the one that maximizes the minimum distance of all training points to the plane (Learning with kernels, Scholkopf and Smola, 2002) • Its expected error is at most the fraction of misclassified points plus a complexity term (Learning with kernels, Scholkopf and Smola, 2002) Margin of a plane • We define the margin as the minimum distance to training points (distance to closest y point) • The optimally separating plane is the one with the maximum margin x Optimally separating hyperplane y w x Optimally separating hyperplane • How do we find the optimally separating hyperplane? • Recall distance of a point to the plane defined earlier Hyperplane separators r x xp w Distance of a point to the separating plane • And so the distance to the plane r is given by wT x w0 r w or wT x w0 ry w where y is -1 if the point is on the left side of the plane and +1 otherwise. Support vector machine: optimally separating hyperplane Distance of point x (with label y) to the hyperplane is given by y ( wT x w0 ) w We want this to be at least some value y ( wT x w0 ) r w By scaling w we can obtain infinite solutions. Therefore we require that r w 1 So we minimize ||w|| to maximize the distance which gives us the SVM optimization problem. Support vector machine: optimally separating hyperplane SVM optimization criterion (primal form) 1 min w w 2 2 subject to yi ( wT xi w0 ) 1, for all i We can solve this with Lagrange multipliers. That tells us that w i yi xi i The xi for which αi is non-zero are called support vectors. SVM dual problem Let L be the Lagrangian: 1 2 T L = w + å a i ( yi (w xi + w0 ) -1) 2 i Solving dL/dw=0 and dL/dw0=0 gives us the dual form 1 arg max Ld = å a i - å å a ia j yi y j xiT x j 2 i j ai i subject to åa y i i i = 0 and 0 £ a i £ C Support vector machine: optimally separating hyperplane y 1/||w|| w 2/||w|| x Another look at the SVM objective • Consider the objective: yi (w xi + w0 ) arg min w,w0 å max(0,1) w i T • This is non-convex and harder to solve than the convex SVM objective • Roughly it measures the total sum of distances of points to the plane. • Misclassified points are given a negative distance whereas correct ones have a 0 distance. SVM objective • The SVM objective is 1 2 min w w subject to yi ( wT xi w0 ) 1, for all i 2 • Compare this with T yi (w xi + w0 ) arg min w,w0 å max(0,1) w i • The SVM objective can be viewed as a convex approximation of this by separating the numerator and denominator • The SVM objective is equivalent to minimizing the regularized hinge loss: n-1 w arg min å max(0,1- yi (w xi + w0 )) + 2 w,w0 i=0 T 2 Hinge loss optimization • Hinge loss: n-1 arg min å max(0,1- yi (wT xi + w0 )) w,w0 i=0 • But the max function is nondifferentiable. Therefore we use the sub-gradient: df = dw j -xij yi if yi (wT xi + w0 ) < 1 0 if yi (wT xi + w0 ) ³ 1 Inseparable case • What is there is no separating hyperplane? For example XOR function. • One solution: consider all hyperplanes and select the one with the minimal number of misclassified points • Unfortunately NP-complete (see paper by Ben-David, Eiron, Long on course website) • Even NP-complete to polynomially approximate (Learning with kernels, Scholkopf and Smola, and paper on website) Inseparable case • But if we measure error as the sum of the distance of misclassified points to the plane then we can solve for a support vector machine in polynomial time • Roughly speaking margin error bound theorem applies (Theorem 7.3, Scholkopf and Smola) • Note that total distance error can be considerably larger than number of misclassified points 0/1 loss vs distance based Optimally separating hyperplane with errors y w x Support vector machine: optimally separating hyperplane In practice we allow for error terms in case there is no hyperplane. 1 2 min w, w0 ,i ( w +C i ) subject to yi ( wT xi w0 ) 1 i , for all i 2 i SVM software • Plenty of SVM software out there. Two popular packages: – SVM-light – LIBSVM Kernels • What if no separating hyperplane exists? • Consider the XOR function. • In a higher dimensional space we can find a separating hyperplane • Example with SVM-light Kernels • The solution to the SVM is obtained by applying KKT rules (a generalization of Lagrange multipliers). The problem to solve becomes 1 Ld i i j yi y j xiT x j 2 i j i subject to y i i i 0 and 0 i C Kernels • The previous problem can be solved in turn again with KKT rules. • The dot product can be replaced by a matrix K(i,j)=xiTxj or a positive definite matrix K. 1 Ld i i j yi y j K ( xi x j ) 2 i j i subject to y i i i 0 and 0 i C Kernels • With the kernel approach we can avoid explicit calculation of features in high dimensions • How do we find the best kernel? • Multiple Kernel Learning (MKL) solves it for K as a linear combination of base kernels.
© Copyright 2026 Paperzz