Foundations of Machine Learning Support Vector Machines Mehryar Mohri Courant Institute and Google Research [email protected] Binary Classification Problem Training data: sample drawn i.i.d. from set X RN according to some distribution D, S = ((x1 , y1 ), . . . , (xm , ym )) X { 1, +1}. Problem: find hypothesis h :X { 1, +1} in H (classifier) with small generalization error R(h). choice of hypothesis set H : learning guarantees of previous lecture. linear classification (hyperplanes) if dimension N is not too large. • Mehryar Mohri - Foundations of Machine Learning page 2 This Lecture Support Vector Machines - separable case Support Vector Machines - non-separable case Margin guarantees Mehryar Mohri - Foundations of Machine Learning page 3 Linear Separation margin w·x+b = 0 • • geometric margin: ⇢ = min • which separating hyperplane? classifiers: H = {x sgn(w · x + b) : w RN, b |w·xi +b| i2[1,m] kwk . Mehryar Mohri - Foundations of Machine Learning page 4 R}. Optimal Hyperplane: Max. Margin (Vapnik and Chervonenkis, 1965) w·x+b = 0 margin w·x+b = 1 ⇢= max w,b : yi (w·xi +b) 0 Mehryar Mohri - Foundations of Machine Learning w·x+b = +1 |w · xi + b| min . kwk i2[1,m] page 5 Maximum Margin |w · xi + b| ⇢= max min kwk w,b : yi (w·xi +b) 0 i2[1,m] |w·xi + b| = max min kwk w,b : yi (w·xi +b) 0 i2[1,m] mini2[1,m] |w·xi +b|=1 = = max w,b : yi (w·xi +b) 0 mini2[1,m] |w·xi +b|=1 max w,b : yi (w·xi +b) 1 kwk 1 . 1 kwk Mehryar Mohri - Foundations of Machine Learning (scale-invariance) (min. reached) page 6 Optimization Problem Constrained optimization: 1 min w 2 w,b 2 subject to yi (w · xi + b) 1, i [1, m]. Properties: Convex optimization. Unique solution for linearly separable sample. • • Mehryar Mohri - Foundations of Machine Learning page 7 Optimal Hyperplane Equations Lagrangian: for all w, b, 1 w L(w, b, ) = 2 i 0, m i [yi (w 2 i=1 · xi + b) KKT conditions: m m wL =w i yi xi =0 m i=1 bL = i yi [1, m], w= m =0 i yi xi . i=1 i yi = 0. i=1 i=1 i i [yi (w Mehryar Mohri - Foundations of Machine Learning 1]. · xi + b) 1] = 0. page 8 Support Vectors Complementarity conditions: i [yi (w · xi + b) 1] = 0 = i =0 yi (w · xi + b) = 1. Support vectors: vectors xi such that i =0 yi (w · xi + b) = 1. • Note: support vectors are not unique. Mehryar Mohri - Foundations of Machine Learning page 9 Moving to The Dual Plugging in the expression of w in L gives: m 1 X L= ↵ i yi x i 2 i=1 | 1 2 Thus, m X 2 i,j=1 Pm i,j=1 i i=1 Mehryar Mohri - Foundations of Machine Learning {z ↵i ↵j yi yj (xi ·xj ) } i=1 | ↵ i yi b + {z m m L= ↵i ↵j yi yj (xi · xj ) m X 1 2 i,j=1 i j yi yj (xi · xj ). page 10 0 } m X i=1 ↵i . Equivalent Dual Opt. Problem Constrained optimization: m m max 1 2 i,j=1 i i=1 i j yi yj (xi · xj ) m subject to: i 0 = 0, i [1, m]. i=1 Solution: m i yi (xi h(x) = sgn with b = yi i yi m i=1 j yj (xj j=1 Mehryar Mohri - Foundations of Machine Learning · x) + b , · xi ) for any SV xi . page 11 Leave-One-Out Error Definition: let hS be the hypothesis output by learning algorithm L after receiving sample S of size m. Then, the leave-one-out error of L over S m is: 1 Rloo (L) = m 1 hS i=1 {xi } (xi )=f (xi ) . Property: unbiased estimate of expected error of hypothesis trained on sample of size m 1 , S 1 E m [Rloo (L)] = D m = S m E[1hS i=1 S {xi } (xi )=f (xi ) Em 1 [ E [1hS D Mehryar Mohri - Foundations of Machine Learning x D ] = E[1hS (x)=f (x) ]] S = S page 12 {x} (x)=f (x) ] E m 1 [R(hS )]. D Leave-One-Out Analysis Theorem: let hS be the optimal hyperplane for a sample S and let NSV (S) be the number of support vectors defining hS. Then, E m [R(hS )] S D E S Dm+1 NSV (S) . m+1 Proof: Let S Dm+1 be a sample linearly separable and let x S . If hS {x} misclassifies x , then x must be a SV for hS . Thus, Rloo (opt.-hyp.) Mehryar Mohri - Foundations of Machine Learning NSV (S) . m+1 page 13 Notes Bound on expectation of error only, not the probability of error. Argument based on sparsity (number of support vectors). We will see later other arguments in support of the optimal hyperplanes based on the concept of margin. Mehryar Mohri - Foundations of Machine Learning page 14 This Lecture Support Vector Machines - separable case Support Vector Machines - non-separable case Margin guarantees Mehryar Mohri - Foundations of Machine Learning page 15 Support Vector Machines (Cortes and Vapnik, 1995) Problem: data often not linearly separable in practice. For any hyperplane, there exists xi such that yi [w · xi + b] 1. Idea: relax constraints using slack variables yi [w · xi + b] Mehryar Mohri - Foundations of Machine Learning 1 i. page 16 i 0 Soft-Margin Hyperplanes w·x+b = 0 ξi ξj w·x+b = +1 w·x+b = 1 Support vectors: points along the margin or outliers. Soft margin: = 1/ w . Mehryar Mohri - Foundations of Machine Learning page 17 Optimization Problem (Cortes and Vapnik, 1995) Constrained optimization: 1 w min w,b, 2 m 2 +C i i=1 subject to yi (w · xi + b) 1 i i 0, i Properties: C 0 trade-off parameter. Convex optimization. Unique solution. • • • Mehryar Mohri - Foundations of Machine Learning page 18 [1, m]. Notes Parameter C : trade-off between maximizing margin and minimizing training error. How do we determine C ? The general problem of determining a hyperplane minimizing the error on the training set is NPcomplete (as a function of the dimension). Other convex functions of the slack variables could be used: this choice and a similar one with squared slack variables lead to a convenient formulation and solution. Mehryar Mohri - Foundations of Machine Learning page 19 SVM - Equivalent Problem Optimization: 1 w min w,b 2 m 2 1 +C i=1 yi (w · xi + b) Loss functions: hinge loss: • L(h(x), y) = (1 yh(x))+ . L(h(x), y) = (1 yh(x))2+ . • quadratic hinge loss: Mehryar Mohri - Foundations of Machine Learning page 20 + . Hinge Loss ‘Quadratic’ hinge loss 2 Hinge loss ξ 1 0/1 loss function Mehryar Mohri - Foundations of Machine Learning page 21 SVMs Equations Lagrangian: for all w, b, 1 w L(w, b, , , ) = 2 +C i m i i=1 i=1 i [yi (w · xi + b) KKT conditions: m wL =w i yi xi = =0 =0 i=1 i i L=C [1, m], i =0 i i [yi (w i i w= m i yi 1 + i] m m i=1 bL 0, m m 2 0, i · xi + b) i yi xi . i=1 i yi i=1 i+ i = 0. = C. 1 + i] = 0 = 0. Mehryar Mohri - Foundations of Machine Learning page 22 i i i=1 . Support Vectors Complementarity conditions: i [yi (w · xi + b) 1 + i] = 0 = i =0 yi (w · xi + b) = 1 Support vectors: vectors xi such that i =0 yi (w · xi + b) = 1 i. • Note: support vectors are not unique. Mehryar Mohri - Foundations of Machine Learning page 23 i. Moving to The Dual Plugging in the expression of w in L gives: m 1 X L= ↵ i yi x i 2 i=1 | 1 2 Thus, m X 2 i,j=1 Pm i,j=1 {z ↵i ↵j yi yj (xi ·xj ) } i=1 | ↵ i yi b + {z 0 m m L= i i=1 The condition ↵i ↵j yi yj (xi · xj ) m X i Mehryar Mohri - Foundations of Machine Learning 1 2 i,j=1 i j yi yj (xi · xj ). 0 is equivalent to i C. page 24 } m X i=1 ↵i . Dual Optimization Problem Constrained optimization: m m max 1 2 i,j=1 i i=1 i j yi yj (xi · xj ) m subject to: 0 i yi C i = 0, i [1, m]. i=1 Solution: m i yi (xi h(x) = sgn m i=1 with b = yi j yj (xj j=1 Mehryar Mohri - Foundations of Machine Learning · x) + b , · xi ) for any xi with 0 < i < C. page 25 This Lecture Support Vector Machines - separable case Support Vector Machines - non-separable case Margin guarantees Mehryar Mohri - Foundations of Machine Learning page 26 High-Dimension Learning guarantees: for hyperplanes in dimension N with probability at least 1 , b R(h) R(h) + s 2(N + 1) log m em N +1 + s log 1 . 2m • bound is uninformative for N m. • but SVMs have been remarkably successful in high-dimension. • can we provide a theoretical justification? • analysis of underlying scoring function. Mehryar Mohri - Foundations of Machine Learning page 27 Rademacher Complexity of Linear Hypotheses Theorem: Let S {x : x R} be a sample of size m }. Then, and let H = {x w·x : w RS (H) Proof: 1 b RS (H) = E m m X R2 2 . m m X 1 sup E sup w · i w · xi = m kwk⇤ i=1 kwk⇤ i=1 X m m h X 2 i 1/2 ⇤ ⇤ E E i xi i xi m m i=1 i=1 r p hX m 1/2 i 2 2 ⇤2 ⇤ ⇤ mR R E kxi k2 = . m m m i=1 Mehryar Mohri - Foundations of Machine Learning page 28 i xi Confidence Margin Definition: the confidence margin of a real-valued function h at (x, y) 2 X ⇥ Y is ⇢h (x, y) = yh(x). interpreted as the hypothesis’ confidence in prediction. if correctly classified coincides with |h(x)| . relationship with geometric margin for linear functions h : x 7! w · x + b: for x in the sample, • • • |⇢h (x, y)| Mehryar Mohri - Foundations of Machine Learning ⇢geom kwk. page 29 Confidence Margin Loss Definition: for any confidence margin parameter > 0 the ⇢-margin loss function ⇢ is defined by ⇢ (yh(x)) 1 0 ρ 1 yh(x) For a sample S = (x1 , . . . , xm ) and real-valued hypothesis h , the empirical margin loss is 1 R (h) = m m (yi h(xi )) i=1 Mehryar Mohri - Foundations of Machine Learning 1 m m 1yi h(xi )< . i=1 page 30 General Margin Bound Theorem: Let H be a set of real-valued functions. Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h H : R(h) R(h) 2 R (h) + Rm H + 2 R (h) + RS H + 3 log 1 2m log 2 . 2m Proof: Let H = {z = (x, y) yh(x) : h H}. Consider the family of functions taking values in [0, 1] : H={ Mehryar Mohri - Foundations of Machine Learning f: f H}. page 31 3, with probability at • By the theorem ofg Lecture H least 1 , for all 1 m E[g(z)] • Thus, E[ • Since Rm • Since 1 (yh(x))] , m i=1 g(zi ) + 2Rm (H) + R (h) + 2Rm log 1 . 2m log 1 . 2m H + is 1 - Lipschitz, by Talagrand’s lemma, H 1 m 1 Rm (H) = E sup m ,S h H i=1 yh(x)<0 i yi h(xi ) Rm H . (yh(x)) , this shows the first statement, and similarly the second one. Mehryar Mohri - Foundations of Machine Learning = 1 page 32 Margin Bound - Linear Classifiers }. Corollary: Let > 0 and H = {x w·x : w Assume that X {x : x R}. Then, for any > 0 , with probability at least 1 , for any h H, R(h) R (h) + 2 R2 2/ 2 m +3 log 2 . 2m Proof: Follows directly general margin bound and bound on RS H for linear classifiers. Mehryar Mohri - Foundations of Machine Learning page 33 High-Dimensional Feature Space Observations: generalization bound does not depend on the dimension but on the margin. this suggests seeking a large-margin separating hyperplane in a higher-dimensional feature space. • • Computational problems: taking dot products in a high-dimensional feature space can be very costly. solution based on kernels (next lecture). • • Mehryar Mohri - Foundations of Machine Learning page 34 References • Corinna Cortes and Vladimir Vapnik, Support-Vector Networks, Machine Learning, 20, 1995. • Koltchinskii,Vladimir and Panchenko, Dmitry. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30(1), 2002. • • Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer, New York. • • Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin, 1982. Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. Mehryar Mohri - Foundations of Machine Learning page 35 Courant Institute, NYU Appendix Mehryar Mohri - Foudations of Machine Learning Saddle Point Let (w , b , ) be the saddle point of the Langrangian. Multiplying both sides of the equation giving b by αi∗ yi and taking the sum leads to: m m m 2 y y b = i i j yi yj (xi · xj ). i i i Using yi2 = 1 , i,j=1 i=1 i=1 m i=1 m i=1 i yi = 0, and w = i yi xi yields m 0= w i 2 . i=1 Thus, the margin is also given by: 2 Mehryar Mohri - Foundations of Machine Learning 1 = w 2 2 = 1 . 1 page 37 Talagrand’s Contraction Lemma (Ledoux and Talagrand, 1991; pp. 112-114) Theorem: Let : R R be an L -Lipschitz function. Then, for any hypothesis set H of real-valued functions, RS ( H) L RS (H). Proof: fix sample S = (x1 , . . . , xm ) . By definition, m RS ( 1 h)(xi ) E sup H) = i( m h H i=1 1 = E sup um 1 (h) + E m 1 ,..., m 1 m h H with um 1 (h) = m( m 1 i( h)(xi ). i=1 Mehryar Mohri - Foundations of Machine Learning page 38 h)(xm ) , Talagrand’s Contraction Lemma Now, assuming that the suprema are reached, there exist h1 , h2 H such that E m sup um h H 1 = [um 2 1 [um 2 1 = [um 2 E m 1 (h) + m( h)(xm ) 1 (h1 ) +( 1 h1 )(xm )] + [um 2 1 (h1 ) + um 1 (h2 ) + sL(h1 (xm ) 1 1 (h1 ) + sLh1 (xm )] + [um 2 sup um h H 1 (h) + 1 (h2 ) m Lh(xm ) 1 (h2 ) ( h2 )(xm )] h2 (xm ))] sLh2 (xm )] , where s = sgn(h1 (xm ) h2 (xm )). Mehryar Mohri - Foundations of Machine Learning page 39 Talagrand’s Contraction Lemma When the suprema are not reached, the same can 0. be shown modulo , followed by Proceeding similarly for other is directly leads to the result. Mehryar Mohri - Foundations of Machine Learning page 40 VC Dimension of Canonical Hyperplanes Theorem: Let S {x : x R}. Then, the VC dimension d of the set of canonical hyperplanes {x sgn(w · x) : min |w · x| = 1 w } verifies x S d R2 2 . Proof: Let {x1 , . . . , xd } be a set fully shattered. Then, d y { 1, +1} for all , there exists w such i Mehryar Mohri - Foundations of Machine Learning [1, d], 1 yi (w · xi ). page 41 • Summing up the inequalities gives w· d yi xi d y U yi xi ] i=1 d = i,j=1 d = i=1 • Thus, E[yi yj ](xi · xj ) (xi · xi ) d i=1 i=1 i=1 E [ yi xi . yi xi w • Taking the expectation over y d d d d 1/2 U (uniform) yields d E [ y U yi xi 2 ] 1/2 i=1 1/2 dR 2 1/2 = R d. R. Mehryar Mohri - Foundations of Machine Learning page 42 (Jensen’s ineq.)
© Copyright 2026 Paperzz