Foundations of Machine Learning

Foundations of Machine Learning
Support Vector Machines
Mehryar Mohri
Courant Institute and Google Research
[email protected]
Binary Classification Problem
Training data: sample drawn i.i.d. from set X RN
according to some distribution D,
S = ((x1 , y1 ), . . . , (xm , ym ))
X { 1, +1}.
Problem: find hypothesis h :X { 1, +1} in H
(classifier) with small generalization error R(h).
choice of hypothesis set H : learning guarantees
of previous lecture.
linear classification (hyperplanes) if
dimension N is not too large.
•
Mehryar Mohri - Foundations of Machine Learning
page 2
This Lecture
Support Vector Machines - separable case
Support Vector Machines - non-separable case
Margin guarantees
Mehryar Mohri - Foundations of Machine Learning
page
3
Linear Separation
margin
w·x+b = 0
•
• geometric margin: ⇢ = min
• which separating hyperplane?
classifiers: H = {x
sgn(w · x + b) : w
RN, b
|w·xi +b|
i2[1,m]
kwk .
Mehryar Mohri - Foundations of Machine Learning
page 4
R}.
Optimal Hyperplane: Max. Margin
(Vapnik and Chervonenkis, 1965)
w·x+b = 0
margin
w·x+b = 1
⇢=
max
w,b : yi (w·xi +b) 0
Mehryar Mohri - Foundations of Machine Learning
w·x+b = +1
|w · xi + b|
min
.
kwk
i2[1,m]
page 5
Maximum Margin
|w · xi + b|
⇢=
max
min
kwk
w,b : yi (w·xi +b) 0 i2[1,m]
|w·xi + b|
=
max
min
kwk
w,b : yi (w·xi +b) 0 i2[1,m]
mini2[1,m] |w·xi +b|=1
=
=
max
w,b : yi (w·xi +b) 0
mini2[1,m] |w·xi +b|=1
max
w,b : yi (w·xi +b)
1
kwk
1
.
1 kwk
Mehryar Mohri - Foundations of Machine Learning
(scale-invariance)
(min. reached)
page 6
Optimization Problem
Constrained optimization:
1
min
w 2
w,b 2
subject to yi (w · xi + b)
1, i
[1, m].
Properties:
Convex optimization.
Unique solution for linearly separable sample.
•
•
Mehryar Mohri - Foundations of Machine Learning
page 7
Optimal Hyperplane Equations
Lagrangian: for all w, b,
1
w
L(w, b, ) =
2
i
0,
m
i [yi (w
2
i=1
· xi + b)
KKT conditions:
m
m
wL
=w
i yi xi
=0
m i=1
bL
=
i yi
[1, m],
w=
m
=0
i yi xi .
i=1
i yi
= 0.
i=1
i=1
i
i [yi (w
Mehryar Mohri - Foundations of Machine Learning
1].
· xi + b)
1] = 0.
page 8
Support Vectors
Complementarity conditions:
i [yi (w
· xi + b)
1] = 0 =
i
=0
yi (w · xi + b) = 1.
Support vectors: vectors xi such that
i
=0
yi (w · xi + b) = 1.
• Note: support vectors are not unique.
Mehryar Mohri - Foundations of Machine Learning
page 9
Moving to The Dual
Plugging in the expression of w in L gives:
m
1 X
L=
↵ i yi x i
2 i=1
|
1
2
Thus,
m
X
2
i,j=1
Pm
i,j=1
i
i=1
Mehryar Mohri - Foundations of Machine Learning
{z
↵i ↵j yi yj (xi ·xj )
}
i=1
|
↵ i yi b +
{z
m
m
L=
↵i ↵j yi yj (xi · xj )
m
X
1
2 i,j=1
i j yi yj (xi
· xj ).
page 10
0
}
m
X
i=1
↵i .
Equivalent Dual Opt. Problem
Constrained optimization:
m
m
max
1
2 i,j=1
i
i=1
i j yi yj (xi
· xj )
m
subject to:
i
0
= 0, i
[1, m].
i=1
Solution:
m
i yi (xi
h(x) = sgn
with b = yi
i yi
m i=1
j yj (xj
j=1
Mehryar Mohri - Foundations of Machine Learning
· x) + b ,
· xi ) for any SV xi .
page 11
Leave-One-Out Error
Definition: let hS be the hypothesis output by
learning algorithm L after receiving sample S of
size m. Then, the leave-one-out error of L over S
m
is:
1
Rloo (L) =
m
1 hS
i=1
{xi } (xi )=f (xi )
.
Property: unbiased estimate of expected error of
hypothesis trained on sample of size m 1 ,
S
1
E m [Rloo (L)] =
D
m
=
S
m
E[1hS
i=1
S
{xi } (xi )=f (xi )
Em 1 [ E [1hS
D
Mehryar Mohri - Foundations of Machine Learning
x D
] = E[1hS
(x)=f (x) ]]
S
=
S
page 12
{x} (x)=f (x)
]
E m 1 [R(hS )].
D
Leave-One-Out Analysis
Theorem: let hS be the optimal hyperplane for a
sample S and let NSV (S) be the number of support
vectors defining hS. Then,
E m [R(hS )]
S D
E
S Dm+1
NSV (S)
.
m+1
Proof: Let S Dm+1 be a sample linearly separable
and let x S . If hS {x} misclassifies x , then x must
be a SV for hS . Thus,
Rloo (opt.-hyp.)
Mehryar Mohri - Foundations of Machine Learning
NSV (S)
.
m+1
page 13
Notes
Bound on expectation of error only, not the
probability of error.
Argument based on sparsity (number of support
vectors). We will see later other arguments in
support of the optimal hyperplanes based on the
concept of margin.
Mehryar Mohri - Foundations of Machine Learning
page 14
This Lecture
Support Vector Machines - separable case
Support Vector Machines - non-separable case
Margin guarantees
Mehryar Mohri - Foundations of Machine Learning
page
15
Support Vector Machines
(Cortes and Vapnik, 1995)
Problem: data often not linearly separable in
practice. For any hyperplane, there exists xi such
that
yi [w · xi + b]
1.
Idea: relax constraints using slack variables
yi [w · xi + b]
Mehryar Mohri - Foundations of Machine Learning
1
i.
page 16
i
0
Soft-Margin Hyperplanes
w·x+b = 0
ξi
ξj
w·x+b = +1
w·x+b = 1
Support vectors: points along the margin or outliers.
Soft margin: = 1/ w .
Mehryar Mohri - Foundations of Machine Learning
page 17
Optimization Problem
(Cortes and Vapnik, 1995)
Constrained optimization:
1
w
min
w,b,
2
m
2
+C
i
i=1
subject to yi (w · xi + b)
1
i
i
0, i
Properties:
C 0 trade-off parameter.
Convex optimization.
Unique solution.
•
•
•
Mehryar Mohri - Foundations of Machine Learning
page 18
[1, m].
Notes
Parameter C : trade-off between maximizing margin
and minimizing training error. How do we
determine C ?
The general problem of determining a hyperplane
minimizing the error on the training set is NPcomplete (as a function of the dimension).
Other convex functions of the slack variables
could be used: this choice and a similar one with
squared slack variables lead to a convenient
formulation and solution.
Mehryar Mohri - Foundations of Machine Learning
page 19
SVM - Equivalent Problem
Optimization:
1
w
min
w,b
2
m
2
1
+C
i=1
yi (w · xi + b)
Loss functions:
hinge loss:
•
L(h(x), y) = (1
yh(x))+ .
L(h(x), y) = (1
yh(x))2+ .
• quadratic hinge loss:
Mehryar Mohri - Foundations of Machine Learning
page 20
+
.
Hinge Loss
‘Quadratic’ hinge loss
2
Hinge loss ξ 1
0/1 loss function
Mehryar Mohri - Foundations of Machine Learning
page 21
SVMs Equations
Lagrangian: for all w, b,
1
w
L(w, b, , , ) =
2
+C
i
m
i
i=1
i=1
i [yi (w · xi + b)
KKT conditions:
m
wL
=w
i yi xi
=
=0
=0
i=1
i
i
L=C
[1, m],
i
=0
i
i [yi (w
i i
w=
m
i yi
1 + i]
m
m i=1
bL
0,
m
m
2
0,
i
· xi + b)
i yi xi .
i=1
i yi
i=1
i+
i
= 0.
= C.
1 + i] = 0
= 0.
Mehryar Mohri - Foundations of Machine Learning
page 22
i i
i=1
.
Support Vectors
Complementarity conditions:
i [yi (w
· xi + b)
1 + i] = 0 =
i
=0
yi (w · xi + b) = 1
Support vectors: vectors xi such that
i
=0
yi (w · xi + b) = 1
i.
• Note: support vectors are not unique.
Mehryar Mohri - Foundations of Machine Learning
page 23
i.
Moving to The Dual
Plugging in the expression of w in L gives:
m
1 X
L=
↵ i yi x i
2 i=1
|
1
2
Thus,
m
X
2
i,j=1
Pm
i,j=1
{z
↵i ↵j yi yj (xi ·xj )
}
i=1
|
↵ i yi b +
{z
0
m
m
L=
i
i=1
The condition
↵i ↵j yi yj (xi · xj )
m
X
i
Mehryar Mohri - Foundations of Machine Learning
1
2 i,j=1
i j yi yj (xi
· xj ).
0 is equivalent to
i
C.
page 24
}
m
X
i=1
↵i .
Dual Optimization Problem
Constrained optimization:
m
m
max
1
2 i,j=1
i
i=1
i j yi yj (xi
· xj )
m
subject to: 0
i yi
C
i
= 0, i
[1, m].
i=1
Solution:
m
i yi (xi
h(x) = sgn
m i=1
with b = yi
j yj (xj
j=1
Mehryar Mohri - Foundations of Machine Learning
· x) + b ,
· xi ) for any xi with
0 < i < C.
page 25
This Lecture
Support Vector Machines - separable case
Support Vector Machines - non-separable case
Margin guarantees
Mehryar Mohri - Foundations of Machine Learning
page
26
High-Dimension
Learning guarantees: for hyperplanes in dimension N
with probability at least 1
,
b
R(h)  R(h)
+
s
2(N + 1) log
m
em
N +1
+
s
log 1
.
2m
• bound is uninformative for N m.
• but SVMs have been remarkably successful in
high-dimension.
• can we provide a theoretical justification?
• analysis of underlying scoring function.
Mehryar Mohri - Foundations of Machine Learning
page 27
Rademacher Complexity of Linear Hypotheses
Theorem: Let S {x : x R} be a sample of size m
}. Then,
and let H = {x w·x : w
RS (H)
Proof:
1
b
RS (H) =
E
m

m
X
R2 2
.
m

m
X
1
sup
E sup w ·
i w · xi =
m
kwk⇤ i=1
kwk⇤
i=1
 X

m
m
h X
2 i 1/2
⇤
⇤

E

E
i xi
i xi
m
m
i=1
i=1
r
p
 hX
m
1/2
i
2
2 ⇤2
⇤
⇤
mR
R

E
kxi k2

=
.
m
m
m
i=1
Mehryar Mohri - Foundations of Machine Learning
page 28
i xi
Confidence Margin
Definition: the confidence margin of a real-valued
function h at (x, y) 2 X ⇥ Y is ⇢h (x, y) = yh(x).
interpreted as the hypothesis’ confidence in
prediction.
if correctly classified coincides with |h(x)| .
relationship with geometric margin for linear
functions h : x 7! w · x + b: for x in the sample,
•
•
•
|⇢h (x, y)|
Mehryar Mohri - Foundations of Machine Learning
⇢geom kwk.
page 29
Confidence Margin Loss
Definition: for any confidence margin parameter > 0
the ⇢-margin loss function ⇢ is defined by
⇢ (yh(x))
1
0
ρ 1
yh(x)
For a sample S = (x1 , . . . , xm ) and real-valued
hypothesis h , the empirical margin loss is
1
R (h) =
m
m
(yi h(xi ))
i=1
Mehryar Mohri - Foundations of Machine Learning
1
m
m
1yi h(xi )< .
i=1
page 30
General Margin Bound
Theorem: Let H be a set of real-valued functions.
Fix > 0 . For any > 0 , with probability at least 1 ,
the following holds for all h H :
R(h)
R(h)
2
R (h) + Rm H +
2
R (h) + RS H + 3
log 1
2m
log 2
.
2m
Proof: Let H = {z = (x, y) yh(x) : h H}. Consider
the family of functions taking values in [0, 1] :
H={
Mehryar Mohri - Foundations of Machine Learning
f: f
H}.
page 31
3, with probability at
• By the theorem ofg Lecture
H
least 1
, for all
1
m
E[g(z)]
• Thus,
E[
•
Since
Rm
• Since 1
(yh(x))]
,
m
i=1
g(zi ) + 2Rm (H) +
R (h) + 2Rm
log 1
.
2m
log 1
.
2m
H +
is 1 - Lipschitz, by Talagrand’s lemma,
H
1
m
1
Rm (H) =
E sup
m ,S h H i=1
yh(x)<0
i yi h(xi )
Rm H .
(yh(x)) , this shows the first
statement, and similarly the second one.
Mehryar Mohri - Foundations of Machine Learning
=
1
page 32
Margin Bound - Linear Classifiers
}.
Corollary: Let > 0 and H = {x w·x : w
Assume that X {x : x R}. Then, for any > 0 ,
with probability at least 1 , for any h H,
R(h)
R (h) + 2
R2
2/ 2
m
+3
log 2
.
2m
Proof: Follows directly general margin bound and
bound on RS H for linear classifiers.
Mehryar Mohri - Foundations of Machine Learning
page 33
High-Dimensional Feature Space
Observations:
generalization bound does not depend on the
dimension but on the margin.
this suggests seeking a large-margin separating
hyperplane in a higher-dimensional feature space.
•
•
Computational problems:
taking dot products in a high-dimensional feature
space can be very costly.
solution based on kernels (next lecture).
•
•
Mehryar Mohri - Foundations of Machine Learning
page 34
References
•
Corinna Cortes and Vladimir Vapnik, Support-Vector Networks, Machine Learning, 20,
1995.
•
Koltchinskii,Vladimir and Panchenko, Dmitry. Empirical margin distributions and bounding
the generalization error of combined classifiers. The Annals of Statistics, 30(1), 2002.
•
•
Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer, New York.
•
•
Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin,
1982.
Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.
Mehryar Mohri - Foundations of Machine Learning
page 35
Courant Institute, NYU
Appendix
Mehryar Mohri - Foudations of Machine Learning
Saddle Point
Let (w , b , ) be the saddle point of the Langrangian. Multiplying
both sides of the equation giving b by αi∗ yi and taking the sum leads
to:
m
m
m
2
y
y
b
=
i
i j yi yj (xi · xj ).
i i
i
Using yi2 = 1 ,
i,j=1
i=1
i=1
m
i=1
m
i=1
i yi = 0, and w =
i yi xi
yields
m
0=
w
i
2
.
i=1
Thus, the margin is also given by:
2
Mehryar Mohri - Foundations of Machine Learning
1
=
w
2
2
=
1
.
1
page 37
Talagrand’s Contraction Lemma
(Ledoux and Talagrand, 1991; pp. 112-114)
Theorem: Let : R R be an L -Lipschitz function.
Then, for any hypothesis set H of real-valued
functions,
RS (
H)
L RS (H).
Proof: fix sample S = (x1 , . . . , xm ) . By definition,
m
RS (
1
h)(xi )
E sup
H) =
i(
m
h H i=1
1
=
E sup um 1 (h) +
E
m 1 ,..., m 1 m h H
with um 1 (h) =
m(
m 1
i(
h)(xi ).
i=1
Mehryar Mohri - Foundations of Machine Learning
page 38
h)(xm ) ,
Talagrand’s Contraction Lemma
Now, assuming that the suprema are reached,
there exist h1 , h2 H such that
E
m
sup um
h H
1
= [um
2
1
[um
2
1
= [um
2
E
m
1 (h)
+
m(
h)(xm )
1 (h1 )
+(
1
h1 )(xm )] + [um
2
1 (h1 )
+ um
1 (h2 )
+ sL(h1 (xm )
1
1 (h1 ) + sLh1 (xm )] + [um
2
sup um
h H
1 (h)
+
1 (h2 )
m Lh(xm )
1 (h2 )
(
h2 )(xm )]
h2 (xm ))]
sLh2 (xm )]
,
where s = sgn(h1 (xm ) h2 (xm )).
Mehryar Mohri - Foundations of Machine Learning
page 39
Talagrand’s Contraction Lemma
When the suprema are not reached, the same can
0.
be shown modulo , followed by
Proceeding similarly for other is directly leads to
the result.
Mehryar Mohri - Foundations of Machine Learning
page 40
VC Dimension of Canonical Hyperplanes
Theorem: Let S {x : x R}. Then, the VC
dimension d of the set of canonical hyperplanes
{x
sgn(w · x) : min |w · x| = 1
w
} verifies
x S
d
R2
2
.
Proof: Let {x1 , . . . , xd } be a set fully shattered. Then,
d
y
{
1,
+1}
for all
, there exists w such
i
Mehryar Mohri - Foundations of Machine Learning
[1, d], 1
yi (w · xi ).
page 41
• Summing up the inequalities gives
w·
d
yi xi
d
y U
yi xi ]
i=1
d
=
i,j=1
d
=
i=1
• Thus,
E[yi yj ](xi · xj )
(xi · xi )
d
i=1
i=1
i=1
E [
yi xi .
yi xi
w
• Taking the expectation over y
d
d
d
d
1/2
U (uniform) yields
d
E [
y U
yi xi
2
]
1/2
i=1
1/2
dR
2 1/2
= R d.
R.
Mehryar Mohri - Foundations of Machine Learning
page 42
(Jensen’s ineq.)