Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines
Advanced Topics in Machine Learning: COMPGI13
Arthur Gretton
Gatsby Unit, CSML, UCL
March 15, 2016
Arthur Gretton
Lecture 9: Support Vector Machines
Overview
The representer theorem
Review of convex optimization
Support vector classification, the C -SV and ν-SV machines
Arthur Gretton
Lecture 9: Support Vector Machines
Representer theorem
Learning problem: setting
Given a set of paired observations (x1 , y1 ), . . . (xn , yn ) (regression or
classification).
Find the function f ∗ in the RKHS H which satisfies
f ∗ = arg min J(f ),
f ∈H
where
(1)
J(f ) = Ly (f (x1 ), . . . , f (xn )) + Ω kf k2H ,
Ω is non-decreasing, y is the vector of yi , loss L depends on xi only
via f (xi ).
P
Classification: Ly (f (x1 ), . . . , f (xn )) = ni=1 Iyi f (xi )≤0
P
Regression: Ly (f (x1 ), . . . , f (xn )) = ni=1 (yi − f (xi ))2
Arthur Gretton
Lecture 9: Support Vector Machines
Representer theorem
The representer theorem: a solution to
h
i
min Ly (f (x1 ), . . . , f (xn )) + Ω kf k2H
f ∈H
takes the form
f∗ =
n
X
αi k(xi , ·).
i=1
If Ω is strictly increasing, the solution must have this form.
Arthur Gretton
Lecture 9: Support Vector Machines
Representer theorem: proof
Proof: Denote fs projection of f onto the subspace
span {k(xi , ·) : 1 ≤ i ≤ n} ,
such that
f = fs + f⊥ ,
Pn
where fs =
Regularizer:
i=1 αi k(xi , ·).
kf k2H = kfs k2H + kf⊥ k2H ≥ kfs k2H ,
then
Ω kf
k2H
≥Ω
kfs k2H
,
so this term is minimized for f = fs .
Arthur Gretton
Lecture 9: Support Vector Machines
(2)
Representer theorem: proof
Proof (cont.): Individual terms f (xi ) in the loss:
f (xi ) = hf , k(xi , ·)iH = hfs + f⊥ , k(xi , ·)iH = hfs , k(xi , ·)iH ,
so
Ly (f (x1 ), . . . , f (xn )) = Ly (fs (x1 ), . . . , fs (xn )).
Hence
Loss L(. . .) only depends on the component of f in the data
subspace,
Regularizer Ω(. . .) minimized when f = fs .
If Ω is non-decreasing, then kf⊥ kH = 0 is a minimum. If Ω
strictly increasing, min. is unique.
Arthur Gretton
Lecture 9: Support Vector Machines
Short overview of convex optimization
Why we need optimization: SVM idea
Classify two clouds of points, where there exists a hyperplane which
linearly separates one cloud from the other without error.
Arthur Gretton
Lecture 9: Support Vector Machines
Why we need optimization: SVM idea
Classify two clouds of points, where there exists a hyperplane which
linearly separates one cloud from the other without error.
Arthur Gretton
Lecture 9: Support Vector Machines
Why we need optimization: SVM idea
Classify two clouds of points, where there exists a hyperplane which
linearly separates one cloud from the other without error.
Smallest distance from each class to the separating hyperplane
w > x + b is called the margin.
Arthur Gretton
Lecture 9: Support Vector Machines
Why we need optimization: SVM idea
This problem can be expressed as follows:
2
max (margin) = max
or
kw k
w ,b
w ,b
min kw k2
w ,b
(3)
subject to
(
w > xi + b ≥ 1
w > xi + b ≤ −1
i : yi = +1,
i : yi = −1.
This is a convex optimization problem.
Arthur Gretton
Lecture 9: Support Vector Machines
(4)
Convex set
(Figure from Boyd and Vandenberghe)
Leftmost set is convex, remaining two are not.
Every point in the set can be seen from any other point in the set,
along a straight line that never leaves the set.
Definition
C is convex if for all x1 , x2 ∈ C and any 0 ≤ θ ≤ 1 we have
θx1 + (1 − θ)x2 ∈ C , i.e. every point on the line between x1 and x2
lies in C .
Arthur Gretton
Lecture 9: Support Vector Machines
Convex function: no local optima
(Figure from Boyd and Vandenberghe)
Definition (Convex function)
A function f is convex if its domain domf is a convex set and if
∀x, y ∈ domf , and any 0 ≤ θ ≤ 1,
f (θx + (1 − θ)y ) ≤ θf (x) + (1 − θ)f (y ).
The function is strictly convex if the inequality is strict for x 6= y .
Arthur Gretton
Lecture 9: Support Vector Machines
Optimization and the Lagrangian
(Generic) optimization problem on x ∈ Rn ,
minimize f0 (x)
subject to fi (x) ≤ 0
i = 1, . . . , m
hi (x) = 0
i = 1, . . . p.
(5)
p ∗ the optimal value of (5), D assumed nonempty, where...
T
Tp
n
D := m
i=0 domfi ∩
i=1 domhi (dom fi =subset of R where fi defined).
Idealy we would want an unconstrained problem
minimize f0 (x) +
m
X
I− (fi (x)) +
i=1
(
0 u≤0
where I− (u) =
∞ u>0
Why is this hard to solve?
Arthur Gretton
p
X
I0 (hi (x)) ,
i=1
and I0 (u) is the indicator of 0.
Lecture 9: Support Vector Machines
Optimization and the Lagrangian
(Generic) optimization problem on x ∈ Rn ,
minimize f0 (x)
subject to fi (x) ≤ 0
i = 1, . . . , m
hi (x) = 0
i = 1, . . . p.
(5)
p ∗ the optimal value of (5), D assumed nonempty, where...
T
Tp
n
D := m
i=0 domfi ∩
i=1 domhi (dom fi =subset of R where fi defined).
Idealy we would want an unconstrained problem
minimize f0 (x) +
m
X
I− (fi (x)) +
i=1
(
0 u≤0
where I− (u) =
∞ u>0
Why is this hard to solve?
Arthur Gretton
p
X
I0 (hi (x)) ,
i=1
and I0 (u) is the indicator of 0.
Lecture 9: Support Vector Machines
Lower bound interpretation of Lagrangian
The Lagrangian L : Rn × Rm × Rp → R is a lower bound on the
original problem:
L(x, λ, ν) := f0 (x) +
m
X
i=1
λi fi (x) +
| {z }
≤I− (fi (x))
Rm
p
X
i=1
νi hi (x) ,
| {z }
≤I0 (hi (x))
Rp .
and has domain domL := D ×
×
The vectors λ and ν are
called lagrange multipliers or dual variables.
To ensure a lower bound, we require λ 0.
Arthur Gretton
Lecture 9: Support Vector Machines
Lagrange dual: lower bound on optimum p ∗
The Lagrange dual function: minimize Lagrangian
When λ 0 and fi (x) ≤ 0, Lagrange dual function is
g (λ, ν) := inf L(x, λ, ν).
x∈D
A dual feasible pair (λ, ν) is a pair for which λ 0 and
(λ, ν) ∈ dom(g ).
We will show: (next slide) for any λ 0 and ν,
g (λ, ν) ≤ f0 (x)
wherever
fi (x) ≤ 0
gi (x) = 0
(including at f0 (x ∗ ) = p ∗ ).
Arthur Gretton
Lecture 9: Support Vector Machines
(6)
Lagrange dual: lower bound on optimum p ∗
The Lagrange dual function: minimize Lagrangian
When λ 0 and fi (x) ≤ 0, Lagrange dual function is
g (λ, ν) := inf L(x, λ, ν).
x∈D
A dual feasible pair (λ, ν) is a pair for which λ 0 and
(λ, ν) ∈ dom(g ).
We will show: (next slide) for any λ 0 and ν,
g (λ, ν) ≤ f0 (x)
wherever
fi (x) ≤ 0
gi (x) = 0
(including at f0 (x ∗ ) = p ∗ ).
Arthur Gretton
Lecture 9: Support Vector Machines
(6)
Lagrange dual: lower bound on optimum p ∗
Simplest example: minimize over x the function
L(x, λ) = f0 (x) + λf1 (x)
(Figure modified from Boyd and Vandenberghe)
Reminders:
f0 is function to
be minimized.
f1 ≤ 0 is
inequality
constraint
λ ≥ 0 is Lagrange
multiplier
p ∗ is minimum f0
in constraint set
Arthur Gretton
Lecture 9: Support Vector Machines
Lagrange dual: lower bound on optimum p ∗
Simplest example:minimize over x the function
L(x, λ) = f0 (x) + λf1 (x)
(Figure from Boyd and Vandenberghe)
Reminders:
f0 is function to
be minimized.
f1 ≤ 0 is
inequality
constraint
λ ≥ 0 is Lagrange
multiplier
p ∗ is minimum f0
in constraint set
Arthur Gretton
Lecture 9: Support Vector Machines
Lagrange dual: lower bound on optimum p ∗
Simplest example: minimize over x the function
L(x, λ) = f0 (x) + λf1 (x)
(Figure from Boyd and Vandenberghe)
Reminders:
f0 is function to
be minimized.
f1 ≤ 0 is
inequality
constraint
λ ≥ 0 is Lagrange
multiplier
p ∗ is minimum f0
in constraint set
Arthur Gretton
Lecture 9: Support Vector Machines
Lagrange dual: lower bound on optimum p ∗
When λ 0, then for all ν we have
g (λ, ν) ≤ p ∗
(7)
A dual feasible pair (λ, ν) is a pair for which λ 0 and
(λ, ν) ∈ dom(g ) (Figure from Boyd and Vandenberghe)
Reminders:
g (λ, ν) :=
inf x∈D L(x, λ)
λ ≥ 0 is Lagrange
multiplier
p ∗ is minimum f0
in constraint set
Arthur Gretton
Lecture 9: Support Vector Machines
Lagrange dual is lower bound on p ∗ (proof)
We now give a formal proof that Lagrange dual function g (λ, ν)
lower bounds p ∗ .
Proof: Assume x̃ is feasible, i.e. fi (x̃) ≤ 0, hi (x̃) = 0, x̃ ∈ D,
λ 0. Then
p
m
X
X
λi fi (x̃) +
νi hi (x̃) ≤ 0
i=1
i=1
Thus
g (λ, ν) :=
≤
inf
x∈D
f0 (x) +
f0 (x̃) +
λi fi (x) +
i=1
m
X
i=1
≤
m
X
λi fi (x̃) +
p
X
!
νi hi (x)
i=1
p
X
νi hi (x̃)
i=1
f0 (x̃).
This holds for every feasible x̃, hence lower bound holds.
Arthur Gretton
Lecture 9: Support Vector Machines
Lagrange dual is lower bound on p ∗ (proof)
We now give a formal proof that Lagrange dual function g (λ, ν)
lower bounds p ∗ .
Proof: Assume x̃ is feasible, i.e. fi (x̃) ≤ 0, hi (x̃) = 0, x̃ ∈ D,
λ 0. Then
p
m
X
X
λi fi (x̃) +
νi hi (x̃) ≤ 0
i=1
i=1
Thus
g (λ, ν) :=
≤
inf
x∈D
f0 (x) +
f0 (x̃) +
λi fi (x) +
i=1
m
X
i=1
≤
m
X
λi fi (x̃) +
p
X
!
νi hi (x)
i=1
p
X
νi hi (x̃)
i=1
f0 (x̃).
This holds for every feasible x̃, hence lower bound holds.
Arthur Gretton
Lecture 9: Support Vector Machines
Lagrange dual is lower bound on p ∗ (proof)
We now give a formal proof that Lagrange dual function g (λ, ν)
lower bounds p ∗ .
Proof: Assume x̃ is feasible, i.e. fi (x̃) ≤ 0, hi (x̃) = 0, x̃ ∈ D,
λ 0. Then
p
m
X
X
λi fi (x̃) +
νi hi (x̃) ≤ 0
i=1
i=1
Thus
g (λ, ν) :=
≤
inf
x∈D
f0 (x) +
f0 (x̃) +
λi fi (x) +
i=1
m
X
i=1
≤
m
X
λi fi (x̃) +
p
X
!
νi hi (x)
i=1
p
X
νi hi (x̃)
i=1
f0 (x̃).
This holds for every feasible x̃, hence lower bound holds.
Arthur Gretton
Lecture 9: Support Vector Machines
Best lower bound: maximize the dual
Best lower bound g (λ, ν) on the optimal solution p ∗ of (5):
Lagrange dual problem
maximize
g (λ, ν)
subject to
λ 0.
(8)
Dual feasible: (λ, ν) with λ 0 and g (λ, ν) > −∞.
Dual optimal: solutions (λ∗ , ν ∗ ) maximizing dual, d ∗ is optimal
value (dual always easy to maximize: next slide).
Weak duality always holds:
d ∗ ≤ p∗.
...but what is the point of finding a best (largest) lower bound
on a minimization problem?
Arthur Gretton
Lecture 9: Support Vector Machines
Best lower bound: maximize the dual
Best lower bound g (λ, ν) on the optimal solution p ∗ of (5):
Lagrange dual problem
maximize
g (λ, ν)
subject to
λ 0.
(8)
Dual feasible: (λ, ν) with λ 0 and g (λ, ν) > −∞.
Dual optimal: solutions (λ∗ , ν ∗ ) maximizing dual, d ∗ is optimal
value (dual always easy to maximize: next slide).
Weak duality always holds:
d ∗ ≤ p∗.
...but what is the point of finding a best (largest) lower bound
on a minimization problem?
Arthur Gretton
Lecture 9: Support Vector Machines
Best lower bound: maximize the dual
Best lower bound g (λ, ν) on the optimal solution p ∗ of (5):
Lagrange dual problem
maximize
g (λ, ν)
subject to
λ 0.
(9)
Dual feasible: (λ, ν) with λ 0 and g (λ, ν) > −∞.
Dual optimal: solutions (λ∗ , ν ∗ ) to the dual problem, d ∗ is
optimal value (dual always easy to maximize: next slide).
Weak duality always holds:
d ∗ ≤ p∗.
Strong duality: (does not always hold, conditions given later):
d ∗ = p∗.
If S.D. holds: solve the easy (concave) dual problem to find p ∗ .
Arthur Gretton
Lecture 9: Support Vector Machines
Maximizing the dual is always easy
The Lagrange dual function: minimize Lagrangian (lower bound)
g (λ, ν) = inf L(x, λ, ν).
x∈D
Dual function is a pointwise infimum of affine functions of (λ, ν),
hence concave in (λ, ν) with convex constraint set λ 0.
Example:
One inequality constraint,
L(x, λ) = f0 (x) + λf1 (x),
and assume there are only four
possible values for x. Each line
represents a different x.
Arthur Gretton
Lecture 9: Support Vector Machines
How do we know if strong duality holds?
Conditions under which strong duality holds are called constraint
qualifications (they are sufficient, but not necessary)
(Probably) best known sufficient condition: Strong duality
holds if
Primal problem is convex, i.e. of the form
minimize f0 (x)
subject to fi (x) ≤ 0
i = 1, . . . , n
(10)
Ax = b
for convex f0 , . . . , fm , and
Slater’s condition holds: there exists some strictly feasible
point1 x̃ ∈ relint(D) such that
fi (x̃) < 0 i = 1, . . . , m
Ax̃ = b.
1
We denote by relint(D) the relative interior of the set D. This looks like
the interior of the set, but is non-empty even when the set is a subspace of a
larger space.
Arthur Gretton
Lecture 9: Support Vector Machines
How do we know if strong duality holds?
Conditions under which strong duality holds are called constraint
qualifications (they are sufficient, but not necessary)
(Probably) best known sufficient condition: Strong duality
holds if
Primal problem is convex, i.e. of the form
minimize f0 (x)
subject to fi (x) ≤ 0
i = 1, . . . , n
(10)
Ax = b
for convex f0 , . . . , fm , and
Slater’s condition holds: there exists some strictly feasible
point1 x̃ ∈ relint(D) such that
fi (x̃) < 0 i = 1, . . . , m
Ax̃ = b.
1
We denote by relint(D) the relative interior of the set D. This looks like
the interior of the set, but is non-empty even when the set is a subspace of a
larger space.
Arthur Gretton
Lecture 9: Support Vector Machines
How do we know if strong duality holds?
Conditions under which strong duality holds are called constraint
qualifications (they are sufficient, but not necessary)
(Probably) best known sufficient condition: Strong duality
holds if
Primal problem is convex, i.e. of the form
minimize f0 (x)
subject to fi (x) ≤ 0
i = 1, . . . , n
Ax = b
for convex f0 , . . . , fm , and
Slater’s condition for the case of affine fi is trivial
(inequality
constraints no longer strict, reduces to original inequality constraints)
fi (x̃) ≤ 0 i = 1, . . . , m
Arthur Gretton
Ax̃ = b.
Lecture 9: Support Vector Machines
A consequence of strong duality...
Assume primal is equal to the dual. What are the consequences?
x ∗ solution of original problem (minimum of f0 under
constraints),
(λ∗ , ν ∗ ) solutions to dual
f0 (x ∗ )
=
g (λ∗ , ν ∗ )
(assumed)
=
(g definition)
≤
inf
f0 (x) +
x∈D
λ∗i fi (x) +
i=1
f0 (x ∗ ) +
(inf definition)
≤
m
X
m
X
i=1
λ∗i fi (x ∗ ) +
p
X
!
νi∗ hi (x)
i=1
p
X
νi∗ hi (x ∗ )
i=1
∗
f0 (x ),
(4)
(4): (x ∗ , λ∗ , ν ∗ ) satisfies λ∗ 0, fi (x ∗ ) ≤ 0, and hi (x ∗ ) = 0.
Arthur Gretton
Lecture 9: Support Vector Machines
A consequence of strong duality...
Assume primal is equal to the dual. What are the consequences?
x ∗ solution of original problem (minimum of f0 under
constraints),
(λ∗ , ν ∗ ) solutions to dual
f0 (x ∗ )
=
g (λ∗ , ν ∗ )
(assumed)
=
(g definition)
≤
inf
f0 (x) +
x∈D
λ∗i fi (x) +
i=1
f0 (x ∗ ) +
(inf definition)
≤
m
X
m
X
i=1
λ∗i fi (x ∗ ) +
p
X
!
νi∗ hi (x)
i=1
p
X
νi∗ hi (x ∗ )
i=1
∗
f0 (x ),
(4)
(4): (x ∗ , λ∗ , ν ∗ ) satisfies λ∗ 0, fi (x ∗ ) ≤ 0, and hi (x ∗ ) = 0.
Arthur Gretton
Lecture 9: Support Vector Machines
...is complementary slackness
From previous slide,
m
X
λ∗i fi (x ∗ ) = 0,
(11)
i=1
which is the condition of complementary slackness. This means
λ∗i > 0
=⇒
fi (x ∗ ) = 0,
fi (x ∗ ) < 0
=⇒
λ∗i = 0.
From λi , read off which inequality constraints are strict.
Arthur Gretton
Lecture 9: Support Vector Machines
KKT conditions for global optimum
Assume functions fi , hi are differentiable and strong duality. Since
x ∗ minimizes L(x, λ∗ , ν ∗ ), derivative at x ∗ is zero,
∗
∇f0 (x ) +
m
X
λ∗i ∇fi (x ∗ )
i=1
+
p
X
νi∗ ∇hi (x ∗ ) = 0.
i=1
KKT conditions definition: we are at global optimum,
(x, λ, ν) = (x ∗ , λ∗ , ν ∗ ) when (a) strong duality holds, and (b)
fi (x) ≤ 0, i = 1, . . . , m
hi (x) = 0, i = 1, . . . , p
λi
≥ 0, i = 1, . . . , m
λi fi (x) = 0, i = 1, . . . , m
∇f0 (x) +
m
X
i=1
λi ∇fi (x) +
p
X
νi ∇hi (x) = 0
i=1
Arthur Gretton
Lecture 9: Support Vector Machines
KKT conditions for global optimum
Assume functions fi , hi are differentiable and strong duality. Since
x ∗ minimizes L(x, λ∗ , ν ∗ ), derivative at x ∗ is zero,
∗
∇f0 (x ) +
m
X
λ∗i ∇fi (x ∗ )
i=1
+
p
X
νi∗ ∇hi (x ∗ ) = 0.
i=1
KKT conditions definition: we are at global optimum,
(x, λ, ν) = (x ∗ , λ∗ , ν ∗ ) when (a) strong duality holds, and (b)
fi (x) ≤ 0, i = 1, . . . , m
hi (x) = 0, i = 1, . . . , p
λi
≥ 0, i = 1, . . . , m
λi fi (x) = 0, i = 1, . . . , m
∇f0 (x) +
m
X
i=1
λi ∇fi (x) +
p
X
νi ∇hi (x) = 0
i=1
Arthur Gretton
Lecture 9: Support Vector Machines
KKT conditions for global optimum
In summary: if
primal problem convex and
constraint functions satisfy Slater’s conditions
then strong duality holds. If in addition
functions fi , hi differentiable
then KKT conditions necessary and sufficient for optimality.
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification
Linearly separable points
Classify two clouds of points, where there exists a hyperplane which
linearly separates one cloud from the other without error.
Smallest distance from each class to the separating hyperplane
w > x + b is called the margin.
Arthur Gretton
Lecture 9: Support Vector Machines
Maximum margin classifier, linearly separable case
This problem can be expressed as follows:
2
max (margin) = max
kw k
w ,b
w ,b
(12)
subject to
(
min w > xi + b = 1
max w > xi + b = −1
i : yi = +1,
i : yi = −1.
(13)
The resulting classifier is
y = sign(w > x + b),
We can rewrite to obtain
max
w ,b
1
kw k
or
min kw k2
w ,b
subject to
yi (w > xi + b) ≥ 1.
Arthur Gretton
Lecture 9: Support Vector Machines
(14)
Maximum margin classifier, linearly separable case
This problem can be expressed as follows:
2
max (margin) = max
kw k
w ,b
w ,b
(12)
subject to
(
min w > xi + b = 1
max w > xi + b = −1
i : yi = +1,
i : yi = −1.
(13)
The resulting classifier is
y = sign(w > x + b),
We can rewrite to obtain
max
w ,b
1
kw k
or
min kw k2
w ,b
subject to
yi (w > xi + b) ≥ 1.
Arthur Gretton
Lecture 9: Support Vector Machines
(14)
Maximum margin classifier: with errors allowed
Allow “errors”: points within the margin, or even on the wrong side
of the decision boudary. Ideally:
!
n
X
1
kw k2 + C
I[yi w > xi + b < 0] ,
min
2
w ,b
i=1
where C controls the tradeoff between maximum margin and loss.
Replace with convex upper bound:
!
n
X
1
min
kw k2 + C
θ yi w > xi + b
.
2
w ,b
i=1
with hinge loss,
(
1−α 1−α>0
θ(α) = (1 − α)+ =
0
otherwise.
Arthur Gretton
Lecture 9: Support Vector Machines
Maximum margin classifier: with errors allowed
Allow “errors”: points within the margin, or even on the wrong side
of the decision boudary. Ideally:
!
n
X
1
kw k2 + C
I[yi w > xi + b < 0] ,
min
2
w ,b
i=1
where C controls the tradeoff between maximum margin and loss.
Replace with convex upper bound:
!
n
X
1
min
kw k2 + C
θ yi w > xi + b
.
2
w ,b
i=1
with hinge loss,
(
1−α 1−α>0
θ(α) = (1 − α)+ =
0
otherwise.
Arthur Gretton
Lecture 9: Support Vector Machines
Hinge loss
Hinge loss:
(
1−α 1−α>0
θ(α) = (1 − α)+ =
0
otherwise.
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification
Substituting in the hinge loss, we get
min
w ,b
!
n
X
1
kw k2 + C
θ yi w > xi + b
.
2
i=1
How do you implement hinge loss with simple inequality
constraints (for optimization)?
!
n
X
1
2
kw k + C
min
ξi
2
w ,b,ξ
(15)
i=1
subject
to2
ξi ≥ 0
yi w > xi + b ≥ 1 − ξi
Either yi w > xi + b ≥ 1 and ξi = 0 as before, or yi w > xi + b < 1, and
then ξi > 0 takes the value satisfying yi w > xi + b = 1 − ξi .
2
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification
Substituting in the hinge loss, we get
min
w ,b
!
n
X
1
kw k2 + C
θ yi w > xi + b
.
2
i=1
How do you implement hinge loss with simple inequality
constraints (for optimization)?
!
n
X
1
2
min
kw k + C
ξi
2
w ,b,ξ
(15)
i=1
subject
to2
ξi ≥ 0
yi w > xi + b ≥ 1 − ξi
Either yi w > xi + b ≥ 1 and ξi = 0 as before, or yi w > xi + b < 1, and
then ξi > 0 takes the value satisfying yi w > xi + b = 1 − ξi .
2
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification
Arthur Gretton
Lecture 9: Support Vector Machines
Does strong duality hold?
1
Is the optimization problem convex wrt the variables w , b, ξ?
n
X
1
ξi
minimize f0 (w , b, ξ) := kw k2 + C
2
i=1
subject to fi (w , b, ξ) := 1 − ξi − yi w > xi + b ≤ 0
Ax = b
2
i = 1, . . . , n
(absent)
Each of f0 , f1 , . . . , fn are convex.
Does Slater’s condition hold? Trivial since inequality
constraints affine, and there always exists some
yi
ξi ≥ 0
w > xi + b ≥ 1 − ξi
Thus strong duality holds, the problem is differentiable, hence the
KKT conditions hold at the global optimum.
Arthur Gretton
Lecture 9: Support Vector Machines
Does strong duality hold?
1
Is the optimization problem convex wrt the variables w , b, ξ?
n
X
1
ξi
minimize f0 (w , b, ξ) := kw k2 + C
2
i=1
subject to fi (w , b, ξ) := 1 − ξi − yi w > xi + b ≤ 0
Ax = b
2
i = 1, . . . , n
(absent)
Each of f0 , f1 , . . . , fn are convex.
Does Slater’s condition hold? Trivial since inequality
constraints affine, and there always exists some
yi
ξi ≥ 0
w > xi + b ≥ 1 − ξi
Thus strong duality holds, the problem is differentiable, hence the
KKT conditions hold at the global optimum.
Arthur Gretton
Lecture 9: Support Vector Machines
Does strong duality hold?
1
Is the optimization problem convex wrt the variables w , b, ξ?
n
X
1
ξi
minimize f0 (w , b, ξ) := kw k2 + C
2
i=1
subject to fi (w , b, ξ) := 1 − ξi − yi w > xi + b ≤ 0
Ax = b
2
i = 1, . . . , n
(absent)
Each of f0 , f1 , . . . , fn are convex.
Does Slater’s condition hold? Trivial since inequality
constraints affine, and there always exists some
yi
ξi ≥ 0
w > xi + b ≥ 1 − ξi
Thus strong duality holds, the problem is differentiable, hence the
KKT conditions hold at the global optimum.
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification: Lagrangian
The Lagrangian: L(w , b, ξ, α, λ)
n
n
n
i=1
i=1
i=1
X
X
X 1
2
>
= kw k +C
ξi +
αi 1 − yi w xi + b − ξi +
λi (−ξi )
2
with dual variable constraints
αi ≥ 0,
λi ≥ 0.
Minimize wrt the primal variables w , b, and ξ.
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification: Lagrangian
Derivative wrt w :
n
X
∂L
αi yi xi = 0
=w−
∂w
w=
i=1
Derivative wrt b:
n
X
αi yi xi .
(16)
i=1
∂L X
=
yi αi = 0.
∂b
(17)
i
Derivative wrt ξi :
∂L
= C − αi − λi = 0
∂ξi
αi = C − λ i .
Noting that λi ≥ 0,
αi ≤ C .
Arthur Gretton
Lecture 9: Support Vector Machines
(18)
Support vector classification: Lagrangian
Now use complementary slackness:
Non-margin SVs: αi = C 6= 0:
1
2
We immediately have 1 − ξi = yi w > xi + b .
Also, from condition αi = C − λi , we have λi = 0 (hence can
have ξi > 0).
Margin SVs: 0 < αi < C :
1
We again have 1 − ξi = yi w > xi + b
2
This time, from αi = C − λi , we have λi 6= 0, hence ξi = 0.
Non-SVs: αi = 0
1
This time we can have: yi w > xi + b > 1 − ξi
2
From αi = C − λi , we have λi 6= 0, hence ξi = 0.
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification: Lagrangian
Now use complementary slackness:
Non-margin SVs: αi = C 6= 0:
1
2
We immediately have 1 − ξi = yi w > xi + b .
Also, from condition αi = C − λi , we have λi = 0 (hence can
have ξi > 0).
Margin SVs: 0 < αi < C :
1
We again have 1 − ξi = yi w > xi + b
2
This time, from αi = C − λi , we have λi 6= 0, hence ξi = 0.
Non-SVs: αi = 0
1
This time we can have: yi w > xi + b > 1 − ξi
2
From αi = C − λi , we have λi 6= 0, hence ξi = 0.
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification: Lagrangian
Now use complementary slackness:
Non-margin SVs: αi = C 6= 0:
1
2
We immediately have 1 − ξi = yi w > xi + b .
Also, from condition αi = C − λi , we have λi = 0 (hence can
have ξi > 0).
Margin SVs: 0 < αi < C :
1
We again have 1 − ξi = yi w > xi + b
2
This time, from αi = C − λi , we have λi 6= 0, hence ξi = 0.
Non-SVs: αi = 0
1
This time we can have: yi w > xi + b > 1 − ξi
2
From αi = C − λi , we have λi 6= 0, hence ξi = 0.
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification: Lagrangian
Now use complementary slackness:
Non-margin SVs: αi = C 6= 0:
1
2
We immediately have 1 − ξi = yi w > xi + b .
Also, from condition αi = C − λi , we have λi = 0 (hence can
have ξi > 0).
Margin SVs: 0 < αi < C :
1
We again have 1 − ξi = yi w > xi + b
2
This time, from αi = C − λi , we have λi 6= 0, hence ξi = 0.
Non-SVs: αi = 0
1
This time we can have: yi w > xi + b > 1 − ξi
2
From αi = C − λi , we have λi 6= 0, hence ξi = 0.
Arthur Gretton
Lecture 9: Support Vector Machines
The support vectors
We observe:
1
The solution is sparse: points which are not on the margin, or
“margin errors”, have αi = 0
2
The support vectors: only those points on the decision
boundary, or which are margin errors, contribute.
3
Influence of the non-margin SVs is bounded, since their weight
cannot exceed C .
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification: dual function
Thus, our goal is to maximize the dual,
g (α, λ) =
+
=
n
n
i=1
i=1
X
X 1
kw k2 + C
ξi +
αi 1 − yi w > xi + b − ξi
2
1
2
n
X
λi (−ξi )
i=1
m
m
XX
αi αj yi yj xi> xj + C
i=1 j=1
m
X
−b
m
X
ξi −
i=1
αi yi +
i=1
m
X
i=1
αi −
m
X
m X
m
X
i=1 j=1
m
X
αi ξi −
i=1
αi αj yi yj xi> xj
(C − αi )ξi
i=1
| {z }
0
=
m
X
i=1
αi −
m m
1 XX
2
αi αj yi yj xi> xj .
i=1 j=1
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification: dual function
Maximize the dual,
g (α) =
m
X
m
αi −
i=1
m
1 XX
αi αj yi yj xi> xj ,
2
i=1 j=1
subject to the constraints
0 ≤ αi ≤ C ,
n
X
yi αi = 0
i=1
This is a quadratic program.
Offset b: for the margin SVs, we have 1 = yi w > xi + b . Obtain b
from any of these, or take an average.
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification: kernel version
rgin Hyperplanes
207
Figure 7.9 Toy problem (task: separate circles from disks) solved using -SV classification,
0 1 (topand
left)Smola
to
0 8 (bottom right). The larger
with parameter values Taken
rangingfrom
fromSchoelkopf
(2002)
we make , the more points are allowed to lie inside the margin (depicted by dotted lines).
Results are shown for a Gaussian kernel, k(x x ) exp( x x 2 ).
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification: kernel version
Maximum margin classifier in RKHS: write the hinge loss
formulation
!
n
X
1
2
kw kH + C
θ (yi hw , k(xi , ·)iH )
min
w
2
i=1
for the RKHS H with kernel k(x, ·). Use the result of the
representer theorem,
w (·) =
n
X
βi k(xi , ·).
i=1
Maximizing the margin equivalent to minimizing kw k2H : for
many RKHSs a smoothness constraint (e.g. Gaussian kernel).
Arthur Gretton
Lecture 9: Support Vector Machines
Support vector classification: kernel version
Substituting and introducing the ξi variables, get
!
n
X
1 >
β Kβ + C
ξi
min
β,ξ
2
i=1
where the matrix K has i, jth entry Kij = k(xi , xj ), subject to
ξi ≥ 0
yi
n
X
βj k(xi , xj ) ≥ 1 − ξi
j=1
Convex in β, ξ since K is positive definite: hence strong duality
holds.
Dual:
m
m m
X
1 XX
g (α) =
αi −
αi αj yi yj k(xi , xj ),
2
i=1
subject to w (·) =
i=1 j=1
n
X
yi αi k(x, ·),
0 ≤ αi ≤ C .
i=1
Arthur Gretton
Lecture 9: Support Vector Machines
(19)
Support vector classification: kernel version
Substituting and introducing the ξi variables, get
!
n
X
1 >
β Kβ + C
ξi
min
β,ξ
2
i=1
where the matrix K has i, jth entry Kij = k(xi , xj ), subject to
ξi ≥ 0
yi
n
X
βj k(xi , xj ) ≥ 1 − ξi
j=1
Convex in β, ξ since K is positive definite: hence strong duality
holds.
Dual:
m
m m
X
1 XX
g (α) =
αi −
αi αj yi yj k(xi , xj ),
2
i=1
subject to w (·) =
i=1 j=1
n
X
yi αi k(x, ·),
0 ≤ αi ≤ C .
i=1
Arthur Gretton
Lecture 9: Support Vector Machines
(19)
Support vector classification: the ν-SVM
Another kind of SVM: the ν-SVM:
Hard to interpret C . Modify the formulation to get a more intuitive
parameter ν.
Again, we drop b for simplicity. Solve
!
n
1X
1
2
kw k − νρ +
ξi
min
w ,ρ,ξ
2
n
i=1
subject to
ρ ≥ 0
ξi
>
yi w xi
≥ 0
≥ ρ − ξi ,
(now directly adjust margin width ρ).
Arthur Gretton
Lecture 9: Support Vector Machines
The ν-SVM: Lagrangian
n
n
n
X
X
1
1X
kw k2H +
ξi −νρ+
αi ρ − yi w > xi − ξi +
βi (−ξi )+γ(−ρ)
2
n
i=1
i=1
i=1
for dual variables αi ≥ 0, βi ≥ 0, and γ ≥ 0.
Differentiating and setting to zero for each of the primal variables
w , ξ, ρ,
n
X
w =
αi yi xi
i=1
αi + βi
=
ν =
1
n
n
X
(20)
αi − γ
i=1
From βi ≥ 0, equation (20) implies
0 ≤ αi ≤ n−1 .
Arthur Gretton
Lecture 9: Support Vector Machines
(21)
The ν-SVM: Lagrangian
n
n
n
X
X
1
1X
kw k2H +
ξi −νρ+
αi ρ − yi w > xi − ξi +
βi (−ξi )+γ(−ρ)
2
n
i=1
i=1
i=1
for dual variables αi ≥ 0, βi ≥ 0, and γ ≥ 0.
Differentiating and setting to zero for each of the primal variables
w , ξ, ρ,
n
X
w =
αi yi xi
i=1
αi + βi
=
ν =
1
n
n
X
(20)
αi − γ
i=1
From βi ≥ 0, equation (20) implies
0 ≤ αi ≤ n−1 .
Arthur Gretton
Lecture 9: Support Vector Machines
(21)
Complementary slackness (1)
Complementary slackness conditions:
Assume ρ > 0 at the global solution, hence γ = 0, and
n
X
(22)
αi = ν.
i=1
Case of ξi > 0: complementary slackness states βi = 0, hence from
(20) we have αi = n−1 . Denote this set as N(α). Then
n
X
X
X 1
=
αi ≤
αi = ν,
n
i∈N(α)
i∈N(α)
i=1
so
|N(α)|
≤ ν,
n
and ν is an upper bound on the number of non-margin SVs.
Arthur Gretton
Lecture 9: Support Vector Machines
Complementary slackness (1)
Complementary slackness conditions:
Assume ρ > 0 at the global solution, hence γ = 0, and
n
X
(22)
αi = ν.
i=1
Case of ξi > 0: complementary slackness states βi = 0, hence from
(20) we have αi = n−1 . Denote this set as N(α). Then
n
X
X
X 1
=
αi ≤
αi = ν,
n
i∈N(α)
i∈N(α)
i=1
so
|N(α)|
≤ ν,
n
and ν is an upper bound on the number of non-margin SVs.
Arthur Gretton
Lecture 9: Support Vector Machines
Complementary slackness (2)
Case of ξi = 0: βi > 0 and so αi < n−1 . Denote by M(α) the set
of points n−1 > αi > 0. Then from (22),
ν=
n
X
i=1
αi =
X 1
X
+
αi ≤
n
i∈N(α)
i∈M(α)
X
i∈M(α)∪N(α)
1
,
n
thus
|N(α)| + |M(α)|
,
n
and ν is a lower bound on the number of support vectors with
non-zero weight (both on the margin, and “margin errors”).
ν≤
Arthur Gretton
Lecture 9: Support Vector Machines
Dual for ν-SVM
Substituting into the Lagrangian, we get
m
m m
n
m X
X
1 XX
1X
>
αi αj yi yj xi> xj
αi αj yi yj xi xj +
ξi − ρν −
2
n
i=1 j=1
i=1
i=1 j=1
!
n
n
n
n
X
X
X 1
X
+
αi ρ −
α i ξi −
αi − ν
− αi ξi − ρ
n
i=1
=−
m m
1 XX
2
i=1
i=1
i=1
αi αj yi yj xi> xj
i=1 j=1
Maximize:
m
g (α) = −
m
1 XX
αi αj yi yj xi> xj ,
2
i=1 j=1
subject to
n
X
αi ≥ ν
i=1
Arthur Gretton
0 ≤ αi ≤
1
.
n
Lecture 9: Support Vector Machines
Dual for ν-SVM
Substituting into the Lagrangian, we get
m
m m
n
m X
X
1 XX
1X
>
αi αj yi yj xi> xj
αi αj yi yj xi xj +
ξi − ρν −
2
n
i=1 j=1
i=1
i=1 j=1
!
n
n
n
n
X
X
X 1
X
+
αi ρ −
α i ξi −
αi − ν
− αi ξi − ρ
n
i=1
=−
m m
1 XX
2
i=1
i=1
i=1
αi αj yi yj xi> xj
i=1 j=1
Maximize:
m
g (α) = −
m
1 XX
αi αj yi yj xi> xj ,
2
i=1 j=1
subject to
n
X
αi ≥ ν
i=1
Arthur Gretton
0 ≤ αi ≤
1
.
n
Lecture 9: Support Vector Machines
Questions?
Arthur Gretton
Lecture 9: Support Vector Machines