How to Solve A Support Vector Machine Problem
Hao Zhang
Colorado State University
Outline
β’ Preliminaries
β’ Support Vector Machine (SVM)
β’ Reference
Preliminaries
Convex function
Local minimum is global minimum
Concave function
Local maximum is global maximum
http://upload.wikimedia.org/wikipedia/en/archive/7/73/20081110212550%21ConcaveDef.png
http://en.wikipedia.org/wiki/File:ConcaveDef.png
Primal Problem:
ππππππππ π0 (π₯)
π π π π π π π π‘π‘ ππ π₯ β€ 0, π = 1, β¦ , π
βπ π₯ = 0, π = 1, β¦ , π
We denote πβ as the optimal value of this problem
Lagrange function:
π
π
π=1
π=1
πΏ π₯, π, π = π0 π₯ + οΏ½ ππ ππ (π₯) + οΏ½ ππ βπ (π₯)
ππ πππ ππ πππ ππππππ Lagrange multipliers
(ππ β₯ 0)
Lagrange dual function
π π, π = inf πΏ π₯, π, π
π₯
π
π
π=1
π=1
= inf π0 π₯ + οΏ½ ππ ππ π₯ + οΏ½ ππ βπ π₯
π₯
Note: π π, π can be ββ for certain π πππ π
Since the dual function π π, π is the point-wise infimum of a
family of affine functions of (π, π), it is concave. (Donβt care about
the convexity of ππ π₯ or βπ π₯ )
Lagrange dual function
π π, π = inf πΏ π₯, π, π
π₯
π
π
π=1
π=1
= inf π0 π₯ + οΏ½ ππ ππ π₯ + οΏ½ ππ βπ π₯
π₯
If π₯ is primal feasible (π0 (π₯) wonβt approach infinity),
then
π(π, π) β€ π0 (π₯)
(quite easy to prove)
What does this mean if π(π, π) >ββ (called dual feasible)?
π(π, π) β€ π0 (π₯)
π(π, π) is one lower bound on optimal value πβ
Can we do better?
Sure! Just maximize π π, π :
πππππππ π π, π
π π π π π π π π‘π‘ ππ β₯ 0
This is called Lagrange dual problem, associated with
its primal problem.
We denote π β as the optimal value of this dual problem
This is a convex optimization problem, since the objective to be
maximized is concave and the constraint is convex. Itβs still the
case even if the primal problem is not convex.
π(π, π) β€ π0 (π₯)
π β β€ πβ (weak duality)
πβ β π β is defined as optimal duality gap
If the primal problem is convex, we usually (but not always)
have strong duality:
π β = πβ
One simple constraint qualification is Slaterβs condition:
There exists an π₯ β ππππππ(π·) such that
ππ (π₯) < 0, π = 1, β¦ , π
βπ (π₯) = 0, π = 1, β¦ , π
then we have strong duality
* ππππππ π· = {π₯ β π·: βπ > 0, ππ (π₯) β© πππ(π·) β π·}
Duality implemented in algorithms
At πth iteration:
compute π(π(π) , π (π) ) and π0 π₯ π
π π π , π π β€ πβ β€ π0 π₯ π
stop if the bound is tight enough
Complementary slackness
Suppose that the primal and dual optimal values are
attained and equal (strong duality holds). Denote π₯ β and
πβ , π β as a dual optimal point, i.e.,
π0 π₯ β = π πβ , π β
π
π
= inf π0 π₯ + οΏ½ πβ π ππ π₯ + οΏ½ π β π βπ π₯
π₯
π
π=1
π
π=1
β€ π0 π₯ β + οΏ½ πβ π ππ π₯ β + οΏ½ π β π βπ π₯ β
β€ π0
π₯β
π=1
π=1
Complementary slackness
Suppose that the primal and dual optimal values are
attained and equal (strong duality holds). Denote π₯ β and
πβ , π β as a dual optimal point, i.e.,
π0 π₯ β = π πβ , π β
π
π
= inf π0 π₯ + οΏ½ πβ π ππ π₯ + οΏ½ π β π βπ π₯
π₯
π
π=1
π
π=1
= π0 π₯ β + οΏ½ πβ π ππ π₯ β + οΏ½ π β π βπ π₯ β
= π0
π₯β
π=1
π=1
So, the two inequalities in this chain hold with equality.
Hence,
π
οΏ½ πβ π ππ π₯ β = 0
π=1
Since each term is non-positive, we conclude that
πβ π ππ π₯ β = 0,
π = 1, β¦ , π
This is called complementary slackness
Karush-Kuhn-Tucker conditions
Assume π0 , β¦ , ππ , β1 , β¦ , βπ are differentiable, therefore they have
open domains. (Donβt care about convexity)
π
π
π=1
π=1
π» π0 π₯ β + οΏ½ πβ π π»ππ π₯ β + οΏ½ π β π π»βπ π₯ β = 0
Karush-Kuhn-Tucker (KKT) conditions
Put all the conditions and conclusions together, we get
KKT conditions:
π
ππ π₯ β
βπ π₯ β
πβ π
πβ π ππ π₯ β
β€ 0,
= 0,
β₯ 0,
= 0,
π
π = 1, β¦ , π
π = 1, β¦ , π
π = 1, . . . , π
π = 1, β¦ , π
π» π0 π₯ β + οΏ½ πβ π π»ππ π₯ β + οΏ½ π β π π»βπ π₯ β = 0
π=1
π=1
Summary: for any optimization problem with differentiable objective
and constraint functions for which strong duality obtains, any pair of
primal and dual optimal points must satisfy the KKT conditions
The KKT conditions play an important role in optimization.
In general, many algorithms for convex optimization are
conceived as, or can be interpreted as, methods for solving
the KKT conditions.
Example: SVM!
Fletcher proved that for all SVMs, KKT conditions are
necessary and sufficient for a solution.
Support Vector Machine (SVM)
Linear SVM
We are given some training data π· , a set of n points of the form
π·=
π₯π , π¦π
π₯π β βπ , π¦π β {β1,1}}
Our goal is to find the maximum-margin hyperplane that divides the
points from these two classes.
H3 (green) doesn't separate
the two classes. H1 (blue)
does, with a small margin and
H2 (red) with the maximum
margin.
http://en.wikipedia.org/wiki/File:Svm_separating_hyperplanes.png
A hyper plane can be written as
π€βπ₯βπ =0
We want to choose π€ and π to maximize the margin, or distance
between the parallel hyperplanes that are as far apart as possible
while still separating the data.
These hyperplanes can be described by the equations:
and
π€βπ₯βπ =1
π€ β π₯ β π = β1
(Uniqueness of the parameters)
the distance between
these two hyperplanes
is:
2
||π€||
Maximum-margin hyperplane and margins for an SVM
trained with samples from two classes. Samples on the
margin are called the support vectors.
http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
As we also have to prevent data points from falling into the margin,
we add the following constraint: for each π₯π either
for π¦π = 1
π€ β π₯π β π β₯ 1
or
π€ β π₯π β π β€ β1 for π¦π = β1
This can be written as
π¦π (π€ β π₯π β π) β₯ 1 for all π
The primal form of this problem is
1
ππππππππ | π€| 2
2
subject to π¦π (π€ β π₯π β π) β₯ 1
Its dual form is
where πΏ πΌ, π€, π =
subject to πΌπ β₯ 0
max min πΏ πΌ, π€, π
πΌ
1
||π€||2
2
π€,π
β β πΌπ [π¦π π€ β π₯π β π β 1]
min πΏ πΌ, π€, π
π€,π
β
ππΏ(πΌ, π€, π)
= π€ β οΏ½ πΌπ π¦π π₯π = 0
ππ€
ππ(πΌ, π€, π)
= οΏ½ πΌπ π¦π = 0
ππ
β΄ min πΏ πΌ, π€, π
π€,π
Geometric view
1
= οΏ½ πΌπ β οΏ½ πΌπ πΌπ π¦π π¦π π₯π π π₯π = πΏοΏ½(πΌ)
2
π,π
Our goal is
max πΏοΏ½(πΌ)
πΌ
Subject to πΌπ β₯ 0 and β πΌπ π¦π = 0
How to compute π€?
π€ = οΏ½ πΌπ π¦π π₯π
How to compute π?
Complementary slackness:
πΌπ π¦π π€ β π₯π β π β 1 = 0
β΄ πππ π‘π‘π‘π‘π‘ πΌπ > 0 β π = π€ β π₯π β π¦π
(In practice, itβs safer to compute the mean)
Kernel SVM
Revisit πΏοΏ½(πΌ)
1
οΏ½πΏ πΌ = οΏ½ πΌπ πΌπ π¦π π¦π π₯π π π₯π
2
π,π
π(π₯π , π₯π )
Some examples of kernels from Wikipedia:
What we have discussed so far focused on separable cases.
what if the data is not separable?
Relax the constraints!
π¦π (π€ β π₯π β π) β₯ 1
π¦π π€ β π₯π β π β₯ 1 β ππ , ππ β₯ 0
Primal problem becomes:
min
π€,π,π
1
| π€ |2 + πΆ οΏ½ ππ
2
πΆ is a trade-off between a large margin and a small error penalty.
Convert this primal form to dual form. (Same idea!)
Reference
[1] http://en.wikipedia.org/wiki/Support_vector_machine
[2] βA Tutorial on Support Vector Machines for Pattern Recognitionβ,
CHRISTOPHER J.C. BURGES
[3] βConvex Optimizationβ, Stephen Boyd, Lieven Vandenberghe
[4] βECEN 629 Lecture 5: Duality and KKT Conditionsβ, Shuguang Cui
[5] http://en.wikipedia.org/wiki/Relint
[6] http://en.wikipedia.org/wiki/Affine_hull
Thanks !
© Copyright 2026 Paperzz