How to Solve A Support Vector Machine Problem

How to Solve A Support Vector Machine Problem
Hao Zhang
Colorado State University
Outline
β€’ Preliminaries
β€’ Support Vector Machine (SVM)
β€’ Reference
Preliminaries
Convex function
Local minimum is global minimum
Concave function
Local maximum is global maximum
http://upload.wikimedia.org/wikipedia/en/archive/7/73/20081110212550%21ConcaveDef.png
http://en.wikipedia.org/wiki/File:ConcaveDef.png
Primal Problem:
π‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘š 𝑓0 (π‘₯)
𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑 𝑓𝑖 π‘₯ ≀ 0, 𝑖 = 1, … , π‘š
β„Žπ‘– π‘₯ = 0, 𝑖 = 1, … , 𝑝
We denote π‘βˆ— as the optimal value of this problem
Lagrange function:
π‘š
𝑝
𝑖=1
𝑖=1
𝐿 π‘₯, πœ†, 𝜈 = 𝑓0 π‘₯ + οΏ½ πœ†π‘– 𝑓𝑖 (π‘₯) + οΏ½ πœˆπ‘– β„Žπ‘– (π‘₯)
πœ†π‘– π‘Žπ‘Žπ‘Ž πœˆπ‘– π‘Žπ‘Žπ‘Ž 𝑐𝑐𝑐𝑐𝑐𝑐 Lagrange multipliers
(πœ†π‘– β‰₯ 0)
Lagrange dual function
𝑔 πœ†, 𝜈 = inf 𝐿 π‘₯, πœ†, 𝜈
π‘₯
π‘š
𝑝
𝑖=1
𝑖=1
= inf 𝑓0 π‘₯ + οΏ½ πœ†π‘– 𝑓𝑖 π‘₯ + οΏ½ πœˆπ‘– β„Žπ‘– π‘₯
π‘₯
Note: 𝑔 πœ†, 𝜈 can be βˆ’βˆž for certain πœ† π‘Žπ‘Žπ‘Ž 𝜈
Since the dual function 𝑔 πœ†, 𝜈 is the point-wise infimum of a
family of affine functions of (πœ†, 𝜈), it is concave. (Don’t care about
the convexity of 𝑓𝑖 π‘₯ or β„Žπ‘– π‘₯ )
Lagrange dual function
𝑔 πœ†, 𝜈 = inf 𝐿 π‘₯, πœ†, 𝜈
π‘₯
π‘š
𝑝
𝑖=1
𝑖=1
= inf 𝑓0 π‘₯ + οΏ½ πœ†π‘– 𝑓𝑖 π‘₯ + οΏ½ πœˆπ‘– β„Žπ‘– π‘₯
π‘₯
If π‘₯ is primal feasible (𝑓0 (π‘₯) won’t approach infinity),
then
𝑔(πœ†, 𝜈) ≀ 𝑓0 (π‘₯)
(quite easy to prove)
What does this mean if 𝑔(πœ†, 𝜈) >βˆ’βˆž (called dual feasible)?
𝑔(πœ†, 𝜈) ≀ 𝑓0 (π‘₯)
𝑔(πœ†, 𝜈) is one lower bound on optimal value π‘βˆ—
Can we do better?
Sure! Just maximize 𝑔 πœ†, 𝜈 :
π‘šπ‘šπ‘šπ‘šπ‘šπ‘šπ‘š 𝑔 πœ†, 𝜈
𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑 πœ†π‘– β‰₯ 0
This is called Lagrange dual problem, associated with
its primal problem.
We denote 𝑑 βˆ— as the optimal value of this dual problem
This is a convex optimization problem, since the objective to be
maximized is concave and the constraint is convex. It’s still the
case even if the primal problem is not convex.
𝑔(πœ†, 𝜈) ≀ 𝑓0 (π‘₯)
𝑑 βˆ— ≀ π‘βˆ— (weak duality)
π‘βˆ— βˆ’ 𝑑 βˆ— is defined as optimal duality gap
If the primal problem is convex, we usually (but not always)
have strong duality:
𝑑 βˆ— = π‘βˆ—
One simple constraint qualification is Slater’s condition:
There exists an π‘₯ ∈ π‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿ(𝐷) such that
𝑓𝑖 (π‘₯) < 0, 𝑖 = 1, … , π‘š
β„Žπ‘– (π‘₯) = 0, 𝑖 = 1, … , 𝑝
then we have strong duality
* π‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿ 𝐷 = {π‘₯ ∈ 𝐷: βˆƒπœ– > 0, π‘πœ– (π‘₯) ∩ π‘Žπ‘Žπ‘Ž(𝐷) βŠ† 𝐷}
Duality implemented in algorithms
At π‘˜th iteration:
compute 𝑔(πœ†(π‘˜) , 𝜈 (π‘˜) ) and 𝑓0 π‘₯ π‘˜
𝑔 πœ† π‘˜ , 𝜈 π‘˜ ≀ π‘βˆ— ≀ 𝑓0 π‘₯ π‘˜
stop if the bound is tight enough
Complementary slackness
Suppose that the primal and dual optimal values are
attained and equal (strong duality holds). Denote π‘₯ βˆ— and
πœ†βˆ— , 𝜈 βˆ— as a dual optimal point, i.e.,
𝑓0 π‘₯ βˆ— = 𝑔 πœ†βˆ— , 𝜈 βˆ—
π‘š
𝑝
= inf 𝑓0 π‘₯ + οΏ½ πœ†βˆ— 𝑖 𝑓𝑖 π‘₯ + οΏ½ 𝜈 βˆ— 𝑖 β„Žπ‘– π‘₯
π‘₯
π‘š
𝑖=1
𝑝
𝑖=1
≀ 𝑓0 π‘₯ βˆ— + οΏ½ πœ†βˆ— 𝑖 𝑓𝑖 π‘₯ βˆ— + οΏ½ 𝜈 βˆ— 𝑖 β„Žπ‘– π‘₯ βˆ—
≀ 𝑓0
π‘₯βˆ—
𝑖=1
𝑖=1
Complementary slackness
Suppose that the primal and dual optimal values are
attained and equal (strong duality holds). Denote π‘₯ βˆ— and
πœ†βˆ— , 𝜈 βˆ— as a dual optimal point, i.e.,
𝑓0 π‘₯ βˆ— = 𝑔 πœ†βˆ— , 𝜈 βˆ—
π‘š
𝑝
= inf 𝑓0 π‘₯ + οΏ½ πœ†βˆ— 𝑖 𝑓𝑖 π‘₯ + οΏ½ 𝜈 βˆ— 𝑖 β„Žπ‘– π‘₯
π‘₯
π‘š
𝑖=1
𝑝
𝑖=1
= 𝑓0 π‘₯ βˆ— + οΏ½ πœ†βˆ— 𝑖 𝑓𝑖 π‘₯ βˆ— + οΏ½ 𝜈 βˆ— 𝑖 β„Žπ‘– π‘₯ βˆ—
= 𝑓0
π‘₯βˆ—
𝑖=1
𝑖=1
So, the two inequalities in this chain hold with equality.
Hence,
π‘š
οΏ½ πœ†βˆ— 𝑖 𝑓𝑖 π‘₯ βˆ— = 0
𝑖=1
Since each term is non-positive, we conclude that
πœ†βˆ— 𝑖 𝑓𝑖 π‘₯ βˆ— = 0,
𝑖 = 1, … , π‘š
This is called complementary slackness
Karush-Kuhn-Tucker conditions
Assume 𝑓0 , … , π‘“π‘š , β„Ž1 , … , β„Žπ‘ are differentiable, therefore they have
open domains. (Don’t care about convexity)
π‘š
𝑝
𝑖=1
𝑖=1
𝛻 𝑓0 π‘₯ βˆ— + οΏ½ πœ†βˆ— 𝑖 𝛻𝑓𝑖 π‘₯ βˆ— + οΏ½ 𝜈 βˆ— 𝑖 π›»β„Žπ‘– π‘₯ βˆ— = 0
Karush-Kuhn-Tucker (KKT) conditions
Put all the conditions and conclusions together, we get
KKT conditions:
π‘š
𝑓𝑖 π‘₯ βˆ—
β„Žπ‘– π‘₯ βˆ—
πœ†βˆ— 𝑖
πœ†βˆ— 𝑖 𝑓𝑖 π‘₯ βˆ—
≀ 0,
= 0,
β‰₯ 0,
= 0,
𝑝
𝑖 = 1, … , π‘š
𝑖 = 1, … , 𝑝
𝑖 = 1, . . . , π‘š
𝑖 = 1, … , π‘š
𝛻 𝑓0 π‘₯ βˆ— + οΏ½ πœ†βˆ— 𝑖 𝛻𝑓𝑖 π‘₯ βˆ— + οΏ½ 𝜈 βˆ— 𝑖 π›»β„Žπ‘– π‘₯ βˆ— = 0
𝑖=1
𝑖=1
Summary: for any optimization problem with differentiable objective
and constraint functions for which strong duality obtains, any pair of
primal and dual optimal points must satisfy the KKT conditions
The KKT conditions play an important role in optimization.
In general, many algorithms for convex optimization are
conceived as, or can be interpreted as, methods for solving
the KKT conditions.
Example: SVM!
Fletcher proved that for all SVMs, KKT conditions are
necessary and sufficient for a solution.
Support Vector Machine (SVM)
Linear SVM
We are given some training data 𝐷 , a set of n points of the form
𝐷=
π‘₯𝑖 , 𝑦𝑖
π‘₯𝑖 ∈ β„œπ‘ , 𝑦𝑖 ∈ {βˆ’1,1}}
Our goal is to find the maximum-margin hyperplane that divides the
points from these two classes.
H3 (green) doesn't separate
the two classes. H1 (blue)
does, with a small margin and
H2 (red) with the maximum
margin.
http://en.wikipedia.org/wiki/File:Svm_separating_hyperplanes.png
A hyper plane can be written as
π‘€βˆ™π‘₯βˆ’π‘ =0
We want to choose 𝑀 and 𝑏 to maximize the margin, or distance
between the parallel hyperplanes that are as far apart as possible
while still separating the data.
These hyperplanes can be described by the equations:
and
π‘€βˆ™π‘₯βˆ’π‘ =1
𝑀 βˆ™ π‘₯ βˆ’ 𝑏 = βˆ’1
(Uniqueness of the parameters)
the distance between
these two hyperplanes
is:
2
||𝑀||
Maximum-margin hyperplane and margins for an SVM
trained with samples from two classes. Samples on the
margin are called the support vectors.
http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
As we also have to prevent data points from falling into the margin,
we add the following constraint: for each π‘₯𝑖 either
for 𝑦𝑖 = 1
𝑀 βˆ™ π‘₯𝑖 βˆ’ 𝑏 β‰₯ 1
or
𝑀 βˆ™ π‘₯𝑖 βˆ’ 𝑏 ≀ βˆ’1 for 𝑦𝑖 = βˆ’1
This can be written as
𝑦𝑖 (𝑀 βˆ™ π‘₯𝑖 βˆ’ 𝑏) β‰₯ 1 for all 𝑖
The primal form of this problem is
1
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 | 𝑀| 2
2
subject to 𝑦𝑖 (𝑀 βˆ™ π‘₯𝑖 βˆ’ 𝑏) β‰₯ 1
Its dual form is
where 𝐿 𝛼, 𝑀, 𝑏 =
subject to 𝛼𝑖 β‰₯ 0
max min 𝐿 𝛼, 𝑀, 𝑏
𝛼
1
||𝑀||2
2
𝑀,𝑏
βˆ’ βˆ‘ 𝛼𝑖 [𝑦𝑖 𝑀 βˆ™ π‘₯𝑖 βˆ’ 𝑏 βˆ’ 1]
min 𝐿 𝛼, 𝑀, 𝑏
𝑀,𝑏
β‡’
πœ•πΏ(𝛼, 𝑀, 𝑏)
= 𝑀 βˆ’ οΏ½ 𝛼𝑖 𝑦𝑖 π‘₯𝑖 = 0
πœ•π‘€
πœ•πœ•(𝛼, 𝑀, 𝑏)
= οΏ½ 𝛼𝑖 𝑦𝑖 = 0
πœ•π‘
∴ min 𝐿 𝛼, 𝑀, 𝑏
𝑀,𝑏
Geometric view
1
= οΏ½ 𝛼𝑖 βˆ’ οΏ½ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 π‘₯𝑖 𝑇 π‘₯𝑗 = 𝐿�(𝛼)
2
𝑖,𝑗
Our goal is
max 𝐿�(𝛼)
𝛼
Subject to 𝛼𝑖 β‰₯ 0 and βˆ‘ 𝛼𝑖 𝑦𝑖 = 0
How to compute 𝑀?
𝑀 = οΏ½ 𝛼𝑖 𝑦𝑖 π‘₯𝑖
How to compute 𝑏?
Complementary slackness:
𝛼𝑖 𝑦𝑖 𝑀 βˆ™ π‘₯𝑖 βˆ’ 𝑏 βˆ’ 1 = 0
∴ 𝑓𝑓𝑓 𝑑𝑑𝑑𝑑𝑑 𝛼𝑖 > 0 β‡’ 𝑏 = 𝑀 βˆ™ π‘₯𝑖 βˆ’ 𝑦𝑖
(In practice, it’s safer to compute the mean)
Kernel SVM
Revisit 𝐿�(𝛼)
1
�𝐿 𝛼 = οΏ½ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 π‘₯𝑖 𝑇 π‘₯𝑗
2
𝑖,𝑗
π‘˜(π‘₯𝑖 , π‘₯𝑗 )
Some examples of kernels from Wikipedia:
What we have discussed so far focused on separable cases.
what if the data is not separable?
Relax the constraints!
𝑦𝑖 (𝑀 βˆ™ π‘₯𝑖 βˆ’ 𝑏) β‰₯ 1
𝑦𝑖 𝑀 βˆ™ π‘₯𝑖 βˆ’ 𝑏 β‰₯ 1 βˆ’ πœ‰π‘– , πœ‰π‘– β‰₯ 0
Primal problem becomes:
min
𝑀,πœ‰,𝑏
1
| 𝑀 |2 + 𝐢 οΏ½ πœ‰π‘–
2
𝐢 is a trade-off between a large margin and a small error penalty.
Convert this primal form to dual form. (Same idea!)
Reference
[1] http://en.wikipedia.org/wiki/Support_vector_machine
[2] β€œA Tutorial on Support Vector Machines for Pattern Recognition”,
CHRISTOPHER J.C. BURGES
[3] β€œConvex Optimization”, Stephen Boyd, Lieven Vandenberghe
[4] β€œECEN 629 Lecture 5: Duality and KKT Conditions”, Shuguang Cui
[5] http://en.wikipedia.org/wiki/Relint
[6] http://en.wikipedia.org/wiki/Affine_hull
Thanks !