Constrain Optimization and SVM - PolyU

Constrained Optimization and Support Vector Machines
Man-Wai MAK
Dept. of Electronic and Information Engineering,
The Hong Kong Polytechnic University
[email protected]
http://www.eie.polyu.edu.hk/∼mwmak
References:
C. Bishop, ”Pattern Recognition and Machine Learning”, Appendix E, Springer, 2006.
S.Y. Kung, M.W. Mak and S.H. Lin, Biometric Authentication: A Machine Learning
Approach, Prentice Hall, 2005, Chapter 4.
March 22, 2017
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
1 / 30
Overview
1
Motivations: Why Study Constrained Optimization and SVM
Why Study Constrained Optimization
Why Study SVM
2
Constrained Optimization
Hard-Constrained Optimization
Lagrangian Function
Inequality Constraint
Multiple Constraints
Software Tools
3
Support Vector Machines
Linear SVM: Separable Case
Linear SVM: Fuzzy Separation (Optional)
Nonlinear SVM
SVM for Pattern Classification
Software Tools
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
2 / 30
Why Study Constrained Optimization?
Constrained optimization is used in almost every discipline:
Power Electronics: “Design of a boost power factor correction
converter using optimization techniques,” IEEE Transactions on Power
Electronics, vol. 19, no. 6, pp. 1388-1396, Nov. 2004.
Wireless Communication: “Energy-constrained modulation
optimization,” IEEE Transactions on Wireless Communications, vol. 4,
no. 5, pp. 2349-2360, Sept. 2005
Photonics: “Module Placement Based on Resistive Network
Optimization,” in IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 3, no. 3, pp. 218-225, July 1984.
Multimedia: “Nonlinear total variation based noise removal
algorithms.” Physica D: Nonlinear Phenomena 60.1-4 (1992): 259-268.
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
3 / 30
Why Study SVM?
SVM is a typical application of constraint optimization.
SVMs are used everywhere:
Power Electronics: “Support Vector Machines Used to Estimate the
Battery State of Charge,” IEEE Transactions on Power Electronics, vol.
28, no. 12, pp. 5919-5926, Dec. 2013.
Wireless Communication: “Localization In Wireless Sensor Networks
Based on Support Vector Machines,” IEEE Transactions on Parallel
and Distributed Systems, vol. 19, no. 7, pp. 981-994, July 2008.
Photonics: “Development of robust calibration models using support
vector machines for spectroscopic monitoring of blood glucose.”
Analytical chemistry 82.23 (2010): 9719-9726.
Multimedia: “Support vector machines using GMM supervectors for
speaker verification,” IEEE Signal Processing Letters, vol. 13, no. 5,
pp. 308-311, May 2006.
Bioinformatics: “Gene selection for cancer classification using support
vector machines.” Machine learning, 46.1-3 (2002): 389-422.
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
4 / 30
Constrained Optimization
Constrained optimization is the process of optimizing an objective
function with respect to some variables in the presence of constraints
on those variables.
The objective function is either a cost function or energy function
which is to be minimized, or a reward function or utility function,
which is to be maximized.
Constraints can be either hard constraints which set conditions for the
variables that are required to be satisfied, or soft constraints which
have some variable values that are penalized in the objective function
if the conditions on the variables are not satisfied.
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
5 / 30
Constrained Optimization
A general constrained minimization problem:
min
f (x)
subject to gi (x) = ci for i = 1, . . . , n (Equality constraints)
hj (x) ≥ dj for j = 1, . . . , m (Inquality constraints)
(1)
where gi (x) = ci and hj (x) ≥ dj are called hard constraints.
If the constrained problem has only equality constraints, the method
of Lagrange multipliers can be used to convert it into an
unconstrained problem whose number of variables is the original
number of variables plus the original number of equality constraints.
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
6 / 30
Constrained Optimization
Example: Maximization of a function of two variables with equality
constraints:
max
f (x, y )
(2)
subject to g (x, y ) = 0
At the optimal point (x ∗ , y ∗ ), the gradient of f (x, y ) and g (x, y ) are
anti-parallel, i.e., ∇f (x ∗ , y ∗ ) = −λ∇g (x ∗ , y ∗ ), where λ is called the
Lagrange multiplier. (See Tutorial for explanation.)
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
7 / 30
Constraint Optimization
Extension to function of D variables:
max
f (x)
subject to g (x) = 0
(3)
where x ∈ <D . Optimal occurs when
∇f (x) + λ∇g (x) = 0.
(4)
Note that the red curve is of dimension D − 1.
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
8 / 30
Lagrangian Function
Define the Lagrangian function as
L(x, λ) ≡ f (x) + λg (x)
(5)
where λ 6= 0 is the Lagrange multiplier.
The optimal condition (Eq. 4) will be satisfied when ∇x L = 0.
Note that ∂L/∂λ = 0 leads to the constrained equation g (x) = 0.
The constrained maximization can be written as:
max
L(x, λ) = f (x) + λg (x)
subject to λ 6= 0, g (x) = 0
Man-Wai MAK (EIE)
Constrained Optimization and SVM
(6)
March 22, 2017
9 / 30
Lagrangian Function: 2D Example
Find the stationary point of the function f (x1 , x2 ):
max
f (x1 , x2 ) = 1 − x12 − x22
subject to g (x1 , x2 ) = x1 + x2 − 1 = 0
(7)
Lagrangian function:
L(x, λ) = 1 − x12 − x22 + λ(x1 + x2 − 1)
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
10 / 30
Lagrangian Function: 2D Example
Differenting L(x, λ) w.r.t. x1 , x2 , and λ and set the results to 0, we
obtain
−2x1 + λ = 0
−2x2 + λ = 0
x1 + x2 − 1 = 0
The solution is (x1∗ , x2∗ ) = ( 12 , 21 ), and the corresponding λ = 1.
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
11 / 30
Inequality Constraint
Maximization with inequality constraint
max
f (x)
subject to g (x) ≥ 0
(8)
Two possible solutions for the max of L(x, µ) = f (x) + µg (x):
Inactive Constraint : g (x) > 0, µ = 0, ∇f (x) = 0
Active Constraint :
g (x) = 0, µ > 0, ∇f (x) = −µ∇g (x)
(9)
Therefore, the maximization can be rewritten as
max
L(x, µ) = f (x) + µg (x)
subject to g (x) ≥ 0, µ ≥ 0, µg (x) = 0
(10)
which is known as the Karush-Kuhn-Tucker (KKT) condition.
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
12 / 30
Inequality Constraint
For minimization,
min
f (x)
subject to g (x) ≥ 0
(11)
We can also express the minimization as
min
L(x, µ) = f (x) − µg (x)
subject to g (x) ≥ 0, µ ≥ 0, µg (x) = 0
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
(12)
13 / 30
Multiple Constraints
Maximization with multiple equality and inequality constraints:
max
f (x)
subject to gj (x) = 0 for j = 1, . . . , J
hk (x) ≥ 0 for k = 1, . . . , K .
(13)
This maximization can be written as
max
L(x, {λj }, {µk }) = f (x) +
J
P
j=1
λj gj (x) +
K
P
µk hk (x)
k=1
subject to λj 6= 0, gj (x) = 0 for j = 1, . . . , J and
µk ≥ 0, hk (x) ≥ 0, µk hk (x) = 0 for k = 1, . . . , K .
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
(14)
14 / 30
Software Tools for Constrained Optimization
Matlab Optimization Toolbox: fmincon can find the minimum of
a function subject to nonlinear multivariable constraints.
Python: scipy.optimize.minimize provides a common interface
to unconstrained and constrained minimization algorithms for
multivariate scalar functions
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
15 / 30
Linear SVM: Separable
Case
x
c
1
x1
x1
Consider
a training set {xi , yi ; i = 1, . . . , N}
∈ X × {+1, −1} shown
(b)
below, where X is the set of input data in <D and yi are the labels.
(a)
−w/kwk
c
yi = 1
yi = −1
x2
Margin: d
−w/kwk
: x1
: x2
c2
w · x + b = +1
x
Decision
Plan
x1
(c)
w·x+b=0
x1
w · x + b = −1
(d)
Figure: Linear SVM on 2-D space
ustration of a two-class classification process, from finding
: y = +1; ◦: y = −1.
es for a basici classifier to i the determination of decision and
anes of the SVM classifier. (a) N sets of labeled data {xi , yi ;i =
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
16 / 30
Linear SVM: Separable Case
A linear support vector machine (SVM) aims to find a decision plane
(a line for the case of 2D)
x·w+b =0
that maximizes the margin of separation (see Fig. 1).
Assume that all data points satisfy the constraints:
xi · w + b ≥ +1 for i ∈ {1, . . . , N} where yi = +1.
xi · w + b ≤ −1 for i ∈ {1, . . . , N} where yi = −1.
(15)
(16)
Data points x1 and x2 in previous page satisfy the equality constraint:
x1 · w + b = +1
(17)
x2 · w + b = −1
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
17 / 30
Linear SVM: Separable Case
Using Eq. 17 and Fig. 1, the distance between the two separating
hyperplane (also called the margin of separation) can be computed:
d(w) = (x1 − x2 ) ·
w
2
=
kwk
kwk
Maximizing d(w) is equivalent to minimizing kwk2 . So, the
constrained optimization problem in SVM is
1
2
min
2 kwk
subject to yi (xi · w + b) ≥ 1 ∀i = 1, . . . , N
(18)
Equivalently, minimizing a Lagrangian function:
P
min
L(w, b, {αi }) = 21 kwk2 − N
i=1 αi [yi (xi · w + b) − 1]
subject to αi ≥ 0, yi (xi · w + b) − 1 ≥ 0,
αi [yi (xi · w + b) − 1] = 0, ∀i = 1, . . . , N
(19)
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
18 / 30
Linear SVM: Separable Case
Setting
∂
∂
L(w, b, {αi }) = 0 and
L(w, b, {αi }) = 0,
∂b
∂w
subject to the constraint αi ≥ 0, results in
XN
XN
αi yi = 0 and w =
αi yi xi .
i=1
(20)
(21)
i=1
Substituting these results back into the Lagrangian function:
N
N
N
X
X
X
1
L(w, b, {αi }) = (w · w) −
αi yi (xi · w) −
αi y i b +
αi
2
=
=
1
2
N
X
i=1
N
X
i=1
αi yi xi ·
αi −
Man-Wai MAK (EIE)
1
2
N
X
j=1
i=1
N
X
αj yj xj −
N X
N
X
i=1 j=1
i=1
i=1
αi yi xi ·
N
X
j=1
i=1
αj yj xj +
N
X
αi
i=1
αi αj yi yj (xi · xj ).
Constrained Optimization and SVM
March 22, 2017
19 / 30
Linear SVM: Separable Case
This results in the following Wolfe dual formulation:
max
α
N
X
i=1
N
N
1 XX
αi −
αi αj yi yj (xi · xj )
2
i=1 j=1
(22)
subject to
N
X
i=1
αi yi = 0
and αi ≥ 0, i = 1, . . . , N.
The solution contains two kinds of Lagrange multiplier:
1
2
αi = 0: The corresponding xi are irrelevant
αi > 0: The corresponding xi are critical
xk for which αk > 0 are called support vectors.
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
20 / 30
Linear SVM: Separable Case
The SVM output is given by
f (x) = w · x + b
X
=
αk yk xk · x + b
k∈S
where S is the set of indexes for which αk > 0.
b can be computed by using the KTT condition, i.e., for any k such
that yk = 1 and αk > 0, we have
αk [yk (xk · w + b) − 1] = 0
=⇒ b = 1 − xk · w.
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
21 / 30
Linear SVM: Fuzzy Separation (Optional)
If the data patterns are not separable by a linear hyperplane, a set of
slack variables {ξ = ξ1 , . . . , ξN } is introduced with ξi ≥ 0 such that
the inequality constraints in SVM become
yi (xi · w + b) ≥ 1 − ξi
∀i = 1, . . . , N.
(23)
The slack variables {ξi }N
i=1 allow some data to violate the constraints
in Eq. 18.
The value of ξi indicates the degree of violation of the constraint.
The minimization problem becomes
X
1
min kwk2 + C
ξi ,
2
i
subject to
yi (xi · w + b) ≥ 1 − ξi , (24)
where C is a user-defined penalty parameter to penalize any violation
of the safety margin for all training data.
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
22 / 30
Linear SVM: Fuzzy Separation (Optional)
The new Lagrangian is
N
N
X
X
X
1
L(w, b, α) = kwk2 +C
ξi −
αi (yi (xi ·w+b)−1+ξi )−
βi ξi ,
2
i
i=1
i=1
(25)
where αi ≥ 0 and βi ≥ 0 are, respectively, the Lagrange multipliers to
ensure that yi (xi · w + b) ≥ 1 − ξi and that ξi ≥ 0.
Differentiating L(w, b, α) w.r.t. w, b, and ξi , we obtain the Wolfe
dual:
N N
N
X
1 XX
max
αi −
αi αj yi yj (xi · xj )
(26)
α
2
i=1
i=1 j=1
subject to 0 ≤ αi ≤ C , i = 1, . . . , N,
Man-Wai MAK (EIE)
PN
i=1 αi yi
Constrained Optimization and SVM
= 0.
March 22, 2017
23 / 30
Linear SVM: Fuzzy Separation (Optional)
Three types of support vectors:
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
24 / 30
Nonlinear SVM
Assume that we have a nonlinear function φ(x) that map x from the
input space to a much higher (possibly infinite) dimensional space
called the feature space.
While data are not linearly separable in the input space, they will
become Section
linearly
separable
4.5. Nonlinear
SVMs in the feature space.
109
x2
Decision boundary
yi = 1
yi = 1
yi = −1
yi = −1
Decision boundary
K(x, xi )
x1
Input Space
Feature Space
Figure 4.8. Use of a kernel function to map nonlinearly separable data in the
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
25 / 30
Nonlinear SVM
The SVM output becomes
f (x) =
N
X
i=1
αi yi φ(xi ) · φ(x) + b
However, the dimension of φ(x) is very high and could be infinite in
some cases, meaning that this function may not be implementable.
Fortunately, the dot product φ(xi ) · φ(x) can be replaced by a kernel
function:
φ(xi ) · φ(x) = φ(x)T φ(x) = K (xi , x)
which can be efficiently implemented.
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
26 / 30
Nonlinear SVM
Common kernel functions include
x · xi p
Polynomial Kernel : K (x, xi ) = 1 + 2
, p > 0 (27)
σ
kx − xi k2
RBF Kernel : K (x, xi ) = exp −
(28)
2σ 2
1
Sigmoidal Kernel : K (x, xi ) =
(29)
x·xi +b
1 + e − σ2
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
27 / 30
Nonlinear SVM
Comparing Kernels
Comparing kernels:
Linear SVM, C=1000.0, #SV=7, acc=95.00%, normW=0.94
RBF SVM, 2*sigma2=8.0, C=1000.0, #SV=7, acc=100.00%
10
10
9
18
8
11
7
9
17
3
4
13
20
10
19
6
1
Linear Kernel
2
4
8
4
9
6
x
8
6
0
19
15
8
RBF Kernel
1
10
10
5
2
7
13
20
3
2
0
3
4
5
14
17
2
5
15
16
12
1
6
4
3
11
7
2
5
18
8
x2
x2
12
14
1
6
0
16
0
2
7
4
9
6
x
1
8
10
1
Polynomial SVM, degree=2, C=10.0, #SV=7, acc=90.00%
10
9
18
8
11
7
x2
2
5
3
4
13
4
20
3
10
19
15
5
2
6
Polynomial Kernel
8
1
7
0
2
4
9
6
x
Man-Wai MAK (EIE)
14
17
1
6
0
16
12
8
10
1
Constrained Optimization and SVM
March 22, 2017
28 / 30
SVM for Pattern Classification
SupportVectorMachines
SVM●  isSVM
goodisfor
binary
classification:
good
for binary
classification.
f (x) > 0 ⇒ x ∈ Class 1; f (x) ≤ 0 ⇒ x ∈ Class 2
● 
To classify multiple classes, we can use the one-vs-rest
To classify
multiple
classes,Kwe
use the
one-vs-resttoapproach
approach
to convert
binary
classifications
a K-classto
converting
K binary classifications to a K -class classification:
classification
k * = argmax f (k ) (x)
k
f
(0)
(x) =
∑y
(0)
i
(0)
i
T
i
α x x+b
Pick Max Score
(0)
i∈SV0
f (9) (x) =
∑y
(9)
i
α i(9) x iT x + b(9)
i∈SV9
SVM0
Classifying digit ‘0’
and the rest
SVM1
SVM9
Convert to
Vector
Classifying digit ‘9’
from the rest
39
!
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
29 / 30
Software Tools for SVM
Matlab: fitcsvm trains an SVM for two-class classification.
Python: svm from the sklearn package provides a set of supervised
learning methods used for classification, regression and outliers
detection.
C/C++: LibSVM is a library for SVM. It also has Java, Perl, Python,
Cuda, and Matlab interface.
Java: SVM-JAVA implements sequential minimal optimization for
training SVM in Java.
Javascript: http://cs.stanford.edu/people/karpathy/svmjs/demo/
Man-Wai MAK (EIE)
Constrained Optimization and SVM
March 22, 2017
30 / 30