Support Vector Machines

Support Vector Machines
主講人：虞台文
Content

Introduction
The VC Dimension & Structure Risk
Minimization
Linear SVM  The Separable case
Linear SVM  The Non-Separable case

Lagrange Multipliers



Support Vector Machines
Introduction
Learning Machines


A machine to learn the mapping
xi
yi
x
f (x, α)
Defined as
Learning by adjusting
this parameter?
Generalization vs. Learning

How a machine learns?
–
–
Adjusting the parameters so as to partition the pattern
(feature) space for classification.
How to adjust?
Minimize the empirical risk (traditional approaches).

What the machine learned?
–
–
–
Memorize the patterns it sees? or
Memorize the rules it finds for different classes?
What does the machine actually learn if it minimizes
empirical risk only?
R(α )  Remp (α) ?
Risks
Expected Risk (test error)
R(α )   12 y  f (x, α ) dP(x, y )
Empirical Risk (training error)
Remp (α ) 
l
1
2l
 y  f (x , α)
i 1
i
i
More on Empirical Risk

How can make the empirical risk arbitrarily small?
–




To let the machine have very large memorization capacity.
Does a machine with small empirical risk also get
small expected risk?
How to avoid the machine to strain to memorize
training patterns, instead of doing generalization,
only?
How to deal with the straining-memorization
capacity of a machine?
What the new criterion should be?
Structure Risk Minimization
Goal:
Learn both the right ‘structure’
and right `rules’ for classification.
Right Structure:
E.g., Right amount and right forms of components or parameters
are to participate in a learning machine.
Right Rules:
The empirical risk will also be reduced if right rules are learned.
New Criterion
Total Risk
=
Empirical Risk
+
Risk due to the
structure of
the learning machine
Support Vector Machines
The VC Dimension &
Structure Risk Minimization
VC: Vapnik Chervonenkis
The VC Dimension




Consider a set of function f (x,)  {1,1}.
A given set of l points can be labeled in 2l ways.
If a member of the set {f ()} can be found which
correctly assigns the labels for all labeling, then
the set of points is shattered by that set of
functions.
The VC dimension of {f ()} is the maximum
number of training points that can be shattered
by {f ()}.
VC dimension = 3
The VC Dimension
for Oriented Lines in R2
More on VC Dimension



In general, the VC dimension of a set of oriented
hyperplanes in Rn is n+1.
VC dimension is a measure of memorization
capability.
VC dimension is not directly related to number of
parameters. Vapnik (1995) has an example with 1
parameter and infinite VC dimension.
Bound on Expected Risk
Expected Risk
Empirical Risk
R(α )   12 y  f (x, α ) dP(x, y )
Remp (α ) 
l
1
2l
 y  f (x , α)
i 1
i
i

h(log(2l h)  1)  log( 4) 
P  R( )  Remp ( ) 
  1  
l


VC Confidence
Bound on Expected Risk
Consider small  (e.g.,   0.5).
h(log(2l h)  1)  log( 4)
R( )  Remp ( ) 
l

h(log(2l h)  1)  log( 4) 
P  R( )  Remp ( ) 
  1  
l


VC Confidence
Bound on Expected Risk
Consider small  (e.g.,   0.5).
h(log(2l h)  1)  log( 4)
R( )  Remp ( ) 
l
Traditional approaches
minimize empirical risk only
Structure risk minimization want to minimize the bound
h(log(2l h)  1)  log( 4)
R( )  Remp ( ) 
l
VC Confidence
Amongst machines with zero
empirical risk, choose the one
with smallest VC dimension
How to evaluate VC dimension?
 =0.05 and l=10,000
Structure Risk Minimization
h4  h3  h2  h1
h4
h3
h2
h1
Nested subset of functions with different VC dimensions.
Support Vector Machines
The Linear SVM 
The Separable Case
The Linear Separability
Linearly separable
Not linearly separable
The Linear Separability
w
wx  b  0
wx  b  1
Linearly Separable
w, b such that
wxi  b  1 for yi  1
wx  b  1
wx  b  0
Linearly separable
wxi  b  1 for yi  1
yi (wxi  b)  1  0 i
yi (wxi  b)  1  0 i
Margin Width
w
1  b 1  b
d

|| w || || w ||
2

|| w ||
How about maximize the margin?
O
What is the relation btw. the
margin width and VC dimension?
yi (wxi  b)  1  0 i
Maximum Margin Classifier
1  b 1  b
d

|| w || || w ||
2

|| w ||
How about maximize the margin?
O
Supporters
What is the relation btw. the
margin width and VC dimension?
Building SVM
Minimize
Subject to
1
2
|| w ||2
yi (w xi  b)  1  0 i
T
This requires the knowledge about Lagrange Multiplier.
The Method of Lagrange
Minimize
Subject to
1
2
|| w ||2
yi (w xi  b)  1  0 i
T
The Lagrangian:
l
1
2
L(w, b; )  || w ||  i  yi (wT xi  b)  1
2
i 1
Minimize it w.r.t w, while maximize it w.r.t. .
i  0
The Method of Lagrange
Minimize
Subject to
What value of i
should be if it is
feasible and nonzero?
1
2
|| w ||2
How about if it is
zero?
yi (w xi  b)  1  0 i
T
The Lagrangian:
l
1
2
L(w, b; )  || w ||  i  yi (wT xi  b)  1
2
i 1
Minimize it w.r.t w, while maximize it w.r.t. .
i  0
l
l
1
2
T
L(w, b; )  || w ||  i yi ( w xi  b)   i
2
i 1
i 1
The Method of Lagrange
Minimize
Subject to
1
2
|| w ||2
yi (w xi  b)  1  0 i
T
The Lagrangian:
l
1
2
L(w, b; )  || w ||  i  yi (wT xi  b)  1
2
i 1
l
l
1
 || w ||2  i yi (wT xi  b)   i
2
i 1
i 1
i  0
l
l
1
2
T
L(w, b; )  || w ||  i yi ( w xi  b)   i
2
i 1
i 1
Duality
Minimize
Subject to
Maximize
Subject to
1
2
|| w ||2
yi (w xi  b)  1  0 i
T
L(w*, b*; )
 w ,b L(w, b; )  0
i  0, i  1, , l
l
l
1
2
T
L(w, b; )  || w ||  i yi ( w xi  b)   i
2
i 1
i 1
Duality
l
 w L(w, b;  )  w   i yi xi  0
i 1
l
b L(w, b; )   i yi  0
i 1
Maximize
Subject to
l
w*   i yi xi
i 1
l
 y
i 1
L(w*, b*; )
 w ,b L(w, b; )  0
i  0, i  1, , l
i
i
0
l
l
1
2
T
L(w, b; )  || w ||  i yi ( w xi  b)   i
2
i 1
i 1
Duality
l
l
w*   i yi xi
 w L(w, b;  )  w   i yi xi  0
i 1
i 1
l
l
 y
b L(w, b; )   i yi  0
i 1
i 1
T
1 l

L(w*, b*; )    i yi xi 
2  i 1

T
 l


y
x


y
x

i i i
 i i i 
i 1
 i 1

l
T
1

  i    i yi xi 
2  i 1
i 1

l
F ( )
l
l
i
0
i
l
l
l
  y x  b  y   
i 1
i
i
i
i 1
i
i
l
 y x
i 1
i
i
i
1 l l
  i   i  j yi y j  xi ,x j 
2 i 1 j 1
i 1
Maximize
i 1
i
l
l
1
2
T
L(w, b; )  || w ||  i yi ( w xi  b)   i
2
i 1
i 1
Duality
l
l
w*   i yi xi
 w L(w, b;  )  w   i yi xi  0
i 1
i 1
l
l
T

y0

y

0
 ii
b L(w, b; )   i yi  0
i 1
i 1
T
1 l

L(w*, b*; )    i yi xi 
2  i 1

T
 l


y
x


y
x

i i i
 i i i 
i 1
 i 1

l
T
1

  i    i yi xi 
2  i 1
i 1

l
F ( )
l
l
l
 y x
i 1
i
i
i
1 l l
  i   i  j yi y j  xi ,x j 
2 i 1 j 1
i 1
l
l
l
  y x  b  y   
i 1
i
i
i
i 1
i
i
i 1
1 T
F (  )    1   D
2
Maximize
i
Duality
The Primal
The Dual
Minimize
1
2
|| w ||2
Subject to yi (w xi  b)  1  0 i
T
Maximize
Subject to
1 T
F (  )    1   D
2
T y  0
0
Quadratic Programming
The Solution
Find * by …
l
w *   i* yi xi
i 1
The Dual
Maximize
Subject to
b ?
*
1 T
F (  )    1   D
2
T y  0
0
The Solution
Find * by …
l
w *   i* yi xi
Call it a support
vector is i > 0.
i 1
b*  yi  w *T xi , i  0
The Lagrangian:
The Karush-Kuhn-Tucker
Conditions
l
1
L(w, b; )  || w ||2  i  yi (wT xi  b)  1
2
i 1
l
1
2
L(w, b; )  || w ||  i  yi (wT xi  b)  1
2
i 1
The Karush-Kuhn-Tucker Conditions
l
 w L(w, b;  )  w   i yi xi  0
i 1
l
b L(w, b; )   i yi  0
i 1
yi (wT xi  b)  1  0, i  1,
,l
i  0, i  1, , l
i  yi (wT xi  b)  1  0, i  1,
,l
l
w    y xi
*
Classification
f (x)  sgn  w * x  b *
T
 l *

 sgn   i yi  xi , x  b * 
 i 1



*
 sgn   i yi  xi , x   b * 
 * 0

 i

i 1
*
i i
Classification Using Supporters
The weight for
the ith support vector.
Bias


*
f (x)  sgn   i yi  xi , x   b * 
 * 0

 i

The similarity measure btw.
input and the ith support vector.
Demonstration
Support Vector Machines
The Linear SVM 
The Non-Separable Case
The Non-Separable Case
We require that
wxi  b  1  i for yi  1
wxi  b  1  i for yi  1
yi (wxi  b)  1  i  0 i
i  0 i
For simplicity, we consider k = 1.
Mathematic Formulation
Minimize
Subject to
1
2
|| w || C
2
  
i
k
i
yi (wT xi  b)  1  i  0 i
i  0 i
The Lagrangian
Minimize
Subject to
1
2
|| w ||2 C i i
yi (wT xi  b)  1  i  0 i
i  0 i
L(w, b, ; ,  )
1
 || w ||2 C  i i   i i  yi (wT xi  b)  1  i    i ii
2
i  0, i  0
1
L(w, b, ; ,  )  || w ||2 C  i i   i i  yi (wT xi  b)  1  i    i ii
2
Duality
Minimize
Subject to
Maximize
Subject to
1
2
i  0, i  0
|| w ||2 C i i
yi (wT xi  b)  1  i  0 i
i  0 i
L(w*, b*, *; ,  )
 w ,b , L(w, b, ; ,  )  0
  0,   0
1
L(w, b, ; ,  )  || w ||2 C  i i   i i  yi (wT xi  b)  1  i    i ii
2
Duality
i  0, i  0
w L(w, b, ; , )  w  i i yi xi  0
b L(w, b, ; , )   i i yi  0
i L(w, b, ; ,  )  C  i  i  0
Maximize
Subject to
L(w*, b*, *; ,  )
w*  i i yi xi
y
i
i
i
0
i  C  i
0  i  C
 w ,b , L(w, b, ; ,  )  0
  0,   0
1
L(w, b, ; ,  )  || w ||2 C  i i   i i  yi (wT xi  b)  1  i    i ii
2
Duality
w L(w, b, ; , )  w  i i yi xi  0
b L(w, b, ; , )   i i yi  0
i L(w, b, ; ,  )  C  i  i  0
i  0, i  0
w*  i i yi xi
y
i
i
i
0
i  C  i
0  i  C
F (,  )  L(w*, b*, *; ,  )   i  1   i  j yi y j  xi , x j 
i
2 i j
Maximize this
1
L(w, b, ; ,  )  || w ||2 C  i i   i i  yi (wT xi  b)  1  i    i ii
2
Duality
i  0, i  0
w L(w, b, ; , )  w  i i yi xi  0
b L(w, b, ; , )   i i yi  0
i L(w, b, ; ,  )  C  i  i  0
w*  i i yi xi
y
i
i
i
0
T y  0
i  C  i
0    C 0  i  C
F (,  )  L(w*, b*, *; ,  )   i  1   i  j yi y j  xi , x j 
i
2 i j
Maximize this
1 T
F (  )    1   D
2
Duality
The Primal
Minimize
Subject to
1
2
|| w || C
2
  
i
k
i
yi (wT xi  b)  1  i  0 i
i  0 i
The Dual
Maximize F ( )    1 
Subject to
T y  0
0    C1
1 T
 D
2
1
L(w, b, ; ,  )  || w ||2 C  i i   i i  yi (wT xi  b)  1  i    i ii
2
The Karush-Kuhn-Tucker Conditions
w L(w, b, ; , )  w  i i yi xi  0
b L(w, b, ; , )   i i yi  0
i L(w, b, ; ,  )  C  i  i  0
yi (wT xi  b)  1  i  0
i  0
i  0
i  0
i  yi (wT xi  b)  1  i   0
ii  0
Quadratic Programming
The Solution
Find * by …
l
w *   i* yi xi
i 1
The Dual
b ?
?
*
Maximize F ( )    1 
Subject to
T y  0
0    C1
1 T
 D
2
The Solution
Find * by …
l
w *   i* yi xi
Call it a support
vector is 0 < i < C.
i 1
b*  yi  w *T xi , 0<i  C
The Lagrangian:
L(w, b, ; ,  )

1
|| w ||2 C  i i   i i  yi (wT xi  b)  1  i    i ii
2
The Solution
Find * by …
l
w *   i* yi xi
Call it a support
vector is 0 < i < C.
i 1
b*  yi  w *T xi , 0<i  C
i  max 0,1  yi (w * xi  b*)
A false classification
pattern if i > 0.
l
w    y xi
*
Classification
f (x)  sgn  w * x  b *
T
 l *

 sgn   i yi  xi , x  b * 
 i 1



*
 sgn   i yi  xi , x   b * 
 * 0

 i

i 1
*
i i
Classification Using Supporters
The weight for
the ith support vector.
Bias


*
f (x)  sgn   i yi  xi , x   b * 
 * 0

 i

The similarity measure btw.
input and the ith support vector.
Support Vector Machines
Lagrange
Multipliers
Lagrange Multiplier


“Lagrange Multiplier Method” is a powerful
tool for constraint optimization.
Contributed by Riemann.
Milkmaid Problem
Milkmaid Problem
2
f ( x, y )   ( x  xi ) 2  ( y  yi ) 2
i 1
Milkmaid Problem
Goal:
(x, y)
Minimize
f ( x, y )
(x1, y1)
Subject to
g ( x, y )  0
(x2, y2)
2
f ( x, y )   ( x  xi ) 2  ( y  yi ) 2
i 1
Observation
Goal:
(x*, y*)
Minimize
f ( x, y )
Subject to
g ( x, y )  0
(x1, y1)
At the extreme point, say, (x*, y*)
f (x*, y*) = g(x*, y*).
(x2, y2)
Optimization with Equality Constraints
Goal: Min/Max
Subject to
f ( x)
g ( x)  0
Lemma:
At an extreme point, say, x*, we have
f (x*)  g (x*) if g (x*)  0
f (x*)  g (x*) if g (x*)  0
Proof
Let
x* be an extreme point.
r(t) be any differentiable path on
surface g(x)=0 such that r(t0)=x*.
r(t0 )
is a vector tangent to the
surface g(x)=0 at x*.
r(t)
f ( x)
r(t0)
x*
f (x*)  f (r (t )) |t t0
d
dt
f (r(t ))  f (r(t ))  r(t )
Since x* be an extreme point,
0  f (r(t0 ))  r(t0 )  f (x*)  r(t0 )
g ( x)  0
f (x*)  g (x*) if g (x*)  0
Proof
f (x*)  r(t0 )
This is true for any r pass through x*
on surface g(x)=0.
It implies that
r(t)
f ( x*)
f ( x)
f (x*)  ,
where  is the tangential plane of
surface g(x)=0 at x*.
0  f (r(t0 ))  r(t0 )  f (x*)  r(t0 )
r(t0)
x*
r(t0 )
g ( x)  0
Optimization with Equality Constraints
Goal: Min/Max
Subject to
f ( x)
g ( x)  0
Lemma:
At an extreme point, say, x*, we have
f (x*)  g (x*) if g (x*)  0
Lagrange Multiplier
x: dimension n.
The Method of Lagrange
Goal: Min/Max
Subject to
f ( x)
g ( x)  0
Find the extreme points by solving the following equations.
f (x)  g (x)
g ( x)  0
n + 1 equations
with n + 1 variables
Lagrangian
Goal: Min/Max
Subject to
f ( x)
g ( x)  0
Define L(x;  )  f (x)   g (x)
Solve
x L(x;  )  0
 L(x;  )  0
Unconstraint
Optimization
Constraint
Optimization
Lagrangian
  (1 ,
, m )
T
Optimization with
Multiple Equality Constraints
Min/Max
Subject to
f ( x)
gi (x)  0,
Define L( x;  )  f ( x) 
,m
m
  g ( x)
i 1
Solve  x ,  L( x;  )  0
i  1,
i
i
Lagrangian
Optimization with
Inequality Constraints
Minimize
Subject to
f ( x)
gi (x)  0,
h j (x)  0,
i  1,
j  1,
,m
,n
You can always reformulate your problems into the about form.
  (1 ,
, m )T
  ( 1 ,
, n )T
i  0
Lagrange Multipliers
Minimize
Subject to
Lagrangian:
f ( x)
gi (x)  0,
h j (x)  0,
i  1,
j  1,
m
n
i 1
j 1
,m
,n
L(x; , )  f (x)   i gi (x)    j h j (x)
  (1 ,
, m )T
  ( 1 ,
, n )T
i  0
Lagrange Multipliers
Minimize
Subject to
0 for feasible
solutions
Lagrangian:
f ( x)
gi (x)  0,
h j (x)  0,
negative for
feasible solutions
i  1,
j  1,
m
n
i 1
j 1
,m
,n
L(x; , )  f (x)   i gi (x)    j h j (x)
m
n
i 1
j 1
L(x; , )  f (x)   i gi (x)    j h j (x)
Duality
Let x* be a local extreme.
0  x L(x; , ) |xx*
Define D(,  )  L(x*; ,  )
m
n
i 1
j 1
D(, )  f (x*)   i gi (x*)    j h j (x*)  f ( x*)
Maximize it w.r.t. , 
m
n
i 1
j 1
L(x; , )  f (x)   i gi (x)    j h j (x)
Duality
To minimize the Lagrangian w.r.t x,
while to maximize it w.r.t.  and .
Let x* be a local extreme.
0  x L(x; , ) |xx*
Define D(,  )  L(x*; ,  )
m
n
i 1
j 1
D(, )  f (x*)   i gi (x*)    j h j (x*)  f ( x*)
Maximize it w.r.t. , 
m
n
i 1
j 1
L(x; , )  f (x)   i gi (x)    j h j (x)
Saddle Point Determination
m
n
i 1
j 1
L(x; , )  f (x)   i gi (x)    j h j (x)
Duality
The primal
The dual
Minimize
Subject to
Maximize
Subject to
f ( x)
gi (x)  0,
i  1,
,m
h j (x)  0,
j  1,
,n
L(x*; ,  )
 x , L(x; ,  )  0
0
m
n
i 1
j 1
L(x; , )  f (x)   i gi (x)    j h j (x)
The Karush-Kuhn-Tucker Conditions
m
n
i 1
j 1
x L(x; , )  x f (x)   ix gi (x)    j x h j (x)  0
gi (x)  0, i  1,
,m
 j  0, j  1,
,n
h j (x)  0, j  1,
,n
 j h j (x)  0, j  1,
,n

Download Report

Support Vector Machines

Paperzz.com

Your Paperzz