Lecture Slides - UTK-EECS

New Horizon in Machine Learning —
Support Vector Machine
for non-Parametric Learning
Zhao Lu, Ph.D.
Associate Professor
Department of Electrical Engineering, Tuskegee University
1
Introduction
As an innovative non-parametric learning strategy,
Support Vector Machine (SVM) gained increasing
popularity in late 1990s. Currently it is among the best
performers for various tasks, such as pattern recognition,
regression and signal processing, etc.
Support vector learning algorithms
Support vector classification for nonlinear pattern
recognition;
Support vector regression for highly nonlinear
function approximation;
2
Part I. Support Vector Learning
for Classification
3
Overfitting in linear separable classification
4
What is a good Decision Boundary?
Consider a two-class, linearly
separable classification problem.
Construct the hyperplane
wT x  b  0,
f ( x) = sign (wT x + b)
Class 2
x  Rn
to make
wT xi  b  0,
for yi  1
wT x  b  0,
for yi  1
Class 1
Many decision boundaries! Are
all decision boundaries equally
good?
5
Examples of Bad Decision Boundaries
f ( x) = sign (wT x + b)
f ( x) = sign (wT x + b)
Class 2
Class 1
Class 2
Class 1
For linearly separable classes, the data from the same class
should be close to the training data.
6
Optimal separating hyperplane
The optimal separating hyperplane (OSH) is defined as
max min  x  xi : x  R n , wT x  b  0, i  1, 2,
w ,b
i
w
,

It can be proved that OSH is
unique and locate halfway
between margin hyperplanes.
Class 2
m
Class 1
wT x  b  1
wT x + b = 1
wT x + b = 0
7
Canonical separating hyperplane
A hyperplane wT x + b = 0 is in canonical form with
respect to all training data x  X if :
min wT xi  b  1
xi X
Margin hyperplanes:
H1 : wT x  b  1
H 2 : wT x  b  1
A canonical hyperplane having a maximal margin
m  min  x  xi : x  R n , wT x  b  0, i  1, 2,
i
,

is the ultimate learning goal, i.e. the optimal separating
hyperplane.
8
Margin in terms of the norm of w
According to the conclusions from the statistical learning
theory, the large-margin decision boundary has the
excellent generalization capability.
For the canonical hyperplane, it can be proved that the
margin m is
1
m
2
w
Hence, maximizing margin is equivalent to minimizing the
square of the norm of w .
9
Finding the optimal decision boundary
Let {x1, ..., xn} be our data set and let yi  {1,-1} be the
class label of xi
The optimal decision boundary should classify all
points correctly  yi ( wT xi + b) ≥ 1, ∀i
The decision boundary can be found by solving the
following constrained optimization problem
1 2
minimize
w
2
subject to y i ( wT xi  b)  1
i
This is a quadratic optimization problem with linear
inequality constraints.
10
Generalized Lagrangian Function
Consider the general (primal) optimization problem
minimize f ( w)
subject to g i ( w) ≤ 0, i = 1, , k
h j ( w) = 0, j = 1, , m
where the functions f , gi , i = 1,, k , and hi , i = 1,, m,
are defined on a domain  . The generalized Lagrangian
was defined as
k
m
i 1
T
j 1
L( w,  ,  )  f ( w)    i g i ( w)    j h j ( w)
 f ( w)   g ( w)   T h( w)
11
Dual Problem and Strong Duality Theorem
Given the primal optimization problem, the dual problem
of it was defined as
maximize θ (α,β )  inf L( w,α, )
w
subject to
α>0
Strong Duality Theorem: Given the primal optimization
problem, where the domain  is convex and the
constraints g i and hi are affine functions. Then the
optimum of the primal problem occurs at the same values
as the optimum of the dual problem .
12
Karush-Kuhn-Tucker Conditions
Given the primal optimization problem with the objective
function f  C 1 convex and g i , hi affine. Necessary and
sufficient conditions for w* to be an optimum are the
existence of  *,  * such that

L( w* ,  * ,  * )  0
w

L( w* ,  * ,  * )  0

(KKT complementarity condition)  i* g i ( w* )  0, i  1, , k
g i ( w* )  0, i  1,, k
 i*  0, i  1,, k
13
Lagrangian of the optimization problem
1 2
minimize
w
2
subject to 1  y i ( wT xi  b)  0
i
The Lagrangian is
n
1 T
L  w w    i (1  y i ( wT xi  b))
2
i 1
Setting the gradient of L w.r.t. w and b to zero, we have
n
n
w    i (  y i ) xi  0  w    i y i xi
i 1
i 1
n
 y
i 1
i
i
( parametric  nonparametric)
0
14
The Dual Problem
n
If we substitute w    i yi xi into Lagrangian L , we have
i 1
n
n
n


1 n
T
T
L    i y i xi   j y j x j    i 1  y i (  j y j x j xi  b) 
2 i 1
j 1
i 1
j 1


0
n
n
n
n
1 n n
   i j y i y j xiT x j    i    i y i   j y j x Tj xi  b  i y i
2 i 1 j 1
i 1
i 1
j 1
i 1
n
1 n n
T
    i j y i y j xi x j    i
2 i 1 j 1
i 1
n
Note that
 y
i 1
i
i
 0 , and the data points appear in terms of
their inner product; this is a quadratic function of i only.
15
The Dual Problem
The new objective function is in terms of i only
The original problem is known as the primal problem
The objective function of the dual problem needs to be
maximized!
The dual problem is therefore:
n
1 n
maxmize W ( )    i    i j y i y j xiT x j
2 i 1, j 1
i 1
subject to  i  0,
Properties of i when we introduce
the Lagrange multipliers
n
 y
i 1
i
i
0
The result when we differentiate the
original Lagrangian w.r.t. b
16
The Dual Problem
n
1 n
T
minimize W ( )    i j y i y j xi x j    i
2 i , j 1
i 1
subject to  i  0,
n
 y
i
i 1
i
0
This is a quadratic programming (QP) problem, and therefore
a global minimum of  i can always be found
n
w can be recovered by w    i yi xi , so the decision function
i 1
can be written in the following non-parametric form
n
f ( x)  sign (  i y i xiT x  b)
i 1
17
Conception of Support Vectors (SVs)
According to the Karush-Kuhn-Tucker (KKT)
complementarity condition, the solution must satisfy
 i [ y i ( wT xi  b)  1]  0
Thus,  i  0 only for those points xi that are closest to the
classifying hyperplane. These points are called support
vectors.
From the KKT complementarity condition, the bias term b
can be calculated by using the support vectors
wT xi  b  yi
n
b  y k    i y i xiT x k
i 1
for any  k  0
18
Sparseness of the solution
Class 2
8=0.6 10=0
7=0
5=0
4=0
9=0
Class 1
w
2=0
1=0.8
6=1.4
3=0
wT x + b = 1
wT x  b  0
wT x  b  1
19
The use of slack variables
We allow “errors” xi in classification for noisy data;
xj
xj
Class 2
w
xi
xi
Class 1
wT x + b = 1
wT x  b  0
wT x  b  1
20
Soft Margin Hyperplane
The use of slack variables xi enable the soft margin classifier
 wT xi  b  1  x i y i  1
 T
w xi  b  1  x i y i  1

xi  0
i

xi are “slack variables” in optimization
Note that xi=0 if there is no error
for xi
n
1
2
w
 Cxi
The objective function
2
i 1
C : tradeoff parameter between error and margin
The primal optimization problem becomes
n
1 2
minimize
w  C x i
2
i 1
T
subject to yi ( w xi  b)  1  x i , x i  0
21
Dual Soft-Margin Optimization Problem
The dual of this new constrained optimization problem is
n
1 n
maxmize W ( )    i    i j yi y j xiT x j
2 i 1, j 1
i 1
subject to C   i  0,
w can be recovered as
n
 y
i 1
i
i
0
n
w    i y i xi
i 1
This is very similar to the optimization problem in the
hard-margin case, except that there is an upper bound C on
i now.
Once again, a QP solver can be used to find i
22
Nonlinear separable problems
23
Extension to Non-linear Decision Boundary
How to extend the linear large-margin classifier to
nonlinear case?
Cover’s theorem
Consider a space made up of nonlinearly separable
patterns.
Cover’s theorem states that such a multi-dimensional
space can be transformed into a new feature space
where the patterns are linearly separable with a high
probability, provided two conditions are satisfied:
(1) The transform is nonlinear;
(2) The dimensionality of the feature space is high enough;
24
Non-linear SVMs: Feature spaces
General idea: the data in original input space can always be
mapped into some higher-dimensional feature space where
the training data become linearly separable by using a
nonlinear transformation:
Φ: x → φ(x)
kernel visualization: http://www.youtube.com/watch?v=9NrALgHFwTo
25
Transforming the data
Key idea: transform xi to a higher dimensional space by
using a nonlinear transformation
Input space: the space the point xi are located
Feature space: the space of  ( xi ) after transformation
Curse of dimensionality: Computation in the feature space
can be very costly because it is high dimensional, and the
feature space is typically infinite-dimensional!
This problem of ‘curse of dimensionality’ can be
surmounted on the strength of kernel function because the
inner product is just a scalar, which is the most appealing
characteristic of SVM.
26
Kernel trick
Recall the SVM dual optimization problem
n
1 n
maxmize W ( )    i    i j y i y j xiT x j
2 i 1, j 1
i 1
subject to C   i  0,
n
 y
i 1
i
i
0
The data points only appear as inner product
With the aid of inner product representation in the feature
space, the nonlinear mapping  ( xi ) can be used implicitly
by defining the kernel function K by
K ( xi , x j )   ( xi )T  ( x j )
27
What functions can be used as kernels?
Mercer’s theorem in operator theory:
Every semi-positive definite symmetric function is a kernel
Semi-positive definite symmetric functions correspond to a
semi-positive definite symmetric Gram matrix on data points:
K(x1,x1) K(x1,x2) K(x1,x3)
K(x2,x1) K(x2,x2) K(x2,x3)
…
K(x1,xn)
K(x2,xn)
…
…
…
K(xn,x1) K(xn,x2) K(xn,x3)
…
…
…
K(xn,xn)
K=
28
An Example for f (.) and K(.,.)
Suppose the nonlinear mapping  (.): R  R 6 is as follows
  x1  
      (1, 2 x1 , 2 x2 , x12 , x22 , 2 x1 x2 )T
  x2  
An inner product in the feature space is
  x1     y1  
      ,        (1  x1 y1  x2 y 2 ) 2
  x2     y 2  
So, if we define the kernel function as follows, there is no
need to carry out  (.) explicitly
K ( x, y)  (1  x1 y1  x2 y2 ) 2
This use of kernel function to avoid carrying out  (.)
explicitly is known as the kernel trick
29
Kernel functions
In practical use of SVM, the user specifies the kernel
function; the transformation (.) is not explicitly stated
Given a kernel function K ( xi , x j ) , the transformation (.) is
given by its eigenfunctions (a concept in functional
analysis)

k ( x, z )  ( z )dz   ( x)
X
Eigenfunctions can be difficult to construct explicitly
This is why people only specify the kernel function
without worrying about the exact transformation 
30
Examples of kernel functions
Polynomial kernel with degree d
K ( x, y)  ( x T y  1) d
Radial basis function kernel with width s
2
K ( x, y )  exp( x  y (2s 2 ))
Closely related to radial basis function neural networks
The feature space induced is infinite-dimensional
Sigmoid function with parameter k and q
K ( x, y)  tanh (kxT y  q )
It does not satisfy the Mercer condition on all k and q
Closely related to feedforward neural networks
31
Kernel: Bridge from linear to nonlinear
Change all inner products to kernel functions
For training, the optimization problem is
n
linear
1 n
maxmize W ( )   i   i j yi y j xiT x j
2 i 1, j 1
i 1
subject to C   i  0,
n
 y
i
i 1
i
0
n
nonlinear
1 n
maxmize W ( )   i   i j yi y j K ( xi , x j )
2 i 1, j 1
i 1
subject to C   i  0,
n
 y
i 1
i
i
0
32
Kernel expansion for decision function
For classifying the new data z, it belongs to the class 1 if f
0, and to class 2 if f <0
linear
n
w    i y i xi
i 1
n
f ( x)  w x  b   i yi xiT x  b
T
i 1
n
nonlinear
w    i y i  ( xi )
i 1
n
f ( x)  wT  ( x)  b    i yi K ( xi , x)  b
i 1
33
Compared to neural networks
SVMs are explicitly based on a theoretical model of
learning rather than on loose analogies with natural
learning systems or other heuristics.
Modularity: Any kernel-based learning algorithm is
composed of two modules:
A general purpose learning machine
A problem specific kernel function
SVMs are not affected by the problem of local minima
because their training amounts to convex optimization.
34
Key features in SV classifier
All features were already present and had been used in
machine learning since 1960s:
Maximum (large) margin;
Kernel method;
Duality in nonlinear programming;
Sparseness of the solution;
Slack variables;
However, not until 1995 all features were combined
together, and it is so surprising how naturally and elegantly
they fit together and complement each other in SVM.
35
SVM classification for 2D data
Figure. Visualization of SVM classification.
36
Part II. Support Vector Learning
for Regression
37
Overfitting in nonlinear regression
38
The linear regression problem

y  f ( x )   w  x  b
39
Linear regression
The problem of linear regression is much older than the
classification one. Least squares linear interpolation was
first used by Gauss in the 18th century for astronomical
problems.
Given a training set S , with xi  X  R n , yi  Y  R , the
problem of linear regression is to find a linear function f
that models the data
y  f ( x )   w  x  b
40
Least squares
The least squares approach prescribes choosing the
parameters ( w, b) to minimize the sum of the squared
derivation of the data,

J ( w, b)   ( y i   w  xi   b) 2
i 1
Setting wˆ  (wT , b)T and
 xˆ1T
 T
 xˆ 2
ˆ
X 
 
 xˆ T
 







where
xˆ i  ( xiT , 1) T
41
Least squares
The square loss function can be written as
J ( wˆ )  ( y  Xˆwˆ ) T ( y  Xˆwˆ )
Taking derivatives of the loss and setting them equal to
zero,
J
ˆT
ˆT ˆ
wˆ
 2 X y  2 X Xwˆ  0
yields the well-known ‘normal equations’
Xˆ T Xˆwˆ  Xˆ T y
and, if the inverse of
Xˆ T Xˆ
exists, the solution is:
wˆ  ( Xˆ T Xˆ ) 1 Xˆ T y
42
Ridge regression
If the matrix Xˆ T Xˆ in the least squares problem is not of
full rank, or in other situations where numerical stability
problems occur, one can use the following solution,
wˆ  ( Xˆ T Xˆ  I n ) 1 Xˆ T y
where I n is the identity matrix with the (n  1, n  1) entry
set to zero. This solution is called ridge regression.
The ridge regression minimizes the penalized loss function

J ( w, b)    w  w   ( w  xi   b  yi ) 2
i 1
regularizer
43
-insensitive loss function
Instead of using the square loss function, the ε-insensitive
loss function is used in SV regression
0
if x  
x 
 x   otherwise
which leads to sparsity of the solution.
-insensitive loss function
Penalty


Square loss function
Penalty
Value off
target
Value off
target
44
The linear regression problem

y  f ( x )   w  x  b

0
, if yi  ( w  xi   b)  
y i  ( w  x i   b)   
 yi  ( w  xi   b)   , otherwise
 max (0, yi  ( w  xi   b)   )
45
Primal problem in SVR (SVR)
Given a data set x1 , x2 , , x  with values y1 , y 2 , , y 
The -SVR was formulated as the following (primal)
convex optimization problem:

1
2
minimize
w  C  (x i  x i* )
2
i 1
 y i   w, xi   b    x i

subject to  w, xi   b  y i    x i*
*

x
,
x
i
i  0

The constant C  0 determines the trade-off between the
flatness of f and the amount up to which deviations larger
than  are tolerated.
46
Lagrangian
Construct the Lagrange function from the objective
function and the corresponding constraints:


1 2
*
L  w  C  (x i  x i )   ( i x i   i*x i* )
2
i 1
i 1

   i (  x i  yi   w, xi   b)
i 1

   i* (  x i*  yi   w, xi   b)
i 1
where the Lagrange multipliers satisfy positivity constraints
 i ,  i* ,  i ,  i*  0
47
Karush-Kuhn-Tucker Conditions
It follows from the saddle point condition that the partial
derivatives of L with respect to the primal variables
have to vanish for optimality,

 b L   ( i*   i )  0
i 1

 w L  w   ( i   i* ) xi  0
( parametric  nonparametric)
i 1
x L  C   i  i  0
 x * L  C   i*   i*  0
The 2nd equation indicates that w can be written as a
linear combination of training patterns xi .
48
Dual problem
Substituting the equations above into the Lagrangian yields
the following dual problem


1 
*
*
*
maximize   ( i   i )( j   j ) xi , x j     ( i   i )   yi ( i   i* )
2 i , j 1
i 1
i 1

subject to
 (
i 1
i
  i* )  0 and  i ,  i*  [0, C ]
The function f can be written in a non-parametric form by

substituting w   ( i   i* ) xi into f ( x)   w  x  b
i 1

f ( x)   ( i   i* ) xi , x  b
i 1
49
KKT complementarity conditions
At the optimal solution the following Karush-Kuhn-Tucker
complementarity condition must be fulfilled
 i ( w, xi   b  y i    x i )  0
 i* ( w, xi   b  y i    x i* )  0
 i x i  (C   i ) x i  0
 i* x i*  (C   i* ) x i*  0
Obviously, for 0   i  C , x i  0 holds, and in similar
for 0   i*  C , x i*  0 .
50
0  i  C
Unbounded support vectors
Hence, for 0   i ,  i*  C , it follows
 w, xi   b  yi    0
  w, xi   b  yi    0
Thus, for all the data points fulfilling y  f (x)   , dual
variables 0   i  C , and for the ones satisfying y  f (x)   ,
the dual variables 0   i*  C . These data points are called
the unbounded support vectors.
51
Computing bias term b
Unbounded support vector allow computing the value of
the bias term b as given below
b  yi   w, xi    , for 0   i  C
b  yi   w, xi    , for 0   i*  C
The calculation of a bias term b is numerically very
sensitive, and it is better to compute the bias b by
averaging over all the unbounded support vector data
points.
52
Bounded support vector
The bounded support vectors are outside the -tube
53
Sparsity of SV expansion
For all data points inside the  -tube, i.e.,
yi  ( w  xi   b)  
From the KKT complementarity condition,
 i ( w, xi   b  yi    x i )  0
 i* ( w, xi   b  yi    x i* )  0
 i and  i* have to be zeros. Therefore, we have a sparse
expansion of w in terms of xi

w   ( i   i* ) xi 
i 1
*
(



 i i ) xi
iSV
54
SV Nonlinear Regression
Key idea: map data xi to a higher dimensional space by
using a nonlinear transformation, and perform linear
regression in feature (embedded) space
Input space: the space the point xi are located
Feature space: the space of  ( xi ) after transformation
Computation in the feature space can be costly because it
is high dimensional, and the feature space is typically
infinite-dimensional!
This problem of ‘curse of dimensionality’ can be
surmounted by resorting to kernel trick, which is the most
appealing characteristic of SVM.
55
Kernel trick
Recall the SVM optimization problem


1 
*
*
*
maximize   ( i   i )( j   j ) xi , x j     ( i   i )   yi ( i   i* )
2 i , j 1
i 1
i 1

subject to
*
*
(



)

0
and

,

 i i
i
i  [0, C ]
i 1
The data points only appear as inner product
As long as we can calculate the inner product in the feature
space, we do not need to know the nonlinear mapping
explicitly
Define the kernel function K by K ( xi , x j )   ( xi )T  ( x j )
56
Kernel: Bridge from linear to nonlinear
Change all inner products to kernel functions
For training,
Original


1 
*
*
maximize   ( i   i )( j   j ) xi , x j    (  yi ) i  (  yi ) i*
2 i , j 1
i 1
i 1

subject to
*
*
(



)

0
and

,

 i i
i
i  [0, C ]
i 1
With kernel function
1
maximize   ( i   i* )( j   *j ) K ( xi , x j )   (  yi ) i  (  yi ) i*
2 i , j 1
i 1
i 1
subject to
 (
i 1
i
  i* )  0 and  i ,  i*  [0, C ]
57
Kernel expansion representation
For testing,
Original

w   ( i   i* ) xi
i 1

f ( x)   w, x  b   ( i   i* ) xi , x  b
i 1
With kernel function

w   ( i   i* )  ( xi )
i 1
f ( x)   w, ( x)   b   ( i   i* ) K ( xi , x)  b
i 1
58
i 's
Compared to SV classification
The size of the SVR problem, with respect to the size of an
SV classifier design task, is doubled now. There are 2
unknown dual variables (   i ' s and   i* ' s ) for a
support vector regression.
Besides the penalty parameter C and the shape parameters
of the kernel functions (such as the variances of a Gaussian
kernel, order of polynomial), the insensitivity zone  also
need to be set beforehand in constructing SV machines for
regression.
59
Influence of insensitivity zone
4/28/2006
60
Linear programming SV regression
In an attempt to improve computational efficiency and
model sparsity, the linear programming SV regression was
formulated


1
 1  C (x i  x i* )
2
i 1


 y i   j k ( x j , xi )    x i
j 1

 
subjectto   j k ( x j , xi )  yi    x i*
 j 1

x i , x i*  0


minimize


where
  [ 1  2    ]T
.
61
Linear programming SV regression
The optimization problem can be converted into a linear
programming problem as follows
minimize
  
 
T
c   
 
x 
  
 K  K  I     y 
      

subjectto 
  K K  I   x     y 
  T





where c  1
, 1,  , 1, 1, 1,  , 1, 2C , 2C ,  , 2C  ,  i   i   i
 

  
 





K ij = k ( xi , x j )    (1 ,  2 ,  ,   )T    (1 ,  2 ,  ,   )T x  (x1 , x 2 , ,x  )T
62
MATLAB Demo for SV regression
MATLAB support vector machines
Toolbox developed by Steve R. Gunn at
University of Southampton, UK.
The software can be downloaded from
http://www.isis.ecs.soton.ac.uk/resources/s
vminfo/
63
Questions?
64