SVMReviewOmerBOEHM20..

A tutorial about SVM
Omer Boehm
[email protected]
Neurocomputation Seminar
© 2011 IBM Corporation
IBM Haifa Labs
Outline






2
Introduction
Classification
Perceptron
SVM for linearly separable data.
SVM for almost linearly separable data.
SVM for non-linearly separable data.
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Introduction
 A branch of artificial intelligence, a scientific discipline
concerned with the design and development of
algorithms that allow computers to evolve behaviors
based on empirical data
 An important task of machine learning is classification.
 Classification is also referred to as pattern recognition.
3
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Example
Classes
Objects
4
Income
Debt
Married age
Shelley
60,000
1000
No
30
Elad
200,000 0
Yes
80
Dan
0
No
25
Alona
100,000 10,000
Yes
40
20,000
Approve deny
Learning Machine
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Types of learning problems
 Supervised learning (n class, n>1)
 Classification
 Regression
 Unsupervised learning (0 class)
 Clustering
(building equivalence classes)
 Density estimation
5
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Supervised learning
 Regression
 Learn a continuous function from input samples
 Stock prediction
In fact, these are
 Input – future date
 Output – stock price
approximation problems
 Training – information on stack price over last period
 Classification
 Learn a separation function from discrete inputs to classes.
 Optical Character Recognition (OCR)
 Input – images of digits.
 Output – labeling 0-9.
 Training - labeled images of digits.
6
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Regression
7
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Classification
8
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Density estimation
9
IBM
© 2011 IBM Corporation
IBM Haifa Labs
What makes learning difficult
Given the following examples
How should we draw the line?
10
IBM
© 2011 IBM Corporation
IBM Haifa Labs
What makes learning difficult
Which one is most appropriate?
11
IBM
© 2011 IBM Corporation
IBM Haifa Labs
What makes learning difficult
The hidden test points
12
IBM
© 2011 IBM Corporation
IBM Haifa Labs
What is Learning (mathematically)?
 We would like to ensure that small changes in an input point
from a learning point will not result in a jump to a different
classification
xi  y i
xi  x  y i
 Such an approximation is called a stable approximation
 As a rule of thumb, small derivatives ensure stable
approximation
13
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Stable vs. Unstable approximation
 Lagrange approximation (unstable)
given n points, ( xi , yi ) , we find the unique polynomial
, that passes through the given points
f ( x)  Ln1 ( x)
 Spline approximation (stable)
given n points, ( xi , yi ) , we find a piecewise approximation
f (x)
by third degree polynomials such that they
pass through the given points and have common tangents at the
xn
division points and in addition :
| f ' ' ( x) |2 dx  min

x1
14
IBM
© 2011 IBM Corporation
IBM Haifa Labs
What would be the best choice?
 The “simplest” solution
 A solution where the distance from each example is as small as
possible and where the derivative is as small as possible
15
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Vector Geometry
Just in case ….
16
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Dot product
 The dot product of two vectors
[a1 , a2 , a3 ,..., an ] [b1 , b2 , b3 ,..., bn ]
Is defined as:
a  b  i 1 aibi  a2b2  a3b3  ...  anbn
n
 An example
[1,3,5]  [4,2,1]  (1)( 4)  (3)( 2)  (5)( 1)  3
17
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Dot product
 a  a  || a ||2
 where || a || denotes the length
(magnitude) of ‘a’
a  b  || a || * || b || * cos 
if ' a' is perpendicu lar to ' b' then
a  b  0 (cos 90o  0)
 Unit vector

a
a
|| a ||
18
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Plane/Hyperplane
 Hyperplane can be defined by:
 Three points
 Two vectors
 A normal vector and a point
19
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Plane/Hyperplane

 Let n be a perpendicular vector to the
hyperplane H

 Let p1 be the position vector of some
known point p1 in the plane. A point

_p with position vector p is in the plane
iff the vector drawn from p1 to p is

perpendicular to n
 Two vectors are perpendicular iff
their dot product is zero
 The hyperplane H can be expressed as
n  ( p  p1 )  0
n  p  n  p1  0 // substitue n  w, p  x, b  n  p1
 w  x  b  0
20
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Classification
21
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Solving approximation problems
 First we define the family of approximating functions F
 Next we define the cost function C ( f ) . This function tells how well f  F
performs the required approximation
 Getting this done , the approximation/classification consists of solving the
minimization problem min (C )
f F
 A first necessary condition (after Fermat) is
C
0
f
 As we know it is always possible to do Newton-Raphson, and get a sequence
of approximations
22
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Classification
 A classifier is a function or an algorithm that maps every possible input
(from a legal set of inputs) to a finite set of categories.
 X is the input space, x  Xis a data point from an input space.
 A typical input space is high-dimensional, for example
x  {x1 , x2 , ..., xd }  R d , d  1
X is also called a feature vector.
 Ω is a finite set of categories to which the input data points
belong : Ω ={1,2,…,C}.
 i   are called labels.
25
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Classification
 Y is a finite set of decisions – the output set of the classifier.
 The classifier is a function
f :x y
x  X  f (classify )  y  f ( x)  Y
26
IBM
© 2011 IBM Corporation
IBM Haifa Labs
The Perceptron
27
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Perceptron - Frank Rosenblatt (1957)
 Linear separation of the input space
f ( x )   w  x  b
h( x)  sign ( f ( x))
28
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Perceptron algorithm
 Start:
The weight vector
set t  0
W0
is generated randomly,
 Test:
A vector x  P  N is selected randomly,
if x  P and wt  x  0 go to test,
if x  P and wt  x  0 go to add,
if x N and wt  x  0 go to test,
if x N and wt  x  0 go to subtract
 Add:
 Subtract:
29
wt 1  wt  x, t  t  1
wt 1  wt  x, t  t  1
IBM
go to test,
go to test,
© 2011 IBM Corporation
IBM Haifa Labs
Perceptron algorithm
Shorter version
Update rule for the k+1 iterations
(iteration for each data point)
if ( yi ( wk  xi  )  0, then
wk 1  wk  yi xi
k  k 1
30
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Perceptron – visualization (intuition)
31
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Perceptron – visualization (intuition)
32
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Perceptron – visualization (intuition)
33
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Perceptron – visualization (intuition)
34
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Perceptron – visualization (intuition)
35
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Perceptron - analysis
Solution is a linear combination of training points
W  i yi xi
i,  i  0
Only uses informative points (mistake driven)
The coefficient of a point reflect its ‘difficulty’
The perceptron learning algorithm does not terminate if the learning set is not
linearly separable (e.g. XOR)
36
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Support Vector Machines
37
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Advantages of SVM, Vladimir Vapnik 1979,1998
 Exhibit good generalization
 Can implement confidence measures, etc.
 Hypothesis has an explicit dependence on the data (via the
support vectors)
 Learning involves optimization of a convex function (no false
minima, unlike NN).
 Few parameters required for tuning the learning machine (unlike
NN where the architecture/various parameters must be found).
38
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Advantages of SVM
 From the perspective of statistical learning theory the motivation
for considering binary classifier SVMs comes from theoretical
bounds on the generalization error.
 These generalization bounds have two important features:
39
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Advantages of SVM
 The upper bound on the generalization error does not depend on
the dimensionality of the space.
 The bound is minimized by maximizing the margin,
i.e.
the minimal distance between the hyperplane separating the two
classes and the closest data-points of each class.
40
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Basic scenario - Separable data set
41
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Basic scenario – define margin
42
IBM
© 2011 IBM Corporation
IBM Haifa Labs
 In an arbitrary-dimensional space, a separating
hyperplane can be written :
 w  x  b  0
 Where W is the normal.
 The decision function would be :
D( x)  sign ( w  x  b)
43
IBM
© 2011 IBM Corporation
IBM Haifa Labs
 Note argument in D(x)
rescaling of the form
is invariant under a
w  w, b  b
 Implicitly the scale can be fixed by defining
H1 :  w  x  b  1
H 2 :  w  x  b  1
as the support vectors (canonical hyperplanes)
44
IBM
© 2011 IBM Corporation
IBM Haifa Labs
 The task is to select
can be described as:
w, b
, so that the training data
xi  w  b  1
for
yi  1
xi  w  b  1
for
yi  1
 These can be combined into:
yi ( xi  w  b)  1  0, i
45
IBM
© 2011 IBM Corporation
IBM Haifa Labs
46
IBM
© 2011 IBM Corporation
IBM Haifa Labs
 The margin will be given by the projection
of the vector ( x  x )
1
2
onto the normal vector to the hyperplane

i.e.
w
w
|| w ||
So the distance (Euclidian) can be formed
| ( x1  x2 )  w |
d
|| w ||
47
IBM
© 2011 IBM Corporation
IBM Haifa Labs
 Note that
x1
lies on H1
i.e.  w  x  b  1
 w  x1   b  1
 Similarly for
x2
 w  x2   b  1
 Subtracting the two results in
w  ( x1  x2 )  2
48
IBM
© 2011 IBM Corporation
IBM Haifa Labs
 The margin can be put as
| ( x1  x2 )  w |
2
m

|| w ||
|| w ||
 Can convert the problem to
subject to the constraints:
1
J ( w)  ( w  w)
2
yi ( xi  w  b)  1  0, i
 J(w) is a quadratic function, thus there is a single
global minimum
49
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Lagrange multipliers
 Problem definition :
Maximize
f ( x, y )
subject to
g ( x, y )  c
 A new λ variable is used , called ‘Lagrange multiplier‘ to define
Ld ( x, y,  )  f ( x, y)   ( g ( x, y)  c)
50
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Lagrange multipliers
51
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Lagrange multipliers
52
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Lagrange multipliers - example
 Maximize f ( x, y )  x  y
, subject to
x2  y 2  1
 Formally set ( x, y)  x  y   ( x 2  y 2  1)
Set the derivatives to 0

 1  2 x  0
x

 1  2y  0
y

 x2  y2 1  0

x y
 Combining the first two yields :
 Substituting into the last x  y   2 / 2
 Evaluating the objective function f on these yields
f ( 2 / 2, 2 / 2)  2
f ( 2 / 2, 2 / 2)   2
53
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Lagrange multipliers - example
54
IBM
© 2011 IBM Corporation
IBM Haifa Labs
 Primal problem:
1
minimize J ( w)  ( w  w)
2
s.t. yi ( xi  w  b)  1  0, i
 Introduce Lagrange multipliers
associated with the constraints
i,  i  0
 The solution to the primal problem is
equivalent to determining the saddle
point of the function :
1
n
L p ( w, b,  )  ( w  w)  i 1 i ( yi ( xi  w  b)  1)
2
55
IBM
© 2011 IBM Corporation
IBM Haifa Labs
 At saddle point, L p , has minimum requiring
L
 w    i yi xi  0  w    i yi xi
w
i
i
L
   i yi  0
b
i
56
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Primal-Dual
1
n
n
L p  ( w  w)  i 1 i yi ( xi  w  b)  i 1 i
2
 Primal:
Lp
Minimize
Substitute
w    i yi xi and
 y
i
 Dual:
i
i
subject to i,  i  0
0
i
1 n
n
Ld  i 1 i  i 1  j 1 i j yi y j xi x j
2
n
Maximize Ld
i,  i  0,
57
w, b
with respect to
with respect to
 y
i
i

subject to
0
i
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Solving QP using dual problem
 Maximize
1 n
n
Ld  i 1 i  i 1  j 1 i j yi y j xi x j
2
n
constrained to i,  i  0, and   i yi  0
i
 We have   {1 ,  2 ,...,  n } , new variables. One for
each data point
 This is a convex quadratic optimization problem , and
we run a QP solver to get  , and W ( w    i yi xi )
i
58
IBM
© 2011 IBM Corporation
IBM Haifa Labs
 ‘b’ can be determined from the optimal 
and Karush-Kuhn-Tucker (KKT) conditions
 i [ yi ( w  xi  b)  1]  0, i
i  0
(data points residing on the SV) implies
yi ( w  xi  b)  1  w  xi  b  yi

59
b  yi  w  xi
1
or AVG b 
Ns
IBM
y
i
 w  xi
i
© 2011 IBM Corporation
IBM Haifa Labs
 For every data point i, one of the following must hold
 i  0
  i  0, yi (w  xi  b  1)  0
 Many  i  0  w 
 y x
i
i i
sparse solution
i
 Data Points with
i  0
are Support Vectors
 Optimal hyperplane is completely defined by support
vectors
60
IBM
© 2011 IBM Corporation
IBM Haifa Labs
SVM - The classification
 Given a new data point Z, find it’s label Y
m

D( z )  sign   i yi ( xi  z )  b
 i 1

61
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Extended scenario - Non-Separable data set
62
IBM
© 2011 IBM Corporation
IBM Haifa Labs
 Data is most likely not to be separable
(inconsistencies, outliers, noise), but linear classifier
may still be appropriate
 Can apply SVM in non-linearly separable case
Data should be almost linearly separable
63
IBM
© 2011 IBM Corporation
IBM Haifa Labs
SVM with slacks
 Use non-negative slack variables 1 ,  2 ,......,  n
one per data point
 Change constraints from yi ( xi  w  b)  1, i
to yi ( xi  w  b)  1  i , i
  i is a measure of deviation
from ideal position for sample I
 i  1
 0  i  1
64
IBM
© 2011 IBM Corporation
IBM Haifa Labs
SVM with slacks
 Would like to minimize
1
J ( w, 1 ,  2 ,......,  n )  ( w  w)  C i  i
2
constrained to
yi ( xi  w  b)  1  i , i and i  0
 The parameter C is a regularization term, which
provides a way to control over-fitting
if C is small, we allow a lot of samples not in ideal position
 if C is large, we want to have very few samples not in ideal
position
65
IBM
© 2011 IBM Corporation
IBM Haifa Labs
SVM with slacks - Dual formulation
 Maximize
1 n
n
Ld ( )  i 1 i  i 1  j 1 i j yi y j xi x j
2
n
 Constraint to
66
0   i  C i and
IBM
n
 y
i 1
i
i
0
© 2011 IBM Corporation
IBM Haifa Labs
SVM - non linear mapping
 Cover’s theorem:
“pattern-classification problem cast in a high dimensional space non-linearly is more
likely to be linearly separable than in a low-dimensional space”
 One dimensional space, not linearly separable
 Lift to two dimensional space with
67
 ( x)  ( x, x 2 )
IBM
© 2011 IBM Corporation
IBM Haifa Labs
SVM - non linear mapping
 Solve a non linear classification problem with a linear classifier
Project data x to high dimension using function  (x )
Find a linear discriminant function for transformed data f (x )
Final nonlinear discriminant function is g ( x)  wt ( x)  w
0
 In 2D, discriminant function is linear
 In 1D, discriminant function is NOT linear
68
IBM
© 2011 IBM Corporation
IBM Haifa Labs
SVM - non linear mapping
 Can use any linear classifier after lifting data into a higher dimensional space.
However we will have to deal with the “curse of dimensionality”
 poor generalization to test data
 computationally expensive
 SVM handles the “curse of dimensionality” problem:
 enforcing largest margin permits good generalization
It can be shown that generalization in SVM is a function of the margin,
independent of the dimensionality
 computation in the higher dimensional case is performed only implicitly
through the use of kernel functions
69
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Non linear SVM - kernels
 Recall: Maximize
1 n
n
Ld  i 1 i  i 1  j 1 i j yi y j xi x j
2
n
m

classifica tion : D( z )  sign   i yi ( xi  z )  b
 i 1

 The data points x i appear only in dot products
 If x i is mapped to high dimensional space using  (x )
The high dimensional product is needed  ( xi )t   ( x j )
Maximize
1 n
n
Ld  i 1 i  i 1  j 1 i j yi y j ( xi ) t   ( x j )
2
n
 The dimensionality of space F not necessarily important.
May not even know the map 
70
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Kernel
 A function that returns the value of the dot product between the
images of the two arguments:
k ( x, y )   ( x )   ( y )
 Given a function K, it is possible to verify that it is a kernel.
 Now we only need to compute k ( x, y ) instead of  ( x)   ( y )
Maximize
Ld  i 1 i 
n
1 n
n
  y y k ( xi , x j )

i 1  j 1 i j i j
2
 “kernel trick”: do not need to perform operations in high
dimensional space explicitly
71
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Kernel Matrix




72
The central structure in kernel machines
Contains all necessary information for the learning algorithm
Fuses information about the data AND the kernel
Many interesting properties:
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Mercer’s Theorem
 The kernel matrix is Symmetric Positive Definite
 Any symmetric positive definite matrix can be regarded as
a kernel matrix, that is as an inner product matrix in some
space
Every (semi)positive definite, symmetric function is a kernel:
i.e. there exists a mapping φ such that it is possible to write:
k ( x, y )   ( x )   ( y )
Positive definite
 k ( x, y) f ( x) f ( y)dxdy  0
f L 2
73
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Examples of kernels
 Some common choices (both satisfying Mercer’s
condition):
 Polynomial kernel
k ( xi , x j )  ( xi , x j  1)
p
Gaussian radial basis function (RBF)
k ( xi , x j )  exp( 
74
1
2
IBM
2
|| xi  x j ||2 )
i
© 2011 IBM Corporation
IBM Haifa Labs
Polynomial Kernel - example
( x  z ) 2  ( x1 z1  x2 z2 ) 2
 ( x z  x z  2 x1 z1 x2 z2 )
2 2
1 1
2 2
2 2
 ( x , x , 2 x1 x2 )  ( z , z , 2 z1 z2 )
2
1
2
2
2
1
2
2
  ( x)   ( z )
75
IBM
© 2011 IBM Corporation
IBM Haifa Labs

Applying - non linear SVM
Start with data {x1 , x2 , ..., xn } , which lives in feature space of
dimension n
 Choose a kernel k ( xi , x j ) corresponding to some function
 (x ) , which takes data point xi to a higher dimensional space
 Find the largest margin linear discriminant function in the higher
dimensional space by using quadratic programming package to
solve
1 n
n
n
Ld ( )  i 1 i  i 1  j 1 i j yi y j k ( xi , x j )
2
constraint to 0   i  C i and
76
IBM
n
 y
i 1
i
i
0
© 2011 IBM Corporation
IBM Haifa Labs
Applying - non linear SVM
 Weight vector w in the high dimensional space:
w    i yi ( xi )
i
 Linear discriminant function of largest margin in the high
dimensional space:
t


g ( ( x))  w  ( x)    i yi ( xi )  ( x)
 xvs

t
 Non-Linear discriminant function in the original space:
g ( ( x))  wt ( x)   i yi t ( xi ) ( x)   i yi k ( xi , x)
xvs
77
xvs
IBM
© 2011 IBM Corporation
IBM Haifa Labs
Applying - non linear SVM
78
IBM
© 2011 IBM Corporation
IBM Haifa Labs
SVM summary
 Advantages:
Based on nice theory
Excellent generalization properties
Objective function has no local minima
Can be used to find non linear discriminant functions
Complexity of the classifier is characterized by the number of support
vectors rather than the dimensionality of the transformed space
 Disadvantages:
It’s not clear how to select a kernel function in a principled manner
 tends to be slower than other methods (in non-linear case).
79
IBM
© 2011 IBM Corporation