Steven F. Ashby Center for Applied Scientific Computing Month DD

CSE 4705
Artificial Intelligence
Jinbo Bi
Department of Computer Science &
Engineering
http://www.engr.uconn.edu/~jinbo
1
1
Introduction and History

SVM is a supervised learning
technique applicable for both
classification and regression.

A classifier derived from statistical
learning theory by Vapnik in 1995.

SVM is a method that generates input-output mapping
functions from a set of labeled training data.
x → f(x, α)

where α is an adjustable parameters
It is the simple geometrical interpretation of the margin,
uniqueness of the solution, statistical robustness of the loss
function, modularity of the kernel function, and overfit control
through the choice of a single regularization parameter.
2
Advantage of SVM

Another name for regularization is The Capacity Control
SVM controls the capacity by optimizing the classification
margin.
– What is the margin?
– How can we optimize it?
– Linearly separable case
– Linearly inseparable case (using hinge loss)
– Primal-dual optimization

The other key features of SVMs are the use of kernels.
– What are the kernels? (May omit in this class)
3
Support Vector Machine

Find a linear hyperplane (decision boundary) that
will separate the data
4
Support Vector Machine
B1

One Possible Solution
5
Support Vector Machine
B2

Another Possible Solution
6
Support Vector Machine
B2

Other possible solutions
7
Support Vector Machine
B1
B2
Which one is better? B1 or B2?
 How do you define better or the optimal?

8
Support Vector Machine Definitions
Basic concepts
B1
Support
Vectors

Examples closest to
the hyperplane are
support vectors.

Distance from a
support vector to the
separator is
B2
b21
b22
margin
b12
b11
wT x + b
r=y
w
Margin of the separator is the width of separation
between support vectors of classes.
9
Support Vector Machine
B1
To find the optimal solution
B2
b21
b22
margin
b11
b12
Find hyperplane that maximizes the margin
 So B1 is better than B2

10
Why Maximum Margin?
B1
1.
Intuitively this feels safest.
2.
If we’ve made a small error in theB
location of the boundary (it’s been
jolted in its perpendicular direction)
this gives us least chance of
causing a misclassification.
2
b21
b22
margin
b11
b12
3.
The model is immune to removal of any non-supportvector data points.
4.
There’s statistical learning theory (using VC dimension)
that is related to (but not the same as) the proposition that
this is a good thing.
5.
Empirically it works very well.
11
Linear separable case
B1
 
w x  b  0
 
w  x  b  1
 
w  x  b  1
b11
 
if w  x  b  1
1

f ( x)  
 
 1 if w  x  b  1
b12
2
Margin   2
|| w ||
12
Linear non-separable case (soft margin)

Noisy data

Slack
variables ξi
error
1
2
 
if w  x i  b  1 - i
1

f ( xi )  
 
 1 if w  x i  b  1  i
13
Nonlinear Support Vector Machines

What if the problem is not linearly separable?
14
Higher Dimensions

Mapping the data into higher dimensional
space (kernel trick) where it is linearly
separable and then we can use linear SVM –
(easier to solve)
(0,1) +
+
-
+
-1
0
+1
(0,0)
+
(1,0)
15
Extension to Non-linear Decision Boundary
 Possible
problem of the transformation
– High computation overhead and hard to get a good estimate.
 SVM
solves these two issues simultaneously
– Kernel tricks for efficient computation
– Minimize ||w||2 can lead to a “good” classifier
Non-linear
separable case
Φ: x → φ(x)
16
Extension to Non-linear Decision Boundary
 Then
we can solve it easly.
Non-linear
separable case
17
Non Linear non-separable case

What if decision boundary is not linear?

The concept of a kernel mapping function is
very powerful. It allows SVM models to perform
separations even with very complex boundaries
18
Linear Support Vector Machine
B1
B2
b21
b22
margin
b11
b12
Find hyperplane that maximizes the margin
 So B1 is better than B2

19
Definition
Define the hyperplane H (decision function) such
that:
d1 = the shortest distance to the closest positive point
 d2 = the shortest distance to the closest negative point
 The margin of a separating hyperplane is d1 + d2.

20
Distance

Distance between Xn and the plane:
-
Take any point X on the plane
-
Projection of (Xn-𝑋) on W.
Xn
𝑾^
-
𝑊=
𝑊
|𝑊|
 distance= |𝑊 𝑇 (Xn - 𝑋)|
1
|𝑊 𝑇 Xn
|𝑊|
-
distance=
-
=
1
|𝑊 𝑇 Xn
|𝑊|
-
=
1
|𝑊|
- 𝑊 𝑇 𝑋)|
𝒙^
+ b - 𝑊 𝑇 𝑋 - b|
(if Xn is a support vector,
Xn
𝑾^
𝑿^
which makes |𝑊 𝑇 Xn + b| =1)
𝑿 ^^
21
SVM boundary and margin

Want: find W, b (offset) such that
Maximize
1
w
T
minn 1,2,...N | w
subject to the constraint:
T
|w
xn  b |
y
T
n
(w
x n  b)
x
n
 b | 1
y
T
n
(w
x
n
 b)  1
22
The Lagrangian trick
We need to minimize Lp w.r.b to
w,b and maximizing w.r.b 𝜶 >= 0
Reformulate the optimization problem:
A ”trick” often used in optimization is to do an
Lagrangian formulation of the problem.
The constraints will be replaced by constraints on
the Lagrangian multipliers and the training data
will only occur as dot products.
23

New formulation which is dependent on α , we need to
maximize:
Dual form requires only the dot
product of each input vector xi
to be calculated
Having moved from minimizing LP to maximizing LD, we need to find:
This is a convex quadratic optimization problem, and we run a QP
solver which will return alpha and then we can get w. What remains is
to calculate b.
24
The Karush-Kuhn-Tucker Conditions

Definition of KKT conditions:
The standard form of optimization is as follows:
min f ( x)
s.t. h j ( x)  0, j  1,
, p,
g k ( x)  0, k  1,
, q,
x  X  Rn
The corresponding KKT conditions are:
1. h j ( x* )  0, j  1,
, p, g k ( x* )  0, k  1,
p
*
( x is the local minimum point)
, q,
q
2. f ( x )    j h j ( x )   k g k ( x* )  0,
*
Feasibility
*
j 1
k 1
Direction
 j  0, k  0, k g k ( x* )  0.
25
The Karush-Kuhn-Tucker Conditions

Geometric meaning of KKT conditions
p
q
j 1
k 1
f ( x* )    j h j ( x* )   k g k ( x* )
Nonlinear programming ---Dimitri P Bertsekas
26
The Karush-Kuhn-Tucker Conditions

Condition for KKT:
The intersection of the
set of feasible directions
with the set of descent
directions coincides with
the intersection of the set
of feasible directions for
linearized constraints
with the set of descent
directions
SVMs problems always
satisfy this condition
Nonlinear programming ---Dimitri P Bertsekas
27
The Karush-Kuhn-Tucker Conditions
For the primal problem of linear SVMs,
l
l
1
2
Lp 
w    i yi ( xi  w  b)    i
2
i 1
i 1
yi ( xi  w  b)  1  0 i  1, l
 i  0 i
The KKT conditions are:

v
L p  v    i yi xiv  0 v  1,
,d
i

L p    i yi  0
b
i
yi ( xi  w  b)  1  0 i  1,
 i  0 i
l
The SVMs problem
is convex( a convex
objective function
and convex feasible
region), thus the
KKT conditions are
necessary and
sufficient, which
means the primal
problem can be
simplified to a KKT
problem.
 i ( yi ( w  xi  b)  1)  0 i
28
Linearly Support Vector Machines

There are two cases of linearly Support Vector
Machines:

1- The Separable case.

2-The In-Separable case.
29
1- The Separable case:
Use it if there are no noises in training data.
 
w  x  b  1
hyperplane

w x  b  0
 
w  x  b  1
30
If there are noises in training data.
We
move
need
to
to
the
Non-Separable
case.
31
2-The In-Separable case:
 Often, data will be noisy which does not allow any hyper-
plane to correctly classify all observations. Thus, the data
are linearly inseparable.
 Idea: relax constraints using slack variables ξi in the
constraints. One for each sample.
Separable case
Non-Separable case
32
Slack variables
ξi is a measure of deviation from the ideal for xi.
- ξi >1 : x is on the wrong side of the separating hyperplan.
- 0 < ξi <1: x is correctly classified, but lies inside the margin.
- ξi < 0 : x is correctly classified, and lies outside the margin.
ξ Is the total
distance of points
on the wrong side
of their margin
33
To deal with the non-separable case, we can rewrite
the problem as:
minimize :
Subject to :

The parameter C controls the trarde-off between
maximizing the margin and minimizing the
training error.
34
Use the Lagrangian formulation for the
optimization problem.
 Introduce a positive Lagrangian multiplier for
each inequality constraint.

Lagrangian
multiplier
are the Lagrange multipliers introduced to
enforce positivity of the
error
is the weight coefficient
35
Get the following Lagrangian :
Reformulating as a Lagrangian, we need to minimize
with respect to w, b and ξ, and maximize with respect
to
36
Differentiating with respect to w, b and ξi and setting the
derivatives to zero:
1
2
37

Substituting these 1, 2 into

We get a new formulation:
38
Calculate b
We run a QP solver which will return
and from
.
will give us w. What remains is
to calculate b.
 Any data point satisfying (1) which is a Support
Vector xs will have the form:
 Substituting in (1), we will get :

Where S denotes the set of indices of the
Support Vectors.
 Faintly, we get this:

39
The summary of In-separable case
40
Non-linear SVMs
Cover’s theorem on the separability of patterns

“A complex pattern-classification problem cast in a high-dimensional space non-linearly
is more likely to be linearly separable than in a low-dimensional space”

The power of SVMs resides in the fact that they represent a robust and efficient
implementation of Cover’s theorem

SVMs operate in two stages
 Perform a non-linear mapping of the feature vector x onto a high-dimensional
space that is hidden from the inputs or the outputs
 Construct an optimal separating hyperplane in the high-dim space
41
Non-linear SVMs

Datasets that are linearly separable with noise work out great:
x
0

But what are we going to do if the dataset is just too hard?
x
0

How about… mapping data to a higher-dimensional space:
x2
0
x
42
Non-linear SVMs: Feature Space

General idea: the original input space can be mapped to
some higher-dimensional feature space where the training set
is separable:
Φ: x → φ(x)
43
Non-linear SVMs
44
Non-linear SVMs

Data is linearly separable in 3D

This means that the problem can still be solved by a linear classifier
45
Non-linear SVMs
Naïve application of this concept by simply projecting to a high-dimensional non-linear manifold
has two major problems

Statistical: operation on high-dimensional spaces is ill-conditioned due to the “curse of
dimensionality” and the subsequent risk of overfitting

–Computational: working in high-dim requires higher computational power, which poses limits on
the size of the problems that can be tackled
SVMs bypass these two problems in a robust and efficient manner

First, generalization capabilities in the high-dimensional manifold are ensured by enforcing a
largest margin classifier


Recall that generalization in SVMs is strictly a function of the margin (or the VC dimension),
regardless of the dimensionality of the feature space
Second, projection onto a high-dimensional manifold is only implicit

Recall that the SVM solution depends only on the dot product 𝑥𝑖,𝑥𝑗 between training examples

Therefore, operations in high-dim space 𝜑(𝑥) do not have to be performed explicitly if we find
a function 𝐾(𝑥𝑖,𝑥𝑗) such that 𝐾(𝑥𝑖,𝑥𝑗) = (𝜑(𝑥𝑖),𝜑(𝑥𝑗))

𝐾(𝑥𝑖,𝑥𝑗) is called a kernel function in SVM terminology
46
Nonlinear SVMs: The Kernel Trick

With this mapping, our discriminant function is now:
g ( x)  w T  ( x)  b 
T


(
x
)
 i i  ( x)  b
iSV

No need to know this mapping explicitly, because we only
use the dot product of feature vectors in both the training
and test.

A kernel function is defined as a function that
corresponds to a dot product of two feature vectors in
some expanded feature space:
K (xi , x j )   (xi )T  (x j )
47
Nonlinear SVMs: The Kernel Trick
An example:

Assume we choose a kernel function 𝐾(𝑥𝑖,𝑥𝑗)=(𝑥𝑖𝑇𝑥𝑗)2

Our goal is to find a non-linear projection 𝜑(𝑥) such that (𝑥𝑖𝑇𝑥𝑗)2=𝜑𝑇(𝑥𝑖)𝜑(𝑥𝑗)

Performing the expansion of 𝐾(𝑥𝑖,𝑥𝑗)
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj),
where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
So in using the kernel 𝐾(𝑥𝑖,𝑥𝑗) = (𝑥𝑖𝑇𝑥𝑗)2, we are implicitly operating on a higherdimensional non-linear manifold defined by
φ(xi) = [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T

Notice that the inner product 𝜑𝑇(𝑥𝑖)𝜑(𝑥𝑗) can be computed in 𝑅2 by means of
the kernel (𝑥𝑖𝑇𝑥𝑗 )2 without ever having to project onto 𝑅6!
48
Kernel methods
Let’s now see how to put together all these
concepts
49
Kernel methods
Let’s now see how to put together all these
concepts

Assume that our original feature vector 𝑥 lives in a space 𝑅𝐷

We are interested in non-linearly projecting 𝑥 onto a higher dimensional implicit space
𝜑(𝑥)∈𝑅𝐷1 𝐷1>𝐷 where classes have a better chance of being linearly separable
o
Notice that we are not guaranteeing linear separability, we are only saying that
we have a better chance because of Cover’s theorem

The separating hyperplane in 𝑅𝐷1 will be defined by

To eliminate the bias term 𝑏, let’s augment the feature vector in the implicit space with
a constant dimension 𝜑0(𝑥)=1

Using vector notation, the resulting hyperplane becomes

From our previous results, the optimal (maximum margin) hyperplane in the implicit
space is given by
50
Kernel methods

Merging this optimal weight vector with the hyperplane equation

and, since 𝜑𝑇(𝑥𝑖)𝜑(𝑥𝑗)=𝐾(𝑥𝑖,𝑥𝑗), the optimal hyperplane becomes

Therefore, classification of an unknown example 𝑥 is performed by computing the
weighted sum of the kernel function with respect to the support vectors 𝑥𝑖 (remember
that only the support vectors have non-zero dual variables 𝛼𝑖)
51
Kernel methods
How do we compute dual variables 𝜶𝒊 in the implicit space?

Very simple: we use the same optimization problem as before, and replace the dot
product 𝜑𝑇(𝑥𝑖) 𝜑(𝑥𝑗) with the kernel 𝐾 (𝑥𝑖,𝑥𝑗)

The Lagrangian dual problem for the non-linear SVM is simply

subject to the constraints
52
Kernel methods
How do we select the implicit mapping 𝝋(𝑥)?

As we saw in the example a few slides back, we will normally select a kernel function
first, and then determine the implicit mapping 𝜑(𝑥) that it corresponds
Then, how do we select the kernel function 𝑲(𝒙𝒊,𝒙𝒋)?

We must select a kernel for which an implicit mapping exists, this is, a kernel that can
be expressed as the dot-product of two vectors
For which kernels 𝑲(𝒙𝒊,𝒙𝒋)does there exist an implicit mapping 𝝋(𝑥
)?

The answer is given by Mercer’s Condition
53
Some Notes On Kernel Trick
What’s the point of kernel trick? More essay to
compute!
 Example: Consider Quadratic Kernel,
, , and our

original data 𝑥 has m dimension.
Number of terms
(m+2)-choose-2,
around 𝑚2 /2
Figure is from Andrew Moore, CMU
54
Some Note On Kernel Trick
That dot product requires 𝒎𝟐 /2
additions and multiplications
So, how about compute
directly? Oh! it is only
Figure is from Andrew Moore, CMU
O(m)
55
Some Notes on and Φ and H

Mercer’s condition tells us whether or not a prospective
kernel is a dot product in some space.

But how to construct Φ or even what H is, if we are given
a kernel ? (Usually we don’t need to know Φ, here we just try to have fun
exploring what Φ looks like)
– Consider homogeneous polynomial kernel, we can actually
explicitly construct the Φ
– Example: For data in 𝑅2
Kernel
We
can easily get
56
Some Notes on and Φ and H

Extend to arbitrary homogeneous polynomial
kernels
– Remember the Multinomial Theorem
( x1  x2    xd L ) p 
Number

r1  r2  rd L  p
p!
xiri

r1 !r2 ! rm ! 1i  d L
of terms:
– Consider an homogeneous polynomial kernel
– We can explicitly get Φ
57
Some Notes on and Φ and H

We can also start with Φ, then construct kernel.

Example:
– Consider Fourier expansion in the data x in R, cut off after N
terms
– The Φ map x to vector in 𝑅2𝑁−1
– We can get the (Dirichlet) kernel:
58
Some Notes on and Φ and H

Prove: Given
Then

Proof

Finally, it is clear that the above implicit mapping trick will
work for any algorithm in which the data only appear as
dot products (for example, the nearest neighbor
algorithm).
59
Some Examples of Nonlinear SVMs


Linear Kernel
Polynomial kernels
– where P =2,3,.... To get the feature vectors we concatenate all up
to Pth order polynomial terms of the components of 𝑥

Radial basis function kernels
– The 𝜎 is a user-specified parameter. In this case the feature space
is infinite dimensional function space

Hyperbolic tangent kernel (Sigmoid Kernel)
– That kernel only satisfies Mercer’s condition for certain values of
the parameters κ and δ
– SVM model using a sigmoid kernel function is equivalent to a twolayer, perceptron neural network.
–
A common value for κ is 1/N, where N is the data dimension
–
A nice Paper for further information.
60
Polynomial Kernel SVM Example
Slide from Tommi S.
Jaakkola, MIT
61
RBF Kernel SVM Example
Kernel we are using:
Notice that:
Decrease sigma, moves towards nearest neighbor
classifier
Slides from A. Zisserman
62
RBF Kernel SVM Example
Kernel we are using:
Notice that:
Decrease C, gives wider (soft) margin
Slides from A. Zisserman
63
Global Solutions and Uniqueness

Fact:
– Training an SVM amounts to solving a convex quadratic
programming problem.
– A proper kernel must satisfy Mercer's positivity condition.

Notes from optimization theorem.
– If the objective function is strictly convex, the solution is
guaranteed to be unique.
– For quadratic programming problems, convexity of the objective
function is equivalent to positive semi-definiteness of the
Hessian, and strict convexity, to positive definiteness.
– For loosely convex (convex but not strictly convex), one must
examine case by case, to determine uniqueness.
64
Global Solutions and Uniqueness

So, if the Hessian is positive definite, we are happy to
announce that the solution for the flowing function is
unique.
The object function
Hessian
(Quick reminder: Hessian is square matrix of second-order partial derivatives of a
function)

What if Hessian is positive semi-definite, namely the
objective function is loosely convex?
– There is necessary and sufficient condition proposed
by J.C.Burge etc. For further info: Paper.
– non-uniqueness of the SVM solution will be the
exception rather than the rule
66
Thanks
Pic by Mark. A. Hicks
67