Support Vector Machine

Support Vector Machines
Tao Tao
Department of computer science
University of Illinois
Adapting many contents and even slides from:
•
•
Gentle Guide to Support Vector Machines, Ming-Hsuan
Yang
Support Vector and Kernel Methods, Thorsten Joachims
Problem

Optimal hyper-plane to classify data points
How to choose this hyperplane?
What is “optimal”?

Intuition: to maximize the margin
What is “optimal”?

•
Statistically: risk Minimization
Risk function
Riskp(h) = P(h(x)!=y) = ∫Δ(h(x)!=y)dP(x,y)
(h in H)
h : hyper-plane function; x: vector; y:1,-1; Δ: indicator function
•
Minimization
hopt = argminh{Riskp(h)}
In practice…




Given N observations: (X,Y) (Y are labels, 1,-1)
Looking for a mapping: x->f(x,α) (1,-1)
Expected risk:
Empirical risk
Question: Are they consistent in terms of minimization?
Vapnik/Chervonenkis (VC) dimension
Definition: VC dimension of H is equal to the maximum number
h of examples that can be split into two sets in all 2h ways
using function from H.
Example: In R2 space, VC dimension is 3 (Rn, vc: n+1)
But, 4 points:
Upper bound for expected risk
Training error
Avoid overfitting
This bound for the expected risk holds with
probability 1-η.
h: VC dimension
the second term: VC confidence
Error vs. vc dimension
Want to minimize expected risk?

It is not enough just to minimize the empirical
risk
Need to choose an appropriate VC
Make both parts small

Solution: Structural Risk Minimization (SRM)


Structural risk minimization

Nested structure of hypothesis space:
h(n) ≤ h(n+1), h(n) is the VC dimension of Hn

Tradeoff between VC dimension and
empirical risk

Problem: VC dimension
minimum empirical risk
Linear SVM
Given xi in Rn



Linearly separable: exist w in Rn and b in R , s.t
yi(w●xi+b) ≥ 1
Scale (w,b) in order to make the distance of the
closest points, say xj, equals to 1/||w||
Optimal separating hyper-plane (OSH): to maximize
the 1/||w||
Linear SVM example


Given (x,y), find (w,b), s.t. <x●w>+b = 0
additional requirement: mini|<w●xi>+b| = 1
ID
D1
D2
D3
D4
x1
1
0
0
0
x2
2
0
2
0
x3
0
0
1
1
x4
0
3
0
1
x5
2
0
0
1
x6
0
1
0
1
x7
2
1
3
1
y
1
-1
1
-1
w,b
2
3
-1
-3
-1
-1
0
b=1
f(x,w,b) = sgn(x●w+b)
VC dimension upper bound

Lemma [Vapnik 1995]
•
Let R be the radius of smallest
ball to cover all x: {||x-a||<R},
let fw,b = sgn((w●x)+b) be the
decision functions
||w|| ≤ A
•
•
Then, VC dimension h < R2A2+1
R
w
δ
||w|| = 1/δ, δ is margin length
So …
Maximizing the margin δ
═> Minimizing ||w||
═> Smallest acceptable VC dimension
═> Constructing an optimal hyper-plane
Is everything clear??
How to do it? Quadratic Programming!
Constrained quadratic programming
Minimize ½ <w●w>
Subject to yi(<w●xi>+b) ≥ 1
Solve it: Lagrange multipliers to find the saddle point
For more details, go to the book:
An introduction to Support Vector Machines
What is “support vectors”?
yi(w●xi+b) ≥ 1
Most of xi achieves inequality signs;
The xi, achieving equal signs,
are called support vectors.
Support vector
Inseparable data
Soft margin classifier



Loose the margin by introducing N
nonnegative variable ξ = (ξ1,ξ2,…, ξn)
So that yi(<w●xi>+b) ≥ 1- ξi
Problem:
Minimize ½ <w●w> + C ∑ ξi
Subject to yi(<w●xi>+b) ≥ 1 – ξi
ξ≥0
C and ξ

•
•

•
•
•
C:
C is small, maximize the minimum distance
C is large, minimize the number of
misclassified points
ξ:
>1: misclassified points
0< ξ<1: correctly classified but closer than 1/||w||
=0: margin vectors
Nonlinear SVM
R2
R3
Feature space
Input Space
Φ
Feature Space
a|b|c
Φ
a | b | c | aa | ab | ac | bb | bc | cc

Problem: Very many parameters! O(Np)
attributes in feature space, for N attributes, p
degree.

Solution: Kernel methods!
Dual representations
Lagrange multipliers:
Require:
substitute
Constrained QP using dual
D is an N×N matrix such that Di,j = yiyj<xi●xj>
Observations: the only way the data points
appear in the training problem is in the form of dot
products--- <xi●xj>
Go back to nonlinear SVM…

Original:

Expanding to high dimensional space:

Problem: Φ is computationally expensive.
Fortunately: We only need Φ(xi)●Φ(xj)

Kernel function
K(xi,xj) = Φ(xi)●Φ(xj)



Without knowing exact Φ
Replace <xi●xj > by K(xi,xj)
All previous derivations in linear SVM hold
How to decide to kernel function?

Mercer condition (necessary and sufficient):
K(u,v) is symmetric
Some examples for kernel functions
Multiple classes (k)




One-against-the rest: k SVM’s
One-against-one: k(k-1)/2 SVM’s
K-class SVM
John Platt’s DAG method
Application in text classification


Counting each term in an article
An article, therefore, becomes a vector (x)
Further reading and advanced
topics. the theory of linear
…
… is much …
…
…
The problem of linear regression
Is much older than the
Classification
…
…
…
count
Read
Problem
…
Class
…
…
2
4
…
5
…
…
Attributes: terms
Value: occurrence
or frequency
Conclusions







Linear SVM
VC dimension
Soft margin classifier
Dual representation
Nonlinear SVM
Kernel methods
Multi-classifier
Thank you!