Support Vector Machines
Tao Tao
Department of computer science
University of Illinois
Adapting many contents and even slides from:
•
•
Gentle Guide to Support Vector Machines, Ming-Hsuan
Yang
Support Vector and Kernel Methods, Thorsten Joachims
Problem
Optimal hyper-plane to classify data points
How to choose this hyperplane?
What is “optimal”?
Intuition: to maximize the margin
What is “optimal”?
•
Statistically: risk Minimization
Risk function
Riskp(h) = P(h(x)!=y) = ∫Δ(h(x)!=y)dP(x,y)
(h in H)
h : hyper-plane function; x: vector; y:1,-1; Δ: indicator function
•
Minimization
hopt = argminh{Riskp(h)}
In practice…
Given N observations: (X,Y) (Y are labels, 1,-1)
Looking for a mapping: x->f(x,α) (1,-1)
Expected risk:
Empirical risk
Question: Are they consistent in terms of minimization?
Vapnik/Chervonenkis (VC) dimension
Definition: VC dimension of H is equal to the maximum number
h of examples that can be split into two sets in all 2h ways
using function from H.
Example: In R2 space, VC dimension is 3 (Rn, vc: n+1)
But, 4 points:
Upper bound for expected risk
Training error
Avoid overfitting
This bound for the expected risk holds with
probability 1-η.
h: VC dimension
the second term: VC confidence
Error vs. vc dimension
Want to minimize expected risk?
It is not enough just to minimize the empirical
risk
Need to choose an appropriate VC
Make both parts small
Solution: Structural Risk Minimization (SRM)
Structural risk minimization
Nested structure of hypothesis space:
h(n) ≤ h(n+1), h(n) is the VC dimension of Hn
Tradeoff between VC dimension and
empirical risk
Problem: VC dimension
minimum empirical risk
Linear SVM
Given xi in Rn
Linearly separable: exist w in Rn and b in R , s.t
yi(w●xi+b) ≥ 1
Scale (w,b) in order to make the distance of the
closest points, say xj, equals to 1/||w||
Optimal separating hyper-plane (OSH): to maximize
the 1/||w||
Linear SVM example
Given (x,y), find (w,b), s.t. <x●w>+b = 0
additional requirement: mini|<w●xi>+b| = 1
ID
D1
D2
D3
D4
x1
1
0
0
0
x2
2
0
2
0
x3
0
0
1
1
x4
0
3
0
1
x5
2
0
0
1
x6
0
1
0
1
x7
2
1
3
1
y
1
-1
1
-1
w,b
2
3
-1
-3
-1
-1
0
b=1
f(x,w,b) = sgn(x●w+b)
VC dimension upper bound
Lemma [Vapnik 1995]
•
Let R be the radius of smallest
ball to cover all x: {||x-a||<R},
let fw,b = sgn((w●x)+b) be the
decision functions
||w|| ≤ A
•
•
Then, VC dimension h < R2A2+1
R
w
δ
||w|| = 1/δ, δ is margin length
So …
Maximizing the margin δ
═> Minimizing ||w||
═> Smallest acceptable VC dimension
═> Constructing an optimal hyper-plane
Is everything clear??
How to do it? Quadratic Programming!
Constrained quadratic programming
Minimize ½ <w●w>
Subject to yi(<w●xi>+b) ≥ 1
Solve it: Lagrange multipliers to find the saddle point
For more details, go to the book:
An introduction to Support Vector Machines
What is “support vectors”?
yi(w●xi+b) ≥ 1
Most of xi achieves inequality signs;
The xi, achieving equal signs,
are called support vectors.
Support vector
Inseparable data
Soft margin classifier
Loose the margin by introducing N
nonnegative variable ξ = (ξ1,ξ2,…, ξn)
So that yi(<w●xi>+b) ≥ 1- ξi
Problem:
Minimize ½ <w●w> + C ∑ ξi
Subject to yi(<w●xi>+b) ≥ 1 – ξi
ξ≥0
C and ξ
•
•
•
•
•
C:
C is small, maximize the minimum distance
C is large, minimize the number of
misclassified points
ξ:
>1: misclassified points
0< ξ<1: correctly classified but closer than 1/||w||
=0: margin vectors
Nonlinear SVM
R2
R3
Feature space
Input Space
Φ
Feature Space
a|b|c
Φ
a | b | c | aa | ab | ac | bb | bc | cc
Problem: Very many parameters! O(Np)
attributes in feature space, for N attributes, p
degree.
Solution: Kernel methods!
Dual representations
Lagrange multipliers:
Require:
substitute
Constrained QP using dual
D is an N×N matrix such that Di,j = yiyj<xi●xj>
Observations: the only way the data points
appear in the training problem is in the form of dot
products--- <xi●xj>
Go back to nonlinear SVM…
Original:
Expanding to high dimensional space:
Problem: Φ is computationally expensive.
Fortunately: We only need Φ(xi)●Φ(xj)
Kernel function
K(xi,xj) = Φ(xi)●Φ(xj)
Without knowing exact Φ
Replace <xi●xj > by K(xi,xj)
All previous derivations in linear SVM hold
How to decide to kernel function?
Mercer condition (necessary and sufficient):
K(u,v) is symmetric
Some examples for kernel functions
Multiple classes (k)
One-against-the rest: k SVM’s
One-against-one: k(k-1)/2 SVM’s
K-class SVM
John Platt’s DAG method
Application in text classification
Counting each term in an article
An article, therefore, becomes a vector (x)
Further reading and advanced
topics. the theory of linear
…
… is much …
…
…
The problem of linear regression
Is much older than the
Classification
…
…
…
count
Read
Problem
…
Class
…
…
2
4
…
5
…
…
Attributes: terms
Value: occurrence
or frequency
Conclusions
Linear SVM
VC dimension
Soft margin classifier
Dual representation
Nonlinear SVM
Kernel methods
Multi-classifier
Thank you!
© Copyright 2025 Paperzz