PPTX

Support Vector Machines
Joseph Gonzalez
From a linear classifier to ...
*One of the most famous
slides you will see, ever!
X
X
X
X
O
O
O
O
O
O
X
X
X
O
O
O
Maximum margin
Maximum possible separation between
positive and negative training examples
*One of the most famous
slides you will see, ever!
Geometric Intuition
SUPPORT VECTORS
X
X
O
O
O
O
X
X
Geometric Intuition
SUPPORT VECTORS
X
X
X
O
O
O
O
X
X
Primal Version
min ||w||2 +C ∑ξ
s.t. (w.x + b)y ≥ 1-ξ
ξ≥0
DUAL Version
max ∑α -1/2 ∑αiαjyiyjxixj
s.t. ∑αiyi = 0
C ≥ αi ≥ 0
Where did this come from?
Remember Lagrange Multipliers
Let us “incorporate” constraints into objective
Then solve the problem in the “dual” space of
lagrange multipliers
Primal vs Dual
min ||w||2 +C ∑ξ
s.t. (w.x + b)y ≥ 1-ξ
ξ≥0
max ∑α -1/2 ∑αiαjyiyjxixj
s.t. ∑αiyi = 0
C ≥ αi ≥ 0
Number of parameters?
large # features?
large # examples?
for large # features, DUAL preferred
many αi can go to zero!
DUAL: the “Support vector” version
max ∑α - 1/2 ∑αiαjyiyjxixj
s.t. ∑αiyi = 0
C ≥ αi ≥ 0
How do we find
α?
Quadratic
programming
How do we find
C?
Cross-validation!
Wait... how do we predict y
for a new point x??
y = sign(w.x+b)
How do we find w?
w = Σi αi yi xi
How do we find b?
y = sign(Σi αi yi xi xj + b)
“Support Vector”s?
max ∑α - 1/2 ∑αiαjyiyjxixj
s.t. ∑αiyi = 0
C ≥ αi ≥ 0
b
α2
X(2,2)
max ∑α - α1α2(-1)(0+2)
- 1/2 α12(1)(0+1)
- 1/2 α22(1)(4+4)
y=w.x+b
b = y-w.x
x1: b = 1.4 [-2 -1][0 1]
= 1+.4 =1.4
α1O
(0,1)
max α1 + α2 + 2α1α2 - α12/2 - 4α22
s.t. α1-α2 = 0
C ≥ αi ≥ 0
α1=α2=α
max 2α -5/2α2
max 5/2α(4/5-α)
α1=α2=2/5
0
2/5
4/5
w = Σi αi yi xi
w = .4([0 1]-[2 2])
=.4[-2 -1 ]
“Support Vector”s?
max ∑α - 1/2 ∑αiαjyiyjxixj
s.t. ∑αiyi = 0
C ≥ αi ≥ 0
α2
X(2,2)
What is α3?
Try this at home
α1O
(0,1)
O
α3
Playing With SVMS
• http://www.csie.ntu.edu.tw/~cjlin/libsvm/
More on Kernels
• Kernels represent inner products
– K(a,b) = a.b
– K(a,b) = φ(a) . φ(b)
• Kernel trick is allows extremely complex φ( )
while keeping K(a,b) simple
• Goal: Avoid having to directly construct φ( )
at any point in the algorithm
Kernels
Complexity of the optimization problem remains
only dependent on the dimensionality of the input
space and not of the feature space!
Can we used Kernels to Measure
Distances?
• Can we measure distance between φ(a) and
φ(b) using K(a,b)?
Continued:
Popular Kernel Methods
• Gaussian Processes
• Kernel Regression (Smoothing)
– Nadarayan-Watson Kernel Regression