Support Vector Machine

Support Vector
Machines
More information can be found in
http://www.cs.cmu.edu/~awm/tutorials
Linear Classifiers
x
denotes +1
a
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
How would you
classify this data?
Support Vector Machines: Slide 2
Linear Classifiers
x
denotes +1
a
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
How would you
classify this data?
Support Vector Machines: Slide 3
Linear Classifiers
x
denotes +1
a
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
How would you
classify this data?
Support Vector Machines: Slide 4
Linear Classifiers
x
denotes +1
a
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
How would you
classify this data?
Support Vector Machines: Slide 5
Linear Classifiers
x
denotes +1
a
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
Any of these
would be fine..
..but which is
best?
Support Vector Machines: Slide 6
Classifier Margin
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
Support Vector Machines: Slide 7
Maximum Margin
a
x
denotes +1
f
yest
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
denotes -1
Linear SVM
This is the
simplest kind of
SVM (Called an
LSVM)
Support Vector Machines: Slide 8
Maximum Margin
a
x
denotes +1
f
yest
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
denotes -1
Support Vectors
are those
datapoints that
the margin
pushes up
against
Linear SVM
This is the
simplest kind of
SVM (Called an
LSVM)
Support Vector Machines: Slide 9
Why Maximum Margin?
1. Intuitively this feels safest.
denotes +1
denotes -1
Support Vectors
are those
datapoints that
the margin
pushes up
against
f(x,w,b)
= sign(w.
- b)
2. If we’ve made
a small
error inxthe
location of the boundary (it’s been
The maximum
jolted in its perpendicular
direction)
this gives us leastmargin
chance linear
of causing a
misclassification. classifier is the
3. LOOCV is easy since
the classifier
model is
linear
immune to removal
of any
with
the,nonum,
support-vector datapoints.
maximum margin.
4. There’s some theory (using VC
is the
dimension) that isThis
related
to (but not
the same as) thesimplest
propositionkind
that of
this
is a good thing. SVM (Called an
LSVM)
5. Empirically it works
very very well.
Support Vector Machines: Slide 10
Estimate the Margin
denotes +1
denotes -1
x
wx +b = 0
• What is the distance expression for a point x to a
line wx+b= 0?
Support Vector Machines: Slide 11
Estimate the Margin
denotes +1
denotes -1
wx +b = 0
Margin
• What is the expression for margin?
Support Vector Machines: Slide 12
Maximize Margin
denotes +1
denotes -1
wx +b = 0
Margin
Support Vector Machines: Slide 13
Maximize Margin
denotes +1
denotes -1
wx +b = 0
Margin
• Min-max problem  game problem
Support Vector Machines: Slide 14
Maximize Margin
denotes +1
denotes -1
wx +b = 0
Margin
Strategy:
Support Vector Machines: Slide 15
Maximum Margin Linear Classifier
• How to solve it?
Support Vector Machines: Slide 16
Learning via Quadratic Programming
• QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.
Support Vector Machines: Slide 17
Quadratic Programming
Find
arg max
u
T
u
Ru
T
cd u
2
Quadratic criterion
a11u1  a12u2  ...  a1mum  b1
Subject to
a21u1  a22u2  ...  a2 mum  b2
:
n additional linear
inequality
constraints
an1u1  an 2u2  ...  anmum  bn
a( n 1)1u1  a( n 1) 2u2  ...  a( n 1) mum  b( n 1)
a( n  2)1u1  a( n  2) 2u2  ...  a( n  2) mum  b( n  2)
:
a( n  e )1u1  a( n  e ) 2u2  ...  a( n  e ) mum  b( n  e )
e additional linear
equality
constraints
And subject to
Support Vector Machines: Slide 18
Quadratic Programming
Find
arg max
u
T
u
Ru
T
cd u
2
Quadratic criterion
a11u1  a12u2  ...  a1mum  b1
Subject to
a21u1  a22u2  ...  a2 mum  b2
:
n additional linear
inequality
constraints
an1u1  an 2u2  ...  anmum  bn
a( n 1)1u1  a( n 1) 2u2  ...  a( n 1) mum  b( n 1)
a( n  2)1u1  a( n  2) 2u2  ...  a( n  2) mum  b( n  2)
:
a( n  e )1u1  a( n  e ) 2u2  ...  a( n  e ) mum  b( n  e )
e additional linear
equality
constraints
And subject to
Support Vector Machines: Slide 19
Uh-oh!
This is going to be a problem!
What should we do?
denotes +1
denotes -1
Support Vector Machines: Slide 20
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1:
Find minimum w.w, while
minimizing number of
training set errors.
Problemette: Two things
to minimize makes for
an ill-defined
optimization
Support Vector Machines: Slide 21
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
Tradeoff parameter
There’s a serious practical
problem that’s about to make
us reject this approach. Can
you guess what it is?
Support Vector Machines: Slide 22
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
Tradeoff parameter
Can’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
There’s a serious practical
(Also, doesn’t distinguish between
problem
that’s
about
to
make
disastrous errors and near misses)
us reject this approach. Can
you guess what it is?
Support Vector Machines: Slide 23
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 2.0:
Minimize
w.w + C (distance of error
points to their
correct place)
Support Vector Machines: Slide 24
Support Vector Machine (SVM) for
Noisy Data
denotes +1
denotes -1
3
1
• Any problem with the above
formulism?
2
Support Vector Machines: Slide 25
Support Vector Machine (SVM) for
Noisy Data
denotes +1
denotes -1
3
1
• Balance the trade off between
margin and classification errors
2
Support Vector Machines: Slide 26
Support Vector Machine for Noisy Data
How do we determine the appropriate value for c ?
Support Vector Machines: Slide 27
An Equivalent QP: Determine b
Fix w
A linear programming problem !
Support Vector Machines: Slide 28
Suppose we’re in 1-dimension
What would
SVMs do with
this data?
x=0
Support Vector Machines: Slide 29
Suppose we’re in 1-dimension
Not a big surprise
x=0
Positive “plane”
Negative “plane”
Support Vector Machines: Slide 30
Harder 1-dimensional dataset
That’s wiped the
smirk off SVM’s
face.
What can be
done about
this?
x=0
Support Vector Machines: Slide 31
Harder 1-dimensional dataset
Remember how
permitting nonlinear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too
x=0
z k  ( xk , x )
2
k
Support Vector Machines: Slide 32
Harder 1-dimensional dataset
Remember how
permitting nonlinear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too
x=0
z k  ( xk , x )
2
k
Support Vector Machines: Slide 33