Maximum Margin Classifiers and Support Vector Machines

Kernel Methods:
Support Vector Machines
• Maximum
Margin Classifiers and Support Vector Machines
Generalized Linear Discriminant Functions
A linear discriminant function g(x) can be written as:
g(x) = wo + Σi wixi
i = 1, …, d (d is number of features).
We could add additional terms to obtain a quadratic function:
g(x) = wo + Σi wixi + Σi Σj wij xixj
The quadratic discriminant function introduces d(d-1)/2
coefficients corresponding to the products of attributes.
The surfaces are thus more complicated(hyperquadric surfaces).
Generalized Linear Discriminant Functions
We could even add more terms wijk xi xj xk and obtain the class
of polynomial discriminant functions. The generalized form is
g(x) = Σi ai φi(x)
g(x) = at φ
Where the summation goes over all functions φi(x).
The φi(x) functions are called the phi or φ functions.
The function is now linear on the φi(x).
The functions map a d-dimensional x-space into a d’ dimensional
y-space. Example: g(x) = a1 + a2x + a3x2 φ = (1 x x2 ) t
Figure 5.5
Support Vector Machines
What are support vector machines (SVMs)?
A very popular classifier that is based on the concepts
previously discussed on linear discriminants and the
new concept of margins.
To begin, SVMs preprocess the data by representing all
examples in a higher dimensional space. With sufficiently
high dimensions the classes can be separated by a
hyperplane.
The Margin
The Goal in Support Vector Machines
Now, let t be 1 or – 1 depending on the example x
being of class positive or negative. A separating
hyperplane ensures that:
t g(x) >= 0
The goal in support vector machines is to find the
separating hyperplane with the “largest” margin.
Margin is the distance between the hyperplane and
the closest example to it.
The Support Vectors
Now the distance from a pattern x to a hyperplane
is g(x) / ||w||.
So let’s change our objective to finding a vector w
that maximizes the margin m in the equation:
t g(x) / ||w|| >= m
We can also say that the support vectors are those
patterns x for which t g(x) / ||w|| = 1, because we can
rescale the w vector and leave the hyperplane in the
same place. Support vectors are equally close
to the hyperplane.
These are the patterns that are most difficult to separate.
These are the most “informative” patterns.
The Support Vectors
We said we want to find a vector w that maximizes
the equation:
t g(x) / ||w|| >= 1
This means all we really need to do is to maximize ||w||
under certain constraints.
So we have the following optimization problem:
arg min
w
½ ||w||2 subject to
t g(x) >= 1
This can be solved using Lagrange Multipliers
-1
The Support Vectors
What happens when there are unavoidable errors?
arg min
w
subject to
½ ||w||2
+
λ ∑ ei
t g(xi) >= 1 - ei
where ei is the error incurred by example xi
These are known as slack variables.
The Support Vectors
We can write this in a dual form (Karush-Kuhn-Tucker
construction).
max ∑ i – ½ ∑ ∑ i j ti tj (xi . xj)
subject to
0 <= i <= λ and ∑ i xi = 0
The Support Vectors
The final result is a set of i, one for each training
example. The optimal hyperplane can be expressed
in the dual representation as:
f(x) = ∑ yi i < xi . x > + b
where w = ∑ yi i xi
The Support Vectors
We can use kernel functions to map from the
original space to a new space.
max ∑ i – ½ ∑ ∑ i j ti tj ((xi) .  (xj) )
subject to
0 <= i <= λ and ∑ i xi = 0
The Support Vectors
Computing the dot product is simplified:
Polynomial kernels:
(xi) .  (xj) = 1 + 2 ∑ xi xj + ∑ xi2 xj2 + …
But fortunately that is equal to:
(1 + xi . xj )
2
= K( xi, xj )
In general all we need is to compute the dot product of
all examples in the original space. This results in the
Gram matrix K
The Support Vectors
The final formulation is as follows:
max ∑ i – ½ ∑ ∑ i j ti tj K (xi . xj)
subject to
0 <= i <= λ and ∑ i xi = 0
Historical Background
Vladimir Vapnik:
Publications: 6 books and over a hundred
research papers.
Developed a theory for expected risk
minimization.
Invented Support Vector Machines
Historical Background
Alexey Chervonenkis
With Vladimir Vapnik developed the
concept of the Vapnik-Chervonenkis
dimension.
An Example
The XOR problem is known to be non-separable:
x2
1
0
-1
-1
0
1 x1
We use phi functions (1, 1.41x1, 1.41x2, 1.41x1x2, x12, x22)
(hidden in the kernel function).
An Example
The optimal hyperplane is found to be g(x1,x2) = x1x2 = 0.
The margin is p = 1.41
1.41 x1x2
1
0
b = 1.41
g=0
-1
-1
0
1 1.41 x1
Benefits of SVMs
Benefits:
 The complexity of the classifier is based on the number
of support vectors rather than the dimensionality of the
feature space.
 This makes the algorithm less prone to overfitting