7. Support Vector Machines (SVMs)
Basic Idea:
1. Transform the data with a non-linear mapping f so
that it is linearly separable. Cf Cover’s theorem:
non-linearly separable data can be transformed
into a new feature space which is linearly
separable if 1) mapping is non-linear 2)
dimensionality of feature space is high enough
2. Construct the ‘optimal’ hyperplane (linear
weighted sum of outputs of first layer) which
maximises the degree of separation (the margin of
separation: denoted by r) between the 2 classes
MLPs and RBFN stop training when all points are classified
correctly. Thus the decision surfaces are not optimised in the
sense that the generalization error is not minimized
r
MLP
RBF
SVM
x1
x2
b= w0 = bias
Input: m-D
vector
xm0
wm1
fm1 (x)
y = Output
y = S wi f(xi) +b
y = wT f(x)
First layer: mapping performed from the input space into a feature
space of higher dimension where the data is now linearly separable
using a set of m1 non-linear functions (cf RBFNs)
1. After learning both RBFN and MLP decision surfaces might
not be at the optimal position. For example, as shown in the
figure, both learning rules will not perform further iterations
(learning) since the error criterion is satisfied (cf perceptron)
2. In contrast the SVM algorithm generates the optimal decision
boundary (the dotted line) by maximising the distance between
the classes r which is specified by the distance between the
decision boundary and the nearest data points
3. Points which lie exactly r/2 away from the decision boundary
are known as Support Vectors
4. Intuition is that these are the most important points since
moving them moves the decision boundary
Moving a support vector
moves the decision
boundary
Moving the other vectors
has no effect
The algorithm to generate the weights proceeds in such a way that
only the support vectors determine the weights and thus the boundary
However, we shall see that the output of the SVM can also be
interpreted as a weighted sum of the inner (dot) products of the
images of the input x and the support vectors xi in the feature space,
which is computed by an inner product kernel function K(x,xm)
x1
x2
Input: m-D
vector
xm0
b = bias
y = Output
y = S ai di K(x, xi) + b
aNdN
KN (x) = K(x, xN)
= fT(x). f( xN)
Where: fT(x) = [f1(x), f2(x), .. , fm1(x)]T I.e. image of x in feature space
and di = +/- 1 depending on the class of xi
Why should inner product kernels be involved in pattern
recognition?
-- Intuition is that they provide some measure of similarity
-- cf Inner product in 2D between 2 vectors of unit length
returns the cosine of the angle between them.
e.g. x = [1, 0]T , y = [0, 1]T
I.e. if they are parallel inner product is 1
xT x = x.x = 1
If they are perpendicular inner product is 0
xT y = x.y = 0
Differs to MLP (etc) approaches in a
fundamental way
• In MLPs complexity is controlled by keeping number of hidden
nodes small
• Here complexity is controlled independently of dimensionality
• The mapping means that the decision surface is constructed in a
very high (often infinite) dimensional space
• However, the curse of dimensionality (which makes finding the
optimal weights difficult) is avoided by using the notion of an
inner product kernel (see: the kernel trick, later) and optimising
the weights in the input space
SVMs are a superclass of network containing both MLPs
and RBFNs (and both can be generated using the SV
algorithm)
Strengths:
Previous slide: i.e. complexity/capacity is independent of
dimensionality of the data thus avoiding curse of dimensionality
Statistically motivated => Can get bounds on the error, can use the
theory of VC dimension and structural risk minimisation (theory
which characterises generalisation abilities of learning machines)
Finding the weights is a quadratic programming problem guaranteed
to find a minimum of the error surface. Thus the algorithm is efficient
and SVMs generate near optimal classification and are insensitive to
overtraining
Obtain good generalisation performance due to high dimension of
feature space
Most important (?): by using a suitable kernel, SVM
automatically computes all network parameters for that kernel. Eg
RBF SVM: automatically selects the number and position of
hidden nodes (and weights and bias)
Weaknesses:
Scale (metric) dependent
Slow training (compared to RBFNs/MLPs) due to
computationally intensive solution to QP problem especially for
large amounts of training data => need special algorithms
Generates complex solutions (normally > 60% of training points
are used as support vectors) especially for large amounts of
training data. E.g. from Haykin: increase in performance of 1.5%
over MLP. However, MLP used 2 hidden nodes, SVM used 285
Difficult to incorporate prior knowledge
The SVM was proposed by Vapnik and colleagues in the 70’s but
has only recently become popular early 90’s). It (and other kernel
techniques) is currently a very active (and trendy) topic of research
See for example:
http://www.kernel-machines.org
or (book):
AN INTRODUCTION TO SUPPORT VECTOR
MACHINES (and other kernel-based learning
methods). N. Cristianini and J. Shawe-Taylor, Cambridge
University Press. 2000. ISBN: 0 521 78019 5
for recent developments
First consider a linearly separable
problem where the decision
boundary is given by
g(x) = wTx+ b = 0
xp
And a set of training data
X={(xi,di): i=1, .., N} where di = +1
if xi is in class 1 and –1 if it’s in
class 2. Let the optimal weight-bias
combination be w0 and b0
Now: x = xp + xn = xp + r w0 / ||w0||
w
where: r = ||xn||
Since: g(xp) = 0, g(x) = w0T(xp + r w0 / ||w0||) + b0
g(x) = r w0T w0 / ||w0|| = r ||w0||
or:
xn
x
r = g(x)/ ||w0||
Thus, as g(x) gives us the algebraic distance to the hyperplane, we
want: g(xi) = w0Txi + b0 >= 1 for di = 1
and
g(xi) = w0Txi + b0 <= -1 for di = -1
(remembering that w0 and b0 can be rescaled without changing the
boundary) with equality for the support vectors xs. Thus, considering
points on the boundary and that:
r = g(x)/ ||w0||
we have:
r = 1/ ||w0|| for dS = 1 and r = -1/ ||w0|| for dS = -1
and so the margin of separation is:
r = 2 / ||w0||
• Thus, the solution w0 maximises the margin of separation
• Maximising this margin is equivalent to minimising ||w||
We now need a computationally efficient algorithm to find w0 and b0
using the training data (xi, di). That is we want to minimise:
F(w) = 1/2 wTw
subject to:
di(wTxi + b) >= 1 for i= 1, .. N
which is known as the primal problem. Note that the cost function F
is convex in w (=> a unique solution) and that the constraints are
linear in w.
Thus we can solve for w using the technique of Lagrange multipliers
(non-maths: technique for solving constrained optimisation
problems). For a geometrical interpretation of Lagrange multipliers
see Bishop, 95, Appendix C.
First we construct the Lagrangian function:
L(w , a) = 1/2 wTw - Si ai [di(wTxi + b) - 1]
where ai are the Lagrange multipliers. L must be minimised with
respect to w and b and maximised with respect to ai (it can be shown
that such problems have a saddle-point at the optimal solution). Note
that the Karush-Kuhn-Tucker (or, intuitively, the
maximisation/constraint) conditions means that at the optimum:
ai [di(wTxi + b) - 1] = 0
This means that unless the data point is a support vector ai = 0 and the
respective points are not involved in the optimisation.
We then set the partial derivatives of L wr to b and w to zero to obtain
the conditions of optimality:
w = Si ai di xi
and
Si ai di = 0
Given such a constrained convex problem, we can reform the
primal problem using the optimality conditions to get the
equivalent dual problem:
Given the training data sample {(xi,di), i=1, …,N}, find the
Lagrangian multipliers ai which maximise:
N
1 N N
T
Q(a) = ai ai a j d i d j x i x j
2 i =1 j =1
i =1
subject to the constraints:
N
a d
i =1
and
i
i
=0
ai >0
Notice that the input vectors are only involved as an inner
product
Once the optimal Lagrangian multipliers a0, i have been found we
can use them to find the optimal w:
w0 = S a0,i di xi
and the optimal bias from the fact that for a positive support vector:
wTxi + b0 = 1
=>
b0 = 1 - w0Txi
[However, from a numerical perspective it is better to take the mean
value of b0 resulting from all such data points in the sample]
Since a0,i = 0 if xi is not a support vector, ONLY the support
vectors determine the optimal hyperplane which was our
intuition
For a non-linearly separable problem we have to first map data onto
feature space so that they are linear separable
f(xi)
xi
with the procedure for determining w the same except that xi is
replaced by f(xi) that is:
Given the training data sample {(xi,di), i=1, …,N}, find the
optimum values of the weight vector w and bias b
w = S a0,i di f(xi)
where a0,i are the optimal Lagrange multipliers determined by
maximising the following objective function
N
1 N N
T
Q(a) = ai ai a j di d j f ( x i )f ( x j )
2 i =1 j =1
i =1
subject to the constraints
S ai di =0 ;
ai >0
Example XOR problem revisited:
Let the nonlinear mapping be :
f(x) = (1,x12, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T
And: f(xi)=(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T
Therefore the feature space is in 6D with input data in 2D
x1 =
x2 =
x3 =
x4 =
(-1,-1),
(-1,1),
(1,-1),
(-1,-1),
d1= - 1
d2= 1
d3= 1
d4= -1
Q(a)= S ai – ½ S S ai aj di dj f(xi) Tf(xj)
= -1/2 (1+1+2+1+2+2) a1 a1
+1/2 (1+1-2+1+2-2)a1 a2 +…
+a1 +a2 +a3 +a4
=a1 +a2 +a3 +a4 – ½(9 a1 a1 - 2a1 a2 -2 a1 a3 +2a1 a4
+9a2 a2 + 2a2 a3 -2a2 a4 +9a3 a3 -2a3 a4 +9 a4 a4 )
To minimize Q, we only need to calculate Partial Q /partial ai
= 0 (due to optimality conditions) which gives
1 = 9 a 1 - a2 - a3 + a4
1 = -a1 + 9 a2 + a3 - a4
1 = -a1 + a2 + 9 a3 - a4
1 = a 1 - a2 - a3 + 9 a4
The solution of which gives the optimal values:
a0,1 =a0,2 =a0,3 =a0,4 =1/8
w0 = S a0,i di f(xi)
= 1/8[f(x1)- f(x2)- f(x3)+ f(x4)]
1 1 1
1 1 1
1 2 2 2
=
8 1 1 1
2 2 2
2 2 2
1 0
1 0
2 1
=
2
1 0
2 0
2 0
Where the first element of w0 gives the bias b
From earlier we have that the optimal hyperplane is defined by:
w0T f(x) = 0
That is:
w0T f(x) = 0 0 1
2
1
2
x1
2x x
1 2
= x1 x2 = 0
0 0 0
2
x2
2x
1
2x
2
which is the optimal decision boundary for the XOR problem.
Furthermore we note that the solution is unique since the optimal
decision boundary is unique
Output for polynomial
RBF
SVM building procedure:
1. Pick a nonlinear mapping f
2. Solve for the optimal weight vector
However: how do we pick the function f?
• In practical applications, if it is not totally impossible to
find f, it is very hard
• In the previous example, the function f is quite
complex: How would we find it?
Answer: the Kernel Trick
Notice that in the dual problem the image of input vectors only
involved as an inner product meaning that the optimisation can be
performed in the (lower dimensional) input space and that the inner
product can be replaced by an inner-product kernel
Q(a) = S ai – ½ S S ai aj di dj f(xi) T f(xj)
= S ai – ½ S S ai aj di dj K(xi, xj)
How do we relate the output of the SVM to the kernel K?
Look at the equation of the boundary in the feature space and use the
optimality conditions derived from the Lagrangian formulations
m1
Hyperplane is defined by w jf j ( x) b = 0
j =1
m1
or
w f
j =0
j
j
where f0 ( x) = 1
( x) = 0;
writing : f ( x) = [f0 ( x), f1 ( x),..., f m1 ( x)]
w f ( x) = 0
T
we get :
N
from optimality conditions : w = ai d i f ( x i )
i =1
N
Thus :
ai d i f ( x i )f ( x) = 0
T
i =1
N
and so boundary is : ai d i K ( x, xi ) = 0
i =1
N
and Output = w f ( x) = ai d i K ( x, xi )
T
i =1
m1
where : K ( x, xi ) = f j ( x)f j ( x i )
j =0
In the XOR problem, we chose to use the kernel function:
K(x, xi) = (x T xi+1)2
= 1+ x12 xi12 + 2 x1x2 xi1xi2 + x22 xi22 + 2x1xi1 ,+ 2x2xi2
Which implied the form of our nonlinear functions:
f(x) = (1,x12, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T
And: f(xi)=(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T
However, we did not need to calculate f at all and could simply have
used the kernel to calculate:
Q(a) = S ai – ½ S S ai aj di dj K(xi, xj)
Maximised and solved for ai and derived the hyperplane via:
N
a d K ( x, x ) = 0
i =1
i
i
i
We therefore only need a suitable choice of kernel function cf:
Mercer’s Theorem:
Let K(x,y) be a continuous symmetric kernel that defined in the
closed interval [a,b]. The kernel K can be expanded in the form
K (x,y) = f(x) T f(y)
provided it is positive definite. Some of the usual choices for K are:
Polynomial SVM
RBF SVM
MLP SVM
(x T xi+1)p
p specified by user
exp(-1/(2s2) || x – xi||2) s specified by user
tanh(s0 x T xi + s1)
Mercer’s theorem not
satisfied for all s0 and s1
How to recover f from a given K ???? Not essential that we do…
Further development
1. In practical applications, it is found that the support vector
machine can outperform other learning machines
2. How to choose the kernel?
3. How much better is the SVM compared with traditional
machine?
Feng J., and Williams P. M. (2001) The generalization error of the
symmetric and scaled support vector machine IEEE T. Neural
Networks Vol. 12, No. 5. 1255-1260
What about regularisation?
Important that we don’t allow noise to spoil our generalisation: we
want a soft margin of separation
Introduce slack variables ei >= 0 such that:
di(wTxi + b) >= 1 – ei
Rather than:
for i= 1, .. N
di(wTxi + b) >= 1
0 < ei <= 1
ei > 1
ei = 0
But all 3 are support vectors since
di(wTxi + b) = 1 – ei
Thus the slack variables measure our deviation from the ideal
pattern separability and also allow us some freedom in
specifying the hyperplane
Therefore formulate new problem to minimise:
F(w, e ) = 1/2 wTw + C Sei
subject to:
di(wTxi + b) >= 1 for i= 1, .. N
And :
ei >= 0
Where C acts as a (inverse) regularisation parameter which can
be determined experimentally or analytically.
The solution proceeds in the same way as before (Lagrangian,
formulate dual and maximise) to obtain optimal ai for:
Q(a)= S ai – ½ S S ai aj di dj K(xi, xj)
subject to the constraints
S ai di =0 ;
0<= ai <= C
Thus, the nonseparable problem differs from the separable one
only in that the second constraint is more stringent. Again the
optimal solution is:
w0 = S a0,i di f(xi)
However, this time the KKT conditions imply that:
ei = 0 if ai < C
SVMs for non-linear regression
SVMs can also be used for non-linear regression. However, unlike
MLPs and RBFs the formulation does not follow directly from the
classification case
Starting point: we have input data
X = {(x1,d1), …., (xN,dN)}
Where xi is in D dimensions and di is a scalar. We want to find a
robust function f(x) that has at most e deviation from the targets d,
while at the same time being as flat (in the regularisation sense of a
smooth boundary) as possible.
Thus setting:
f(x) = wTf(x) + b
The problem becomes, minimise:
½ wTw
(for flatness)
[think of gradient between (0,0) and (1,1) if weights are (1,1) vs
(1000, 1000)]
Subject to:
di - wTf(xi) + b <= e
wTf(xi) + b - di <= e
e
L(f,y)
e
e
This formalisation is called e -insensitive regression as it is
equivalent to minimising the empirical risk (amount you might be
wrong) using an e -insensitive loss function:
L(f, d, x) = | f(x) – d | - e
=
0
for
else
| f(x) – d | < e
Comparing e -insensitive loss function to least squares loss
function (used for MLP/RBFN)
• More robust (robust to small changes in data/ model)
• Less sensitive to outliers
• Non-continuous derivative
Cost function is:
C Si L(f, di, xi)
Where C can be viewed as a regularisation parameter
Oe
Original (O)
e = 0.2
e = 0.1
Oe
e = 0.5
Regression for different e: function selected is the flattest
We now introduce 2 slack variables, ei and ei* as in the case of
nonlinearly separable data and write:
di - wTf(xi) + b <= e ei
wTf(xi) + b - di <= e ei*
Where:
Thus:
ei , ei* >= 0
C S L(f, di, xi) = C S (ei + ei*)
And the problem becomes to minimise:
F(w, e ) = 1/2 wTw + C S (ei + ei*)
subject to:
di - wTf(xi) + b <= e ei
wTf(xi) + b - di <= e ei*
And :
ei , ei* >= 0
We now form the Lagrangian, and find the dual. Note that this time,
there will be 2 sets of Lagrangian multipliers as there are 2
constraints. The dual to be maximised is:
N
1 N N
*
Q(a, a*) = d i (ai ai ) e (ai ai ) (ai ai ) K ( x i , x j )
2 i =1 j =1
i =1
i =1
*
N
*
subject to :
N
d i (ai ai ) = 0
*
i =1
0 ai C ,
and
0 a *i C
Where e and C are free parameters that control the approximating
function:
f(x) = wTf(x)
= Si (ai – ai*) K (x, xi)
From the KKT conditions we now have:
ai (e ei - di + wTxi + b) = 0
ai* (e ei* + di - wTxi - b) = 0
This means that the Lagrange multipliers will only be non-zero
for points where:
| f(xi) – di | >= e
That is, only for points outside the e tube.
e
e
Thus these points are the support
vectors and we have a sparse
expansion of w in terms of x
e = 0.2
e = 0.5
SVs
Data points
e = 0.1
e Controls the amount of SVs selected
e = 0.02
Only non-zero a’s can
contribute: the Lagrange
multipliers act like forces on the
regression. However, they can
only be applied at points
outside or touching the e tube
Points where forces act
One note of warning:
Regression is much harder than classification for 2 reasons
1. Regression is intrinsically more difficult than classification
2. e and C must be tuned simultaneously
Research issues:
•
Incorporation of prior knowledge e.g.
1. train a machine,
2. add in virtual support vectors which incorporate
known invariances, of SVs found in 1.
3. retrain
•
Speeding up training time?
Various techniques, mainly to deal with reducing the
size of the data set. Chunking: use subsets of the data
at a time and only keep SVs. Also, more sophisticated
versions which use linear combinations of the training
points as inputs
• Optimisation packages/techniques?
Off the shelf ones are not brilliant (including the
MATLAB one). Sequential Minimal Opitmisation (SMO)
widely used. For details of that and others see:
A. J. Smola and B. Schölkopf. A Tutorial on Support
Vector Regression. NeuroCOLT Technical Report NC-TR98-030, Royal Holloway College, University of London,
UK, 1998.
• Selection of e and C
© Copyright 2026 Paperzz