Discriminative learning methods for analysing genome

Support vector machines
Usman Roshan
Separating hyperplanes
• For two sets of points there
are many hyperplane
separators
y
• Which one should we choose
for classification?
• In other words which one is
most likely to produce least
error?
x
Theoretical foundation
• Margin error theorem (7.3 from Learning
with kernels)
Separating hyperplanes
• Best hyperplane is the one that maximizes
the minimum distance of all training points to
the plane (Learning with kernels, Scholkopf
and Smola, 2002)
• Its expected error is at most the fraction of
misclassified points plus a complexity term
(Learning with kernels, Scholkopf and Smola,
2002)
Margin of a plane
• We define the margin as the
minimum distance to training
points (distance to closest
y
point)
• The optimally separating plane
is the one with the maximum
margin
x
Optimally separating
hyperplane
y
w
x
Optimally separating
hyperplane
• How do we find the optimally separating
hyperplane?
• Recall distance of a point to the plane
defined earlier
Hyperplane separators
r
x
xp
w
Distance of a point to the
separating plane
• And so the distance to the plane r is
given by
wT x  w0
r
w
or
wT x  w0
ry
w
where y is -1 if the point is on the left side of
the plane and +1 otherwise.
Support vector machine:
optimally separating hyperplane
Distance of point x (with label y) to the hyperplane is given by
y ( wT x  w0 )
w
We want this to be at least some value
y ( wT x  w0 )
r
w
By scaling w we can obtain infinite solutions. Therefore we require that
r w 1
So we minimize ||w|| to maximize the distance which gives us the SVM optimization
problem.
Support vector machine:
optimally separating hyperplane
SVM optimization criterion (primal form)
1
min w w
2
2
subject to yi ( wT xi  w0 )  1, for all i
We can solve this with Lagrange multipliers.
That tells us that
w    i yi xi
i
The xi for which αi is non-zero are called support vectors.
SVM dual problem
Let L be the Lagrangian:
1 2
T
L = w + å a i ( yi (w xi + w0 ) -1)
2
i
Solving dL/dw=0 and dL/dw0=0 gives us the dual form
1
arg max Ld = å a i - å å a ia j yi y j xiT x j
2 i j
ai
i
subject to
åa y
i
i
i
= 0 and 0 £ a i £ C
Support vector machine:
optimally separating hyperplane
y
1/||w||
w
2/||w||
x
Another look at the SVM
objective
• Consider the objective:
yi (w xi + w0 )
arg min w,w0 å max(0,1)
w
i
T
• This is non-convex and harder to solve than the
convex SVM objective
• Roughly it measures the total sum of distances of
points to the plane.
• Misclassified points are given a negative distance
whereas correct ones have a 0 distance.
SVM objective
• The SVM objective is
1
2
min w w subject to yi ( wT xi  w0 )  1, for all i
2
• Compare this with
T
yi (w xi + w0 )
arg min w,w0 å max(0,1)
w
i
• The SVM objective can be viewed as a convex
approximation of this by separating the numerator
and denominator
• The SVM objective is equivalent to minimizing the
regularized hinge loss:
n-1
w
arg min å max(0,1- yi (w xi + w0 )) +
2
w,w0
i=0
T
2
Hinge loss optimization
• Hinge loss:
n-1
arg min å max(0,1- yi (wT xi + w0 ))
w,w0
i=0
• But the max function is nondifferentiable. Therefore we use the
sub-gradient:
df
=
dw j
-xij yi
if yi (wT xi + w0 ) < 1
0
if yi (wT xi + w0 ) ³ 1
Inseparable case
• What is there is no separating hyperplane? For
example XOR function.
• One solution: consider all hyperplanes and select the
one with the minimal number of misclassified points
• Unfortunately NP-complete (see paper by Ben-David,
Eiron, Long on course website)
• Even NP-complete to polynomially approximate
(Learning with kernels, Scholkopf and Smola, and
paper on website)
Inseparable case
• But if we measure error as the sum of the
distance of misclassified points to the plane
then we can solve for a support vector
machine in polynomial time
• Roughly speaking margin error bound
theorem applies (Theorem 7.3, Scholkopf and
Smola)
• Note that total distance error can be
considerably larger than number of
misclassified points
0/1 loss vs distance based
Optimally separating
hyperplane with errors
y
w
x
Support vector machine:
optimally separating hyperplane
In practice we allow for error terms in case there is no
hyperplane.
1
2
min w, w0 ,i ( w +C i ) subject to yi ( wT xi  w0 )  1  i , for all i
2
i
SVM software
• Plenty of SVM software out there. Two
popular packages:
– SVM-light
– LIBSVM
Kernels
• What if no separating hyperplane
exists?
• Consider the XOR function.
• In a higher dimensional space we can
find a separating hyperplane
• Example with SVM-light
Kernels
• The solution to the SVM is obtained by
applying KKT rules (a generalization of
Lagrange multipliers). The problem to
solve becomes
1
Ld    i    i j yi y j xiT x j
2 i j
i
subject to
 y
i
i
i
 0 and 0   i  C
Kernels
• The previous problem can be solved in
turn again with KKT rules.
• The dot product can be replaced by a
matrix K(i,j)=xiTxj or a positive definite
matrix K.
1
Ld    i    i j yi y j K ( xi x j )
2 i j
i
subject to
 y
i
i
i
 0 and 0   i  C
Kernels
• With the kernel approach we can avoid
explicit calculation of features in high
dimensions
• How do we find the best kernel?
• Multiple Kernel Learning (MKL) solves it
for K as a linear combination of base
kernels.