Improving the Accuracy and Speed of Support Vector Machines

Improving the Accuracy and Speed of
Support Vector Machines
Chris J.C. Burges
Bell Laboratories
Lucent Technologies, Room 3G429
101 Crawford's Corner Road
Holmdel, NJ 07733-3030
[email protected]
Bernhard Scholkopf
Max{Planck{Institut fur
biologische Kybernetik,
Spemannstr. 38
72076 Tubingen, Germany
[email protected]
Abstract
Support Vector Learning Machines (SVM) are nding application
in pattern recognition, regression estimation, and operator inversion for ill-posed problems. Against this very general backdrop,
any methods for improving the generalization performance, or for
improving the speed in test phase, of SVMs are of increasing interest. In this paper we combine two such techniques on a pattern
recognition problem. The method for improving generalization performance (the \virtual support vector" method) does so by incorporating known invariances of the problem. This method achieves
a drop in the error rate on 10,000 NIST test digit images of 1.4%
to 1.0%. The method for improving the speed (the \reduced set"
method) does so by approximating the support vector decision surface. We apply this method to achieve a factor of fty speedup in
test phase over the virtual support vector machine. The combined
approach yields a machine which is both 22 times faster than the
original machine, and which has better generalization performance,
achieving 1.1% error. The virtual support vector method is applicable to any SVM problem with known invariances. The reduced
set method is applicable to any support vector machine.
1 INTRODUCTION
Support Vector Machines are known to give good results on pattern recognition
problems despite the fact that they do not incorporate problem domain knowledge.
Part of this work was done while B.S. was with AT&T Research, Holmdel, NJ.
However, they exhibit classication speeds which are substantially slower than those
of neural networks (LeCun et al., 1995).
The present study is motivated by the above two observations. First, we shall
improve accuracy by incorporating knowledge about invariances of the problem at
hand. Second, we shall increase classication speed by reducing the complexity of
the decision function representation. This paper thus brings together two threads
explored by us during the last year (Scholkopf, Burges & Vapnik, 1996 Burges,
1996).
The method for incorporating invariances is applicable to any problem for which
the data is expected to have known symmetries. The method for improving the
speed is applicable to any support vector machine. Thus we expect these methods
to be widely applicable to problems beyond pattern recognition (for example, to
the regression estimation problem (Vapnik, Golowich & Smola, 1996)).
After a brief overview of Support Vector Machines in Section 2, we describe how
problem domain knowledge was used to improve generalization performance in Section 3. Section 4 contains an overview of a general method for improving the
classication speed of Support Vector Machines. Results are collected in Section 5.
We conclude with a discussion.
2 SUPPORT VECTOR LEARNING MACHINES
This Section summarizes those properties of Support Vector Machines (SVM) which
are relevant to the discussion below. For details on the basic SVM approach, the
reader is referred to (Boser, Guyon & Vapnik, 1992 Cortes & Vapnik, 1995 Vapnik,
1995). We end by noting a physical analogy.
Let the training data be elements xi 2 L L = Rd = 1
, with corresponding
class labels i 2 f1g. An SVM performs a mapping : L ! H x 7! x into a
high (possibly innite) dimensional Hilbert space H. In the following, vectors in
H will be denoted with a bar. In H, the SVM decision rule is simply a separating
hyperplane: the algorithm constructs a decision surface with normal 2 H which
separates the xi into two classes:
x i + 0 ; i i = +1
(1)
x i + 1 + i i = ;1
(2)
where the i are positive slack variables, introduced to handle the non-separable
case (Cortes & Vapnik, 1995), and where 0 and 1 are typically dened to be +1
and ;1, respectively. is computed by minimizing the objective function
+ (X̀ )p
(3)
i
2
i=1
subject to (1), (2), where C is a constant, and we choose = 2. In the separable case,
the SVM algorithm constructs that separating hyperplane for which the margin
between the positive and negative examples in H is maximized. A test vector x 2 L
is then assigned a class label f+1 ;1g depending on whether (x)+ is greater
or less than ( 0 + 1 ) 2. Support vectors sj 2 L are dened as training samples
for which one of Equations (1) or (2) is an equality. (We name the support vectors
s to distinguish them from the rest of the training data). The solution may be
expressed
NS
X
=
(4)
j j (sj )
i
::: `
y
b
k
y
b
k
y
k
C
k
p
b
k
k
=
y
j =1
where j 0 are the positive weights, determined during training, j 2 f1g the
class labels of the sj , and S the number of support vectors. Thus in order to
classify a test point x one must compute
y
N
X
x =
NS
j =1
s x =
j yj j
XS
N
j =1
j yj
(sj ) (x) =
XS
N
j yj K
j =1
(sj x)
(5)
:
One of the key properties of support vector machines is the use of the kernel to
compute dot products in H without having to explicitly compute the mapping .
It is interesting to note that the solution has a simple physical interpretation in
the high dimensional space H. If we assume that each support vector sj exerts a
perpendicular force of size j and sign j on a solid plane sheet lying along the
hyperplane x + = ( 0 + 1) 2, then the solution satises the requirements
of
P
NS
mechanical stability. At the solution, the j can be shown to satisfy j =1 j j = 0,
which translates into the forces on the sheet summing to zero and Equation (4)
implies that the torques also sum to zero.
K
b
k
y
k
=
y
3 IMPROVING ACCURACY
This section follows the reasoning of (Scholkopf, Burges, & Vapnik, 1996). Problem
domain knowledge can be incorporated in two dierent ways: the knowledge can
be directly built into the algorithm, or it can be used to generate articial training
examples (\virtual examples"). The latter signicantly slows down training times,
due to both correlations in the articial data and to the increased training set size
(Simard et al., 1992) however it has the advantage of being readily implemented for
any learning machine and for any invariances. For instance, if instead of Lie groups
of symmetry transformations one is dealing with discrete symmetries, such as the
bilateral symmetries of Vetter, Poggio, & Bultho (1994), then derivative{based
methods (e.g. Simard et al., 1992) are not applicable.
For support vector machines, an intermediate method which combines the advantages of both approaches is possible. The support vectors characterize the solution
to the problem in the following sense: If all the other training data were removed,
and the system retrained, then the solution would be unchanged. Furthermore,
those support vectors si which are not errors are close to the decision boundary
in H, in the sense that they either lie exactly on the margin ( i = 0) or close to
it ( i 1). Finally, dierent types of SVM, built using dierent kernels, tend to
produce the same set of support vectors (Scholkopf, Burges, & Vapnik, 1995). This
suggests the following algorithm: rst, train an SVM to generate a set of support
vectors fs1
sNs g then, generate the articial examples (virtual support vectors) by applying the desired invariance transformations to fs1
sNs g nally,
train another SVM on the new set. To build a ten{class classier, this procedure is
carried out separately for ten binary classiers.
Apart from the increase in overall training time (by a factor of two, in our experiments), this technique has the disadvantage that many of the virtual support
vectors become support vectors for the second machine, increasing the number of
summands in Equation (5) and hence decreasing classication speed. However, the
latter problem can be solved with the reduced set method, which we describe next.
<
:::
:::
4 IMPROVING CLASSIFICATION SPEED
The discussion in this Section follows that of (Burges, 1996). Consider a set of
vectors zk 2 L = 1
Z and corresponding weights k 2 R for which
k
::: N
X (z )
0 k
k
NZ
(6)
k =1
minimizes (for xed
) the Euclidean distance to the original solution:
= k ; 0 k
(7)
Note that , expressed here in terms of vectors in H, can be expressed entirely
in terms of functions (using the kernel ) of vectors in the input space L. The
f( k zk ) j = 1
Z g is called the reduced set. To classify a test point x, the
expansion in Equation (5) is replaced by the approximation
NZ
:
K
k
::: N
X z x = X
0 x =
k k
NZ
NZ
k =1
k K
k =1
(zk x)
:
(8)
The goal is then to choose the smallest Z S , and corresponding reduced
set, such that any resulting loss in generalization performance remains acceptable.
Clearly, by allowing Z = S , can be made zero. Interestingly, there are nontrivial cases where Z
= 0, in which case the reduced set leads to
S and
an increase in classication speed with no loss in generalization performance. Note
that reduced set vectors are not support vectors, in that they do not necessarily lie
on the separating margin and, unlike support vectors, are not training samples.
While the reduced set can be found exactly in some cases, in general an unconstrained conjugate gradient method is used to nd the zk (while the corresponding
optimal k can be found exactly, for all ). The method for nding the reduced set
is computationally very expensive (the nal phase constitutes a conjugate gradient
descent in a space of ( + 1) Z variables, which in our case is typically of order
50,000).
N
N
N
N
N
< N
k
d
N
5 EXPERIMENTAL RESULTS
In this Section, by \accuracy" we mean generalization performance, and by \speed"
we mean classication speed. In our experiments, we used the MNIST database of
60000+10000 handwritten digits, which was used in the comparison investigation
of LeCun et al (1995). In that study, the error rate record of 0.7% is held by a
boosted convolutional neural network (\LeNet4").
We start by summarizing the results of the virtual support vector method. We
trained ten binary classiers using = 10 in Equation (3). We used a polynomial
kernel (x y) = (x y)5. Combining classiers then gave 1.4% error on the 10,000
test set this system is referred to as ORIG below. We then generated new training data by translating the resulting support vectors by one pixel in each of four
directions, and trained a new machine (using the same parameters). This machine,
which is referred to as VSV below, achieved 1.0% error on the test set. The results
for each digit are given in Table 1.
Note that the improvement in accuracy comes at a cost in speed of approximately
a factor of 2. Furthermore, the speed of ORIG was comparatively slow to start
with (LeCun et al., 1995), requiring approximately 14 million multiply adds for one
C
K
Table 1: Generalization Performance Improvement by Incorporating Invariances.
and SV are the number of errors and number of support vectors respectively \ORIG" refers to the original support vector machine, \VSV" to the machine
trained on virtual support vectors.
Digit E ORIG E VSV SV ORIG SV VSV
0
17
15
1206
2938
1
15
13
757
1887
2
34
23
2183
5015
3
32
21
2506
4764
4
30
30
1784
3983
5
29
23
2255
5235
6
30
18
1347
3328
7
43
39
1712
3968
8
47
35
3053
6978
9
56
40
2720
6348
NE
N
N
N
N
N
Table 2: Dependence of Performance of Reduced Set System on Threshold. The
numbers in parentheses give the corresponding number of errors on the test set.
Note that Thrsh Test gives a lower bound for these numbers.
Digit Thrsh VSV Thrsh Bayes Thrsh Test
0
1.39606 (19) 1.48648 (18) 1.54696 (17)
1
3.98722 (24) 4.43154 (12) 4.32039 (10)
2
1.27175 (31) 1.33081 (30) 1.26466 (29)
3
1.26518 (29) 1.42589 (27) 1.33822 (26)
4
2.18764 (37) 2.3727 (35) 2.30899 (33)
5
2.05222 (33) 2.21349 (27) 2.27403 (24)
6
0.95086 (25) 1.06629 (24) 0.790952 (20)
7
3.0969 (59) 3.34772 (57) 3.27419 (54)
8 -1.06981 (39) -1.19615 (40) -1.26365 (37)
9
1.10586 (40) 1.10074 (40) 1.13754 (39)
classication (this can be reduced by caching results of repeated support vectors
(Burges, 1996)). In order to become competitive with systems with comparable
accuracy, we will need approximately a factor of fty improvement in speed. We
therefore approximated VSV with a reduced set system RS with a factor of fty
fewer vectors than the number of support vectors in VSV.
Since the reduced set method computes an approximation to the decision surface in
the high dimensional space, it is likely that the accuracy of RS could be improved
by choosing a dierent threshold in Equations (1) and (2). We computed that
threshold which gave the empirical Bayes error for the RS system, measured on
the training set. This can be done easily by nding the maximum of the dierence
between the two un-normalized cumulative distributions of the values of the dot
products x i , where the xi are the original training data. Note that the eects of
bias are reduced by the fact that VSV (and hence RS) was trained only on shifted
data, and not on any of the original data. Thus, in the absence of a validation
set, the original training data provides a reasonable means of estimating the Bayes
threshold. This is a serendipitous bonus of the VSV approach. Table 2 compares
results obtained using the threshold generated by the training procedure for the
VSV system the estimated Bayes threshold for the RS system and, for comparison
b
Table 3: Speed Improvement Using the Reduced Set method. The second through
fourth columns give numbers of errors on the test set for the original system, the
virtual support vector system, and the reduced set system. The last three columns
give, for each system, the number of vectors whose dot product must be computed
in test phase.
Digit ORIG Err VSV Err RS Err ORIG # SV VSV # SV # RSV
0
17
15
18
1206
2938
59
1
15
13
12
757
1887
38
2
34
23
30
2183
5015
100
3
32
21
27
2506
4764
95
4
30
30
35
1784
3983
80
5
29
23
27
2255
5235
105
6
30
18
24
1347
3328
67
7
43
39
57
1712
3968
79
8
47
35
40
3053
6978
140
9
56
40
40
2720
6348
127
purposes only (to see the maximum possible eect of varying the threshold), the
Bayes error computed on the test set.
Table 3 compares results on the test set for the three systems, where the Bayes
threshold (computed with the training set) was used for RS. The results for all ten
digits combined are 1.4% error for ORIG, 1.0% for VSV (with roughly twice as
many multiply adds) and 1.1% for RS (with a factor of 22 fewer multiply adds than
ORIG).
The reduced set conjugate gradient algorithm does not reduce the objective function
2 (Equation (7)) to zero. For example, for the rst 5 digits, 2 is only reduced
on average by a factor of 2.4 (the algorithm is stopped when progress becomes too
slow). It is striking that nevertheless, good results are achieved.
6 DISCUSSION
The only systems in LeCun et al (1995) with better than 1.1% error are LeNet5
(0.9% error, with approximately 350K multiply-adds) and boosted LeNet4 (0.7%
error, approximately 450K multiply-adds). Clearly SVMs are not in this league yet
(the RS system described here requires approximately 650K multiply-adds).
However, SVMs present clear opportunities for further improvement. (In fact, we
have since trained a VSV system with 0.8% error, by choosing a dierent kernel).
More invariances (for example, for the pattern recognition case, small rotations,
or varying ink thickness) could be added to the virtual support vector approach.
Further, one might use only those virtual support vectors which provide new information about the decision boundary, or use a measure of such information to keep
only the most important vectors. Known invariances could also be built directly
into the SVM objective function.
Viewed as an approach to function approximation, the reduced set method is currently restricted in that it assumes a decision function with the same functional
form as the original SVM. In the case of quadratic kernels, the reduced set can be
computed both analytically and e!ciently (Burges, 1996). However, the conjugate
gradient descent computation for the general kernel is very ine!cient. Perhaps re-
laxing the above restriction could lead to analytical methods which would apply to
more complex kernels also.
Acknowledgements
We wish to thank V. Vapnik, A. Smola and H. Drucker for discussions. C. Burges
was supported by ARPA contract N00014-94-C-0186. B. Scholkopf was supported
by the Studienstiftung des deutschen Volkes.
References
"1] Boser, B. E., Guyon, I. M., Vapnik, V., A Training Algorithm for Optimal
Margin Classiers, Fifth Annual Workshop on Computational Learning Theory,
Pittsburgh ACM (1992) 144{152.
"2] Bottou, L., Cortes, C., Denker, J. S., Drucker, H., Guyon, I., Jackel, L. D., Le
Cun, Y., Muller, U. A., Sackinger, E., Simard, P., Vapnik, V., Comparison of
Classier Methods: a Case Study in Handwritten Digit Recognition, Proceedings of the 12th International Conference on Pattern Recognition and Neural
Networks, Jerusalem (1994)
"3] Burges, C. J. C., Simplied Support Vector Decision Rules, 13th International
Conference on Machine Learning (1996), pp. 71 { 77.
"4] Cortes, C., Vapnik, V., Support Vector Networks, Machine Learning 20 (1995)
pp. 273 { 297
"5] LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., Drucker,
H., Guyon, I., Muller, U., Sackinger, E., Simard, P., and Vapnik, V., Comparison of Learning Algorithms for Handwritten Digit Recognition, International
Conference on Articial Neural Networks, Ed. F. Fogelman, P. Gallinari, pp.
53-60, 1995.
"6] Scholkopf, B., Burges, C.J.C., Vapnik, V., Extracting Support Data for a Given
Task, in Fayyad, U. M., Uthurusamy, R. (eds.), Proceedings, First International
Conference on Knowledge Discovery & Data Mining, AAAI Press, Menlo Park,
CA (1995)
"7] Scholkopf, B., Burges, C.J.C., Vapnik, V., Incorporating Invariances in Support
Vector Learning Machines, in Proceedings ICANN'96 | International Conference on Articial Neural Networks. Springer Verlag, Berlin, (1996)
"8] Simard, P., Victorri, B., Le Cun, Y., Denker, J., Tangent Prop | a Formalism
for Specifying Selected Invariances in an Adaptive Network, in Moody, J. E.,
Hanson, S. J., Lippmann, R. P., Advances in Neural Information Processing
Systems 4, Morgan Kaufmann, San Mateo, CA (1992)
"9] Vapnik, V., Estimation of Dependences Based on Empirical Data, "in Russian]
Nauka, Moscow (1979) English translation: Springer Verlag, New York (1982)
"10] Vapnik, V., The Nature of Statistical Learning Theory, Springer Verlag, New
York (1995)
"11] Vapnik, V., Golowich, S., and Smola, A., Support Vector Method for Function
Approximation, Regression Estimation, and Signal Processing, Submitted to
Advances in Neural Information Processing Systems, 1996
"12] Vetter, T., Poggio, T., and Bultho, H., The Importance of Symmetry and Virtual Views in Three{Dimensional Object Recognition, Current Biology 4 (1994)
18{23