Support Vectors Algorithms as Regularization
Networks
Andrea Caponnetto1,2
Lorenzo Rosasco2
Alessandor Verri2 ∗
Francesca Odone2
1-DISI, Università di Genova,
v. Dodecaneso 35,
16146 Genova, Italy
2- CBCL - Massachusetts Institute of Technology,
Bldg.E25-201, 45 Carleton St.,
Cambridge, MA 02142
Abstract.
Many recently proposed learning algorithms are clearly
inspired by Suppport Vector Machines. Some of them were developed
while trying to simplify the quadratic programming problem that has to
be solved in training SVMs. Some others have been proposed to solve
problems other than binary classification (for example one-class SVM for
novelty detection).
Though indeed attractive, for most of the learning machine community
the above algorithms lack of a clear theoretical motivation. In this context
it seems that the connection to regularization networks is most promising
both from a theoretical and a practical point of view and might be of great
use to understand the mathematical properties of various SV algorithms.
In this paper we contribute to fill the existing gap reviewing several SV
algorithms from a regularization point of view.
1
Introduction
A binary patten classification problem amounts to determine a classificator able
to distinguish between elements of two classes. In learning from examples we
are given a training set of instances of the two classes and we have to determine
a classificator capable of generalization, that is able to perfom well on new
instances.
Support Vector Machines (SVM) algorithms proved to be extremely effective
in a variety of different application fields and, as shown in [10], they can be
seen as a particular instance of a wider class of learning algorithms usually
called Regularization Networks (RN). In the RN formulation the original SVM
problem is stated as the minimization problem of a regularized functional on a
reproducing kernel Hilbert space (RKHS). The latter proved to be a convenient
formulation in order to study the mathematical properties of SVMs.
Recently several learning algorithms were inspired by SVMs. Some of them
resulted from the effort to obtain algorithms easier to implement. In fact it
is well known that training standard SVMs involves a QP dual problem which
∗ This
is an optional funding source acknowledgement.
can be hard to solve. A possible way to overcome these difficulties is slightly
modifying the dual problem formulation. The above approach led to a number of
learning algorithms to which we refer as non standard Support Vector Machines
(NS-SVM). Some other algorithms, as for instance one-class SVM for novelty
detection, shared the same geometrical intuition of the original SVM.
Though indeed promising from a practical viewpoint the above algorithms
lack a clear theoretical foundation. In this paper we show that all the above algorithms can be described in a general regularization framework and this automatically makes available many existing theoretical results. In particular through
the formulation in terms of RN we can easily apply known existence and uniqueness results, describe the explicit form of the solution and dicuss generalization
properties.
The paper is organized as follows. In Section 2 after briefly racalling some
basic concepts from learning theory, we introduce the general formulation of
RN and discuss in detail the meaning of the offset term. In Section 3 we review
several NS-SVM and in Section 4 we deal with one-class SVM. Finally, in Section
5 we end with some comments and remarks.
2
Binary Classification and Regularization Networks
We preliminarly review the main concepts and notations of the binary patterns
classification problem in a statistical learning framework (for details see [10],
[24], [9]).
We consider an input space X ⊂ Rn and an output space Y = {−1, 1}. The
space X × Y is endowed with a probability distribution p(x, y) = p(x)p(y|x),
where p(x) denotes the marginal probability distribution on X and p(y|x) the
conditional probability distribution of y given x. In a learning problem the
probability distribution p(x, y) is fixed but unknown and the data we are given
are ` pairs of examples D = {(xi , yi )}`i=1 , called training set, that we assume
to be drawn i.i.d. with respect to ρ. Roughly speaking, given the training set
D the classification problem amounts to find a deterministic classification rule
in such a way that the misclassification error on future examples is as small as
possible.
A learning algorithm is a map that, given the training set D, provides us with
a classification rule fD . The class of algorithms that we consider throughout
are the so called Regularization Networks (RN), which amount, in the simplest
version, to solve the following minimization problem
`
min{
f ∈H
1X
2
V (yi , f (xi )) + λ kf kH }
` i=1
(1)
where the first term measure how well the function f fits the given data and
the second term is the squared norm of f in the RKHS H which controls the
complexity (smoothness) of the solution. The parameter λ is the regularization
parameter that balances the tradeoff between the two terms.
The term V (y, f (x)) can be interpreted as a measure of the penalty (loss)
when classifying by f (x) a new input x whose label is indeed y. Usually in the
classification setting the loss function V depends on its arguments through the
product yf (x). However it should be stressed that this particular form of the
loss function implicitly models situations in which false negative (y = +1 and
f (x) < 0) and false positive (y = −1 and f (x) > 0) errors are equally penalized.
More general situations have been considered in the literature (see for example
[16]), in the general case an extra factor depending on y has to be added to the
loss function
V (y, f (x)) = L(y)V (yf (x)).
(2)
We will return on this last fact while considering one-class SVM. Different loss
functions realize different learning algorithms [10], some choices we will consider
in the following are
• the square loss V (y, w) = (w − y)2 = (1 − wy)2 ,
• the hinge loss V (y, w) = max{1 − wy, 0} =: |1 − wy|+ ,
• the truncated square loss V (y, w) = max{1 − wy, 0}2 =: |1 − wy|2+ .
In practice an unpenalized offset term is often added to the classifier in Prob.
(1), yielding
`
1X
2
min {
V (yi , f (xi ) + b) + λ kf kH }.
(3)
f ∈H,b∈R `
i=1
In Section 2.1 we discuss in detail how the introduction of the unpenalized
offset can be interpreted in terms of RKHSs. The problem above can be further
generalized to include explicit penalization of the offset term. In particular lwe
will consider the additive penalization bc , with c a fixed parameter
`
min {
f ∈H,b∈R
1X
2
V (yi , f (xi ) + b) + λ(kf kH + bc )}.
` i=1
(4)
The cases c = 0 and c = 2 will be the main subject of the following subsection
and of Section 3. The case c = 1 will be studied in Section 4, where it is identified
with one-class SVM.
Before developing our analysis we briefly recall the minimization problem
arising in standard SVM and Regularized Least Squares Classification (RLSC).
The following minimization problem describes the SVM algorithm in its primal
formulation
`
1X
2
min {
|1 − yi (f (xi ) + b)|+ + λ kf kH }.
(5)
f ∈H,b∈R `
i=1
The above problem can be alternatively stated in terms of slack variables and
margin, that is
`
X
1
min {C
ξi + hw, wi}
(6)
w,b,ξi
2
i=1
s.t. ∀ i
yi (w · xi + b) ≥ 1 − ξi , ξi ≥ 0.
Finally the following minimization problem describe instead RLSC algorithm
`
1X
2
min{
(1 − yi f (xi ))2 + λ kf kH }.
f ∈H `
i=1
2.1
(7)
Offset Term and RKHS
Before reviewing a number of different SV algorithms from a regularization theoretical point of view, in this section we report two results which clarify the
meaning of the offset b. When considering RN algorithms one of the following
cases usually occurs
1. no offset term is considered,
2. an unpenalized offset term is considered,
3. a quadratic penalization of the offset is added.
In this subsection we discuss the connections between these situations. The two
theorems stated in the following refer respectively to cases 2 and 3 showing that
they can both be seen as special instances of case 1.
In order to state the first result of this Section we previously note the following facts. First, we recall that the space of constants functions B = R is a
RKHS with kernel KB (s, x) = 1. Second, we note that when we consider the
minimization problem with an offset term, we separately minimize on f ∈ H
and b ∈ B = R, that is we are considering couples (f, b) ∈ H × R. Third, since
we are looking for a solution of the form f¯(x) + b̄ it is useful to consider the sum
space
S = H + B = H + R.
(8)
The hypothesis space S is again a RKHS and the associated kernel is simply
the sum of the kernels of H and B that is K + 1. We are now ready to state the
following theorem which is a simple reformulation of Theorem 3 in [8] (see [8]
for details and proof).
Let Q be the orthogonal projection on the subset of functions orthogonal to
B w.r.t. the scalar product in S, that is the following closed subspace of S
S0 = {s ∈ S | hs, giS = 0 ∀g ∈ B}.
(9)
We have that the couple (f¯, b̄) is a solution of the problem
`
min {
f ∈H,b∈R
1X
2
V (yi , f (xi ) + b) + λ kf kH }
` i=1
(10)
if and only if f¯ = Qs̄ and b̄ = s̄ − Qs̄, with s̄ (that is f¯ + b̄) a solution of the
problem
`
1X
2
min{
V (yi , s(xi )) + λ kQskS }.
(11)
s∈S `
i=1
The above result shows that the case in which an unpenalized offset term
appears in the minimization problem corresponds to the following situation: we
are simply considering the minimization of a functional in the space S but the
penalty term appearing in the functional is a seminorm.
But what happens when we consider an offset term and we do penalize it?
Next theorem formally answers to this question. Intuitively it shows that we are
considering the minimization of a functional in the sum space S but this time
the penalty term is the standard squared norm in the space. Before stating and
proving the theorem which formalizes the above statements, let us introduce
some notation. Let s ∈ S, then it is well known (see [1] or [19]) that
2
kskS =
2
min {kf kH + b2 }.
f ∈H,b∈R
s.t. f +b=s
(12)
Moreover the minimum is attained by the couple (fs , bs ) where fs = Qs and
bs = s−Qs (here Q is the projector defined in the text of Theorem 2.1). This last
fact is a consequence of Lemma 6 in [8] applied to the case of finite dimensional
B.
We are now ready to state the theorem The couple (f¯, b̄) is a solution of the
problem
`
1X
2
V (yi , f (xi ) + b) + λ(kf kH + b2 )}
(13)
min {
f ∈H,b∈R `
i=1
if and only if f¯ = Qs̄ and b̄ = s̄ − Qs̄, with s̄ (that is f¯ + b̄) a solution of the
problem
`
1X
2
min{
V (yi , s(xi )) + λ kskS }.
(14)
s∈S `
i=1
Let us preliminarily notice that the minima achieved by the two functionals
in the text are equal, in fact
`
inf
{
f ∈H,b∈R
1X
V (yi , f (xi ) + b) + λ(kf k2H + b2 )}
` i=1
(15)
`
= inf
s∈S
inf
f ∈H,b∈R
s.t. f +b=s
{
1X
V (yi , s(xi )) + λ(kf k2H + b2 )}
` i=1
(16)
`
= inf {
s∈S
1X
V (yi , s(xi )) + λksk2S },
` i=1
(17)
where the last equality descends from Eq.(12). Now assume that s̄ is a solution
of Prob.(14), then
`
`
1X
1X
2
V (yi , s̄(xi )) + λ(kfs̄ kH + (bs̄ )2 ) =
V (yi , s̄(xi )) + λks̄k2S .
` i=1
` i=1
(18)
Recalling that we set f¯ = Qs̄ and b̄ = s̄ − Qs̄, it follows that the couple (fs̄ , bs̄ )
is a solution of Prob.(13) since it attains the common minimum. This proves
one half of the theorem.
On the other hand assume that (f¯, b̄) is a solution of Prob.(13), then setting
s̄ = f¯ + b̄ and recalling again Eq.(12), we get
`
`
1X
1X
V (yi , s̄(xi )) + λks̄k2S ≤
V (yi , f¯(xi ) + b̄) + λ(kf¯k2H + b̄2 ).
` i=1
` i=1
(19)
But the r.h.s. of the inequality above is the common minimum of problems
(13) and (14), then equality must hold. It follows both that s̄ is a solution of
Prob.(14) and that (f¯, b̄) = (fs̄ , bs̄ ). This concludes the proof.
3
Non Standard SVMs Revisited
In this section we review a number of SV algorithms and show that each one is
indeed equivalent to a particular RN algorithm. We stress that in our analysis
we never make use of the dual formulation of the algorithmsa and on the other
hand we always consider the general non-linear setting making use of the feature
map Φ(x) (see [24]).
3.1
Regularization Networks with Unpenalized Offset Term
We first consider algorithms where the offset rem appearing in the form of the
solution is no penalized.
3.1.1
L2-SVM.
We consider (see [5] for reference) the problem
min {C
w,b,ξi
`
X
i=1
ξi2 +
1
hw, wi}
2
(20)
s.t. ∀ i
yi (w · Φ(xi ) + b) ≥ 1 − ξi , ξi ≥ 0,
that is we consider the squares of the slack variables and do not penalize the
bias term b.
The above problem can be seen as the regularization network obtained considering the truncated square loss (see Section 2) and penalizing the estimators
by the seminorm kQ·kS in the sum space S defined in Theorem 2.1. In fact we
are simply considering the minimization problem
`
min{
s∈H
1X
2
|1 − yi s(xi )|2+ + λ kQskS }.
` i=1
(21)
We have seen in Section 2.1 that the above problem can be also written in the
(more direct) form
`
min {
f ∈H,b∈B
3.1.2
1X
2
|1 − yi (f (xi ) + b)|2+ + λ kf kH }
` i=1
(22)
Least Square SVM (LS-SVM).
We consider (see [5] and [21] for reference) the problem
min {C
w,b,ξi
`
X
i=1
ξi2 +
1
hw, wi}
2
(23)
s.t. ∀ i
yi (w · Φ(xi ) + b) = 1 − ξi , ξi ≥ 0,
the inequality constraints are now replaced by equality constraints and the square
of the slack variables is considered, moreover an unpenalized offset term is added.
It is easy to see that the above problem corresponds to a regularization network
with square loss, the sum space S as hypothesis space and again penalty term
2
λ kQskS . It follows that in this case the minimization problem can be written
as
`
1X
2
min{
(1 − yi s(xi ))2 + λ kQskS }.
(24)
s∈S `
i=1
As we have seen in Subsection 2.1 another equivalent formulation is the following
`
min {
f ∈H,b∈B
1X
2
(1 − yi (f (xi ) + b))2 + λ kf kH }.
` i=1
(25)
We now consider algorithms where the offset term is quadratically penalized.
3.2
Regularization Networks Penalizing the Offset Term
For all the following algorithms the bias term appears explicitly in the form of
the solution but it is now quadratically penalized by the complexity term.
3.2.1
Modified SVM.
We consider (see [13] for reference) the problem
min {C
w,b,ξi
`
X
1
ξi + (hw, wi + b2 )}
2
i=1
(26)
s.t. ∀ i
yi (w · Φ(xi ) + b) ≥ 1 − ξi , ξi ≥ 0,
that is, we now penalize the bias term b. The above problem can be easily
interpreted in the sum space S with kernel K + 1. In fact by Theorem 2.1 in
Subsection 2.1, it is equivalent to the problem
`
min{
s∈S
1X
2
|1 − yi s(xi )|+ + λ kskS }.
` i=1
(27)
As we mentioned before the above situation is formally equivalent to the case in
which no offset term is considered.
3.2.2
Smooth SVM (SSVM).
We consider (see [15] for reference) the problem
min {C
w,b,ξi
`
X
1
ξi2 + (hw, wi + b2 )}
2
i=1
(28)
s.t. ∀ i
yi (w · Φ(xi ) + b) ≥ 1 − ξi , ξi ≥ 0,
that is we consider the square of the slack variables and do penalize the bias
term b.
The above problem can be seen as the regularization network obtained considering the truncated square loss and searching for a solution in the sum space
S. In fact we are considering the minimization problem
`
min{
s∈S
3.2.3
1X
2
|1 − yi s(xi )|2+ + λ kskS }.
` i=1
(29)
Proximal SVM (PSVM).
We consider (see [11] for reference) the problem
min {C
w,b,ξi
`
X
1
ξi2 + (hw, wi + b2 )}
2
i=1
(30)
s.t. ∀ i
yi (w · Φ(xi ) + b) = 1 − ξi , ξi ≥ 0,
the inequality constraints are now replaced by equality constraints and the
squares of the slack variables are considered, moreover the bias term b is penalized.
The above problem can be interpreted in the sum space S. In fact we are
considering the minimization problem
`
min{
s∈S
1X
2
(1 − yi s(xi ))2 + λ kskS }.
` i=1
(31)
It should be apparent that the above problem is completely equivalent to the
RLSC algorithm in the RKHS S.
4
One-class Support Vector Machine as a Regularization
Network
One-class classification techniques were originally developed to cope with binary
classification problems in which statistics for one of the two classes was virtually
absent ([22, 18]). In this setting the component of the training set (xi , yi )`i=1
labelled according to the minority class (y = −1 in the following) is intentionally
`+
removed, generating the reduced one-class training set (xi )i=1
.
Intuitively the idea behind one-class SVM algorithm, in its simplest formulation, is to look for the smallest sphere enclosing the examples in the data
space. Hence the training procedure amounts to the solution of the following
constrained minimization problem with respect to the balls of center a and radius
R in the input space
min{C
R,ξi
`+
X
ξi + R2 },
(32)
i=1
conditioned to the existence of a vector a ∈ Rn s.t. ∀ i ≤ `+
(xi − a) · (xi − a)T ≤ R2 + ξi , ξi ≥ 0.
A non-linear extension of the previous algorithm can be directly achieved by
substituting scalar products with kernel functions. From a more geometrical
point of view we consider balls in a suitable feature space, that is the squared
distance appearing in (32) is now replaced by the expression (Φ(xi )−a)·(Φ(xi )−
a)T . It can be shown (see for example [7]) that the centers a can be mapped
one to one with the functions f of the RKHS H of kernel K(x, s) = Φ(x) · Φ(s).
This correspondence is such that Φ(x) · a = f (x), so that the problem can be
written as follows
min
{C
2
R ,ξi
`+
X
ξi + R2 },
(33)
i=1
requiring that there exists a vector f ∈ H s.t. ∀ i ≤ `+
2
K(xi , xi ) − 2f (xi ) + kf kH ≤ R2 + ξi , ξi ≥ 0.
In the minimization problems above the relevant complexity measure R2 runs
over non negative real values. In order to interpret the above problems from a
regularization point of view it results convenient slightly modify the problem
allowing the complexity measure run over unconstrained reals. This is obtained
simply by introducing the real variable ρ and formulating the modified problem
as follows
min{C
ρ,ξi
`+
X
ξi + ρ},
(34)
i=1
conditioned to the existence of a function f ∈ H s.t. ∀ i ≤ `+
2
K(xi , xi ) − 2f (xi ) + kf kH ≤ ρ + ξi , ξi ≥ 0.
The two problems are indeed virtually equivalent, in fact it is straightforward to
verify that whenever C > `1+ problems (33) and (34) have the same solutions.
On the other hand if C < `1+ both problems are trivial: any solution of (33)
attains R = 0 while no solution of (34) exists.
Introducing the offset b and the fixed bias function D(x), and suitably rescaling the parameter C and the kernel K, the previous problem becomes
e
min {C
f,b,ξi
`+
X
1
2
ξi + (kf kHe + b)},
2
i=1
s.t. ∀ i ≤ `+
D(xi ) + f (xi ) + b ≥ −ξi , ξi ≥ 0.,
where we set
1
2
(ρ − kf kH − K(0, 0)),
2
1
D(x) =
(K(x, x) − K(0, 0)),
2
e = 1 C,
C
4
e = 2K.
K
b
=
(35)
e × R, with H
e the RKHS of kernel K.
e
The couple (f, b) in (35) runs over H
Finally, by standard reasoning the problem can be rewritten in terms of loss
function, penalty term and original two-class training set, in fact considering the
loss function
V (y, w) = θ(y) · | − wy|+ ,
(36)
which matches the general form in Eq.(2), we easily obtain
`
min {
e
f ∈H,b∈R
1X
2
V (yi , D(xi ) + f (xi ) + b) + λ(kf kHe + b)}.
` i=1
(37)
By definition the fixed bias function D(x) is null for translation invariant kernel
functions (e.g. the gaussian kernel). In this case problem (37) fits the general
regularization network form (4) with c = 1, as claimed.
5
Remarks and Conclusions
In this paper we reviewed several SV algorithms as particular instances of functional minimization on a RKHS. We showed that each algorithm can be recovered
in the general RN framework by choosing appropriate loss function and penalty
term. From this formulation many theoretical results easily follow.
Consistency results are available for all the NS-SVM algorithms considered
(see for example [20] and reference therein). A reasonable question would be to
ask which algorithm provides best performances in the classification task, but
unfortunately the theory shows that there cannot be a general answer to this
question (see [9]). Moreover most of the theoretical results concerning consistency do not shed light on the role -if any at all- played by the underling loss
function (some results can be found in [17], [2], [25]) and most important the
probabilistic bounds at the basis of the consistency results are often too loose to
be useful in practice.
Representation theorems for the solution of the various methods can be applied.
In particular recent results ([8], [25]) express the solution in closed form exploting a convexity assumption on the loss function.
Results about existence and uniqueness can be found in [4] and [8] for all the
NS-SVM algorithms except for L2-SVM though it should be easily treated using
results on the standard SVM.
It would be interesting to extend our review to algorithms using more general
penalty terms (as for instance L-1 norm), in such cases a careful mathematical
analysis is required since we are no longer working in an hilbertian setting (see
[23] and [14]).
Other interesting topics could be the extension of known results about consistency and generalization to one-class SVM and to the case of asymmetric loss
function.
References
[1]
Aronszajn, N.: Theory of reproducing kernels. Trans. Amer. Math. Soc., 686, 337-404,
1950.
[2]
Bartlett, P. L. and Jordan, M. I. and McAuliffe, J. D.: Convexity, classification, and risk
bound. Department of Statistics, U.C. Berkeley , Technical report, 638, 2003.
[3]
Bartlett, P. L. and Jordan, M. I. and McAuliffe, J. D.: Discussion of boosting papers.
The Annals of Statistics, to appear, 2003.
[4]
Burges, C. and Crisp, D.: Uniqueness Theorems for Kernel Methods. to appear in Neurocomputing.
[5]
Cristianini, N. and Shawe Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, 2000, Cambridge, UK.
[6]
Cucker, F. and Smale, S.: Best Choices for regularization parameters in learning theory:
on the bias-variance problem. Foundations of Computationals Mathematics, 2, 413-428,
2002.
[7]
Cucker, F. and Smale, S.: On the mathematical foundation of learning. Bull. A.M.S., 39,
1-49, 2002.
[8]
De Vito, E. and Rosasco, L. and Caponnetto, A. and Piana, M. and Verri, A.: Some
Properties of Regularized Kernel Methods. submitted, 2003.
[9]
Devroye, L. and Györfi, L. and Lugosi G.: A Probabilistic Theory of Pattern Recognition.
Springer, Applications of mathematics, 31, 1996, New York.
[10] Evgeniou, T. and Pontil M. and Poggio, T.: Regularization networks and support vector
machines. Adv. Comp. Math., 13, 1-50, 2000.
[11] Fung, G. and Mangasarian, O. L.: Proximal Support Vector Machine Classifiers. 01-02,
February, 2001, Data Mining Institute - University of Wisconsin - Madison.
[12] Mangasarian, O. L. and Musicant, D. R.: Lagrangian Support Vector Machines. Journal
of Machine Learning Research, 1, 161-177, 2001.
[13] Mangasarian, O. L. and Musicant, D. R.: Data Discrimination via Nonlinear Generalized
Support Vector Machines. Complementarity: Applications, Algorithms and Extensions,
233-251, Kluwer Academic Publishers, 2001.
[14] Micchelli, C. A. and M. Pontil, M.: A Function Representation for Learning in Banach
Spaces. Research Note RN/04/03, Dept of Computer Science, UCL, February 2004.
[15] Lee, Y. J. and Mangasarian, O. L.: SSVM: A Smooth Support Vector Machine for Classification. Computational Optimization and Applications, 1, 20, 5-22, 2001.
[16] Lin, Y. and Lee, Y. and Wahba, G.: Support Vector Machines for Classification in Nonstandard Situations. Machine Learning, 46, 191-202, 2002.
[17] Rosasco,L. and De Vito, E. and Caponnetto, A. and Piana, M. and Verri, A.: Are Loss
Functions all the Same?. Neural Computation, Vol. 16, Issue 6, 1063-1076 2004.
[18] Schölkopf, B. and J.C. Platt and J. Shawe-Taylor and A.J. Smola and R.C. Williamson:
Estimating the support of a high-dimensional distribution. Neural Computation, Vol. 13,
Issue 7, 1443-1471, 2001.
[19] Schwartz, L.: Sous-espaces hilbertiens d’espaces vectorials topologiques and noyaux associès. Journal d’Analyse Mathematique, 13, 115-256, 1964.
[20] Steinwart, I.: Consistency of support vector machines and other regularized kernel machines. submitted to IEEE Transactions on Information Theory, 2002.
[21] Suykens, J. A. K. and Van Gestel, T. and De Brabanter, J. and De Moor, B.: Least
Squares Support Vector Machines. World Scientific, 2002.
[22] Tax, D.M.J. and Duin, R.P.W.: Data Domain Description using Support Vectors. Proceedings of European Symposium on Artificial Neural Networks ’99, Brugge, 1999.
[23] von Luxburg, U. and O. Bousquet: Distance-Based Classification with Lipschitz Functions. submitted, 2004.
[24] Vapnik, V.: Statistical Learning Theory. Wiley, 1998, New York
[25] Zhang, T.: Convergence of Large Margin Separable Linear Classification. Advances in
Neural Information Processing Systems 13, 357-363, MIT Press, 2001.
© Copyright 2026 Paperzz