Lecture 05 - NYU Computer Science

Foundations of Machine Learning
Kernel Methods
Mehryar Mohri
Courant Institute and Google Research
[email protected]
Motivation
Efficient computation of inner products in high
dimension.
Non-linear decision boundary.
Non-vectorial inputs.
Flexible selection of more complex features.
Mehryar Mohri - Foundations of Machine Learning
page 2
This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels
Mehryar Mohri - Foundations of Machine Learning
page 3
Non-Linear Separation
Linear separation impossible in most problems.
Non-linear mapping from input space to highdimensional feature space: : X F .
Generalization ability: independent of dim(F ),
depends only on margin and sample size.
Mehryar Mohri - Foundations of Machine Learning
page 4
Kernel Methods
Idea:
Define K : X X
•
R , called kernel, such that:
(x) · (y) = K(x, y).
• K often interpreted as a similarity measure.
Benefits:
Efficiency: K is often more efficient to compute
than and the dot product.
Flexibility: K can be chosen arbitrarily so long as
the existence of is guaranteed (PDS condition
or Mercer’s condition).
•
•
Mehryar Mohri - Foundations of Machine Learning
page 5
PDS Condition
Definition: a kernel K: X X R is positive definite
symmetric (PDS) if for any {x1 , . . . , xm } X , the
matrix K = [K(xi , xj )]ij Rm m is symmetric
positive semi-definite (SPSD).
K SPSD if symmetric and one of the 2 equiv. cond.’s:
• its eigenvalues are non-negative.
• for any c R , c Kc = c c K(x , x )
m
m 1
i j
i
j
i,j=1
Terminology: PDS for kernels, SPSD for kernel
matrices (see (Berg et al., 1984)).
Mehryar Mohri - Foundations of Machine Learning
page 6
0.
Example - Polynomial Kernels
Definition:
x, y
RN , K(x, y) = (x · y + c)d ,
c > 0.
Example: for N = 2 and d = 2 ,
K(x, y) = (x1 y1 + x2 y2 + c)2
=
Mehryar Mohri - Foundations of Machine Learning
x21
x22
2 x1 x2
·
2c x1
2c x2
c
y12
y22
2 y1 y2
.
2c y1
2c y2
c
page 7
XOR Problem
Use second-degree polynomial kernel with c = 1:
(-1, 1)
x2
√
(1, 1)
2 x1 x2
√
√
√
(1, 1, + 2, − 2, − 2, 1)
√
√
√
(1, 1, + 2, + 2, + 2, 1)
√
x1
(-1, -1)
(1, -1)
Linearly non-separable
√
√
√
(1, 1, − 2, − 2, + 2, 1)
√
√
√
(1, 1, − 2, + 2, − 2, 1)
Linearly separable by
x1 x2 = 0.
Mehryar Mohri - Foundations of Machine Learning
2 x1
page 8
Normalized Kernels
Definition: the normalized kernel K associated to a
kernel K is defined by
x, x
X , K (x, x ) =
0
K(x,x )
K(x,x)K(x ,x )
if (K(x, x) = 0)
otherwise.
(K(x , x ) = 0)
• If K is PDS, then K is PDS:
m
i,j=1
m
ci cj K(xi , xj )
ci cj (xi ), (xj )
=
=
(x
)
(x
)
K(xi , xi )K(xj , xj ) i,j=1
i H
j H
m
i=1
• By definition, for all x with K(x, x) = 0 ,
K (x, x) = 1.
Mehryar Mohri - Foundations of Machine Learning
page 9
ci (xi )
(xi ) H
2
0.
H
Other Standard PDS Kernels
Gaussian kernels:
K(x, y) = exp
•
||x
y||2
2
2
Normalized kernel of (x, x )
,
= 0.
exp
x·x
2
Sigmoid Kernels:
K(x, y) = tanh(a(x · y) + b), a, b
Mehryar Mohri - Foundations of Machine Learning
0.
page 10
.
Reproducing Kernel Hilbert Space
(Aronszajn, 1950)
Theorem: Let K: X X R be a PDS kernel. Then,
there exists a Hilbert space H and a mapping
from X to H such that
X, K(x, y) =
x, y
(x) · (y).
Proof: For any x X , define (x) : X
y
X,
RX as follows:
(x)(y) = K(x, y).
• Let H = a (x ) : a R, x X, card(I) < .
• We are going to define an inner product ·, · on H .
0
i
i
i
i
i I
0
Mehryar Mohri - Foundations of Machine Learning
page 11
• Definition: for anyf =
f, g =
•
•
•
i I
ai (xi ), g =
j J
ai bj K(xi , yj ) =
i I,j J
bj (yj ),
bj f (yj ) =
j J
ai g(xi ).
i I
·, · does not depend on representations of f and g.
·, · is bilinear and symmetric.
·, · is positive semi-definite since K is PDS: for any f,
f, f =
ai aj K(xi , xj )
i,j I
• note: for any f , . . . , f
1
m
and c1 , . . . , cm ,
m
m
m
ci cj f i , f j =
i,j=1
0.
cj f j
ci f i ,
i=1
j=1
·, · is a PDS kernel on H0 .
Mehryar Mohri - Foundations of Machine Learning
page 12
0.
•
·, · is definite:
inequality for PDS kernels.
• first, Cauchy-Schwarz
M=
x, y
K(x,x) K(x,y)
K(y,x) K(y,y)
X
If K is PDS,
is SPSD for all
In particular, the product of its eigenvalues, det(M)
is non-negative:
det(M) = K(x, x)K(y, y)
K(x, y)2
• since ·, · is a PDS kernel, for any f
f, (x)
2
f, f
0.
H0 and x X ,
(x), (x) .
property of ·, · :
• observe the reproducingX
•
8f 2 H0 , 8x 2 X, f (x) =
i2I
ai K(xi , x) = hf, (x)i.
Thus,[f (x)]2 f, f K(x, x) for all x X , which
shows the definiteness of ·, · .
Mehryar Mohri - Foundations of Machine Learning
page 13
• Thus, ·, · defines an inner product on H , which
thereby becomes a pre-Hilbert space.
• H can be completed to form a Hilbert space H in
0
0
which it is dense.
Notes:
H is called the reproducing kernel Hilbert space
(RKHS) associated to K.
A Hilbert space such that there exists : X H
with K(x, y) = (x)· (y) for all x, y X is also
called a feature space associated to K . is called
a feature mapping.
Feature spaces associated to K are in general not
unique.
•
•
•
Mehryar Mohri - Foundations of Machine Learning
page 14
This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels
Mehryar Mohri - Foundations of Machine Learning
page 15
SVMs with PDS Kernels
(Boser, Guyon, and Vapnik, 1992)
Constrained optimization:
m
m
max
(xi )· (xj )
1
2 i,j=1
i
i=1
i j yi yj K(xi , xj )
m
subject to: 0
= 0, i
[1, m].
i=1
Solution:
m
h(x) = sgn
with b = yi
i yi
C
i
i yi K(xi , x)
m i=1
j yj K(xj , xi )
j=1
Mehryar Mohri - Foundations of Machine Learning
+b ,
for any xi with
0<
i < C.
page 16
Rad. Complexity of Kernel-Based Hypotheses
Theorem: Let K: X X R be a PDS kernel and
let : X ! H be a feature mapping associated to K.
Let S {x : K(x, x) R2 } be a sample of size m , and
let H = {x 7! w· (x) : kwkH  ⇤}. Then,
Tr[K]
m
RS (H)
Proof:
1
E
RS (H) =
m
m
sup w ·
w
m
=
m
Mehryar Mohri - Foundations of Machine Learning
E
i
(xi )
K(xi , xi )
i=1
(xi )
i=1
i=1
m
E
m
i
m
(Jensen’s ineq.)
R2 2
.
m
2
m
E
i
i=1
m
1/2
m
1/2
=
(xi )
E
(xi )
2
i=1
Tr[K]
m
page 17
1/2
R2 2
.
m
Generalization: Representer Theorem
(Kimeldorf and Wahba, 1971)
Theorem: Let K: X X R be a PDS kernel with H
the corresponding RKHS. Then, for any nondecreasing function G: R R and any L: Rm R {+ }
problem
argmin F (h) = argmin G( h
h H
h H
H)
+ L h(x1 ), . . . , h(xm )
m
admits a solution of the form h =
i=1
i K(xi , ·).
If G is further assumed to be increasing, then any
solution has this form.
Mehryar Mohri - Foundations of Machine Learning
page 18
• Proof: let H = span({K(x , ·): i
[1, m]}). Any h H
admits the decomposition h = h1 + h according
to H = H1 H1 .
1
i
• Since G is non-decreasing,
•
= G( h H ).
By the reproducing property, for all i [1, m],
h(xi ) = h, K(xi , ·) = h1 , K(xi , ·) = h1 (xi ).
G( h1
H)
G
2
H
h1
• Thus, L h(x ), . . . , h(x
and F (h1 )
1
m)
+ h
2
H
= L h1 (x1 ), . . . , h1 (xm )
F (h).
• If G is increasing, then F (h ) < F (h) when h
1
=0
and any solution of the optimization problem
must be in H1.
Mehryar Mohri - Foundations of Machine Learning
page 19
Kernel-Based Algorithms
PDS kernels used to extend a variety of algorithms
in classification and other areas:
regression.
ranking.
dimensionality reduction.
clustering.
•
•
•
•
But, how do we define PDS kernels?
Mehryar Mohri - Foundations of Machine Learning
page 20
This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels
Mehryar Mohri - Foundations of Machine Learning
page 21
Closure Properties of PDS Kernels
Theorem: Positive definite symmetric (PDS)
kernels are closed under:
sum,
product,
tensor product,
pointwise limit,
composition with a power series with nonnegative coefficients.
•
•
•
•
•
Mehryar Mohri - Foundations of Machine Learning
page 22
Closure Properties - Proof
Proof: closure under sum:
c Kc
0
0
c Kc
c (K + K )c
• closure under product: K✓ = MM
m
X
ci cj (Kij K0ij ) =
i,j=1
=
m
X
ci cj
i,j=1
m 
X
k=1
2
m
hX
i,j=1
,
i
Mik Mjk K0ij
k=1
m
X
0.
◆
ci cj Mik Mjk K0ij
3>
2
3
c1 M1k
c1 M1k
4 · · · 5 K0 4 · · · 5
=
cm Mmk
k=1 cm Mmk
m
X
Mehryar Mohri - Foundations of Machine Learning
page 23
0.
• Closure under tensor product:
• definition: for all x , x , y , y
1
(K1
2
1
2
X,
K2 )(x1 , y1 , x2 , y2 ) = K1 (x1 , x2 )K2 (y1 , y2 ).
• thus, PDS kernel as product of the kernels
(x1 , y1 , x2 , y2 )
K1 (x1 , x2 ) (x1 , y1 , x2 , y2 )
K2 (y1 , y2 ).
• Closure under pointwise limit: if for all x, y
X,
lim Kn (x, y) = K(x, y),
n
Then, ( n, c Kn c 0)
Mehryar Mohri - Foundations of Machine Learning
lim c Kn c = c Kc 0.
n
page 24
• Closure under composition with power series:
• assumptions: K PDS kernel with|K(x, y)| < for
•
all x, y X and f (x) = n=0 an xn , an 0 power
series with radius of convergence .
f K is a PDS kernel since K n is PDS by closure
N
under product, n=0 an K n is PDS by closure
under sum, and closure under pointwise limit.
Example: for any PDS kernel K, exp(K) is PDS.
Mehryar Mohri - Foundations of Machine Learning
page 25
This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels
Mehryar Mohri - Foundations of Machine Learning
page 26
Sequence Kernels
Definition: Kernels defined over pairs of strings.
• Motivation: computational biology, text and
speech classification.
• Idea: two sequences are related when they share
some common substrings or subsequences.
• Example: bigramXkernel;
K(x, y) =
bigram u
Mehryar Mohri - Foundations of Machine Learning
countx (u) ⇥ county (u).
page 27
Weighted Transducers
b:a/0.6
b:a/0.2
a:b/0.1
0
1
a:a/0.4
2
b:a/0.3
3/0.1
a:b/0.5
T (x, y) = Sum of the weights of all accepting
paths with input x and output y .
T (abb, baa) = .1
Mehryar Mohri - Foundations of Machine Learning
.2
.3
.1 + .5
.3
page 28
.6
.1
Rational Kernels over Strings
(Cortes et al., 2004)
R is rational if K = T
Definition: a kernel K :
for some weighted transducer T .
R and T2 :
R be
Definition: let T1 :
two weighted transducers. Then, the composition
,y
of T1 and T2 is defined for all x
by
(T1 T2 )(x, y) =
T1 (x, z) T2 (z, y).
z
Definition: the inverse of a transducer T :
R
is the transducer T 1 :
R obtained from T
by swapping input and output labels.
Mehryar Mohri - Foundations of Machine Learning
page 29
PDS Rational Kernels
General Construction
Theorem: for any weighted transducer T :
the function K = T T 1 is a PDS rational kernel.
Proof: by definition, for all x, y
K(x, y) =
,
T (x, z) T (y, z).
z
• K is pointwise limit of (K )
n n 0
x, y
•K
n
defined by
, Kn (x, y) =
T (x, z) T (y, z).
|z| n
is PDS since for any sample (x1 , . . . , xm ),
Kn = AA with A = (Kn (xi , zj )) i [1,m] .
j [1,N ]
Mehryar Mohri - Foundations of Machine Learning
page 30
R,
PDS Sequence Kernels
PDS sequences kernels in computational biology,
text classification, other applications:
special instances of PDS rational kernels.
PDS rational kernels easy to define and modify.
single general algorithm for their computation:
composition + shortest-distance computation.
no need for a specific ‘dynamic-programming’
algorithm and proof for each kernel instance.
general sub-family: based on counting
transducers.
•
•
•
•
•
Mehryar Mohri - Foundations of Machine Learning
page 31
Counting Transducers
b:ε/1
a:ε/1
b:ε/1
a:ε/1
0
X:X/1
1/1
TX
X = ab
Z = bbabaabba
εεabεεεεε
εεεεεabεε
X may be a string or an automaton
representing a regular expression.
Counts of Z in X : sum of the weights of
accepting paths of Z TX .
Mehryar Mohri - Foundations of Machine Learning
page 32
Transducer Counting Bigrams
b:ε/1
a:ε/1
b:ε/1
a:ε/1
0
a:a/1
b:b/1
1
a:a/1
b:b/1
2/1
Tbigram
Counts of Z given by Z Tbigram ab .
Mehryar Mohri - Foundations of Machine Learning
page 33
Transducer Counting Gappy Bigrams
b:ε/1
a:ε/1
0
b:ε/1
a:ε/1
b:ε/λ
a:ε/λ
a:a/1
b:b/1
1
a:a/1
b:b/1
2/1
Tgappy bigram
Counts of Z given by Z Tgappy bigram ab ,
gap penalty (0, 1) .
Mehryar Mohri - Foundations of Machine Learning
page 34
Composition
Theorem: the composition of two weighted
transducer is also a weighted transducer.
Proof: constructive proof based on composition
algorithm.
states identified with pairs.
-free case: transitions defined by
•
•
E=
(q1 , q1 ), a, c, w1
w2 , (q2 , q2 )
(q1 ,a,b,w1 ,q2 ) E1
(q1 ,b,c,w2 ,q2 ) E2
• general case: use of intermediate
Mehryar Mohri - Foundations of Machine Learning
-filter.
page 35
.
Composition Algorithm
ε-Free Case
a:a/0.6
b:b/0.3
0
a:b/0.1
a:b/0.2
1
2
b:a/0.2
a:b/0.5
3/0.7
b:b/0.4
a:a/.04
2
a:b/0.3
0
b:b/0.1
1
b:a/0.5
a:b/0.4
3/0.6
(0, 1)
(3, 2)
a:a/.02
a:b/.18
(0, 0)
a:b/.01
(1, 1)
b:a/.06
(2, 1)
b:a/.08
a:a/0.1
(3, 1)
a:b/.24
(3, 3)/.42
Complexity: O(|T1 | |T2 |) in general, linear in some cases.
Mehryar Mohri - Foundations of Machine Learning
page 36
(c)
!:!1
A'
0
(d)
!:!1
T!1
!:!1
!:!1
Redundant ε-Paths Problem
a:a
!2:!
T1
!:!1
1
!2:!
b:!2
c:!2
2
!2:!
d:d
3
(MM,!2Pereira,
and Riley, 1996; Pereira and Riley, 1997)
:!
0 a:aa:d 1 b:ε !1:e2 c:ε d:a3 d:d
B'
1
0
ε:ε1
0
0,0
ε:ε1
a:a
a:d
(x:x)
2
ε:ε1
ε:ε1
b:!
(!2:!2)
2,1
!:e
(!1:!1)
b:e
(!1:!1)
c:!
ε:ε1
ε2:ε
4
0
d:d
(!1:!1)
d:a
3
ε2:ε
a:d
1 ε1: e
2
ε2:ε1 ε1:ε1
x:x
x:x
b:!
2,2
0
d:a
3
1
ε2:ε2 ε2:ε2
x:x
F
2
3,2
d:a
(x:x)
4,3
Mehryar Mohri - Foundations of Machine Learning
T = T!1 ◦ F ◦ T!2 .
page 37
T2
ε2:ε
ε2:ε
ε1:ε1
(!2:!2)
!:e
1 ε:e 2
1,2
c:!
(!2:!2)
3,1
0
(!2:!1) (!2:!2)
!:e
a:d
4
3
1 b:ε2 2 c:ε2 3
1,1
4
T!2
Kernels for Other Discrete Structures
Similarly, PDS kernels can be defined on other
discrete structures:
• Images,
• graphs,
• parse trees,
• automata,
• weighted automata.
Mehryar Mohri - Foundations of Machine Learning
page 38
This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels
Mehryar Mohri - Foundations of Machine Learning
page 39
Questions
Gaussian kernels have the form exp( d2 ) where d is
a metric.
for what other functions d does exp( d2 ) define a
PDS kernel?
what other PDS kernels can we construct from a
metric in a Hilbert space?
•
•
Mehryar Mohri - Foundations of Machine Learning
page 40
Negative Definite Kernels
(Schoenberg, 1938)
Definition: A function K: X X R is said to be a
negative definite symmetric (NDS) kernel if it is
m
c
R
{x
,
.
.
.
,
x
}
X
symmetric and if for all 1
and
m
with 1 c = 0 ,
c Kc
0.
Clearly, if K is PDS, then K is NDS, but the
converse does not hold in general.
Mehryar Mohri - Foundations of Machine Learning
page 41
1
Examples
The squared distance ||x y||2 in a Hilbert space H
m
defines an NDS kernel. If i=1 ci = 0 ,
m
m
i,j=1
ci cj ||xi
xj ||2 =
ci cj (xi
i,j=1
m
ci cj ( xi
=
xj ) · (xi
2
+ xj
xj )
2xi · xj )
2
i,j=1
m
ci cj ( xi
=
2
+ xj
2
)
2
i=1
i,j=1
m
ci cj ( xi
i,j=1
m
=
Mehryar Mohri - Foundations of Machine Learning
2
+ xj
2
ci ( xi
i=1
2
ci xi ·
cj xj
j=1
)
m
m
m
cj
j=1
m
m
+
cj xj
ci
i=1
j=1
page 42
2
= 0.
NDS Kernels - Property
(Schoenberg, 1938)
Theorem: Let K: X X R be an NDS kernel such
that for all x, y X, K(x, y) = 0 iff x = y . Then, there
exists a Hilbert space H and a mapping : X H
such that
∀x, y ∈ X, K(x, y) = ∥Φ(x) − Φ(y)∥2 .
√
Thus, under the hypothesis of the theorem, K
defines a metric.
Mehryar Mohri - Foundations of Machine Learning
page 43
PDS and NDS Kernels
(Schoenberg, 1938)
Theorem: let K: X X R be a symmetric kernel,
then:
K is NDS iff exp( tK) is a PDS kernel for all t > 0 .
Let K be defined for any x0 by
•
•
K (x, y) = K(x, x0 ) + K(y, x0 )
K(x, y)
K(x0 , x0 )
for all x, y X. Then, K is NDS iff K is PDS.
Mehryar Mohri - Foundations of Machine Learning
page 44
Example
The kernel defined by K(x, y) = exp( t||x
is PDS for all t > 0 since ||x y||2 is NDS.
y||2 )
The kernel exp( |x y|p )is not PDS for p > 2 .
Otherwise, for any t > 0 ,{x1 , . . . , xm } X and c Rm
m
ci cj e
t|xi xj |p
i,j=1
m
=
ci cj e
|t1/p xi t1/p xj |p
0.
i,j=1
This would imply that |x y|p is NDS for p > 2, but
that cannot be (see past homework assignments).
Mehryar Mohri - Foundations of Machine Learning
page 45
1
Conclusion
PDS kernels:
rich mathematical theory and foundation.
general idea for extending many linear
algorithms to non-linear prediction.
flexible method: any PDS kernel can be used.
widely used in modern algorithms and
applications.
can we further learn a PDS kernel and a
hypothesis based on that kernel from labeled
data? (see tutorial: http://www.cs.nyu.edu/~mohri/icml2011tutorial/).
•
•
•
•
•
Mehryar Mohri - Foundations of Machine Learning
page 46
References
•
•
N. Aronszajn, Theory of Reproducing Kernels, Trans. Amer. Math. Soc., 68, 337-404, 1950.
•
Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on
Semigroups. Springer-Verlag: Berlin-New York, 1984.
•
Bernhard Boser, Isabelle M. Guyon, and Vladimir Vapnik. A training algorithm for optimal
margin classifiers. In proceedings of COLT 1992, pages 144-152, Pittsburgh, PA, 1992.
•
Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational Kernels: Theory and
Algorithms. Journal of Machine Learning Research (JMLR), 5:1035-1062, 2004.
•
Corinna Cortes and Vladimir Vapnik, Support-Vector Networks, Machine Learning, 20,
1995.
•
Kimeldorf, G. and Wahba, G. Some results on Tchebycheffian Spline Functions, J. Mathematical
Analysis and Applications, 33, 1 (1971) 82-95.
Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector
machines and other pattern classifiers. In Advances in kernel methods: support vector learning,
pages 43–54. MIT Press, Cambridge, MA, USA, 1999.
Mehryar Mohri - Foundations of Machine Learning
page 47
Courant Institute, NYU
References
•
James Mercer. Functions of Positive and Negative Type, and Their Connection with the
Theory of Integral Equations. In Proceedings of the Royal Society of London. Series A,
Containing Papers of a Mathematical and Physical Character,Vol. 83, No. 559, pp. 69-70, 1909.
•
Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Weighted Automata in Text and
Speech Processing, In Proceedings of the 12th biennial European Conference on Artificial
Intelligence (ECAI-96),Workshop on Extended finite state models of language. Budapest,
Hungary, 1996.
•
Fernando C. N. Pereira and Michael D. Riley. Speech Recognition by Composition of
Weighted Finite Automata. In Finite-State Language Processing, pages 431-453. MIT Press,
1997.
•
I. J. Schoenberg, Metric Spaces and Positive Definite Functions. Transactions of the American
Mathematical Society,Vol. 44, No. 3, pp. 522-536, 1938.
•
Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin,
1982.
•
•
Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.
Mehryar Mohri - Foundations of Machine Learning
page 48
Courant Institute, NYU
Appendix
Mercer’s Condition
(Mercer, 1909)
Theorem: Let X X be a compact subset of RN and
let K : X X R be in L (X X) and symmetric.
Then, K admits a uniformly convergent expansion
K(x, y) =
an
n (x) n (y),
with an > 0,
n=0
iff for any function c in L2 (X),
c(x)c(y)K(x, y)dxdy
0.
X X
Mehryar Mohri - Foundations of Machine Learning
page 50
SVMs with PDS Kernels
Constrained optimization:
max 2 1
subject to: 0
Solution:
(
Hadamard product
y) K(
C
y)
y = 0.
m
h = sgn
with b = yi
Mehryar Mohri - Foundations of Machine Learning
i=1
(
i yi K(xi , ·)
+b ,
y) Kei for any xi with
0 < i < C.
page 51