Foundations of Machine Learning Kernel Methods Mehryar Mohri Courant Institute and Google Research [email protected] Motivation Efficient computation of inner products in high dimension. Non-linear decision boundary. Non-vectorial inputs. Flexible selection of more complex features. Mehryar Mohri - Foundations of Machine Learning page 2 This Lecture Kernels Kernel-based algorithms Closure properties Sequence Kernels Negative kernels Mehryar Mohri - Foundations of Machine Learning page 3 Non-Linear Separation Linear separation impossible in most problems. Non-linear mapping from input space to highdimensional feature space: : X F . Generalization ability: independent of dim(F ), depends only on margin and sample size. Mehryar Mohri - Foundations of Machine Learning page 4 Kernel Methods Idea: Define K : X X • R , called kernel, such that: (x) · (y) = K(x, y). • K often interpreted as a similarity measure. Benefits: Efficiency: K is often more efficient to compute than and the dot product. Flexibility: K can be chosen arbitrarily so long as the existence of is guaranteed (PDS condition or Mercer’s condition). • • Mehryar Mohri - Foundations of Machine Learning page 5 PDS Condition Definition: a kernel K: X X R is positive definite symmetric (PDS) if for any {x1 , . . . , xm } X , the matrix K = [K(xi , xj )]ij Rm m is symmetric positive semi-definite (SPSD). K SPSD if symmetric and one of the 2 equiv. cond.’s: • its eigenvalues are non-negative. • for any c R , c Kc = c c K(x , x ) m m 1 i j i j i,j=1 Terminology: PDS for kernels, SPSD for kernel matrices (see (Berg et al., 1984)). Mehryar Mohri - Foundations of Machine Learning page 6 0. Example - Polynomial Kernels Definition: x, y RN , K(x, y) = (x · y + c)d , c > 0. Example: for N = 2 and d = 2 , K(x, y) = (x1 y1 + x2 y2 + c)2 = Mehryar Mohri - Foundations of Machine Learning x21 x22 2 x1 x2 · 2c x1 2c x2 c y12 y22 2 y1 y2 . 2c y1 2c y2 c page 7 XOR Problem Use second-degree polynomial kernel with c = 1: (-1, 1) x2 √ (1, 1) 2 x1 x2 √ √ √ (1, 1, + 2, − 2, − 2, 1) √ √ √ (1, 1, + 2, + 2, + 2, 1) √ x1 (-1, -1) (1, -1) Linearly non-separable √ √ √ (1, 1, − 2, − 2, + 2, 1) √ √ √ (1, 1, − 2, + 2, − 2, 1) Linearly separable by x1 x2 = 0. Mehryar Mohri - Foundations of Machine Learning 2 x1 page 8 Normalized Kernels Definition: the normalized kernel K associated to a kernel K is defined by x, x X , K (x, x ) = 0 K(x,x ) K(x,x)K(x ,x ) if (K(x, x) = 0) otherwise. (K(x , x ) = 0) • If K is PDS, then K is PDS: m i,j=1 m ci cj K(xi , xj ) ci cj (xi ), (xj ) = = (x ) (x ) K(xi , xi )K(xj , xj ) i,j=1 i H j H m i=1 • By definition, for all x with K(x, x) = 0 , K (x, x) = 1. Mehryar Mohri - Foundations of Machine Learning page 9 ci (xi ) (xi ) H 2 0. H Other Standard PDS Kernels Gaussian kernels: K(x, y) = exp • ||x y||2 2 2 Normalized kernel of (x, x ) , = 0. exp x·x 2 Sigmoid Kernels: K(x, y) = tanh(a(x · y) + b), a, b Mehryar Mohri - Foundations of Machine Learning 0. page 10 . Reproducing Kernel Hilbert Space (Aronszajn, 1950) Theorem: Let K: X X R be a PDS kernel. Then, there exists a Hilbert space H and a mapping from X to H such that X, K(x, y) = x, y (x) · (y). Proof: For any x X , define (x) : X y X, RX as follows: (x)(y) = K(x, y). • Let H = a (x ) : a R, x X, card(I) < . • We are going to define an inner product ·, · on H . 0 i i i i i I 0 Mehryar Mohri - Foundations of Machine Learning page 11 • Definition: for anyf = f, g = • • • i I ai (xi ), g = j J ai bj K(xi , yj ) = i I,j J bj (yj ), bj f (yj ) = j J ai g(xi ). i I ·, · does not depend on representations of f and g. ·, · is bilinear and symmetric. ·, · is positive semi-definite since K is PDS: for any f, f, f = ai aj K(xi , xj ) i,j I • note: for any f , . . . , f 1 m and c1 , . . . , cm , m m m ci cj f i , f j = i,j=1 0. cj f j ci f i , i=1 j=1 ·, · is a PDS kernel on H0 . Mehryar Mohri - Foundations of Machine Learning page 12 0. • ·, · is definite: inequality for PDS kernels. • first, Cauchy-Schwarz M= x, y K(x,x) K(x,y) K(y,x) K(y,y) X If K is PDS, is SPSD for all In particular, the product of its eigenvalues, det(M) is non-negative: det(M) = K(x, x)K(y, y) K(x, y)2 • since ·, · is a PDS kernel, for any f f, (x) 2 f, f 0. H0 and x X , (x), (x) . property of ·, · : • observe the reproducingX • 8f 2 H0 , 8x 2 X, f (x) = i2I ai K(xi , x) = hf, (x)i. Thus,[f (x)]2 f, f K(x, x) for all x X , which shows the definiteness of ·, · . Mehryar Mohri - Foundations of Machine Learning page 13 • Thus, ·, · defines an inner product on H , which thereby becomes a pre-Hilbert space. • H can be completed to form a Hilbert space H in 0 0 which it is dense. Notes: H is called the reproducing kernel Hilbert space (RKHS) associated to K. A Hilbert space such that there exists : X H with K(x, y) = (x)· (y) for all x, y X is also called a feature space associated to K . is called a feature mapping. Feature spaces associated to K are in general not unique. • • • Mehryar Mohri - Foundations of Machine Learning page 14 This Lecture Kernels Kernel-based algorithms Closure properties Sequence Kernels Negative kernels Mehryar Mohri - Foundations of Machine Learning page 15 SVMs with PDS Kernels (Boser, Guyon, and Vapnik, 1992) Constrained optimization: m m max (xi )· (xj ) 1 2 i,j=1 i i=1 i j yi yj K(xi , xj ) m subject to: 0 = 0, i [1, m]. i=1 Solution: m h(x) = sgn with b = yi i yi C i i yi K(xi , x) m i=1 j yj K(xj , xi ) j=1 Mehryar Mohri - Foundations of Machine Learning +b , for any xi with 0< i < C. page 16 Rad. Complexity of Kernel-Based Hypotheses Theorem: Let K: X X R be a PDS kernel and let : X ! H be a feature mapping associated to K. Let S {x : K(x, x) R2 } be a sample of size m , and let H = {x 7! w· (x) : kwkH ⇤}. Then, Tr[K] m RS (H) Proof: 1 E RS (H) = m m sup w · w m = m Mehryar Mohri - Foundations of Machine Learning E i (xi ) K(xi , xi ) i=1 (xi ) i=1 i=1 m E m i m (Jensen’s ineq.) R2 2 . m 2 m E i i=1 m 1/2 m 1/2 = (xi ) E (xi ) 2 i=1 Tr[K] m page 17 1/2 R2 2 . m Generalization: Representer Theorem (Kimeldorf and Wahba, 1971) Theorem: Let K: X X R be a PDS kernel with H the corresponding RKHS. Then, for any nondecreasing function G: R R and any L: Rm R {+ } problem argmin F (h) = argmin G( h h H h H H) + L h(x1 ), . . . , h(xm ) m admits a solution of the form h = i=1 i K(xi , ·). If G is further assumed to be increasing, then any solution has this form. Mehryar Mohri - Foundations of Machine Learning page 18 • Proof: let H = span({K(x , ·): i [1, m]}). Any h H admits the decomposition h = h1 + h according to H = H1 H1 . 1 i • Since G is non-decreasing, • = G( h H ). By the reproducing property, for all i [1, m], h(xi ) = h, K(xi , ·) = h1 , K(xi , ·) = h1 (xi ). G( h1 H) G 2 H h1 • Thus, L h(x ), . . . , h(x and F (h1 ) 1 m) + h 2 H = L h1 (x1 ), . . . , h1 (xm ) F (h). • If G is increasing, then F (h ) < F (h) when h 1 =0 and any solution of the optimization problem must be in H1. Mehryar Mohri - Foundations of Machine Learning page 19 Kernel-Based Algorithms PDS kernels used to extend a variety of algorithms in classification and other areas: regression. ranking. dimensionality reduction. clustering. • • • • But, how do we define PDS kernels? Mehryar Mohri - Foundations of Machine Learning page 20 This Lecture Kernels Kernel-based algorithms Closure properties Sequence Kernels Negative kernels Mehryar Mohri - Foundations of Machine Learning page 21 Closure Properties of PDS Kernels Theorem: Positive definite symmetric (PDS) kernels are closed under: sum, product, tensor product, pointwise limit, composition with a power series with nonnegative coefficients. • • • • • Mehryar Mohri - Foundations of Machine Learning page 22 Closure Properties - Proof Proof: closure under sum: c Kc 0 0 c Kc c (K + K )c • closure under product: K✓ = MM m X ci cj (Kij K0ij ) = i,j=1 = m X ci cj i,j=1 m X k=1 2 m hX i,j=1 , i Mik Mjk K0ij k=1 m X 0. ◆ ci cj Mik Mjk K0ij 3> 2 3 c1 M1k c1 M1k 4 · · · 5 K0 4 · · · 5 = cm Mmk k=1 cm Mmk m X Mehryar Mohri - Foundations of Machine Learning page 23 0. • Closure under tensor product: • definition: for all x , x , y , y 1 (K1 2 1 2 X, K2 )(x1 , y1 , x2 , y2 ) = K1 (x1 , x2 )K2 (y1 , y2 ). • thus, PDS kernel as product of the kernels (x1 , y1 , x2 , y2 ) K1 (x1 , x2 ) (x1 , y1 , x2 , y2 ) K2 (y1 , y2 ). • Closure under pointwise limit: if for all x, y X, lim Kn (x, y) = K(x, y), n Then, ( n, c Kn c 0) Mehryar Mohri - Foundations of Machine Learning lim c Kn c = c Kc 0. n page 24 • Closure under composition with power series: • assumptions: K PDS kernel with|K(x, y)| < for • all x, y X and f (x) = n=0 an xn , an 0 power series with radius of convergence . f K is a PDS kernel since K n is PDS by closure N under product, n=0 an K n is PDS by closure under sum, and closure under pointwise limit. Example: for any PDS kernel K, exp(K) is PDS. Mehryar Mohri - Foundations of Machine Learning page 25 This Lecture Kernels Kernel-based algorithms Closure properties Sequence Kernels Negative kernels Mehryar Mohri - Foundations of Machine Learning page 26 Sequence Kernels Definition: Kernels defined over pairs of strings. • Motivation: computational biology, text and speech classification. • Idea: two sequences are related when they share some common substrings or subsequences. • Example: bigramXkernel; K(x, y) = bigram u Mehryar Mohri - Foundations of Machine Learning countx (u) ⇥ county (u). page 27 Weighted Transducers b:a/0.6 b:a/0.2 a:b/0.1 0 1 a:a/0.4 2 b:a/0.3 3/0.1 a:b/0.5 T (x, y) = Sum of the weights of all accepting paths with input x and output y . T (abb, baa) = .1 Mehryar Mohri - Foundations of Machine Learning .2 .3 .1 + .5 .3 page 28 .6 .1 Rational Kernels over Strings (Cortes et al., 2004) R is rational if K = T Definition: a kernel K : for some weighted transducer T . R and T2 : R be Definition: let T1 : two weighted transducers. Then, the composition ,y of T1 and T2 is defined for all x by (T1 T2 )(x, y) = T1 (x, z) T2 (z, y). z Definition: the inverse of a transducer T : R is the transducer T 1 : R obtained from T by swapping input and output labels. Mehryar Mohri - Foundations of Machine Learning page 29 PDS Rational Kernels General Construction Theorem: for any weighted transducer T : the function K = T T 1 is a PDS rational kernel. Proof: by definition, for all x, y K(x, y) = , T (x, z) T (y, z). z • K is pointwise limit of (K ) n n 0 x, y •K n defined by , Kn (x, y) = T (x, z) T (y, z). |z| n is PDS since for any sample (x1 , . . . , xm ), Kn = AA with A = (Kn (xi , zj )) i [1,m] . j [1,N ] Mehryar Mohri - Foundations of Machine Learning page 30 R, PDS Sequence Kernels PDS sequences kernels in computational biology, text classification, other applications: special instances of PDS rational kernels. PDS rational kernels easy to define and modify. single general algorithm for their computation: composition + shortest-distance computation. no need for a specific ‘dynamic-programming’ algorithm and proof for each kernel instance. general sub-family: based on counting transducers. • • • • • Mehryar Mohri - Foundations of Machine Learning page 31 Counting Transducers b:ε/1 a:ε/1 b:ε/1 a:ε/1 0 X:X/1 1/1 TX X = ab Z = bbabaabba εεabεεεεε εεεεεabεε X may be a string or an automaton representing a regular expression. Counts of Z in X : sum of the weights of accepting paths of Z TX . Mehryar Mohri - Foundations of Machine Learning page 32 Transducer Counting Bigrams b:ε/1 a:ε/1 b:ε/1 a:ε/1 0 a:a/1 b:b/1 1 a:a/1 b:b/1 2/1 Tbigram Counts of Z given by Z Tbigram ab . Mehryar Mohri - Foundations of Machine Learning page 33 Transducer Counting Gappy Bigrams b:ε/1 a:ε/1 0 b:ε/1 a:ε/1 b:ε/λ a:ε/λ a:a/1 b:b/1 1 a:a/1 b:b/1 2/1 Tgappy bigram Counts of Z given by Z Tgappy bigram ab , gap penalty (0, 1) . Mehryar Mohri - Foundations of Machine Learning page 34 Composition Theorem: the composition of two weighted transducer is also a weighted transducer. Proof: constructive proof based on composition algorithm. states identified with pairs. -free case: transitions defined by • • E= (q1 , q1 ), a, c, w1 w2 , (q2 , q2 ) (q1 ,a,b,w1 ,q2 ) E1 (q1 ,b,c,w2 ,q2 ) E2 • general case: use of intermediate Mehryar Mohri - Foundations of Machine Learning -filter. page 35 . Composition Algorithm ε-Free Case a:a/0.6 b:b/0.3 0 a:b/0.1 a:b/0.2 1 2 b:a/0.2 a:b/0.5 3/0.7 b:b/0.4 a:a/.04 2 a:b/0.3 0 b:b/0.1 1 b:a/0.5 a:b/0.4 3/0.6 (0, 1) (3, 2) a:a/.02 a:b/.18 (0, 0) a:b/.01 (1, 1) b:a/.06 (2, 1) b:a/.08 a:a/0.1 (3, 1) a:b/.24 (3, 3)/.42 Complexity: O(|T1 | |T2 |) in general, linear in some cases. Mehryar Mohri - Foundations of Machine Learning page 36 (c) !:!1 A' 0 (d) !:!1 T!1 !:!1 !:!1 Redundant ε-Paths Problem a:a !2:! T1 !:!1 1 !2:! b:!2 c:!2 2 !2:! d:d 3 (MM,!2Pereira, and Riley, 1996; Pereira and Riley, 1997) :! 0 a:aa:d 1 b:ε !1:e2 c:ε d:a3 d:d B' 1 0 ε:ε1 0 0,0 ε:ε1 a:a a:d (x:x) 2 ε:ε1 ε:ε1 b:! (!2:!2) 2,1 !:e (!1:!1) b:e (!1:!1) c:! ε:ε1 ε2:ε 4 0 d:d (!1:!1) d:a 3 ε2:ε a:d 1 ε1: e 2 ε2:ε1 ε1:ε1 x:x x:x b:! 2,2 0 d:a 3 1 ε2:ε2 ε2:ε2 x:x F 2 3,2 d:a (x:x) 4,3 Mehryar Mohri - Foundations of Machine Learning T = T!1 ◦ F ◦ T!2 . page 37 T2 ε2:ε ε2:ε ε1:ε1 (!2:!2) !:e 1 ε:e 2 1,2 c:! (!2:!2) 3,1 0 (!2:!1) (!2:!2) !:e a:d 4 3 1 b:ε2 2 c:ε2 3 1,1 4 T!2 Kernels for Other Discrete Structures Similarly, PDS kernels can be defined on other discrete structures: • Images, • graphs, • parse trees, • automata, • weighted automata. Mehryar Mohri - Foundations of Machine Learning page 38 This Lecture Kernels Kernel-based algorithms Closure properties Sequence Kernels Negative kernels Mehryar Mohri - Foundations of Machine Learning page 39 Questions Gaussian kernels have the form exp( d2 ) where d is a metric. for what other functions d does exp( d2 ) define a PDS kernel? what other PDS kernels can we construct from a metric in a Hilbert space? • • Mehryar Mohri - Foundations of Machine Learning page 40 Negative Definite Kernels (Schoenberg, 1938) Definition: A function K: X X R is said to be a negative definite symmetric (NDS) kernel if it is m c R {x , . . . , x } X symmetric and if for all 1 and m with 1 c = 0 , c Kc 0. Clearly, if K is PDS, then K is NDS, but the converse does not hold in general. Mehryar Mohri - Foundations of Machine Learning page 41 1 Examples The squared distance ||x y||2 in a Hilbert space H m defines an NDS kernel. If i=1 ci = 0 , m m i,j=1 ci cj ||xi xj ||2 = ci cj (xi i,j=1 m ci cj ( xi = xj ) · (xi 2 + xj xj ) 2xi · xj ) 2 i,j=1 m ci cj ( xi = 2 + xj 2 ) 2 i=1 i,j=1 m ci cj ( xi i,j=1 m = Mehryar Mohri - Foundations of Machine Learning 2 + xj 2 ci ( xi i=1 2 ci xi · cj xj j=1 ) m m m cj j=1 m m + cj xj ci i=1 j=1 page 42 2 = 0. NDS Kernels - Property (Schoenberg, 1938) Theorem: Let K: X X R be an NDS kernel such that for all x, y X, K(x, y) = 0 iff x = y . Then, there exists a Hilbert space H and a mapping : X H such that ∀x, y ∈ X, K(x, y) = ∥Φ(x) − Φ(y)∥2 . √ Thus, under the hypothesis of the theorem, K defines a metric. Mehryar Mohri - Foundations of Machine Learning page 43 PDS and NDS Kernels (Schoenberg, 1938) Theorem: let K: X X R be a symmetric kernel, then: K is NDS iff exp( tK) is a PDS kernel for all t > 0 . Let K be defined for any x0 by • • K (x, y) = K(x, x0 ) + K(y, x0 ) K(x, y) K(x0 , x0 ) for all x, y X. Then, K is NDS iff K is PDS. Mehryar Mohri - Foundations of Machine Learning page 44 Example The kernel defined by K(x, y) = exp( t||x is PDS for all t > 0 since ||x y||2 is NDS. y||2 ) The kernel exp( |x y|p )is not PDS for p > 2 . Otherwise, for any t > 0 ,{x1 , . . . , xm } X and c Rm m ci cj e t|xi xj |p i,j=1 m = ci cj e |t1/p xi t1/p xj |p 0. i,j=1 This would imply that |x y|p is NDS for p > 2, but that cannot be (see past homework assignments). Mehryar Mohri - Foundations of Machine Learning page 45 1 Conclusion PDS kernels: rich mathematical theory and foundation. general idea for extending many linear algorithms to non-linear prediction. flexible method: any PDS kernel can be used. widely used in modern algorithms and applications. can we further learn a PDS kernel and a hypothesis based on that kernel from labeled data? (see tutorial: http://www.cs.nyu.edu/~mohri/icml2011tutorial/). • • • • • Mehryar Mohri - Foundations of Machine Learning page 46 References • • N. Aronszajn, Theory of Reproducing Kernels, Trans. Amer. Math. Soc., 68, 337-404, 1950. • Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on Semigroups. Springer-Verlag: Berlin-New York, 1984. • Bernhard Boser, Isabelle M. Guyon, and Vladimir Vapnik. A training algorithm for optimal margin classifiers. In proceedings of COLT 1992, pages 144-152, Pittsburgh, PA, 1992. • Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational Kernels: Theory and Algorithms. Journal of Machine Learning Research (JMLR), 5:1035-1062, 2004. • Corinna Cortes and Vladimir Vapnik, Support-Vector Networks, Machine Learning, 20, 1995. • Kimeldorf, G. and Wahba, G. Some results on Tchebycheffian Spline Functions, J. Mathematical Analysis and Applications, 33, 1 (1971) 82-95. Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. In Advances in kernel methods: support vector learning, pages 43–54. MIT Press, Cambridge, MA, USA, 1999. Mehryar Mohri - Foundations of Machine Learning page 47 Courant Institute, NYU References • James Mercer. Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations. In Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character,Vol. 83, No. 559, pp. 69-70, 1909. • Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Weighted Automata in Text and Speech Processing, In Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96),Workshop on Extended finite state models of language. Budapest, Hungary, 1996. • Fernando C. N. Pereira and Michael D. Riley. Speech Recognition by Composition of Weighted Finite Automata. In Finite-State Language Processing, pages 431-453. MIT Press, 1997. • I. J. Schoenberg, Metric Spaces and Positive Definite Functions. Transactions of the American Mathematical Society,Vol. 44, No. 3, pp. 522-536, 1938. • Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin, 1982. • • Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. Mehryar Mohri - Foundations of Machine Learning page 48 Courant Institute, NYU Appendix Mercer’s Condition (Mercer, 1909) Theorem: Let X X be a compact subset of RN and let K : X X R be in L (X X) and symmetric. Then, K admits a uniformly convergent expansion K(x, y) = an n (x) n (y), with an > 0, n=0 iff for any function c in L2 (X), c(x)c(y)K(x, y)dxdy 0. X X Mehryar Mohri - Foundations of Machine Learning page 50 SVMs with PDS Kernels Constrained optimization: max 2 1 subject to: 0 Solution: ( Hadamard product y) K( C y) y = 0. m h = sgn with b = yi Mehryar Mohri - Foundations of Machine Learning i=1 ( i yi K(xi , ·) +b , y) Kei for any xi with 0 < i < C. page 51
© Copyright 2026 Paperzz