Kernel Fisher Discriminant Analysis in Gaussian Reproducing Kernel Hilbert Spaces –Theory1 Su-Yun Huang2 [email protected] Institute of Statistical Science Academia Sinica, Taipei 11529, Taiwan, R.O.C. Chii-Ruey Hwang [email protected] Institute of Mathematics Academia Sinica, Taipei 11529, Taiwan, R.O.C. November 21, 2006 Abstract Kernel Fisher discriminant analysis (KFDA) has been proposed for nonlinear binary classification. It is a hybrid method of the classical Fisher linear discriminant analysis and a kernel machine. Experimental results have shown that the KFDA performs slightly better in terms of prediction error than the popular support vector machines and is a strong competitor to the latter. However, there is very limited statistical justification of this method. This article aims to provide a fundamental study for it in the framework of Gaussian reproducing kernel Hilbert spaces (RKHS). The success of KFDA is mainly due to two attractive features. (1) The KFDA has the flexibility of a nonparametric model using kernel mixture, (2) while its implementation algorithm uses a parametric notion via kernel machine (being linear in an RKHS and arising from the log-likelihood ratio of Gaussian measures). The KFDA emerging from the machine learning community can be linked to some classical results of discrimination of Gaussian processes in Hilbert space (Grenander, 1950). One of the main purposes of this article is to provide a justification of the underlying Gaussian assumption. It is shown, under suitable 1 2 Running title: Kernel Fisher Discriminant Analysis in Gaussian RKHS. Corresponding author. 1 conditions, most low dimensional projections of kernel transformed data in an RKHS are approximately Gaussian. The study of approximate Gaussianity of kernel data is not limited to the KFDA context. It can be applied to general kernel machines as well, e.g., support vector machines. Key words and phrases: Gaussian measure, Gaussian discrimination, kernel Fisher discriminant analysis, maximum likelihood ratio, projection pursuit, reproducing kernel Hilbert space, support vector machine. 1 Introduction The aim of discriminant analysis is to classify an object into one of k given groups based on training data consisting of {(xj , yj )}nj=1 , where xj ∈ X ⊂ Rp is a p-variate input measurement and yj ∈ {1, . . . , k} indicates the corresponding group membership. The classical Fisher linear discriminant analysis (FLDA) is a commonly used and time-honored tool for multiclass classification because of its simplicity and probabilistic outputs. With k ≤ p + 1, the FLDA finds k − 1 canonical variates that are optimal (in a certain sense) for separating the groups, and the FLDA’s decision boundaries are linear in these canonical variates. Often such a linear formulation for decision rule is not adequate, thus, quadratic decision boundaries are called for. But still there are commonly seen cases which need a more general nonlinear decision rule. Motivated from the active development of statistical learning theory (see, e.g., Vapnik, 1998; Hastie, Tibshirani and Friedman, 2001) and the popular and successful usage of various kernel machines (Cortes and Vapnik, 1995; Cristianini and ShaweTaylor, 2000; Schölkopf and Smola, 2002), there has emerged a hybrid approach which combines the idea of feature map in SVM with the classical FLDA. Its usage can be traced back to Mika, Rätsch, Weston, Schölkopf and Müller (1999), Solla, Keen and Müller (1999) and Baudat and Anouar (2000). Later it was also studied by Mika, Rätsch and Müller (2001), Van Gestel, Suykens and De Brabanter (2001), Mika, 2 Smola and Schölkopf (2001), Xu, Zhang and Li (2001), Mika (2002) and Suykens, Van Gestel, De Brabanter, De Moor and Vandewalle (2002). However, there is very limited statistical justification of this method despite its successful performance in classification. In this article we provide a fundamental study for it in the framework of Gaussian reproducing kernel Hilbert spaces (RKHS). Mika et al. (1999), Baudat and Anouar (2000) and others have used the term kernel Fisher discriminant analysis (KFDA) for the hybrid method of FLDA and kernel machine. In all the above-mentioned articles, the KFDA is based on performing the FLDA in a kernel-spectrum-based feature space. In this article, we introduce an alternative but equivalent KFDA formulation. This reformulation makes the theoretic study in a more convenient way. The reformulated KFDA is a two-stage procedure. The first stage is to embed the input space X into an infinite dimensional RKHS, denoted by Hκ , via a kernel function κ. The second stage is to carry out Fisher’s procedure using the notion of the maximum likelihood ratio of Gaussian measures. Data, which are embedded into an infinite dimensional RKHS, become sparse and can be better separated by simple hyperplanes. The classical FLDA finds an optimal low dimensional subspace and the discrimination is done in this low-dimensional subspace. The optimality is referring to being as a maximum likelihood ratio criterion as well as a Bayes classifier under a Gaussian assumption on the input data distribution. Parallel to the classical theory, the KFDA discussed in this article extends the maximum likelihood ratio criterion and Bayes classification to their kernel generalization under a Gaussian Hilbert space assumption. Theorems will be given to justify such an underlying assumption. We show that under suitable conditions most low-dimensional projections of kernel transformed data are approximately Gaussian. This extends part of Diaconis and Freedman’s (1984) results to functional data. Readers are referred to their Theorems 1.1 and 1.2 on limiting distributions for low-dimensional projections of high-dimensional data, and Example 3.1 for iid coordinates. The heuristics are as follows. Hyperplanes in Euclidean spaces are rigid, while hyperplanes in RKHS, 3 which consist of kernel mixtures, are much more flexible. Also, data, first embedded into an infinite dimensional Hilbert space and then projected to a low dimensional subspace, can be better approximated by normal distribution. The rest of the article is organized as follows. In Section 2 we give a brief review of the KFDA emerging from the machine learning community in a kernel-spectrumbased feature space. In Section 3 we introduce an alternative, but equivalent, feature map into an RKHS and link the theory of KFDA in this RKHS to some classical statistical results. In Section 4 we provide theoretic and empirical justification of the underlying Gaussian assumption, which can also be applied to general kernel machines. All proofs are in the Appendix. 2 Review: KFDA in feature space This section gives a very brief review of the original KFDA emerging from machine learning community. The KFDA procedure in Mika et al. (1999) and Baudat and Anouar (2000) was formulated in a kernel-spectrum-based feature space. For a given positive definite kernel and its spectrum: κ(x, u) = d λq ψq (x)ψq (u), d ≤ ∞, (1) q=1 the main idea of the KFDA is first to map the data in the input space X ⊂ Rp into the spectrum-based feature space Rd via a transformation √ √ x → Ψ(x) := ( λ1 ψ1 (x), . . . , λd ψd (x)) . (2) Let Z := Ψ(X ) be the image of the feature map. Next the classical Fisher procedure is operated on the transformed data in the feature space Z. For binary classification, the KFDA finds the discriminant function of the form discriminant function (z) = w z + b = q 4 wq λq ψq (x) + b, z ∈ Z, (3) where w is the canonical variate that maximizes the so-called Rayleigh coefficient in the feature space Z w Sb w JRKF DA (w) ≡ , w Sw w + rA(w) where Sb = (z̄1 − z̄2 )(z̄1 − z̄2 ) and Sw = nj=1 {zj zj − (n1 z̄1 z̄1 + n2 z̄2 z̄2 )}/(n − 2) are respectively the between- and within-group sample covariances in feature space Z, and z̄1 , z̄2 are group means in Z, n1 , n2 are group sizes, A(w) is a penalty functional on w and r is the associated regularization parameter. In practice, the kernel function κ is defined directly without explicit expression for its spectrum Ψ. Thus, there is no way that Sb and Sw are explicitly known. However, as the inner product in feature space can be represented via the following kernel value: Ψ(x), Ψ(u) Z = λq ψq (x)ψq (u) = κ(x, u), (4) q it allows us to work directly on the kernel values without knowing the spectrum-based transformation Ψ : X → Z, nor the associated sample means and covariances. By the kernel trick (4), one can show that (Mika et al., 1999, Schölkopf and Smola, 2002) the solution w can be expanded as w= n αj Ψ(xj ) = Z α, (5) j=1 where Z = (Ψ(x1 ), . . . , Ψ(xn )) is the transformed input data matrix in Z. The discriminant function can be re-formulated as discriminant function (x) = n αj κ(xj , x) + b. (6) j=1 The coefficients αj ’s can then be obtained as the solution to the following optimization problem α Mb α , (7) arg maxn JRKF DA (α) ≡ arg maxn α∈R α∈R α Mw α + rA(α) where Mb = (k̄1 − k̄2 )(k̄1 − k̄2 ) , Mw = (K 2 − 2i=1 ni k̄i k̄i )/(n−2), K = [κ(xj , xj )]n×n , k̄i = ni −1 j∈Ii Kj , Kj is the j-th column of K and Ii the index set for group i. 5 Since the matrix Mw is singular, a penalty functional A(α) is added to overcome the numerical problems. With α being solved from (7), the intercept b of the discriminant hyperplane (6) is determined by setting the hyperplane to pass through the mid point of the two group means, i.e., b = −α (k̄1 + k̄2 )/2. 3 KFDA in Gaussian RKHS The kernel trick in (4) of turning inner products in Z into kernel values allows us to carry out the KFDA in the spectrum-based feature space without explicitly knowing the spectrum itself. For the convenience of a theoretic study, we introduce an alternative, but equivalent, KFDA formulation. This formulation will use an alternative data representation (8) below, which embeds the input space X directly into the kernel associated RKHS. Also with this kernel map (8), the KFDA can be reformulated so that statistical properties, e.g., the maximum likelihood ratio, can be naturally developed in this framework. The classical FDA is good for data with predictors having approximately normal distribution or at least having approximately elliptically symmetric distribution. The normality or elliptical symmetry is a very restrictive condition on data distribution. A way to improve the data normality is to map the original input space X into a very high dimensional (often infinite dimensional) space, named the embedding input space. Then project the embedding input space into a low dimensional subspace. Data normality can be dramatically improved if they are properly handled in this way. As the effective working subspace of KFDA, other kernel machines alike, is of low dimensionality, the low-dimensional approximate Gaussian is enough. Assume, for the time being, that the data distribution in the embedding Hilbert space is Gaussian. In this section we focus mainly on KFDA in a Gaussian Hilbert space. Theory as well as empirical examples to justify the underlying Gaussian assumption are deferred to Section 4 . 6 3.1 Aronszajn kernel map and Gaussian measure Below we introduce a kernel map for embedding data into a high dimensional space. Kernels used throughout the article are positive definite kernels (also known as reproducing kernels). See Aronszajn (1950) for the theory of reproducing kernels and reproducing kernel Hilbert spaces. Given a positive definite kernel κ(·, ·) on X × X , we are going to associate with it an RKHS. Definition 1 (Reproducing kernel Hilbert space) An RKHS is a Hilbert space of real-valued functions on X satisfying the property that, all the evaluation functionals are bounded linear functionals. To every RKHS there corresponds a unique positive-definite kernel κ satisfying the reproducing property, i.e., f, κ(x, ·) = f (x) for all f in this RKHS and all x ∈ X . We say that this RKHS admits the kernel κ. Conversely, given a positive-definite kernel κ on X × X there exists a unique Hilbert space admitting this kernel. We denote this Hilbert space by Hκ . Let µ be a probability measure on X . (The measure µ needs not be the uderlying probability distribution of the input data.) Throughout this article we assume that all the reproducing kernels employed are (1) measurable, (2) of trace type, i.e., κ(x, x)dµ < ∞ and (3) for x = u, κ(x, ·) = κ(u, ·) in L2 (X , µ) sense. Consider the X transformation Γ : X → Hκ given by x → Γ(x) := κ(x, ·). (8) The original input space X is then embedded into a new input space Hκ via the transformation Γ. Each input point x ∈ X is mapped to an element κ(x, ·) ∈ Hκ , which is called Aronszajn map in Hein and Bousquet (2004). Their article gives a survey of results in the mathematical literature on positive definite kernels and associated structures potentially relevant for machine learning. Let J : Z → Hκ be a map from the spectrum-based feature space Z to the kernel associated Hilbert 7 space Γ(X ) ⊂ Hκ defined by J (Ψ(x)) = κ(x, ·). Notice that J is a one-to-one linear transformation satisfying Ψ(x)2Z = κ(x, x) = κ(x, ·)2Hκ . Thus, Ψ(X ) and Γ(X ) are isometrically isomorphic, and the two feature representations (2) and (8) are equivalent in this sense. Below we introduce a Gaussian measure on an arbitrary Hilbert space H. Definition 2 (Gaussian measure on a Hilbert space) Let H be an arbitrary real separable Hilbert space. A probability measure PH defined on H is said to be Gaussian, if the distribution of f, h H is a one-dimensional normal for any f ∈ H, where h denotes the random element having the probability measure PH . It can be shown that for any m and any {f1 , . . . , fm ∈ H}, the joint distribution of f1 , h H , . . . , fm , h H is normal. For references of Gaussian measures on Hilbert spaces, see, e.g., Grenander (1963), Vakhania, Tarieladze and Chobanyan (1987), and Janson (1997). For simplicity we assume that EPH h, h H < ∞ throughout this article. For a probability measure PH on H, there exists m ∈ H, the mean, and an operator Λ, known as the covariance operator, such that m, f H = Eh, f H , ∀f ∈ H; and Λf, g H = E{h−m, f H h−m, g H }, ∀f, g ∈ H. The covariance operator Λ is of trace type with trace(Λ) = Eh − m, h − m H . 3.2 Maximum likelihood ratio of Gaussian measures In this subsection we establish the KFDA as a discriminant rule based on the likelihood ratios of Gaussian measures in a Hilbert space. Let π1 , . . . , πk denote the underlying populations with probability distributions (X , F, Pi ), i = 1, . . . , k. Let Ii be the index set of training samples from πi and I = ∪ki=1 Ii be the index set for 8 the entire data. The training data can then be partitioned as ∪ki=1 {(xj , yj )}j∈Ii . Let ni = |Ii |, the size of Ii , and n = |I|, the size of I. Suppose that πi has probability density function fi (x), i = 1, . . . , k. Also assume that the prior probability of an observation coming from πi is qi . Then the conditional probability of a given input measurement x coming from population πi is prob(πi |x) = qi fi (x) , i = 1, . . . , k. q1 f1 (x) + · · · + qk fk (x) (9) An equivalent expression for prob(πi |x) is prob(πi |x) = qi fi (x)/q1 f1 (x) , i = 1, . . . , k. 1 + q2 f2 (x)/q1 f1 (x) + · · · + qk fk (x)/q1 f1 (x) (10) Thus, for a test input x we assign it to πi , if prob(πi |x) is the maximum, which is the Bayes classifier. By this conditional probability approach it is sufficient to train as many as k − 1 binary classifiers of πi against an arbitrarily chosen reference group, say π1 , via likelihood ratios for i = 2, . . . , k. The KFDA solves this multiclass classification problem in 2 steps. First, it maps the data via Γ into Hκ . Assume in Hκ the underlying populations π1 , . . . , πk can be approximated by Gaussian measures with a common covariance.3 (The Gaussian assumption and its justification are discussed later in Section 4.) Next, with the Gaussian assumption, the log-likelihood ratios result in linear decision boundaries (Grenander, 1950). Below we give a formal definition for linear classifiers in a Hilbert space and then recall Grenander’s result of likelihood ratio of two Gaussian measures. Definition 3 (linear classifier) Consider a binary classification in a Hilbert space H. We say that a classifier is linear if and only if its decision boundary is given by 3 These Gaussian measures do not live in the underlying RKHS Hκ , but rather in a larger Banach space, namely, the abstract Wiener space, denoted by (i, Hκ , B). In the special case of B being a Hilbert space, B is the completion of Hκ with respect to the norm f B := Af Hκ , where A is a Hilbert Schmidt operator on Hκ , and i is the injection map of Hκ into B. For reference, see, Kuo (1975). 9 (h) + b = 0, where (·) is a bounded linear functional, b is a real scalar and h is an element in H. By Riesz Representation Theorem, for each linear functional (·) there exists a unique g ∈ H such that the decision boundary is given by g, h H + b = 0 for test input h ∈ H. (11) The function g acts as a functional normal direction for the separating hyperplane g, h H + b = 0. Theorem 1 (Grenander, 1950) Assume that P1,H and P2,H are two equivalent Gaussian measures on H with means m1 and m2 and a common nonsingular covariance operator Λ of trace type. Let L2,1 = log(dP2,H /dP1,H ) and h be an element in H. Let ma = (m1 +m2 )/2 and md = m2 −m1 . A necessary and sufficient condition for the log-likelihood ratio L2,1 being linear is that md ∈ R(Λ1/2 ), where R(Λ1/2 ) is the range of Λ1/2 . The log-likelihood ratio is then given by L2,1 (h) = h − ma , Λ−1 md H . 4 (12) To separate two Gaussian populations in H, the log-likelihood ratio in Theorem 1 leads to an ideal optimal linear decision boundary. Exactly, a binary KFDA is to look for a functional normal direction g, which is optimal in separating the two groups of kernel inputs {Γ(xj )}j∈I1 and {Γ(xj )}j∈I2 . Heuristically, when data patterns (conveyed in realizations Γ(xj ), j = 1 . . . , n) are projected along g, the group centers are far apart, while the spread within each group is small causing the overlap 4 Notice that h − ma , Λ−1 md H exists a.s. for h from the Gaussian measure with mean ma and covariance operator Λ. Let λi and ei be the eigenvalues and eigenvectors of Λ, respectively. Then h− ma , Λ−1 md H = i h − ma , ei H md , ei H /λi . As {h − ma , ei H } are independent normal random variables with mean zero and variance Eh−ma , ei 2H = λi , then i E(h−ma , ei H md , ei H /λi )2 = 2 1/2 ). Thus, h − ma , Λ−1 md H exists a.s. (For independent i md , ei H /λi < ∞, as md ∈ R(Λ random variables Xi having zero mean, i Xi converges a.s. if and only if i var(Xi ) < ∞.) 10 of these two groups to be as small as possible when projected along this functional direction. The optimality can be formalized in the sense of maximum likelihood ratio of two Gaussian measures. There are parameters including the population means and the covariance operator involved in the log-likelihood ratio (12), which have to be estimated from the data. Below we derive their maximum likelihood estimates. Theorem 2 (Maximum likelihood estimates) Let H be a Hilbert space of realvalued functions on X . Assume that {hj }nj=1 are iid random elements from a Gaussian measure on H with mean m and nonsingular covariance operator Λ of trace-type. Then, for any g, f ∈ H, the maximum likelihood estimate for g, m H is given by g, m̂ Hκ with 1 m̂ = hj , n j=1 n (13) and the maximum likelihood estimate for g, Λf H is given by g, Λ̂f H with 1 Λ̂ = (hj − m̂) ⊗ (hj − m̂), n j=1 n (14) where ⊗ denotes the tensor product. In particular, for any given x, u ∈ X , by taking g and f the evaluation functionals at x and u respectively, the MLEs for m(x) and Λ(x, u) are given, respectively, by m̂(x) = n−1 nj=1 hj (x) and 1 (hj (x) − m̂(x))(hj (u) − m̂(u)). Λ̂(x, u) = n j=1 n (15) For multiple populations sharing a common covariance operator, we pool together sample covariance estimates from all populations according to their sizes to get a pooled single estimate. By abusing the notation a little bit, the KFDA decision functions are then given by log(qi fi (x)/q1 f1 (x)) = {Γ(x) − (Γ̄i + Γ̄1 )/2} Mw−1 (Γ̄i − Γ̄1 ) + ρi (16) where ρi = log(qi /q1 ), Mw is the pooled within-group sample covariance of kernel training inputs, Γ̄i is the i-th group sample mean of kernel inputs, and Γ(x) is the 11 kernel test input. The test input x is assigned to πi , if its corresponding log-likelihood ratio is the maximum. Remark 1 (Regularization) Often the empirical covariance Mw is singular and some kind of regularization is necessary. See, for instance, the ridge approach in Friedman (1989). By taking hj as if they were the kernel data Γ(xj ) given in (8), Theorems 1 and 2 lead to the maximum likelihood method that coincides with the existing KFDA algorithm of Mika et al. (1999).5 Remark 2 (Bayesian interpretation) If prior probabilities q1 and q2 are considered, there is an adjustment ρ = log(q2 /q1 ) that should be added to the log-likelihood ratio (12). This prior adjusted log-likelihood ratio provides a Bayesian interpretation for KFDA. There are other Bayesian kernel machines, see, e.g., the relevance vector machine (Tipping, 2001) and the Bayes point machine (Herbrich, Graepel and Campbell, 2001). Remark 3 (Discriminant analysis by Gaussian mixtures) It can be easily seen that decision boundaries given by (16) are kernel mixtures. If a Gaussian kernel is used, then the KFDA is a discriminant rule by Gaussian mixtures. The Gaussian mixture approach is not new in statistical and pattern recognition literature, see, e.g., Hastie and Tibshirani (1996), Taxt, Hjort and Eikvil (1991). However, the KFDA as a Gaussian mixture has two main attractive features over other Gaussian mixture approaches. One is that, the KFDA has the flexibility of nonparametric model; and 5 In Theorems 1 and 2 we have assumed that data hj ’s are generated from Gaussian measures. As the kernel data Γ(xj ), at least for the Gaussian kernel, are positive functions in Hκ , an assumption that Γ(xj )’s are Gaussian data would be truly void. However, in Section 4 we will show that kernel data Γ(xj ) projected into a low dimensional subspace, say m-dimensional, can be well approximated by an m-dimensional normal. The effective working subspace for the KFDA is at most (k − 1)dimensional for a k-class problem. 12 the other is that, its implementation algorithm uses parametric notion via kernel machine (being linear in an RKHS and arising from the log-likelihood ratio of Gaussian measures). 4 Justification of the Gaussian assumption Results in the previous section are based on the Gaussian assumption in the underlying Hilbert space. In this section we provide both theoretic and empirical justification for it. In some previous works (Mika, Rätsch and Müller, 2001; Mika, 2002; Schölkopf and Smola, 2002), there has been empirical evidence that Gaussianity can be improved by kernel maps. Though being rudimentary (they show only histograms), their findings are interesting and original. In this section, we provide a more rigorous justification. The basic phenomenon here is as follows. Most low-dimensional projections from high-dimensional data are approximately Gaussian under suitable conditions. We start our illustration with the single population (non-centered and centered cases), and next extend it to multiple populations. As the task here is classification, some extra care should be taken to avoid the asymptotic distributions of the underlying populations from collapsing into an identical Gaussian. 4.1 Single population We start with a single population. Though the KFDA involves at least two populations, there are two main reasons for treating a single population. One is that, it is easier and more comprehensible to gain ideas of the asymptotic behavior of low-dimensional projections in the single population case. The other is that, besides classification problems, the results obtained below can be used for other problems involving only one population, e.g., kernel principal component analysis, kernel regression, kernel canonical correlation analysis, etc. As for the non-centered and the 13 centered cases, their purpose will become clear in Subsection 4.2 for multiple population discrimination. Briefly speaking, what we need in practice is the centered case. It is conceptually similar to the Central Limit Theorem, where data are centered to their sample mean. However, for kernel-based classification algorithms, KFDA or else, we have to be careful about the control of window width. Once the window width goes to zero too fast, projections of kernel data from different populations will have zero mean and converge to a common normal distribution, and thus become indistinguishable. This indistinguishability phenomenon can be avoided by a proper control of kernel width. 4.1.1 Non-centered case Given data x1 , . . . , xn , let Γ(x1 ), . . . , Γ(xn ) be (nonrandom) functions in Hκ . They form the data set in kernel representation. The kernel window width σ depends on n and so do the kernel κ and the associated Hilbert space Hκ . Notations σn , κn and Hn will be used from now on to indicate their dependence on n.6 Notice that the kernel data Γ(xj ) also depend implicitly on n, but we still stick to the notation Γ, instead of Γn , for simplicity. Let κo denote the baseline kernel with window width one. Then κn (x, u) = σn−p κo (x/σn , u/σn ). The size of σn controls the resolution of the associated Hilbert space. The smaller σn is, the finer resolution the space Hn has. The resolution can be easily seen in the splines and wavelets associated Hilbert spaces via a sequence of nested RKHSs. The window width σn should be decreasing to zero, as the sample size n approaches infinity. Suppose that there exists a constant τ 2 > 0 such that for any > 0, as n → ∞, 1 card{1 ≤ j ≤ n : |σnp Γ(xj )2Hn − τ 2 | > } → 0, n 6 In our setup we consider a sequence of reproducing kernel Hilbert spaces Hn , n = 1, 2, . . .. We may choose the same B = L2 (X , µ) in the abstract Wiener spaces (i, Hn , B), but in B we have a family 1/2 of Gaussian measures with different covariance operators κn , n = 1, 2, . . .. Moreover, Hn = κn (B). 14 1 card{1 ≤ j, j ≤ n : σnp |Γ(xj ), Γ(xj ) Hn | > } → 0, 2 n where σn → 0. By the reproducing property of kernel, the two conditions above are equivalent to 1 (17) card{1 ≤ j ≤ n : |σnp κn (xj , xj ) − τ 2 | > } → 0, n 1 card{1 ≤ j, j ≤ n : σnp |κn (xj , xj )| > } → 0. (18) n2 √ Condition (17) says that most kernel data { σnp Γ(xj )}nj=1 have Hn -norm near τ 2 . Condition (18) says that most kernel data are nearly orthogonal in Hn . Let h be a random element from a Gaussian measure with zero mean and covariance operator κn . The Karhunen-Loève expansion for h is h(x) = eq ψ̃q,n (x), eq ∼ iid N (0, 1), q λq ψq,n (x) = λq σn−p ψq,o (x/σn ) having unitary Hn -norm, i.e., √ = 1. Kernel data { σnp Γ(xj )}nj=1 projected along the random direction where ψ̃q,n (x) = ψ̃q,n (x)Hn h have values given by σnp h, Γ(x1 ) Hn , · · · , σnp h, Γ(xn ) Hn . Let θn (h) be the empirical distribution of this sequence, assigning probability mass √ n−1 to each σnp h, Γ(xj ) Hn . Theorem 3 Under conditions (17) and (18), as n → ∞, the empirical distribution θn (h) converges weakly to N (0, τ 2 ) in probability. In Theorem 3 we have used the same convention as Diaconis and Freedman (1984). We show in the Appendix that the random characteristic function φn,h (t) for θn (h) converges in probability to the characteristic function φ0 (t) of N (0, τ 2 ), ∀t. “In probability” is referring to the distribution of the random direction h. Theorem 3 is 15 established for one-dimensional projections along most random directions h and the result can extend to m-dimensional projections along random directions h1 , . . . , hm for an arbitrary but fixed m. Corollary 4 For hk , k = 1, . . . , m, iid from the Gaussian measure with zero mean √ and covariance operator κn , kernel data { σnp Γ(xj )}nj=1 projected along the random directions consisting of h = (h1 , . . . , hm ) have values given by σnp h, Γ(x1 ) Hn , · · · , σnp h, Γ(xn ) Hn , where h, Γ(xj ) Hn = (h1 , Γ(xj ) Hn , . . . , hm , Γ(xj ) Hn ) ∈ Rm . Let θn (h) be the empirical distribution of this sequence, assigning probability mass n−1 to each point √ p σn h, Γ(xj ) Hn ∈ Rm . Under conditions (17) and (18), as n → ∞, the empirical distribution θn (h) converges weakly to N (0, τ 2 Im ) in probability. Let t2n (h) = n−1 n j=1 σnp h, Γ(xj ) 2Hn be the empirical second moment and let θn1 (h) be the scaled empirical distribution, assigning probability mass n−1 to each of √ the scaled projections σnp h, Γ(xj ) Hn /tn . Under some slightly stronger conditions we will establish their asymptotic distributions. By the inequality I(|t| > ) ≤ |t|2 / 2 , the following conditions will imply (17) and (18): n σnp κn (xj , xj ) → τ 2 , n j=1 1 card {1 ≤ j ≤ n : |σnp κn (xj , xj ) − τ 2 | > } → 0, n n σn2p 2 κ (xj , xj ) → 0. n2 j,j =1 n (19) (20) (21) Theorem 5 Assume conditions (19)-(21) hold. Then (a) the empirical second moment t2n (h) converges to τ 2 in probability, and (b) the scaled empirical distribution θn1 (h) converges weakly to N (0, 1) in probability. 16 4.1.2 Centered case Let Γ̄ = n−1 n j=1 Γ(xj ) be the empirical data centroid in the feature space Hκ , and Γ̃(xj ) = Γ(xj ) − Γ̄, j = 1, . . . , n, be the centered kernel data. Below we consider the asymptotic distribution of these centered data projected along a random direction h from N (0, κn ). Let θn0 (h) be the empirical distribution for centered data projected √ along h, assigning probability mass n−1 to each σnp h, Γ̃(xj ) Hn . Conditions below are centered versions of conditions (19)-(21): n σnp Γ̃(xj )2Hn → τ 2 , n j=1 1 card {1 ≤ j ≤ n : |σnp Γ̃(xj )2Hn − τ 2 | > } → 0, n n σn2p Γ̃(xj ), Γ̃(xj ) 2Hn → 0. n2 j,j =1 By the kernel reproducing property, conditions above are equivalent to conditions (22)-(24). n σnp [κn (xj , xj ) − κ·· ] → τ 2 , n j=1 (22) 1 card {1 ≤ j ≤ n : |σnp (κn (xj , xj ) − 2κj· + κ·· ) − τ 2 | > } → 0, n n σn2p [κn (xj , xj ) − κj· − κ·j + κ·· ]2 → 0, n2 j,j =1 (23) (24) where κj· = n −1 n κn (xj , xj ), κ·j = n −1 j =1 Let s2n (h) = n−1 σnp n κn (xj , xj ) and κ·· = n −2 n κn (xj , xj ). j,j =1 j=1 n 2 j=1 h, Γ̃(xj ) Hn , and let θn1 (h) be the scaled empirical distribu√ tion, which assigns probability mass n−1 to each of σnp h, Γ̃(xj ) Hn /sn . Theorem 6 Assume conditions (22)-(24) hold. Then (a) the empirical variance s2n (h) converges to τ 2 in probability, (b) the centered empirical distributions θn0 (h) 17 converges weakly to N (0, τ 2 ) in probability, and (c) the scaled empirical distribution θn1 (h) converges weakly to N (0, 1) in probability. Theorem 6 can extend to m-dimensional projections along random directions h1 , . . . , hm for an arbitrary but fixed m. Corollary 7 For hk , k = 1, . . . , m, iid from the Gaussian measure with zero mean √ and covariance operator κn , the centered kernel data { σnp Γ̃(xj )}nj=1 projected along the random directions consisting of h = (h1 , . . . , hm ) have values given by σnp h, Γ̃(x1 ) Hn , · · · , σnp h, Γ̃(xn ) Hn , where h, Γ̃(xj ) Hn = (h1 , Γ̃(xj ) Hn , . . . , hm , Γ̃(xj ) Hn ) ∈ Rm . Let θn0 (h) be the empirical distribution of this sequence, assigning probability mass n−1 to each point √ p σn h, Γ̃(xj ) Hn . Under conditions (22)-(24), the empirical distribution θn0 (h) converges weakly to N (0, τ 2 Im ) in probability as n → ∞. With all the above asymptotic results, it is natural to ask when conditions (22)(24) will be met to validate Theorem 6 and Corollary 7. Theorem 8 Let X1 , X2 , X3 , . . . be an iid sequence from a distribution having contin uous probability density function p(x) on X , which satisfies the condition p2 (x)dx < ∞. Assume that κo (s, t) is a symmetric translation type decreasing kernel with tail decay as t − s → ∞. Also assume that σn → 0, as n → ∞. Then conditions (22)-(24) hold for almost all realizations of X1 , X2 , X3 , . . .. Remark 4 The KFDA algorithm for binary classification has an effective working subspace of one-dimension. Theorem 6 says that most one-dimensional projections of kernel data can be well approximated by a Gaussian distribution under suitable conditions. Corollary 7 further extends the result to m-dimensional projections for multiclass discriminant analysis. Theorem 8 validates conditions (22)-(24) for the 18 asymptotic approximate Gaussian. Though Theorem 8 is established for symmetric translation type kernels, it extends to integral-translation-invariant kernels, i.e., κo (x, u) = κo (x − m, u − m) for any integer m. Wavelets and splines are of this type. Remark 5 In either the centered or non-centered case, the asymptotic empirical distribution does not depend on the random projection direction h. This phenomenon indicates that the distribution of the kernel data Γ(xj ) looks spherically symmetric over Hn when n is large. Here the spherical symmetry is with respect to the Hn norm rather than the L2 norm. 4.2 Multiple populations Suppose training inputs x1 , . . . , xn are from one of the populations π1 , . . . , πk with corresponding group labels y1 , . . . , yn . With kernel representation, the data set then consists of {(Γ(xj ), yj )}nj=1 . Let Γ̄i := n−1 i j∈Ii Γ(xj ) be the ith group data centroid and Γ̃(xj ) be the centered data (about individual group centroid) given by Γ̃(xj ) = Γ(xj ) − Γ̄i , if xj belongs to the ith group, i.e., yj = i. Suppose, for each group, conditions (22)-(24) hold. By projecting the centered data from all groups along a common random direction h ∼ N (0, κn ), Theorem 6 says that, for each group, the empirical distribution of centered data converges weakly to a common normal distribution. For the task of classification it is necessary that these group centroids are distinct. Notice that Theorem 6 is established under conditions (22)-(24). However, if conditions (17)-(18) are also met, then by Theorem 3 all these k non-centered empirical distributions, denoted by θni (h), i = 1, . . . , k, will converge weakly to an identical distribution N (0, τ 2 ), which implies that all group centroids are converging to zero and becoming indistinguishable by their asymptotic empirical distributions, centered or non-centered alike. Therefore, ideally we need conditions (22)-(24) be met while condition (18) be violated, so that these k populations can 19 be discriminated by their group means. Otherwise, they are asymptotically indistinguishable, as their means are collapsing to zero. Next we discuss the control of window width σn to validate or to fail these technical conditions. Proposition 9 Let X0 , X1 , X2 , X3 , . . . be an iid sequence from a distribution having continuous probability density function p(x) on X and satisfying X p2 (u)du < ∞. Assume that κo (s, t) is a symmetric translation type decreasing kernel with tail decay to zero as s − t → ∞. Also assume that σn → 0. Let r = limn→∞ nσnp and Sn, = card{1 ≤ j = n : σnp |κn (Xj , X0 )| > }. (a) If 0 ≤ r < ∞, then for any small > 0 and any positive integer i we have lim P {Sn, = i} = n→∞ e−rq (rq )i i! for some 0 < q < ∞ (where q is given in the proof ). In particular, if r = 0, we have that limn→∞ σnp κn (Xj , X0 ) → 0 a.s. (b) If r = ∞, then for any > 0 we have that Sn, → ∞ in probability. Again, the above results extend to integral-translation-invariant kernels. Remark 6 Result (a) above implies that if σn → 0 at a rate O(n−1/p ) or faster, then for arbitrarily small but fixed η > 0, we have Sn, P {Sn, = i} → 1. lim P < η = lim n→∞ n→∞ n 0≤i<nη Thus, Sn, 1 = card{1 ≤ j ≤ n : σnp |κn (Xj , X0 )| > } → 0 in probability, n n which also implies 1 card{1 ≤ j, j ≤ n : σnp |κn (Xj , Xj )| > } → 0 in probability. n2 20 Therefore, “σn = O(n−1/p ) or of a smaller order of magnitude” is the case that we should avoid. On the other hand, if σn → 0 is controlled in such a way that r = limn→∞ nσnp = ∞, though result (b) of Proposition 9 does not guarantee the almost-sure or in-probability failure of condition (18), it says that for each j th column card{1 ≤ j ≤ n : σnp |κn (Xj , Xj )| > } → ∞ in probability. In conclusion, “limn→∞ nσnp = ∞” is the asymptotic window size that we should take. Remark 7 From Proposition 9, we see that the kernel employed should have tail decay and the ideal window width should be controlled in a way that nσnp → ∞ and σn → 0. This is compatible with the well known optimal window width σn = O(n−1/(2s+p) ) for nonparametric function estimation, where p is the dimension of x and s is the kernel order. (Especially, s = 2 for the Gaussian and Epanechnikov kernels used in a later empirical study.) Remark 8 Polynomial kernels do not satisfy the tail decay property. Hence, theorems discussed in Section 4 do not apply to polynomial kernels. 4.3 Empirical examples We use three examples to show (1) the influence of kernel transformation on data normality and their elliptical symmetry, (2) the effect of window widths and (3) projections along random directions versus the most discriminant direction. Example 1 In this example, we show how the kernel map can bring the empirical data distribution to a better elliptical symmetry in low-dimensional projections. Consider a random sample of size 200 consisting of data {xi := (xi1 , . . . , xi5 )}200 i=1 , where iid xi1 , xi3 , xi4 , xi5 ∼ uniform (0, 2π), and iid xi2 = sin(xi1 ) + i , i ∼ N (0, τ 2 ) with τ = 0.4. 21 The data scatter plot over the first 2 coordinates is in Figure 1, and it reveals the pattern of sine function. Next, we compute the kernel data K = [κ(xj , xj )] using a √ Gaussian kernel with window width σ = 10, where data {xj }200 j=1 are scaled to have unitary variance in each coordinate prior to forming the kernel data. Two random columns are drawn from K and their data scatter is plotted in Figure 2. It has much better elliptical symmetry. (Notice that a random column from K is exactly the projection of kernel data {Γ(xj )}200 j=1 along a direction randomly picked from the set D = {Γ(xj )}200 j=1 .) Raw data normal probability plot for the second coordinate is in Figure 3. This normal probability plot is produced using Matlab m-file “normplot”. A characteristic of the normal probability plot is that a normal distribution is plotted as a straight line. Substantial departures from a straight line are signs for non-normal. The “blue +” in Figure 3 are data points and the red line is the reference line for a fitted normal. For comparison, projections of kernel data along directions from the set D are taken. Their normality is checked. We present here normal probability plots for the best four, median four and worst four projections among D in Figures 4-6. 2 1 1.5 0.9 1 0.8 0.5 0 0.7 −0.5 0.6 −1 0.5 −1.5 −2 0 1 2 3 4 5 6 7 0.4 0.4 Figure 1: Scatter plot of (x1 , x2 ). 0.5 0.6 0.7 0.8 0.9 Figure 2: Kernel data scatter plot. 22 1 Normal Probability Plot 0.997 0.99 0.99 0.98 0.98 0.95 0.95 0.90 0.90 0.75 0.75 Probability Probability Normal Probability Plot 0.997 0.50 0.50 0.25 0.25 0.10 0.10 0.05 0.05 0.02 0.02 0.01 0.01 0.003 0.003 −1.5 −1 −0.5 0 Data 0.5 1 1.5 0.5 0.6 0.7 Data 0.8 0.9 1 Figure 4: Best four projections of kernel data. Figure 3: Normal probability plot for x2 . Example 2 Draw 10000 random samples from an exponential distribution with probability density function p(x) = exp(−x). Arrange them in a 200 × 50 matrix, denoted by X. The data matrix X represents a random sample of size 200 in R50 . Each row of X is an observation in a 50-dimensional space and there are 200 observations in total. Each column of X represents a variable coordinate axis and is a one-dimensional projection of observations along that particular variable coordinate. There are 50 variable coordinate axes. Render all these 50 one-dimensional projections to normality checks using (1) the Kolmogorov-Smirnov test of a continuous distribution (centered and coordinatewise scaled to unitary variance) against a standard normal, and (2) the normal probability plot. The average p-value and its standard deviation of these 50 Kolmogorov-Smirnov tests are: average p-value = 2.853 × 10−4 , sample standard deviation = 5.633 × 10−4 . We also plot the normal probability plots for the best 4 out of 50 projections (along the coordinate axes) closest to a Gaussian (Figure 7), the median 4 projections (Figure 8), and the worst 4 projections farthest from a Gaussian (Figure 9). Associated p-values 23 Normal Probability Plot 0.997 0.99 0.99 0.98 0.98 0.95 0.95 0.90 0.90 0.75 0.75 Probability Probability Normal Probability Plot 0.997 0.50 0.50 0.25 0.25 0.10 0.10 0.05 0.05 0.02 0.02 0.01 0.01 0.003 0.003 0.4 0.5 0.6 0.7 Data 0.8 0.9 1 0.6 0.65 0.7 0.75 0.8 Data 0.85 0.9 0.95 1 Figure 5: Median four projections of kernel Figure 6: Worst four projections of kernel data. data. for the best four, median four and worst four cases are reported below: best 4 : 0.0025 0.0023 0.0018 0.0010 median 4 : 0.0000 0.0000 0.0000 0.0000 worst 4 : 0.0000 0.0000 0.0000 0.0000 From these p-values and normal probability plots in Figures 7-9, it is clear that this random sample of size 200 from a 50-dimensional exponential distribution is by far from a Gaussian. 24 Normal Probability Plot 0.997 0.99 0.99 0.98 0.98 0.95 0.95 0.90 0.90 0.75 0.75 Probability Probability Normal Probability Plot 0.997 0.50 0.50 0.25 0.25 0.10 0.10 0.05 0.05 0.02 0.02 0.01 0.01 0.003 0.003 0 0.5 1 1.5 2 2.5 3 3.5 4 0 1 2 3 4 Data 5 6 7 8 Data Figure 7: Best four projections of original Figure 8: Median four projections of origidata. nal data. Normal Probability Plot 0.997 0.99 0.98 0.95 0.90 Probability 0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003 0 1 2 3 4 5 6 7 Data Figure 9: Worst four projections of original data. 25 To improve the Gaussianity of data distribution, we transform the data by the Aronszajn kernel map Γ in (8). We try out using both the Gaussian kernel and the Epanechnikov kernel. Via the kernel map Γ, each data point xj ∈ R50 is given a new representation by a function. For the case of a Gaussian kernel, we have xj → Γ(xj ) = κ(xj , x) = exp(−0.5x − xj 22 /σ 2 ) with σ = 10. Let x = (x1 , . . . , xk , . . . , x50 ) and xj = (xj1 , . . . , xjk , . . . , xj50 ). For the case of an Epanechnikov kernel, we have xj → Γ(xj ) = κ(xj , x) = 50 (1 − (xk − xjk )2 /σ 2 )I{|xk − xjk | < σ} with σ = 10, k=1 where I is an indicator function. We check if one-dimensional projections of kernel data {Γ(xj )}200 j=1 resemble a Gaussian random sample. We consider one-dimensional projections along directions randomly picked from the set D = {Γ(xj )}200 j=1 . That is, the kth column of kernel matrix K comes from projecting the kernel data {Γ(xj )}200 j=1 along the functional direction κ(xk , ·). In other words, columns of K are the result of one-dimensional projections. These columns are then rendered to normality checks. Results show that one-dimensional projections of kernel-transformed data are much closer to a normal. Listed below are summarized p-values for the best four, median four and worst four out of 200 columns (i.e., projections) of transformed data. For the case of a Gaussian kernel, we have best 4 : 0.9943 0.9794 0.9727 0.9634 median 4 : 0.5771 0.5769 0.5702 0.5658 worst 4 : 0.0975 0.0854 0.0288 0.0202 average : 0.5650 (std deviation = 0.2542) For the case of an Epanechnikov kernel, we have best 4 : 0.9583 0.9527 0.9134 0.9063 26 median 4 : 0.3301 0.3291 0.3123 0.3122 worst 4 : 0.0263 0.0261 0.0210 0.0044 average : 0.3662 (std deviation = 0.2441) Normal probability plots are presented in Figures 10-12 for Gaussian-kernel transformed data, and in Figures 13-15 for Epanechnikov-kernel transformed data. Again, colored “+” denote data points, different colors stand for different projections, and the dashed lines are references lines for fitted normals. It is clear that kernel data in Figures 10-15 follow their reference lines much closer than raw data in Figures 7-9, which indicates projections of kernel transformed data are much better Gaussian than untransformed data. Normal Probability Plot 0.997 0.99 0.99 0.98 0.98 0.95 0.95 0.90 0.90 0.75 0.75 Probability Probability Normal Probability Plot 0.997 0.50 0.50 0.25 0.25 0.10 0.10 0.05 0.05 0.02 0.02 0.01 0.01 0.003 0.003 0.5 0.6 0.7 Data 0.8 0.9 1 0.4 0.5 0.6 0.7 Data 0.8 0.9 1 Figure 10: Best four projections of Gaussian Figure 11: Median four projections of Gauskernel data. sian kernel data. Example 3 This is a two-population example, which aims to show the importance of the choice of a proper window size to avoid the merge of the two populations into an indistinguishable one. Throughout this example an Epanechnikov kernel is used. Random samples of size 200 each are drawn from distributions on R50 with probability 50 density functions p1 (x) = 50 i=1 exp(−xi ), xi > 0, and p2 (x) = i=1 exp(−xi + 2), 27 Normal Probability Plot 0.997 0.99 0.98 0.95 0.90 Probability 0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003 0.4 0.5 0.6 0.7 Data 0.8 0.9 1 Figure 12: Worst four projections of Gaussian kernel data. xi > 2, respectively. Next, we compute the kernel matrix K = [κσ (xj , xj )]400×400 with σ = 10, 5, 3, 2. The first 200 rows in K are from population one and the rest 200 rows are from population two. Then a one-dimensional projection along a random direction e ∼ N (0, 0.052 I400 ) is calculated, and their normal probability plots are displayed in Figures 16-19. (The quantity 0.052 is used to normalize the scale of projection vector. It will not affect the appearance of normal probability plots, but will only affect the scale of the horizontal axis.) From these plots we see clearly that one-dimensional random projections have good normality and that, as the window width decreases, the two normals are getting to merge together. Next the kernel data K together with their group labels are used to train for the most discriminant direction using KFDA. We then draw two extra test samples, each of size 200 from p1 and p2 respectively, and project their kernel transforms along the KFDA-found most discriminant direction. Their normal probability plots are displayed in Figures 20-23. We see that data projected along the KFDA-found most discriminant direction have better separation (Figure 20) between groups than along a random direction (Figure 16), if the window size is right. As the window size decreases, the two underlying samples tend to merge together even along the most 28 Normal Probability Plot 0.997 0.99 0.99 0.98 0.98 0.95 0.95 0.90 0.90 0.75 0.75 Probability Probability Normal Probability Plot 0.997 0.50 0.50 0.25 0.25 0.10 0.10 0.05 0.05 0.02 0.02 0.01 0.01 0.003 0.003 −1 −0.8 −0.6 −0.4 −0.2 0 Data 0.2 0.4 0.6 0.8 −0.6 1 −0.4 Figure 13: Best four projections of Epanech- Figure 14: −0.2 0 0.2 Data 0.4 0.6 0.8 1 Median four projections of Epanechnikov kernel data. nikov kernel data. discriminant direction. 5 Concluding discussion In Subsection 3.2 we only considered the case of Gaussian measures having a common covariance operator, which leads to a linear classifier in H. For the case of distinct covariance operators, see Rao and Varadarajan (1963), which will lead to a quadratic classifier in H. Often an extension to quadratic machines (in the feature space) is straightforward using some existing statistical notion, e.g., the maximum likelihood ratio discussed in this article. In Subsection 4.2 we have discussed the asymptotic behavior of projections for the purpose of multiclass classification. The populations are distinguished solely by their means in the feature space. Their covariances are assumed to share a common structure and are pooled together, and do not play any role in the classification. Sometimes, covariance structures may also carry useful population-dependent information and should be considered to enter the classifier, which will lead to a quadratic extension. We are not to advocate a common use of 29 Normal Probability Plot 0.997 0.99 0.98 0.95 0.90 Probability 0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Data Figure 15: Worst four projections of Epanechnikov kernel data. quadratic extensions, but merely to point out the existing opportunities. The study of asymptotic distributions of projections of kernel data is not limited to KFDA. Results in Section 4 apply to general kernel machines including the support vector machines. Acknowledgment The authors thank Yuh-Jye Lee for helpful comments. This research was partially supported by the National Science Council of Taiwan, R.O.C., grant number NSC93-2118-M-001-015. Appendix In this appendix, Eh , varh and covh denote, respectively, the expectation, variance and covariance with respect to the distribution of the random element h. Proof of Theorem 2: For an arbitrary pair f, g ∈ H, the random vector given by ( g, h H , f, h H ) has a 2-dimensional Gaussian distribution by Definition 2. The mean for the first 30 Normal Probability Plot 0.997 0.99 0.99 0.98 0.98 0.95 0.95 0.90 0.90 0.75 0.75 Probability Probability Normal Probability Plot 0.997 0.50 0.50 0.25 0.25 0.10 0.10 0.05 0.05 0.02 0.02 0.01 0.01 0.003 0.003 −0.5 −0.4 −0.3 −0.2 −0.1 0 −0.1 −0.05 Data Figure 16: Random direction, σ = 10. 0 Data 0.05 0.1 Figure 17: Random direction, σ = 5. variate is Eh g, h H = g, m H , and the covariance of the two variates is covh {g, h) H , f, h) H } = g, Λf H . Notice that (g, hj H , f, hj H ) are iid from the distribution above. The MLEs for a 2 2-dimensional Gaussian are clear. Proof of Theorem 3: The characteristic function of θn (h) is φn,h (t) = n −1 n exp{it σnp h, Γ(xj ) Hn }. j=1 Notice that varh {h, Γ(xj ) Hn } = Γ(xj ), Γ(xj ) Hn = κn (xj , xj ). Then, Eh φn,h (t) = n −1 n exp{−t2 σnp κn (xj , xj )/2} by (17) −→ exp{−t2 τ 2 /2} := φ0 (t), j=1 where φ0 (t) is the characteristic function for N (0, τ 2 ). Likewise, since varh {h, Γ(xj ) − Γ(xj ) Hn } = κn (xj , xj ) + κn (xj , xj ) − 2κn (xj , xj ), then Eh {|φn,h (t)|2 } = n−2 n Eh {exp(it σnp h, Γ(xj ) − Γ(xj ) Hn )} j,j =1 = n−2 n exp{−t2 σnp [κn (xj , xj ) + κn (xj , xj ) − 2κn (xj , xj )]/2} → φ20 (t) j,j =1 31 Normal Probability Plot 0.997 0.99 0.99 0.98 0.98 0.95 0.95 0.90 0.90 0.75 0.75 Probability Probability Normal Probability Plot 0.997 0.50 0.50 0.25 0.25 0.10 0.10 0.05 0.05 0.02 0.02 0.01 0.01 0.003 0.003 −0.1 −0.05 0 Data 0.05 0.1 −0.1 Figure 18: Random direction, σ = 3. −0.05 0 Data 0.05 0.1 Figure 19: Random direction, σ = 2. by conditions (17) and (18). Next, by Chebychev’s inequality P rob{|φn,h (t) − φ0 (t)| > } ≤ −2 Eh |φn,h (t) − φ0 (t)|2 = −2 Eh |φn,h (t)|2 + φ20 (t) − 2φ0 (t)Eh φn,h (t) → 0. That is, φn,h (t) → φ0 (t) in probability for each t. Using Lemma 2.2 of Diaconis and Freedman (1984), we conclude that θn (h) converges weakly to N (0, τ 2 ) in probabil2 ity. Proof of Corollary 4: Let t = (t1 , . . . , tm ). The characteristic function of θn (h) is φn,h (t) = n −1 n j=1 m exp{i tk σnp hk , Γ(xj ) Hn }. k=1 Notice that covh {h, Γ(xj ) Hn } = Γ(xj ), Γ(xj ) Hn Im = κn (xj , xj )Im , where Im is the m × m identity matrix. Then, by condition (17), Eh φn,h (t) = n −1 n j=1 exp{− m t2k σnp κn (xj , xj )/2} k=1 → exp{− m t2k τ 2 /2} := φ0 (t), k=1 where φ0 (t) is the characteristic function for N (0, τ 2 Im ). Likewise, since covh {h, Γ(xj ) − Γ(xj ) Hn } = (κn (xj , xj ) + κn (xj , xj ) − 2κn (xj , xj ))Im , 32 Normal Probability Plot 0.997 0.99 0.99 0.98 0.98 0.95 0.95 0.90 0.90 0.75 0.75 Probability Probability Normal Probability Plot 0.997 0.50 0.50 0.25 0.25 0.10 0.10 0.05 0.05 0.02 0.02 0.01 0.01 0.003 0.003 −20 −10 0 Data 10 20 30 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 Data Figure 20: Test samples projected along the Figure 21: Test samples projected along the most discriminant direction, σ = 10. most discriminant direction, σ = 5. and by conditions (17) and (18) Eh {|φn,h (t)|2 } = n−2 n Eh {exp(i j,j =1 m tk σnp hk , Γ(xj ) − Γ(xj ) Hn )} → φ20 (t). k=1 Similar to the proof of Theorem 3, we have, by Chebychev’s inequality, φn,h (t) → φ0 (t) in probability for each t. Lemma 2.2 of Diaconis and Freedman (1984) can easily extend to a multivariate setting. We then conclude that θn (h) converges weakly to 2 N (0, τ 2 Im ) in probability. Proof of Theorem 5: (a) By (19) Eh t2n = n−1 n Eh σnp h, Γ(xj ) 2Hn = n−1 σnp j=1 n κn (xj , xj ) → τ 2 . j=1 Since (h, Γ(xj ) Hn , h, Γ(xj ) Hn ) is a bivariate normal with zero mean and covariance matrix: κn (xj , xj ) κn (xj , xj ) , κn (x , xj ) κn (x , x ) j j j similar to Lemma 2.3 in Diaconis and Freedman (1984), we have Eh [h, Γ(xj ) 2Hn h, Γ(xj ) 2Hn ] = 2[κn (xj , xj )]2 + κn (xj , xj )κn (xj , xj ). 33 Normal Probability Plot 0.997 0.99 0.99 0.98 0.98 0.95 0.95 0.90 0.90 0.75 0.75 Probability Probability Normal Probability Plot 0.997 0.50 0.50 0.25 0.25 0.10 0.10 0.05 0.05 0.02 0.02 0.01 0.01 0.003 0.003 −8 −6 −4 −2 0 Data 2 4 6 8 −8 −6 −4 x 10 −4 −2 Data 0 2 4 6 −7 x 10 Figure 22: Test samples projected along the Figure 23: Test samples projected along the most discriminant direction, σ = 3. most discriminant direction, σ = 2. By conditions (20) and (21) n n Eh t4n = n−2 σn2p Eh [ h, Γ(xj ) 2Hn ]2 = n−2 σn2p Eh [h, Γ(xj ) 2Hn h, Γ(xj ) 2Hn ] = 2n−2 σn2p j=1 n [κn (xj , xj )]2 + n−2 σn2p j,j =1 j,j =1 n κn (xj , xj )κn (xj , xj ) → τ 4 . j,j =1 Then, P {|t2n − τ 2 | > } ≤ −2 Eh (t2n − τ 2 )2 = −2 (Eh t4n − 2τ 2 Eh t2n + τ 4 ) → 0, for any > 0. Thus, t2n → τ 2 in probability. (b) By Theorem 3 and Theorem 5-(1), the assertion, θn1 → N (0, 1) weakly in 2 probability, can be obtained by Slutzky’s lemma. Proof of Theorem 6: (a) By (22) Eh s2n =n −1 n Eh σnp h, Γ̃(xj ) 2Hκ = n−1 σnp j=1 n [κ(xj , xj ) − κ·· ] → τ 2 . j=1 Similarly, by conditions (23) and (24), we have Eh s4n → τ 4 . Thus, s2n → τ 2 in probability. (b) The characteristic function of θn0 (h) is φn (h, t) = n−1 n exp{it σnp h, Γ̃(xj ) Hκ }. j=1 34 Since varh {h, Γ̃(xj ) Hn } = Γ̃(xj ), Γ̃(xj ) Hn = κn (xj , xj ) − 2κj· + κ·· , then Eh φn (h, t) = n −1 n exp{−t2 σnp [κ(xj , xj ) − 2κj· + κ·· ]/2} → exp{−t2 τ 2 /2} j=1 by condition (23). Likewise, since varh {h, Γ̃(xj ) − Γ̃(xj ) Hκ } = Γ̃(xj ) − Γ̃(xj )2Hκ , then by conditions (22)-(24) Eh {|φn (h, t)| } = n 2 −2 = n−2 n j,j =1 n Eh {exp(it σnp h, Γ̃(xj ) − Γ̃(xj ) Hκ )} exp{−t2 σnp Γ̃(xj ) − Γ̃(xj )2Hκ /2} → exp{−t2 τ 2 }. j,j =1 Thus, by Chebychev’s inequality, φn (h, t) → exp{−t2 τ 2 /2} in probability. We may then conclude that θn0 (h) converges weakly to N (0, τ 2 ) in probability. 2 (c) It can be obtained by Slutzky’s lemma. Proof of Corollary 7 is similar to Corollary 4 and Theorem 6, and is thus omitted. Proof of Theorem 8: We first show that condition (22) holds for almost all re alizations of X1 , X2 , X3 , . . .. Let Un := (n(n − 1))−1 j=j κo (Xj /σn , Xj /σn ). By Hoeffding’s inequality for U -statistic we have n P {|Un − EUn | > } ≤ 2 exp{−n 2 /κ2o (0)} < ∞. n Then by Borel-Cantelli lemma Un − EUn → 0 almost surely. Notice that EUn = Eκo (X/σn , X /σn ) → 0, as σn → 0. Thus, Un → 0 almost surely. That is, the average of off diagonal elements of kernel matrix converges to zero a.s. As the diagonal elements are all equal to τ 2 for symmetric translation type kernel and the average of off diagonals converges to zero a.s., condition (22) holds almost surely. Similarly one can show that condition (23) holds a.s. Next we show condition (24), limn→∞ n−2 σn2p a.s. The proof can be resolved into steps below. 35 n 2 j,j =1 Γ̃(Xj ), Γ̃(Xj ) Hn = 0, holds • After some straightforward calculation, it leads to the following decomposition: n n−2 σn2p Γ̃(Xj ), Γ̃(Xj ) 2Hn j,j =1 σn2p Γ̄ = − mn , Γ̄ − mn 2Hn − 2n−1 σn2p n Γ(Xj ) − mn , Γ̄ − mn 2Hn (25) j=1 n +n−2 σn2p Γ(Xj ) − mn , Γ(Xj ) − mn 2Hn , where mn = EΓ(X).(26) j,j =1 • Since √ σnp Γ(Xj ) are iid with finite second moment τ 2 , then σnp Γ̄ − mn 2Hn → 0 a.s. Thus, the first term in (25) σn2p Γ̄ − mn 4Hn → 0 a.s. as well. • Next, the second term in (25) n −1 n j=1 n−1 ≤ σn2p Γ(Xj ) − mn , Γ̄ − mn 2Hn n σnp Γ(Xj ) − mn 2Hn σnp Γ̄ − mn 2Hn → 0 a.s. j=1 • Let Vn = n j=j =1 σn2p Γ(Xj )−mn , Γ(Xj )−mn 2Hn /[n(n−1)]. Using Hoeffding’s inequality for U -statistic, we have P {|Vn − EVn | > } ≤ 2 n e−n 2 /κ4 (0) o < ∞, ∀ > 0. n Thus, Vn − EVn → 0 a.s. Finally, we show that limn→∞ EVn = 0. For any fixed > 0, let U = sup{|t| : |κo (t)| > }. Notice that we have the equalities EΓ(Xj ), Γ(Xj ) Hn = Eκn (Xj , Xj ) = EΓ(Xj ), mn ) Hn = mn 2Hn . Then EVn = σn2p EΓ(X) − mn , Γ(X ) − mn 2Hn = σn2p E{κn (X, X ) − mn , Γ(X) Hn − mn , Γ(X ) Hn + mn 2Hn }2 = σn2p {Eκ2n (X, X ) − mn 4Hn } ≤ σn2p Eκ2n (X, X ) 36 κ2o ((x − u)/σn )p(x)p(u)dxdu = ≤ τ 4 p(x)p(u)dxdu + 2 , x−u≤σn U where limn→∞ p(x)p(u)dxdu = 0, if x−u≤σn U p2 (x)dx < ∞. 2 Lemma 1 (Poisson approximation) Suppose for each n, Zn1 , . . . , Znrn are independent random variables, where each Znk is a Bernoulli trial with probability of n success pnk . If limn→∞ rk=1 pnk = q, 0 ≤ q < ∞, and limn→∞ max1≤k≤rn pnk = 0, then rn e−q q i Znk = i} = , i = 0, 1, 2, . . . . lim P { n→∞ i! k=1 Proof for the above lemma can be found in Billingsley (1986). Also notice that, by n Znk → ∞ in probability. letting q → ∞, rk=1 Proof of Proposition 9: For any fixed > 0, let U = sup{|t| : |κo (t)| > }. Then, σn−p P {|κo ((Xj − X0 )/σn )| > } = σn−p P {Xj − X0 ≤ σn U } −p p(t)p(u)dtdu = p(σt + u)p(u)dt du = σn t−u/σn ≤U t ≤U p → vp U p2 (u)du, as n → ∞, (27) X where vp is the volume of a unit ball in Rp . Let Inj = I{σnp |κn (Xj , X0 )| > }, where I is an indicator function, and Sn, = nj=1 Inj . From (27) and the definition of r, we have lim n→∞ n P {Inj = 1} = rvp U j=1 Denote the limit above by rq , where 0 < q = vp U p2 (u)du. X X p2 (u)du < ∞. Also notice that max P {Inj = 1} = P {|κo (Xj /σn , X0 /σn )| > } → 0, as σn → 0. 1≤j≤n Thus, by Lemma 1, for the case 0 ≤ r < ∞, we have lim P {Sn, n→∞ e−rq (rq )i = i} = , i! 37 and, for the case r = ∞, we have Sn, → ∞ in probability. 2 References Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 686, 337–404. Baudat, G. and Anouar, F. (2000). Generalized discriminant analysis using a kernel approach. Neural Computation, 12, 2385–2404. Billingsley, P. (1986). Probability and Measure. 2nd ed., John Wiley & Sons, New York. Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–279. Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press. Diaconis, P. and Freedman, D. (1984). Asymptotics of graphical projection pursuit. Ann. Statist., 12, 793–815. Friedman, J. (1989). Regularized discriminant analysis. J. Amer. Statist. Assoc., 84, 165–175. Grenander, U. (1950). Stochastic processes and statistical inference. Arkiv för Matematik, 1, 195–277. Grenander, U. (1963). Probabilities on Algebraic Structures. Almqvist & Wiksells, Stockholm, and John Wiley & Sons, New York. Grenander, U. (1981) Abstract Inference. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York. 38 Hastie, T. and Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures. J. R. Statist. Soc. B, 58, 155–176. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York. Hein, M. and Bousquet, O. (2004). Kernels, associated structures and generalizations. Technical report, Max Planck Institute for Biological Cybernetics, Germany. http://www.kyb.tuebingen.mpg.de/techreports.html. Herbrich, R., Graepel, T. and Campbell, C. (2001). Bayes point machines. J. Machine Learning Research, 1, 245–279. Janson, S. (1997). Gaussian Hilbert Spaces. Cambridge University Press, Cambridge. Kuo, H.H. (1975). Gaussian Measures in Banach Spaces. Lecture Notes in Mathematics, 463. Springer-Verlag. Lee, Y.J. and Mangasarian, O.L. (2001). RSVM: reduced support vector machines. Proceeding 1st International Conference on Data Mining, SIAM. Mahalanobis, P.C. (1925). Analysis of race mixture in Bengal. J. Asiat. Soc. (Bengal), 23, 301. Mika, S. (2002). Kernel Fisher Discriminants. Ph.D. dissertation, Electrical Engineering and Computer Science, Technische Universität Berlin. Mika, S., Rätsch, G., Weston, J., Schölkopf, B. and Müller, K.-R. (1999). Fisher discriminant analysis with kernels. In Hu, Y.-H., Larsen, J., Wilson, E., and Douglas, S., eds, Neural Networks for Signal Processing, IX, 41–48, IEEE. 39 Mika, S., Rätsch, G. and Müller, K.-R. (2001). A mathematical programming approach to the kernel Fisher Algorithm. In T.K. Leen, T.G. Dietterich and V. Tresp, editors, Advances in Neural Information Processing Systems, 13, 591– 597, MIT Press. Mika, S., Smola, A. and Schölkopf, B. (2001). An improved training algorithm for kernel Fisher discriminants. In T. Jaakkola and T. Richardson, editors, Artificial Intelligence and Statistics, 98–104, Morgan Kaufmann. Rao, C.R. and Varadarajan, V.S. (1963). Discrimination of Gaussian processes. Sankhyā, A, 25, 303–330. Schölkopf, B. and Smola, A.J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA. Solla, S.A., Keen, T.K. and Müller, K.R. (1999). Nonlinear Discriminant Analysis Using Kernel Functions. MIT Press, Cambridge, MA. Suykens, J.A.K., Van Gestel, T., De Brabanter, J., De Moor, B. and Vandewalle, J. (2002). Least Squares Support Vector Machines. World Scientific, New Jersey. Taxt, T., Hjort, N. and Eikvil, L. (1991). Statistical classification using a linear mixture of multinormal probability densities. Pattn. Recogn. Lett., 12, 731– 737. Tipping, M.E. (2001). Sparse Bayesian learning and the relevance vector machine. J. Machine Learning Research, 1, 211–244. Vakhania, N.N. Tarieladze, V.I. and Chobanyan, S.A. (1987). Probability Distributions on Banach Spaces. Translated from the Russian by W.A. Woyczynski. Mathematics and Its Applications (Soviet Series), 14, D. Reidel Publishing Co., Dordrecht, Holland. 40 Van Gestel, T., Suykens, J.A.K. and De Brabanter, J. (2001). Least squares support vector machine regression for discriminant analysis. Proc. International Joint INNS-IEEE Conf. Neural Networks (INNS2001), 2445–2450. Wiley, New York. Vapnik, V.N. (1998). Statistical Learning Theory. Wiley, New York. Xu, J., Zhang, X. and Li, Y. (2001). Kernel MSE algorithm: a unified framework for KFD, LS-SVM and KRR. Proceedings Intern. Joint Conf. Neural Networks, 2, 1486–1491, IEEE Press. 41
© Copyright 2026 Paperzz