Kernel Fisher Discriminant Analysis in Gaussian Reproducing

Kernel Fisher Discriminant Analysis in Gaussian
Reproducing Kernel Hilbert Spaces –Theory1
Su-Yun Huang2
[email protected]
Institute of Statistical Science
Academia Sinica, Taipei 11529, Taiwan, R.O.C.
Chii-Ruey Hwang
[email protected]
Institute of Mathematics
Academia Sinica, Taipei 11529, Taiwan, R.O.C.
November 21, 2006
Abstract
Kernel Fisher discriminant analysis (KFDA) has been proposed for nonlinear binary classification. It is a hybrid method of the classical Fisher linear
discriminant analysis and a kernel machine. Experimental results have shown
that the KFDA performs slightly better in terms of prediction error than the
popular support vector machines and is a strong competitor to the latter. However, there is very limited statistical justification of this method. This article
aims to provide a fundamental study for it in the framework of Gaussian reproducing kernel Hilbert spaces (RKHS).
The success of KFDA is mainly due to two attractive features. (1) The
KFDA has the flexibility of a nonparametric model using kernel mixture, (2)
while its implementation algorithm uses a parametric notion via kernel machine
(being linear in an RKHS and arising from the log-likelihood ratio of Gaussian
measures). The KFDA emerging from the machine learning community can be
linked to some classical results of discrimination of Gaussian processes in Hilbert
space (Grenander, 1950). One of the main purposes of this article is to provide a
justification of the underlying Gaussian assumption. It is shown, under suitable
1
2
Running title: Kernel Fisher Discriminant Analysis in Gaussian RKHS.
Corresponding author.
1
conditions, most low dimensional projections of kernel transformed data in an
RKHS are approximately Gaussian. The study of approximate Gaussianity of
kernel data is not limited to the KFDA context. It can be applied to general
kernel machines as well, e.g., support vector machines.
Key words and phrases: Gaussian measure, Gaussian discrimination, kernel Fisher discriminant analysis, maximum likelihood ratio, projection pursuit, reproducing kernel
Hilbert space, support vector machine.
1
Introduction
The aim of discriminant analysis is to classify an object into one of k given groups
based on training data consisting of {(xj , yj )}nj=1 , where xj ∈ X ⊂ Rp is a p-variate
input measurement and yj ∈ {1, . . . , k} indicates the corresponding group membership. The classical Fisher linear discriminant analysis (FLDA) is a commonly used
and time-honored tool for multiclass classification because of its simplicity and probabilistic outputs. With k ≤ p + 1, the FLDA finds k − 1 canonical variates that
are optimal (in a certain sense) for separating the groups, and the FLDA’s decision
boundaries are linear in these canonical variates. Often such a linear formulation
for decision rule is not adequate, thus, quadratic decision boundaries are called for.
But still there are commonly seen cases which need a more general nonlinear decision
rule. Motivated from the active development of statistical learning theory (see, e.g.,
Vapnik, 1998; Hastie, Tibshirani and Friedman, 2001) and the popular and successful
usage of various kernel machines (Cortes and Vapnik, 1995; Cristianini and ShaweTaylor, 2000; Schölkopf and Smola, 2002), there has emerged a hybrid approach which
combines the idea of feature map in SVM with the classical FLDA. Its usage can be
traced back to Mika, Rätsch, Weston, Schölkopf and Müller (1999), Solla, Keen and
Müller (1999) and Baudat and Anouar (2000). Later it was also studied by Mika,
Rätsch and Müller (2001), Van Gestel, Suykens and De Brabanter (2001), Mika,
2
Smola and Schölkopf (2001), Xu, Zhang and Li (2001), Mika (2002) and Suykens,
Van Gestel, De Brabanter, De Moor and Vandewalle (2002). However, there is very
limited statistical justification of this method despite its successful performance in
classification. In this article we provide a fundamental study for it in the framework
of Gaussian reproducing kernel Hilbert spaces (RKHS).
Mika et al. (1999), Baudat and Anouar (2000) and others have used the term
kernel Fisher discriminant analysis (KFDA) for the hybrid method of FLDA and
kernel machine. In all the above-mentioned articles, the KFDA is based on performing
the FLDA in a kernel-spectrum-based feature space. In this article, we introduce an
alternative but equivalent KFDA formulation. This reformulation makes the theoretic
study in a more convenient way. The reformulated KFDA is a two-stage procedure.
The first stage is to embed the input space X into an infinite dimensional RKHS,
denoted by Hκ , via a kernel function κ. The second stage is to carry out Fisher’s
procedure using the notion of the maximum likelihood ratio of Gaussian measures.
Data, which are embedded into an infinite dimensional RKHS, become sparse and can
be better separated by simple hyperplanes. The classical FLDA finds an optimal low
dimensional subspace and the discrimination is done in this low-dimensional subspace.
The optimality is referring to being as a maximum likelihood ratio criterion as well as a
Bayes classifier under a Gaussian assumption on the input data distribution. Parallel
to the classical theory, the KFDA discussed in this article extends the maximum
likelihood ratio criterion and Bayes classification to their kernel generalization under
a Gaussian Hilbert space assumption. Theorems will be given to justify such an
underlying assumption. We show that under suitable conditions most low-dimensional
projections of kernel transformed data are approximately Gaussian. This extends part
of Diaconis and Freedman’s (1984) results to functional data. Readers are referred to
their Theorems 1.1 and 1.2 on limiting distributions for low-dimensional projections
of high-dimensional data, and Example 3.1 for iid coordinates. The heuristics are
as follows. Hyperplanes in Euclidean spaces are rigid, while hyperplanes in RKHS,
3
which consist of kernel mixtures, are much more flexible. Also, data, first embedded
into an infinite dimensional Hilbert space and then projected to a low dimensional
subspace, can be better approximated by normal distribution.
The rest of the article is organized as follows. In Section 2 we give a brief review
of the KFDA emerging from the machine learning community in a kernel-spectrumbased feature space. In Section 3 we introduce an alternative, but equivalent, feature
map into an RKHS and link the theory of KFDA in this RKHS to some classical
statistical results. In Section 4 we provide theoretic and empirical justification of
the underlying Gaussian assumption, which can also be applied to general kernel
machines. All proofs are in the Appendix.
2
Review: KFDA in feature space
This section gives a very brief review of the original KFDA emerging from machine
learning community. The KFDA procedure in Mika et al. (1999) and Baudat and
Anouar (2000) was formulated in a kernel-spectrum-based feature space. For a given
positive definite kernel and its spectrum:
κ(x, u) =
d
λq ψq (x)ψq (u), d ≤ ∞,
(1)
q=1
the main idea of the KFDA is first to map the data in the input space X ⊂ Rp into
the spectrum-based feature space Rd via a transformation
√
√
x → Ψ(x) := ( λ1 ψ1 (x), . . . , λd ψd (x)) .
(2)
Let Z := Ψ(X ) be the image of the feature map. Next the classical Fisher procedure
is operated on the transformed data in the feature space Z. For binary classification,
the KFDA finds the discriminant function of the form
discriminant function (z) = w z + b =
q
4
wq
λq ψq (x) + b, z ∈ Z,
(3)
where w is the canonical variate that maximizes the so-called Rayleigh coefficient in
the feature space Z
w Sb w
JRKF DA (w) ≡ ,
w Sw w + rA(w)
where Sb = (z̄1 − z̄2 )(z̄1 − z̄2 ) and Sw = nj=1 {zj zj − (n1 z̄1 z̄1 + n2 z̄2 z̄2 )}/(n − 2) are
respectively the between- and within-group sample covariances in feature space Z,
and z̄1 , z̄2 are group means in Z, n1 , n2 are group sizes, A(w) is a penalty functional
on w and r is the associated regularization parameter. In practice, the kernel function
κ is defined directly without explicit expression for its spectrum Ψ. Thus, there is no
way that Sb and Sw are explicitly known. However, as the inner product in feature
space can be represented via the following kernel value:
Ψ(x), Ψ(u)
Z =
λq ψq (x)ψq (u) = κ(x, u),
(4)
q
it allows us to work directly on the kernel values without knowing the spectrum-based
transformation Ψ : X → Z, nor the associated sample means and covariances. By
the kernel trick (4), one can show that (Mika et al., 1999, Schölkopf and Smola, 2002)
the solution w can be expanded as
w=
n
αj Ψ(xj ) = Z α,
(5)
j=1
where Z = (Ψ(x1 ), . . . , Ψ(xn )) is the transformed input data matrix in Z. The
discriminant function can be re-formulated as
discriminant function (x) =
n
αj κ(xj , x) + b.
(6)
j=1
The coefficients αj ’s can then be obtained as the solution to the following optimization
problem
α Mb α
,
(7)
arg maxn JRKF DA (α) ≡ arg maxn α∈R
α∈R α Mw α + rA(α)
where Mb = (k̄1 − k̄2 )(k̄1 − k̄2 ) , Mw = (K 2 − 2i=1 ni k̄i k̄i )/(n−2), K = [κ(xj , xj )]n×n ,
k̄i = ni −1 j∈Ii Kj , Kj is the j-th column of K and Ii the index set for group i.
5
Since the matrix Mw is singular, a penalty functional A(α) is added to overcome the
numerical problems. With α being solved from (7), the intercept b of the discriminant
hyperplane (6) is determined by setting the hyperplane to pass through the mid point
of the two group means, i.e., b = −α (k̄1 + k̄2 )/2.
3
KFDA in Gaussian RKHS
The kernel trick in (4) of turning inner products in Z into kernel values allows us to
carry out the KFDA in the spectrum-based feature space without explicitly knowing the spectrum itself. For the convenience of a theoretic study, we introduce an
alternative, but equivalent, KFDA formulation. This formulation will use an alternative data representation (8) below, which embeds the input space X directly into
the kernel associated RKHS. Also with this kernel map (8), the KFDA can be reformulated so that statistical properties, e.g., the maximum likelihood ratio, can be
naturally developed in this framework. The classical FDA is good for data with predictors having approximately normal distribution or at least having approximately
elliptically symmetric distribution. The normality or elliptical symmetry is a very
restrictive condition on data distribution. A way to improve the data normality is
to map the original input space X into a very high dimensional (often infinite dimensional) space, named the embedding input space. Then project the embedding
input space into a low dimensional subspace. Data normality can be dramatically
improved if they are properly handled in this way. As the effective working subspace
of KFDA, other kernel machines alike, is of low dimensionality, the low-dimensional
approximate Gaussian is enough. Assume, for the time being, that the data distribution in the embedding Hilbert space is Gaussian. In this section we focus mainly on
KFDA in a Gaussian Hilbert space. Theory as well as empirical examples to justify
the underlying Gaussian assumption are deferred to Section 4 .
6
3.1
Aronszajn kernel map and Gaussian measure
Below we introduce a kernel map for embedding data into a high dimensional space.
Kernels used throughout the article are positive definite kernels (also known as reproducing kernels). See Aronszajn (1950) for the theory of reproducing kernels and
reproducing kernel Hilbert spaces. Given a positive definite kernel κ(·, ·) on X × X ,
we are going to associate with it an RKHS.
Definition 1 (Reproducing kernel Hilbert space) An RKHS is a Hilbert space
of real-valued functions on X satisfying the property that, all the evaluation functionals are bounded linear functionals.
To every RKHS there corresponds a unique positive-definite kernel κ satisfying the
reproducing property, i.e., f, κ(x, ·)
= f (x) for all f in this RKHS and all x ∈ X .
We say that this RKHS admits the kernel κ. Conversely, given a positive-definite
kernel κ on X × X there exists a unique Hilbert space admitting this kernel. We
denote this Hilbert space by Hκ .
Let µ be a probability measure on X . (The measure µ needs not be the uderlying probability distribution of the input data.) Throughout this article we assume
that all the reproducing kernels employed are (1) measurable, (2) of trace type, i.e.,
κ(x, x)dµ < ∞ and (3) for x = u, κ(x, ·) = κ(u, ·) in L2 (X , µ) sense. Consider the
X
transformation Γ : X → Hκ given by
x → Γ(x) := κ(x, ·).
(8)
The original input space X is then embedded into a new input space Hκ via the
transformation Γ. Each input point x ∈ X is mapped to an element κ(x, ·) ∈ Hκ ,
which is called Aronszajn map in Hein and Bousquet (2004). Their article gives
a survey of results in the mathematical literature on positive definite kernels and
associated structures potentially relevant for machine learning. Let J : Z → Hκ
be a map from the spectrum-based feature space Z to the kernel associated Hilbert
7
space Γ(X ) ⊂ Hκ defined by J (Ψ(x)) = κ(x, ·). Notice that J is a one-to-one linear
transformation satisfying
Ψ(x)2Z = κ(x, x) = κ(x, ·)2Hκ .
Thus, Ψ(X ) and Γ(X ) are isometrically isomorphic, and the two feature representations (2) and (8) are equivalent in this sense.
Below we introduce a Gaussian measure on an arbitrary Hilbert space H.
Definition 2 (Gaussian measure on a Hilbert space) Let H be an arbitrary real
separable Hilbert space. A probability measure PH defined on H is said to be Gaussian, if the distribution of f, h
H is a one-dimensional normal for any f ∈ H, where
h denotes the random element having the probability measure PH .
It can be shown that for any m and any {f1 , . . . , fm ∈ H}, the joint distribution
of f1 , h
H , . . . , fm , h
H is normal. For references of Gaussian measures on Hilbert
spaces, see, e.g., Grenander (1963), Vakhania, Tarieladze and Chobanyan (1987), and
Janson (1997).
For simplicity we assume that EPH h, h
H < ∞ throughout this article. For a
probability measure PH on H, there exists m ∈ H, the mean, and an operator Λ,
known as the covariance operator, such that
m, f H = Eh, f H , ∀f ∈ H; and Λf, g
H = E{h−m, f H h−m, g
H }, ∀f, g ∈ H.
The covariance operator Λ is of trace type with trace(Λ) = Eh − m, h − m
H .
3.2
Maximum likelihood ratio of Gaussian measures
In this subsection we establish the KFDA as a discriminant rule based on the likelihood ratios of Gaussian measures in a Hilbert space. Let π1 , . . . , πk denote the
underlying populations with probability distributions (X , F, Pi ), i = 1, . . . , k. Let
Ii be the index set of training samples from πi and I = ∪ki=1 Ii be the index set for
8
the entire data. The training data can then be partitioned as ∪ki=1 {(xj , yj )}j∈Ii . Let
ni = |Ii |, the size of Ii , and n = |I|, the size of I. Suppose that πi has probability
density function fi (x), i = 1, . . . , k. Also assume that the prior probability of an
observation coming from πi is qi . Then the conditional probability of a given input
measurement x coming from population πi is
prob(πi |x) =
qi fi (x)
, i = 1, . . . , k.
q1 f1 (x) + · · · + qk fk (x)
(9)
An equivalent expression for prob(πi |x) is
prob(πi |x) =
qi fi (x)/q1 f1 (x)
, i = 1, . . . , k.
1 + q2 f2 (x)/q1 f1 (x) + · · · + qk fk (x)/q1 f1 (x)
(10)
Thus, for a test input x we assign it to πi , if prob(πi |x) is the maximum, which
is the Bayes classifier. By this conditional probability approach it is sufficient to
train as many as k − 1 binary classifiers of πi against an arbitrarily chosen reference
group, say π1 , via likelihood ratios for i = 2, . . . , k. The KFDA solves this multiclass
classification problem in 2 steps. First, it maps the data via Γ into Hκ . Assume in
Hκ the underlying populations π1 , . . . , πk can be approximated by Gaussian measures
with a common covariance.3 (The Gaussian assumption and its justification are
discussed later in Section 4.) Next, with the Gaussian assumption, the log-likelihood
ratios result in linear decision boundaries (Grenander, 1950). Below we give a formal
definition for linear classifiers in a Hilbert space and then recall Grenander’s result of
likelihood ratio of two Gaussian measures.
Definition 3 (linear classifier) Consider a binary classification in a Hilbert space
H. We say that a classifier is linear if and only if its decision boundary is given by
3
These Gaussian measures do not live in the underlying RKHS Hκ , but rather in a larger Banach
space, namely, the abstract Wiener space, denoted by (i, Hκ , B). In the special case of B being a
Hilbert space, B is the completion of Hκ with respect to the norm f B := Af Hκ , where A is a
Hilbert Schmidt operator on Hκ , and i is the injection map of Hκ into B. For reference, see, Kuo
(1975).
9
(h) + b = 0, where (·) is a bounded linear functional, b is a real scalar and h is an
element in H.
By Riesz Representation Theorem, for each linear functional (·) there exists a unique
g ∈ H such that the decision boundary is given by
g, h
H + b = 0 for test input h ∈ H.
(11)
The function g acts as a functional normal direction for the separating hyperplane
g, h
H + b = 0.
Theorem 1 (Grenander, 1950) Assume that P1,H and P2,H are two equivalent
Gaussian measures on H with means m1 and m2 and a common nonsingular covariance operator Λ of trace type. Let L2,1 = log(dP2,H /dP1,H ) and h be an element
in H. Let ma = (m1 +m2 )/2 and md = m2 −m1 . A necessary and sufficient condition
for the log-likelihood ratio L2,1 being linear is that md ∈ R(Λ1/2 ), where R(Λ1/2 ) is
the range of Λ1/2 . The log-likelihood ratio is then given by
L2,1 (h) = h − ma , Λ−1 md H .
4
(12)
To separate two Gaussian populations in H, the log-likelihood ratio in Theorem 1
leads to an ideal optimal linear decision boundary. Exactly, a binary KFDA is to
look for a functional normal direction g, which is optimal in separating the two
groups of kernel inputs {Γ(xj )}j∈I1 and {Γ(xj )}j∈I2 . Heuristically, when data patterns (conveyed in realizations Γ(xj ), j = 1 . . . , n) are projected along g, the group
centers are far apart, while the spread within each group is small causing the overlap
4
Notice that h − ma , Λ−1 md H exists a.s. for h from the Gaussian measure with mean ma and
covariance operator Λ. Let λi and ei be the eigenvalues and eigenvectors of Λ, respectively. Then h−
ma , Λ−1 md H = i h − ma , ei H md , ei H /λi . As {h − ma , ei H } are independent normal random
variables with mean zero and variance Eh−ma , ei 2H = λi , then i E(h−ma , ei H md , ei H /λi )2 =
2
1/2
). Thus, h − ma , Λ−1 md H exists a.s. (For independent
i md , ei H /λi < ∞, as md ∈ R(Λ
random variables Xi having zero mean, i Xi converges a.s. if and only if i var(Xi ) < ∞.)
10
of these two groups to be as small as possible when projected along this functional
direction. The optimality can be formalized in the sense of maximum likelihood ratio
of two Gaussian measures. There are parameters including the population means
and the covariance operator involved in the log-likelihood ratio (12), which have to
be estimated from the data. Below we derive their maximum likelihood estimates.
Theorem 2 (Maximum likelihood estimates) Let H be a Hilbert space of realvalued functions on X . Assume that {hj }nj=1 are iid random elements from a Gaussian
measure on H with mean m and nonsingular covariance operator Λ of trace-type.
Then, for any g, f ∈ H, the maximum likelihood estimate for g, m
H is given by
g, m̂
Hκ with
1
m̂ =
hj ,
n j=1
n
(13)
and the maximum likelihood estimate for g, Λf H is given by g, Λ̂f H with
1
Λ̂ =
(hj − m̂) ⊗ (hj − m̂),
n j=1
n
(14)
where ⊗ denotes the tensor product. In particular, for any given x, u ∈ X , by taking
g and f the evaluation functionals at x and u respectively, the MLEs for m(x) and
Λ(x, u) are given, respectively, by m̂(x) = n−1 nj=1 hj (x) and
1
(hj (x) − m̂(x))(hj (u) − m̂(u)).
Λ̂(x, u) =
n j=1
n
(15)
For multiple populations sharing a common covariance operator, we pool together
sample covariance estimates from all populations according to their sizes to get a
pooled single estimate.
By abusing the notation a little bit, the KFDA decision functions are then given
by
log(qi fi (x)/q1 f1 (x)) = {Γ(x) − (Γ̄i + Γ̄1 )/2} Mw−1 (Γ̄i − Γ̄1 ) + ρi
(16)
where ρi = log(qi /q1 ), Mw is the pooled within-group sample covariance of kernel
training inputs, Γ̄i is the i-th group sample mean of kernel inputs, and Γ(x) is the
11
kernel test input. The test input x is assigned to πi , if its corresponding log-likelihood
ratio is the maximum.
Remark 1 (Regularization) Often the empirical covariance Mw is singular and
some kind of regularization is necessary. See, for instance, the ridge approach in
Friedman (1989). By taking hj as if they were the kernel data Γ(xj ) given in (8),
Theorems 1 and 2 lead to the maximum likelihood method that coincides with the
existing KFDA algorithm of Mika et al. (1999).5
Remark 2 (Bayesian interpretation) If prior probabilities q1 and q2 are considered, there is an adjustment ρ = log(q2 /q1 ) that should be added to the log-likelihood
ratio (12). This prior adjusted log-likelihood ratio provides a Bayesian interpretation
for KFDA. There are other Bayesian kernel machines, see, e.g., the relevance vector machine (Tipping, 2001) and the Bayes point machine (Herbrich, Graepel and
Campbell, 2001).
Remark 3 (Discriminant analysis by Gaussian mixtures) It can be easily seen
that decision boundaries given by (16) are kernel mixtures. If a Gaussian kernel is
used, then the KFDA is a discriminant rule by Gaussian mixtures. The Gaussian
mixture approach is not new in statistical and pattern recognition literature, see, e.g.,
Hastie and Tibshirani (1996), Taxt, Hjort and Eikvil (1991). However, the KFDA
as a Gaussian mixture has two main attractive features over other Gaussian mixture
approaches. One is that, the KFDA has the flexibility of nonparametric model; and
5
In Theorems 1 and 2 we have assumed that data hj ’s are generated from Gaussian measures. As
the kernel data Γ(xj ), at least for the Gaussian kernel, are positive functions in Hκ , an assumption
that Γ(xj )’s are Gaussian data would be truly void. However, in Section 4 we will show that kernel
data Γ(xj ) projected into a low dimensional subspace, say m-dimensional, can be well approximated
by an m-dimensional normal. The effective working subspace for the KFDA is at most (k − 1)dimensional for a k-class problem.
12
the other is that, its implementation algorithm uses parametric notion via kernel machine (being linear in an RKHS and arising from the log-likelihood ratio of Gaussian
measures).
4
Justification of the Gaussian assumption
Results in the previous section are based on the Gaussian assumption in the underlying Hilbert space. In this section we provide both theoretic and empirical justification
for it. In some previous works (Mika, Rätsch and Müller, 2001; Mika, 2002; Schölkopf
and Smola, 2002), there has been empirical evidence that Gaussianity can be improved by kernel maps. Though being rudimentary (they show only histograms),
their findings are interesting and original. In this section, we provide a more rigorous justification. The basic phenomenon here is as follows. Most low-dimensional
projections from high-dimensional data are approximately Gaussian under suitable
conditions. We start our illustration with the single population (non-centered and
centered cases), and next extend it to multiple populations. As the task here is classification, some extra care should be taken to avoid the asymptotic distributions of
the underlying populations from collapsing into an identical Gaussian.
4.1
Single population
We start with a single population. Though the KFDA involves at least two populations, there are two main reasons for treating a single population. One is that,
it is easier and more comprehensible to gain ideas of the asymptotic behavior of
low-dimensional projections in the single population case. The other is that, besides
classification problems, the results obtained below can be used for other problems
involving only one population, e.g., kernel principal component analysis, kernel regression, kernel canonical correlation analysis, etc. As for the non-centered and the
13
centered cases, their purpose will become clear in Subsection 4.2 for multiple population discrimination. Briefly speaking, what we need in practice is the centered case.
It is conceptually similar to the Central Limit Theorem, where data are centered to
their sample mean. However, for kernel-based classification algorithms, KFDA or
else, we have to be careful about the control of window width. Once the window
width goes to zero too fast, projections of kernel data from different populations will
have zero mean and converge to a common normal distribution, and thus become
indistinguishable. This indistinguishability phenomenon can be avoided by a proper
control of kernel width.
4.1.1
Non-centered case
Given data x1 , . . . , xn , let Γ(x1 ), . . . , Γ(xn ) be (nonrandom) functions in Hκ . They
form the data set in kernel representation. The kernel window width σ depends on
n and so do the kernel κ and the associated Hilbert space Hκ . Notations σn , κn and
Hn will be used from now on to indicate their dependence on n.6 Notice that the
kernel data Γ(xj ) also depend implicitly on n, but we still stick to the notation Γ,
instead of Γn , for simplicity. Let κo denote the baseline kernel with window width
one. Then κn (x, u) = σn−p κo (x/σn , u/σn ). The size of σn controls the resolution of
the associated Hilbert space. The smaller σn is, the finer resolution the space Hn
has. The resolution can be easily seen in the splines and wavelets associated Hilbert
spaces via a sequence of nested RKHSs. The window width σn should be decreasing
to zero, as the sample size n approaches infinity. Suppose that there exists a constant
τ 2 > 0 such that for any > 0, as n → ∞,
1
card{1 ≤ j ≤ n : |σnp Γ(xj )2Hn − τ 2 | > } → 0,
n
6
In our setup we consider a sequence of reproducing kernel Hilbert spaces Hn , n = 1, 2, . . .. We
may choose the same B = L2 (X , µ) in the abstract Wiener spaces (i, Hn , B), but in B we have a family
1/2
of Gaussian measures with different covariance operators κn , n = 1, 2, . . .. Moreover, Hn = κn (B).
14
1
card{1 ≤ j, j ≤ n : σnp |Γ(xj ), Γ(xj )
Hn | > } → 0,
2
n
where σn → 0. By the reproducing property of kernel, the two conditions above are
equivalent to
1
(17)
card{1 ≤ j ≤ n : |σnp κn (xj , xj ) − τ 2 | > } → 0,
n
1
card{1 ≤ j, j ≤ n : σnp |κn (xj , xj )| > } → 0.
(18)
n2
√
Condition (17) says that most kernel data { σnp Γ(xj )}nj=1 have Hn -norm near τ 2 .
Condition (18) says that most kernel data are nearly orthogonal in Hn . Let h be a
random element from a Gaussian measure with zero mean and covariance operator
κn . The Karhunen-Loève expansion for h is
h(x) =
eq ψ̃q,n (x), eq ∼ iid N (0, 1),
q
λq ψq,n (x) =
λq σn−p ψq,o (x/σn ) having unitary Hn -norm, i.e.,
√
= 1. Kernel data { σnp Γ(xj )}nj=1 projected along the random direction
where ψ̃q,n (x) =
ψ̃q,n (x)Hn
h have values given by
σnp h, Γ(x1 )
Hn , · · · , σnp h, Γ(xn )
Hn .
Let θn (h) be the empirical distribution of this sequence, assigning probability mass
√
n−1 to each σnp h, Γ(xj )
Hn .
Theorem 3 Under conditions (17) and (18), as n → ∞, the empirical distribution
θn (h) converges weakly to N (0, τ 2 ) in probability.
In Theorem 3 we have used the same convention as Diaconis and Freedman (1984).
We show in the Appendix that the random characteristic function φn,h (t) for θn (h)
converges in probability to the characteristic function φ0 (t) of N (0, τ 2 ), ∀t. “In probability” is referring to the distribution of the random direction h. Theorem 3 is
15
established for one-dimensional projections along most random directions h and the
result can extend to m-dimensional projections along random directions h1 , . . . , hm
for an arbitrary but fixed m.
Corollary 4 For hk , k = 1, . . . , m, iid from the Gaussian measure with zero mean
√
and covariance operator κn , kernel data { σnp Γ(xj )}nj=1 projected along the random
directions consisting of h = (h1 , . . . , hm ) have values given by
σnp h, Γ(x1 )
Hn , · · · ,
σnp h, Γ(xn )
Hn ,
where h, Γ(xj )
Hn = (h1 , Γ(xj )
Hn , . . . , hm , Γ(xj )
Hn ) ∈ Rm . Let θn (h) be the
empirical distribution of this sequence, assigning probability mass n−1 to each point
√ p
σn h, Γ(xj )
Hn ∈ Rm . Under conditions (17) and (18), as n → ∞, the empirical
distribution θn (h) converges weakly to N (0, τ 2 Im ) in probability.
Let t2n (h) = n−1
n
j=1
σnp h, Γ(xj )
2Hn be the empirical second moment and let
θn1 (h) be the scaled empirical distribution, assigning probability mass n−1 to each of
√
the scaled projections σnp h, Γ(xj )
Hn /tn . Under some slightly stronger conditions
we will establish their asymptotic distributions. By the inequality I(|t| > ) ≤ |t|2 /
2 ,
the following conditions will imply (17) and (18):
n
σnp κn (xj , xj ) → τ 2 ,
n j=1
1
card {1 ≤ j ≤ n : |σnp κn (xj , xj ) − τ 2 | > } → 0,
n
n
σn2p 2
κ (xj , xj ) → 0.
n2 j,j =1 n
(19)
(20)
(21)
Theorem 5 Assume conditions (19)-(21) hold. Then (a) the empirical second moment t2n (h) converges to τ 2 in probability, and (b) the scaled empirical distribution
θn1 (h) converges weakly to N (0, 1) in probability.
16
4.1.2
Centered case
Let Γ̄ = n−1
n
j=1
Γ(xj ) be the empirical data centroid in the feature space Hκ , and
Γ̃(xj ) = Γ(xj ) − Γ̄, j = 1, . . . , n, be the centered kernel data. Below we consider the
asymptotic distribution of these centered data projected along a random direction h
from N (0, κn ). Let θn0 (h) be the empirical distribution for centered data projected
√
along h, assigning probability mass n−1 to each σnp h, Γ̃(xj )
Hn . Conditions below
are centered versions of conditions (19)-(21):
n
σnp Γ̃(xj )2Hn → τ 2 ,
n j=1
1
card {1 ≤ j ≤ n : |σnp Γ̃(xj )2Hn − τ 2 | > } → 0,
n
n
σn2p Γ̃(xj ), Γ̃(xj )
2Hn → 0.
n2 j,j =1
By the kernel reproducing property, conditions above are equivalent to conditions
(22)-(24).
n
σnp [κn (xj , xj ) − κ·· ] → τ 2 ,
n j=1
(22)
1
card {1 ≤ j ≤ n : |σnp (κn (xj , xj ) − 2κj· + κ·· ) − τ 2 | > } → 0,
n
n
σn2p [κn (xj , xj ) − κj· − κ·j + κ·· ]2 → 0,
n2 j,j =1
(23)
(24)
where
κj· = n
−1
n
κn (xj , xj ), κ·j = n
−1
j =1
Let s2n (h) = n−1 σnp
n
κn (xj , xj ) and κ·· = n
−2
n
κn (xj , xj ).
j,j =1
j=1
n
2
j=1 h, Γ̃(xj )
Hn ,
and let θn1 (h) be the scaled empirical distribu√
tion, which assigns probability mass n−1 to each of σnp h, Γ̃(xj )
Hn /sn .
Theorem 6 Assume conditions (22)-(24) hold. Then (a) the empirical variance
s2n (h) converges to τ 2 in probability, (b) the centered empirical distributions θn0 (h)
17
converges weakly to N (0, τ 2 ) in probability, and (c) the scaled empirical distribution
θn1 (h) converges weakly to N (0, 1) in probability.
Theorem 6 can extend to m-dimensional projections along random directions h1 , . . . , hm
for an arbitrary but fixed m.
Corollary 7 For hk , k = 1, . . . , m, iid from the Gaussian measure with zero mean
√
and covariance operator κn , the centered kernel data { σnp Γ̃(xj )}nj=1 projected along
the random directions consisting of h = (h1 , . . . , hm ) have values given by
σnp h, Γ̃(x1 )
Hn , · · · ,
σnp h, Γ̃(xn )
Hn ,
where h, Γ̃(xj )
Hn = (h1 , Γ̃(xj )
Hn , . . . , hm , Γ̃(xj )
Hn ) ∈ Rm . Let θn0 (h) be the
empirical distribution of this sequence, assigning probability mass n−1 to each point
√ p
σn h, Γ̃(xj )
Hn . Under conditions (22)-(24), the empirical distribution θn0 (h) converges weakly to N (0, τ 2 Im ) in probability as n → ∞.
With all the above asymptotic results, it is natural to ask when conditions (22)(24) will be met to validate Theorem 6 and Corollary 7.
Theorem 8 Let X1 , X2 , X3 , . . . be an iid sequence from a distribution having contin
uous probability density function p(x) on X , which satisfies the condition p2 (x)dx <
∞. Assume that κo (s, t) is a symmetric translation type decreasing kernel with tail
decay as t − s → ∞. Also assume that σn → 0, as n → ∞. Then conditions
(22)-(24) hold for almost all realizations of X1 , X2 , X3 , . . ..
Remark 4 The KFDA algorithm for binary classification has an effective working
subspace of one-dimension. Theorem 6 says that most one-dimensional projections
of kernel data can be well approximated by a Gaussian distribution under suitable
conditions. Corollary 7 further extends the result to m-dimensional projections for
multiclass discriminant analysis. Theorem 8 validates conditions (22)-(24) for the
18
asymptotic approximate Gaussian. Though Theorem 8 is established for symmetric translation type kernels, it extends to integral-translation-invariant kernels, i.e.,
κo (x, u) = κo (x − m, u − m) for any integer m. Wavelets and splines are of this type.
Remark 5 In either the centered or non-centered case, the asymptotic empirical distribution does not depend on the random projection direction h. This phenomenon
indicates that the distribution of the kernel data Γ(xj ) looks spherically symmetric
over Hn when n is large. Here the spherical symmetry is with respect to the Hn norm
rather than the L2 norm.
4.2
Multiple populations
Suppose training inputs x1 , . . . , xn are from one of the populations π1 , . . . , πk with
corresponding group labels y1 , . . . , yn . With kernel representation, the data set then
consists of {(Γ(xj ), yj )}nj=1 . Let Γ̄i := n−1
i
j∈Ii Γ(xj ) be the ith group data centroid
and Γ̃(xj ) be the centered data (about individual group centroid) given by
Γ̃(xj ) = Γ(xj ) − Γ̄i , if xj belongs to the ith group, i.e., yj = i.
Suppose, for each group, conditions (22)-(24) hold. By projecting the centered data
from all groups along a common random direction h ∼ N (0, κn ), Theorem 6 says
that, for each group, the empirical distribution of centered data converges weakly to
a common normal distribution. For the task of classification it is necessary that these
group centroids are distinct. Notice that Theorem 6 is established under conditions
(22)-(24). However, if conditions (17)-(18) are also met, then by Theorem 3 all these
k non-centered empirical distributions, denoted by θni (h), i = 1, . . . , k, will converge
weakly to an identical distribution N (0, τ 2 ), which implies that all group centroids
are converging to zero and becoming indistinguishable by their asymptotic empirical
distributions, centered or non-centered alike. Therefore, ideally we need conditions
(22)-(24) be met while condition (18) be violated, so that these k populations can
19
be discriminated by their group means. Otherwise, they are asymptotically indistinguishable, as their means are collapsing to zero. Next we discuss the control of
window width σn to validate or to fail these technical conditions.
Proposition 9 Let X0 , X1 , X2 , X3 , . . . be an iid sequence from a distribution having
continuous probability density function p(x) on X and satisfying X p2 (u)du < ∞.
Assume that κo (s, t) is a symmetric translation type decreasing kernel with tail decay
to zero as s − t → ∞. Also assume that σn → 0. Let r = limn→∞ nσnp and
Sn, = card{1 ≤ j = n : σnp |κn (Xj , X0 )| > }.
(a) If 0 ≤ r < ∞, then for any small > 0 and any positive integer i we have
lim P {Sn, = i} =
n→∞
e−rq (rq )i
i!
for some 0 < q < ∞ (where q is given in the proof ). In particular, if r = 0, we have
that limn→∞ σnp κn (Xj , X0 ) → 0 a.s. (b) If r = ∞, then for any > 0 we have that
Sn, → ∞ in probability.
Again, the above results extend to integral-translation-invariant kernels.
Remark 6 Result (a) above implies that if σn → 0 at a rate O(n−1/p ) or faster, then
for arbitrarily small but fixed η > 0, we have
Sn,
P {Sn, = i} → 1.
lim P
< η = lim
n→∞
n→∞
n
0≤i<nη
Thus,
Sn,
1
= card{1 ≤ j ≤ n : σnp |κn (Xj , X0 )| > } → 0 in probability,
n
n
which also implies
1
card{1 ≤ j, j ≤ n : σnp |κn (Xj , Xj )| > } → 0 in probability.
n2
20
Therefore, “σn = O(n−1/p ) or of a smaller order of magnitude” is the case that
we should avoid. On the other hand, if σn → 0 is controlled in such a way that
r = limn→∞ nσnp = ∞, though result (b) of Proposition 9 does not guarantee the
almost-sure or in-probability failure of condition (18), it says that for each j th column card{1 ≤ j ≤ n : σnp |κn (Xj , Xj )| > } → ∞ in probability. In conclusion,
“limn→∞ nσnp = ∞” is the asymptotic window size that we should take.
Remark 7 From Proposition 9, we see that the kernel employed should have tail decay and the ideal window width should be controlled in a way that nσnp → ∞ and σn →
0. This is compatible with the well known optimal window width σn = O(n−1/(2s+p) )
for nonparametric function estimation, where p is the dimension of x and s is the
kernel order. (Especially, s = 2 for the Gaussian and Epanechnikov kernels used in
a later empirical study.)
Remark 8 Polynomial kernels do not satisfy the tail decay property. Hence, theorems discussed in Section 4 do not apply to polynomial kernels.
4.3
Empirical examples
We use three examples to show (1) the influence of kernel transformation on data
normality and their elliptical symmetry, (2) the effect of window widths and (3)
projections along random directions versus the most discriminant direction.
Example 1 In this example, we show how the kernel map can bring the empirical data distribution to a better elliptical symmetry in low-dimensional projections.
Consider a random sample of size 200 consisting of data {xi := (xi1 , . . . , xi5 )}200
i=1 ,
where
iid
xi1 , xi3 , xi4 , xi5 ∼ uniform (0, 2π), and
iid
xi2 = sin(xi1 ) + i , i ∼ N (0, τ 2 ) with τ = 0.4.
21
The data scatter plot over the first 2 coordinates is in Figure 1, and it reveals the
pattern of sine function. Next, we compute the kernel data K = [κ(xj , xj )] using a
√
Gaussian kernel with window width σ = 10, where data {xj }200
j=1 are scaled to have
unitary variance in each coordinate prior to forming the kernel data. Two random
columns are drawn from K and their data scatter is plotted in Figure 2. It has much
better elliptical symmetry. (Notice that a random column from K is exactly the
projection of kernel data {Γ(xj )}200
j=1 along a direction randomly picked from the set
D = {Γ(xj )}200
j=1 .) Raw data normal probability plot for the second coordinate is in
Figure 3. This normal probability plot is produced using Matlab m-file “normplot”.
A characteristic of the normal probability plot is that a normal distribution is plotted
as a straight line. Substantial departures from a straight line are signs for non-normal.
The “blue +” in Figure 3 are data points and the red line is the reference line for a
fitted normal. For comparison, projections of kernel data along directions from the
set D are taken. Their normality is checked. We present here normal probability
plots for the best four, median four and worst four projections among D in Figures
4-6.
2
1
1.5
0.9
1
0.8
0.5
0
0.7
−0.5
0.6
−1
0.5
−1.5
−2
0
1
2
3
4
5
6
7
0.4
0.4
Figure 1: Scatter plot of (x1 , x2 ).
0.5
0.6
0.7
0.8
0.9
Figure 2: Kernel data scatter plot.
22
1
Normal Probability Plot
0.997
0.99
0.99
0.98
0.98
0.95
0.95
0.90
0.90
0.75
0.75
Probability
Probability
Normal Probability Plot
0.997
0.50
0.50
0.25
0.25
0.10
0.10
0.05
0.05
0.02
0.02
0.01
0.01
0.003
0.003
−1.5
−1
−0.5
0
Data
0.5
1
1.5
0.5
0.6
0.7
Data
0.8
0.9
1
Figure 4: Best four projections of kernel data.
Figure 3: Normal probability plot for x2 .
Example 2 Draw 10000 random samples from an exponential distribution with probability density function p(x) = exp(−x). Arrange them in a 200 × 50 matrix, denoted
by X. The data matrix X represents a random sample of size 200 in R50 . Each row
of X is an observation in a 50-dimensional space and there are 200 observations in total. Each column of X represents a variable coordinate axis and is a one-dimensional
projection of observations along that particular variable coordinate. There are 50
variable coordinate axes. Render all these 50 one-dimensional projections to normality checks using (1) the Kolmogorov-Smirnov test of a continuous distribution
(centered and coordinatewise scaled to unitary variance) against a standard normal,
and (2) the normal probability plot. The average p-value and its standard deviation
of these 50 Kolmogorov-Smirnov tests are:
average p-value = 2.853 × 10−4 , sample standard deviation = 5.633 × 10−4 .
We also plot the normal probability plots for the best 4 out of 50 projections (along the
coordinate axes) closest to a Gaussian (Figure 7), the median 4 projections (Figure 8),
and the worst 4 projections farthest from a Gaussian (Figure 9). Associated p-values
23
Normal Probability Plot
0.997
0.99
0.99
0.98
0.98
0.95
0.95
0.90
0.90
0.75
0.75
Probability
Probability
Normal Probability Plot
0.997
0.50
0.50
0.25
0.25
0.10
0.10
0.05
0.05
0.02
0.02
0.01
0.01
0.003
0.003
0.4
0.5
0.6
0.7
Data
0.8
0.9
1
0.6
0.65
0.7
0.75
0.8
Data
0.85
0.9
0.95
1
Figure 5: Median four projections of kernel Figure 6: Worst four projections of kernel
data.
data.
for the best four, median four and worst four cases are reported below:
best 4 : 0.0025 0.0023 0.0018 0.0010
median 4 : 0.0000 0.0000 0.0000 0.0000
worst 4 : 0.0000 0.0000 0.0000 0.0000
From these p-values and normal probability plots in Figures 7-9, it is clear that this
random sample of size 200 from a 50-dimensional exponential distribution is by far
from a Gaussian.
24
Normal Probability Plot
0.997
0.99
0.99
0.98
0.98
0.95
0.95
0.90
0.90
0.75
0.75
Probability
Probability
Normal Probability Plot
0.997
0.50
0.50
0.25
0.25
0.10
0.10
0.05
0.05
0.02
0.02
0.01
0.01
0.003
0.003
0
0.5
1
1.5
2
2.5
3
3.5
4
0
1
2
3
4
Data
5
6
7
8
Data
Figure 7: Best four projections of original Figure 8: Median four projections of origidata.
nal data.
Normal Probability Plot
0.997
0.99
0.98
0.95
0.90
Probability
0.75
0.50
0.25
0.10
0.05
0.02
0.01
0.003
0
1
2
3
4
5
6
7
Data
Figure 9: Worst four projections of original data.
25
To improve the Gaussianity of data distribution, we transform the data by the
Aronszajn kernel map Γ in (8). We try out using both the Gaussian kernel and the
Epanechnikov kernel. Via the kernel map Γ, each data point xj ∈ R50 is given a new
representation by a function. For the case of a Gaussian kernel, we have
xj → Γ(xj ) = κ(xj , x) = exp(−0.5x − xj 22 /σ 2 ) with σ = 10.
Let x = (x1 , . . . , xk , . . . , x50 ) and xj = (xj1 , . . . , xjk , . . . , xj50 ). For the case of an
Epanechnikov kernel, we have
xj → Γ(xj ) = κ(xj , x) =
50
(1 − (xk − xjk )2 /σ 2 )I{|xk − xjk | < σ} with σ = 10,
k=1
where I is an indicator function. We check if one-dimensional projections of kernel
data {Γ(xj )}200
j=1 resemble a Gaussian random sample. We consider one-dimensional
projections along directions randomly picked from the set D = {Γ(xj )}200
j=1 . That is,
the kth column of kernel matrix K comes from projecting the kernel data {Γ(xj )}200
j=1
along the functional direction κ(xk , ·). In other words, columns of K are the result of
one-dimensional projections. These columns are then rendered to normality checks.
Results show that one-dimensional projections of kernel-transformed data are much
closer to a normal. Listed below are summarized p-values for the best four, median
four and worst four out of 200 columns (i.e., projections) of transformed data. For
the case of a Gaussian kernel, we have
best 4 : 0.9943 0.9794 0.9727 0.9634
median 4 : 0.5771 0.5769 0.5702 0.5658
worst 4 : 0.0975 0.0854 0.0288 0.0202
average : 0.5650 (std deviation = 0.2542)
For the case of an Epanechnikov kernel, we have
best 4 : 0.9583 0.9527 0.9134 0.9063
26
median 4 : 0.3301 0.3291 0.3123 0.3122
worst 4 : 0.0263 0.0261 0.0210 0.0044
average : 0.3662 (std deviation = 0.2441)
Normal probability plots are presented in Figures 10-12 for Gaussian-kernel transformed data, and in Figures 13-15 for Epanechnikov-kernel transformed data. Again,
colored “+” denote data points, different colors stand for different projections, and
the dashed lines are references lines for fitted normals. It is clear that kernel data
in Figures 10-15 follow their reference lines much closer than raw data in Figures
7-9, which indicates projections of kernel transformed data are much better Gaussian
than untransformed data.
Normal Probability Plot
0.997
0.99
0.99
0.98
0.98
0.95
0.95
0.90
0.90
0.75
0.75
Probability
Probability
Normal Probability Plot
0.997
0.50
0.50
0.25
0.25
0.10
0.10
0.05
0.05
0.02
0.02
0.01
0.01
0.003
0.003
0.5
0.6
0.7
Data
0.8
0.9
1
0.4
0.5
0.6
0.7
Data
0.8
0.9
1
Figure 10: Best four projections of Gaussian Figure 11: Median four projections of Gauskernel data.
sian kernel data.
Example 3 This is a two-population example, which aims to show the importance
of the choice of a proper window size to avoid the merge of the two populations into
an indistinguishable one. Throughout this example an Epanechnikov kernel is used.
Random samples of size 200 each are drawn from distributions on R50 with probability
50
density functions p1 (x) = 50
i=1 exp(−xi ), xi > 0, and p2 (x) =
i=1 exp(−xi + 2),
27
Normal Probability Plot
0.997
0.99
0.98
0.95
0.90
Probability
0.75
0.50
0.25
0.10
0.05
0.02
0.01
0.003
0.4
0.5
0.6
0.7
Data
0.8
0.9
1
Figure 12: Worst four projections of Gaussian kernel data.
xi > 2, respectively. Next, we compute the kernel matrix K = [κσ (xj , xj )]400×400
with σ = 10, 5, 3, 2. The first 200 rows in K are from population one and the rest 200
rows are from population two. Then a one-dimensional projection along a random
direction e ∼ N (0, 0.052 I400 ) is calculated, and their normal probability plots are
displayed in Figures 16-19. (The quantity 0.052 is used to normalize the scale of
projection vector. It will not affect the appearance of normal probability plots, but
will only affect the scale of the horizontal axis.) From these plots we see clearly that
one-dimensional random projections have good normality and that, as the window
width decreases, the two normals are getting to merge together.
Next the kernel data K together with their group labels are used to train for
the most discriminant direction using KFDA. We then draw two extra test samples,
each of size 200 from p1 and p2 respectively, and project their kernel transforms
along the KFDA-found most discriminant direction. Their normal probability plots
are displayed in Figures 20-23. We see that data projected along the KFDA-found
most discriminant direction have better separation (Figure 20) between groups than
along a random direction (Figure 16), if the window size is right. As the window size
decreases, the two underlying samples tend to merge together even along the most
28
Normal Probability Plot
0.997
0.99
0.99
0.98
0.98
0.95
0.95
0.90
0.90
0.75
0.75
Probability
Probability
Normal Probability Plot
0.997
0.50
0.50
0.25
0.25
0.10
0.10
0.05
0.05
0.02
0.02
0.01
0.01
0.003
0.003
−1
−0.8
−0.6
−0.4
−0.2
0
Data
0.2
0.4
0.6
0.8
−0.6
1
−0.4
Figure 13: Best four projections of Epanech- Figure 14:
−0.2
0
0.2
Data
0.4
0.6
0.8
1
Median four projections of
Epanechnikov kernel data.
nikov kernel data.
discriminant direction.
5
Concluding discussion
In Subsection 3.2 we only considered the case of Gaussian measures having a common
covariance operator, which leads to a linear classifier in H. For the case of distinct
covariance operators, see Rao and Varadarajan (1963), which will lead to a quadratic
classifier in H. Often an extension to quadratic machines (in the feature space) is
straightforward using some existing statistical notion, e.g., the maximum likelihood
ratio discussed in this article. In Subsection 4.2 we have discussed the asymptotic
behavior of projections for the purpose of multiclass classification. The populations
are distinguished solely by their means in the feature space. Their covariances are
assumed to share a common structure and are pooled together, and do not play any
role in the classification. Sometimes, covariance structures may also carry useful
population-dependent information and should be considered to enter the classifier,
which will lead to a quadratic extension. We are not to advocate a common use of
29
Normal Probability Plot
0.997
0.99
0.98
0.95
0.90
Probability
0.75
0.50
0.25
0.10
0.05
0.02
0.01
0.003
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Data
Figure 15: Worst four projections of Epanechnikov kernel data.
quadratic extensions, but merely to point out the existing opportunities.
The study of asymptotic distributions of projections of kernel data is not limited
to KFDA. Results in Section 4 apply to general kernel machines including the support
vector machines.
Acknowledgment
The authors thank Yuh-Jye Lee for helpful comments. This research was partially
supported by the National Science Council of Taiwan, R.O.C., grant number NSC93-2118-M-001-015.
Appendix
In this appendix, Eh , varh and covh denote, respectively, the expectation, variance
and covariance with respect to the distribution of the random element h.
Proof of Theorem 2: For an arbitrary pair f, g ∈ H, the random vector given by
( g, h
H , f, h
H )
has a 2-dimensional Gaussian distribution by Definition 2. The mean for the first
30
Normal Probability Plot
0.997
0.99
0.99
0.98
0.98
0.95
0.95
0.90
0.90
0.75
0.75
Probability
Probability
Normal Probability Plot
0.997
0.50
0.50
0.25
0.25
0.10
0.10
0.05
0.05
0.02
0.02
0.01
0.01
0.003
0.003
−0.5
−0.4
−0.3
−0.2
−0.1
0
−0.1
−0.05
Data
Figure 16: Random direction, σ = 10.
0
Data
0.05
0.1
Figure 17: Random direction, σ = 5.
variate is Eh g, h
H = g, m
H , and the covariance of the two variates is
covh {g, h)
H , f, h)
H } = g, Λf H .
Notice that (g, hj H , f, hj H ) are iid from the distribution above. The MLEs for a
2
2-dimensional Gaussian are clear.
Proof of Theorem 3: The characteristic function of θn (h) is
φn,h (t) = n
−1
n
exp{it σnp h, Γ(xj )
Hn }.
j=1
Notice that varh {h, Γ(xj )
Hn } = Γ(xj ), Γ(xj )
Hn = κn (xj , xj ). Then,
Eh φn,h (t) = n
−1
n
exp{−t2 σnp κn (xj , xj )/2}
by (17)
−→ exp{−t2 τ 2 /2} := φ0 (t),
j=1
where φ0 (t) is the characteristic function for N (0, τ 2 ). Likewise, since
varh {h, Γ(xj ) − Γ(xj )
Hn } = κn (xj , xj ) + κn (xj , xj ) − 2κn (xj , xj ),
then
Eh {|φn,h (t)|2 } = n−2
n
Eh {exp(it σnp h, Γ(xj ) − Γ(xj )
Hn )}
j,j =1
= n−2
n
exp{−t2 σnp [κn (xj , xj ) + κn (xj , xj ) − 2κn (xj , xj )]/2} → φ20 (t)
j,j =1
31
Normal Probability Plot
0.997
0.99
0.99
0.98
0.98
0.95
0.95
0.90
0.90
0.75
0.75
Probability
Probability
Normal Probability Plot
0.997
0.50
0.50
0.25
0.25
0.10
0.10
0.05
0.05
0.02
0.02
0.01
0.01
0.003
0.003
−0.1
−0.05
0
Data
0.05
0.1
−0.1
Figure 18: Random direction, σ = 3.
−0.05
0
Data
0.05
0.1
Figure 19: Random direction, σ = 2.
by conditions (17) and (18). Next, by Chebychev’s inequality
P rob{|φn,h (t) − φ0 (t)| > } ≤ −2 Eh |φn,h (t) − φ0 (t)|2
= −2 Eh |φn,h (t)|2 + φ20 (t) − 2φ0 (t)Eh φn,h (t) → 0.
That is, φn,h (t) → φ0 (t) in probability for each t. Using Lemma 2.2 of Diaconis and
Freedman (1984), we conclude that θn (h) converges weakly to N (0, τ 2 ) in probabil2
ity.
Proof of Corollary 4: Let t = (t1 , . . . , tm ). The characteristic function of θn (h) is
φn,h (t) = n
−1
n
j=1
m
exp{i
tk σnp hk , Γ(xj )
Hn }.
k=1
Notice that covh {h, Γ(xj )
Hn } = Γ(xj ), Γ(xj )
Hn Im = κn (xj , xj )Im , where Im is the
m × m identity matrix. Then, by condition (17),
Eh φn,h (t) = n
−1
n
j=1
exp{−
m
t2k σnp κn (xj , xj )/2}
k=1
→ exp{−
m
t2k τ 2 /2} := φ0 (t),
k=1
where φ0 (t) is the characteristic function for N (0, τ 2 Im ). Likewise, since
covh {h, Γ(xj ) − Γ(xj )
Hn } = (κn (xj , xj ) + κn (xj , xj ) − 2κn (xj , xj ))Im ,
32
Normal Probability Plot
0.997
0.99
0.99
0.98
0.98
0.95
0.95
0.90
0.90
0.75
0.75
Probability
Probability
Normal Probability Plot
0.997
0.50
0.50
0.25
0.25
0.10
0.10
0.05
0.05
0.02
0.02
0.01
0.01
0.003
0.003
−20
−10
0
Data
10
20
30
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
Data
Figure 20: Test samples projected along the Figure 21: Test samples projected along the
most discriminant direction, σ = 10.
most discriminant direction, σ = 5.
and by conditions (17) and (18)
Eh {|φn,h (t)|2 } = n−2
n
Eh {exp(i
j,j =1
m
tk σnp hk , Γ(xj ) − Γ(xj )
Hn )} → φ20 (t).
k=1
Similar to the proof of Theorem 3, we have, by Chebychev’s inequality, φn,h (t) →
φ0 (t) in probability for each t. Lemma 2.2 of Diaconis and Freedman (1984) can easily
extend to a multivariate setting. We then conclude that θn (h) converges weakly to
2
N (0, τ 2 Im ) in probability.
Proof of Theorem 5: (a) By (19)
Eh t2n = n−1
n
Eh σnp h, Γ(xj )
2Hn = n−1 σnp
j=1
n
κn (xj , xj ) → τ 2 .
j=1
Since (h, Γ(xj )
Hn , h, Γ(xj )
Hn ) is a bivariate normal with zero mean and covariance
matrix:



κn (xj , xj )
κn (xj , xj )
,
κn (x , xj ) κn (x , x )
j
j
j
similar to Lemma 2.3 in Diaconis and Freedman (1984), we have
Eh [h, Γ(xj )
2Hn h, Γ(xj )
2Hn ] = 2[κn (xj , xj )]2 + κn (xj , xj )κn (xj , xj ).
33
Normal Probability Plot
0.997
0.99
0.99
0.98
0.98
0.95
0.95
0.90
0.90
0.75
0.75
Probability
Probability
Normal Probability Plot
0.997
0.50
0.50
0.25
0.25
0.10
0.10
0.05
0.05
0.02
0.02
0.01
0.01
0.003
0.003
−8
−6
−4
−2
0
Data
2
4
6
8
−8
−6
−4
x 10
−4
−2
Data
0
2
4
6
−7
x 10
Figure 22: Test samples projected along the Figure 23: Test samples projected along the
most discriminant direction, σ = 3.
most discriminant direction, σ = 2.
By conditions (20) and (21)
n
n
Eh t4n = n−2 σn2p Eh [ h, Γ(xj )
2Hn ]2 = n−2 σn2p
Eh [h, Γ(xj )
2Hn h, Γ(xj )
2Hn ]
= 2n−2 σn2p
j=1
n
[κn (xj , xj )]2 + n−2 σn2p
j,j =1
j,j =1
n
κn (xj , xj )κn (xj , xj ) → τ 4 .
j,j =1
Then, P {|t2n − τ 2 | > } ≤ −2 Eh (t2n − τ 2 )2 = −2 (Eh t4n − 2τ 2 Eh t2n + τ 4 ) → 0, for any
> 0. Thus, t2n → τ 2 in probability.
(b) By Theorem 3 and Theorem 5-(1), the assertion, θn1 → N (0, 1) weakly in
2
probability, can be obtained by Slutzky’s lemma.
Proof of Theorem 6: (a) By (22)
Eh s2n
=n
−1
n
Eh σnp h, Γ̃(xj )
2Hκ
=
n−1 σnp
j=1
n
[κ(xj , xj ) − κ·· ] → τ 2 .
j=1
Similarly, by conditions (23) and (24), we have Eh s4n → τ 4 . Thus, s2n → τ 2 in
probability.
(b) The characteristic function of θn0 (h) is
φn (h, t) = n−1
n
exp{it σnp h, Γ̃(xj )
Hκ }.
j=1
34
Since varh {h, Γ̃(xj )
Hn } = Γ̃(xj ), Γ̃(xj )
Hn = κn (xj , xj ) − 2κj· + κ·· , then
Eh φn (h, t) = n
−1
n
exp{−t2 σnp [κ(xj , xj ) − 2κj· + κ·· ]/2} → exp{−t2 τ 2 /2}
j=1
by condition (23). Likewise, since varh {h, Γ̃(xj ) − Γ̃(xj )
Hκ } = Γ̃(xj ) − Γ̃(xj )2Hκ ,
then by conditions (22)-(24)
Eh {|φn (h, t)| } = n
2
−2
= n−2
n
j,j =1
n
Eh {exp(it σnp h, Γ̃(xj ) − Γ̃(xj )
Hκ )}
exp{−t2 σnp Γ̃(xj ) − Γ̃(xj )2Hκ /2} → exp{−t2 τ 2 }.
j,j =1
Thus, by Chebychev’s inequality, φn (h, t) → exp{−t2 τ 2 /2} in probability. We may
then conclude that θn0 (h) converges weakly to N (0, τ 2 ) in probability.
2
(c) It can be obtained by Slutzky’s lemma.
Proof of Corollary 7 is similar to Corollary 4 and Theorem 6, and is thus omitted.
Proof of Theorem 8: We first show that condition (22) holds for almost all re
alizations of X1 , X2 , X3 , . . .. Let Un := (n(n − 1))−1 j=j κo (Xj /σn , Xj /σn ). By
Hoeffding’s inequality for U -statistic we have
n
P {|Un − EUn | > } ≤
2 exp{−n
2 /κ2o (0)} < ∞.
n
Then by Borel-Cantelli lemma Un − EUn → 0 almost surely. Notice that
EUn = Eκo (X/σn , X /σn ) → 0,
as σn → 0.
Thus, Un → 0 almost surely. That is, the average of off diagonal elements of kernel
matrix converges to zero a.s. As the diagonal elements are all equal to τ 2 for symmetric translation type kernel and the average of off diagonals converges to zero a.s.,
condition (22) holds almost surely. Similarly one can show that condition (23) holds
a.s.
Next we show condition (24), limn→∞ n−2 σn2p
a.s. The proof can be resolved into steps below.
35
n
2
j,j =1 Γ̃(Xj ), Γ̃(Xj )
Hn
= 0, holds
• After some straightforward calculation, it leads to the following decomposition:
n
n−2 σn2p
Γ̃(Xj ), Γ̃(Xj )
2Hn
j,j =1
σn2p Γ̄
=
− mn , Γ̄ −
mn 2Hn
−
2n−1 σn2p
n
Γ(Xj ) − mn , Γ̄ − mn 2Hn
(25)
j=1
n
+n−2 σn2p
Γ(Xj ) − mn , Γ(Xj ) − mn 2Hn , where mn = EΓ(X).(26)
j,j =1
• Since
√
σnp Γ(Xj ) are iid with finite second moment τ 2 , then σnp Γ̄ − mn 2Hn → 0
a.s. Thus, the first term in (25) σn2p Γ̄ − mn 4Hn → 0 a.s. as well.
• Next, the second term in (25)
n
−1
n
j=1
n−1
≤
σn2p Γ(Xj ) − mn , Γ̄ − mn 2Hn
n
σnp Γ(Xj ) − mn 2Hn
σnp Γ̄ − mn 2Hn → 0 a.s.
j=1
• Let Vn =
n
j=j =1
σn2p Γ(Xj )−mn , Γ(Xj )−mn 2Hn /[n(n−1)]. Using Hoeffding’s
inequality for U -statistic, we have
P {|Vn − EVn | > } ≤ 2
n
e−n
2 /κ4 (0)
o
< ∞, ∀
> 0.
n
Thus, Vn − EVn → 0 a.s. Finally, we show that limn→∞ EVn = 0. For any fixed
> 0, let U = sup{|t| : |κo (t)| > }. Notice that we have the equalities
EΓ(Xj ), Γ(Xj )
Hn = Eκn (Xj , Xj ) = EΓ(Xj ), mn )
Hn = mn 2Hn .
Then
EVn = σn2p EΓ(X) − mn , Γ(X ) − mn 2Hn
= σn2p E{κn (X, X ) − mn , Γ(X)
Hn − mn , Γ(X )
Hn + mn 2Hn }2
= σn2p {Eκ2n (X, X ) − mn 4Hn } ≤ σn2p Eκ2n (X, X )
36
κ2o ((x − u)/σn )p(x)p(u)dxdu
=
≤
τ 4 p(x)p(u)dxdu + 2 ,
x−u≤σn U
where limn→∞
p(x)p(u)dxdu = 0, if
x−u≤σn U
p2 (x)dx < ∞.
2
Lemma 1 (Poisson approximation) Suppose for each n, Zn1 , . . . , Znrn are independent random variables, where each Znk is a Bernoulli trial with probability of
n
success pnk . If limn→∞ rk=1
pnk = q, 0 ≤ q < ∞, and limn→∞ max1≤k≤rn pnk = 0,
then
rn
e−q q i
Znk = i} =
, i = 0, 1, 2, . . . .
lim P {
n→∞
i!
k=1
Proof for the above lemma can be found in Billingsley (1986). Also notice that, by
n
Znk → ∞ in probability.
letting q → ∞, rk=1
Proof of Proposition 9: For any fixed > 0, let U = sup{|t| : |κo (t)| > }. Then,
σn−p P {|κo ((Xj − X0 )/σn )| > } = σn−p P {Xj − X0 ≤ σn U }
−p
p(t)p(u)dtdu =
p(σt + u)p(u)dt du
= σn
t−u/σn ≤U
t ≤U
p
→ vp U
p2 (u)du, as n → ∞,
(27)
X
where vp is the volume of a unit ball in Rp . Let Inj = I{σnp |κn (Xj , X0 )| > }, where
I is an indicator function, and Sn, = nj=1 Inj . From (27) and the definition of r,
we have
lim
n→∞
n
P {Inj = 1} = rvp U
j=1
Denote the limit above by rq , where 0 < q = vp U
p2 (u)du.
X
X
p2 (u)du < ∞. Also notice that
max P {Inj = 1} = P {|κo (Xj /σn , X0 /σn )| > } → 0, as σn → 0.
1≤j≤n
Thus, by Lemma 1, for the case 0 ≤ r < ∞, we have
lim P {Sn,
n→∞
e−rq (rq )i
= i} =
,
i!
37
and, for the case r = ∞, we have Sn, → ∞ in probability.
2
References
Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc.,
686, 337–404.
Baudat, G. and Anouar, F. (2000). Generalized discriminant analysis using a kernel
approach. Neural Computation, 12, 2385–2404.
Billingsley, P. (1986). Probability and Measure. 2nd ed., John Wiley & Sons, New
York.
Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20,
273–279.
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector
Machines and Other Kernel-Based Learning Methods. Cambridge University
Press.
Diaconis, P. and Freedman, D. (1984). Asymptotics of graphical projection pursuit.
Ann. Statist., 12, 793–815.
Friedman, J. (1989). Regularized discriminant analysis. J. Amer. Statist. Assoc.,
84, 165–175.
Grenander, U. (1950). Stochastic processes and statistical inference. Arkiv för
Matematik, 1, 195–277.
Grenander, U. (1963). Probabilities on Algebraic Structures. Almqvist & Wiksells,
Stockholm, and John Wiley & Sons, New York.
Grenander, U. (1981) Abstract Inference. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York.
38
Hastie, T. and Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures.
J. R. Statist. Soc. B, 58, 155–176.
Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York.
Hein, M. and Bousquet, O. (2004). Kernels, associated structures and generalizations. Technical report, Max Planck Institute for Biological Cybernetics,
Germany.
http://www.kyb.tuebingen.mpg.de/techreports.html.
Herbrich, R., Graepel, T. and Campbell, C. (2001). Bayes point machines. J.
Machine Learning Research, 1, 245–279.
Janson, S. (1997). Gaussian Hilbert Spaces. Cambridge University Press, Cambridge.
Kuo, H.H. (1975). Gaussian Measures in Banach Spaces. Lecture Notes in Mathematics, 463. Springer-Verlag.
Lee, Y.J. and Mangasarian, O.L. (2001). RSVM: reduced support vector machines.
Proceeding 1st International Conference on Data Mining, SIAM.
Mahalanobis, P.C. (1925). Analysis of race mixture in Bengal. J. Asiat. Soc.
(Bengal), 23, 301.
Mika, S. (2002). Kernel Fisher Discriminants. Ph.D. dissertation, Electrical Engineering and Computer Science, Technische Universität Berlin.
Mika, S., Rätsch, G., Weston, J., Schölkopf, B. and Müller, K.-R. (1999). Fisher
discriminant analysis with kernels. In Hu, Y.-H., Larsen, J., Wilson, E., and
Douglas, S., eds, Neural Networks for Signal Processing, IX, 41–48, IEEE.
39
Mika, S., Rätsch, G. and Müller, K.-R. (2001). A mathematical programming approach to the kernel Fisher Algorithm. In T.K. Leen, T.G. Dietterich and V.
Tresp, editors, Advances in Neural Information Processing Systems, 13, 591–
597, MIT Press.
Mika, S., Smola, A. and Schölkopf, B. (2001). An improved training algorithm
for kernel Fisher discriminants. In T. Jaakkola and T. Richardson, editors,
Artificial Intelligence and Statistics, 98–104, Morgan Kaufmann.
Rao, C.R. and Varadarajan, V.S. (1963). Discrimination of Gaussian processes.
Sankhyā, A, 25, 303–330.
Schölkopf, B. and Smola, A.J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA.
Solla, S.A., Keen, T.K. and Müller, K.R. (1999). Nonlinear Discriminant Analysis
Using Kernel Functions. MIT Press, Cambridge, MA.
Suykens, J.A.K., Van Gestel, T., De Brabanter, J., De Moor, B. and Vandewalle, J.
(2002). Least Squares Support Vector Machines. World Scientific, New Jersey.
Taxt, T., Hjort, N. and Eikvil, L. (1991). Statistical classification using a linear
mixture of multinormal probability densities. Pattn. Recogn. Lett., 12, 731–
737.
Tipping, M.E. (2001). Sparse Bayesian learning and the relevance vector machine.
J. Machine Learning Research, 1, 211–244.
Vakhania, N.N. Tarieladze, V.I. and Chobanyan, S.A. (1987). Probability Distributions on Banach Spaces. Translated from the Russian by W.A. Woyczynski.
Mathematics and Its Applications (Soviet Series), 14, D. Reidel Publishing Co.,
Dordrecht, Holland.
40
Van Gestel, T., Suykens, J.A.K. and De Brabanter, J. (2001). Least squares support
vector machine regression for discriminant analysis. Proc. International Joint
INNS-IEEE Conf. Neural Networks (INNS2001), 2445–2450. Wiley, New York.
Vapnik, V.N. (1998). Statistical Learning Theory. Wiley, New York.
Xu, J., Zhang, X. and Li, Y. (2001). Kernel MSE algorithm: a unified framework for
KFD, LS-SVM and KRR. Proceedings Intern. Joint Conf. Neural Networks, 2,
1486–1491, IEEE Press.
41