Nonlinear Measure of Association and Test of Independence by

Nonlinear Measure of Association and Test of
Independence by Kernel Canonical Correlation
Analysis∗
Su-Yun Huang,1† Mei-Hsien Lee2 and Chuhsing Kate Hsiao2
1
Institute of Statistical Science, Academia Sinica, Taiwan
2
Division of Biostatistics, Institute of Epidemiology
National Taiwan University
April 15, 2006
Abstract
Measures of association between two sets of random variables have long
been of interests to statisticians. The classical canonical correlation analysis
can characterize, but also be limited to, linear association. In this article we
study nonlinear association measure using the kernel method. This measure
can further be used to characterize stochastic independence between two sets
of variables.
The introduction of kernel method from machine learning community has a
great impact on statistical analysis. The kernel canonical correlation analysis is
a method that generalizes the classical linear canonical correlation to nonlinear
setting. Such a generalization is nonparametric. It allows us to depict the
nonlinear relation of two sets of variables and enables the application of many
classical multivariate data analyses originally constrained to linearity relation.
Moreover, the kernel-based canonical correlation analysis no longer requires the
elliptically symmetric distributional assumption on observations, and therefore
enhances greatly the applicability.
∗
Running head: Nonlinear measure of association and test of independence.
Corresponding author: Su-Yun Huang, Institute of Statistical Science, Academia Sinica, Taipei
11529, Taiwan, [email protected].
†
1
In this article we first apply the kernel canonical correlation analysis to
measure the association between two sets of variables. Our proposal can measure the relation beyond linear one. In addition, the kernel-based approach can
reduce the dimension of multivariate data, and conduct further discriminant
analysis. We also introduce the kernel method to test of independence between
two sets of variables without the usual normality assumption. Theoretical background and implementation algorithms will be discussed and several examples
will be illustrated.
Key words and phrases: association, canonical correlation analysis, kernel
canonical correlation analysis, nonlinear canonical correlation analysis, test of
independence.
1
Introduction
The description and classification of relation between two sets of variables have been
a long interest to many researchers. Hotelling (1936) introduced the canonical correlation analysis to describe the linear relation between two sets. We will use the
abbreviation LCCA for linear canonical correlation analysis. The LCCA is concerned
with the linear relation between two sets of variables having a joint distribution. It
defines a new orthogonal coordinate system for each of the sets in a way that the new
pair of coordinate systems are optimal in maximizing the correlations. The new systems of coordinates are simply linear systems of the original ones. Thus, the LCCA
can only be used to describe linear relation. Via such linear relation it finds only
linear dimension reduction subspace and linear discriminant subspace, too. Under
Gaussian assumption, the LCCA can also be used for testing independence between
two sets of variables. However, all these become invalid, if data are not Gaussian
nor at least elliptically symmetrically distributed. Thus, one has to resort to some
nonparametric strategy.
Motivated from the active development of statistical learning theory (Vapnik,
1998; Hastie, Tibshirani and Friedman, 2001; and references therein) and the popular
and successful usage of various kernel machines (Cortes and Vapnik, 1995; Cristianini and Shawe-Taylor, 2000; Schölkopf and Smola, 2002), there has emerged a
hybrid approach of the LCCA with a kernel machine (Akaho, 2001; Bach and Jordan,
2002), named kernel canonical correlation analysis (KCCA). The KCCA was also
studied recently by Gretton, Herbrich and Smola (2003), Kuss and Graepel (2003),
Hardoon, Szedmak and Shawe-Taylor (2004) and Gretton, Herbrich, Smola, Bousquet and Schölkopf (2005). All the above works are more on the computer science
side and there is still limited statistical applications of the KCCA method. In this
article we introduce the method for several statistical applications including association measure, nonlinear dimension reduction for discriminant analysis and test of
independence.
2
The rest of the article is organized as follows. In Section 2 we brief the methods
of LCCA and KCCA. In Section 3 we discuss some implementation issues concerning
regularization and parameter selection. In Section 4 we introduce some statistical
applications using KCCA. These applications are nonlinear association measure, dimension reduction for nonlinear discriminant analysis and a test of independence.
Finally, concluding remarks and discussion are given in Section 5. All relevant theoretical backgrounds including the nonlinear canonical correlation analysis, its approximation, estimation and asymptotic distribution are in the Appendix.
2
2.1
Canonical correlation analysis
A review of linear canonical correlation analysis
Suppose the random vector X of q components has a probability distribution P on
X ⊂ Rq . We partition X into q1 and q2 components:
(1) X
X=
.
X (2)
Let the corresponding partition of X be denoted by X1 ⊕ X2 . All vectors are column
vectors throughout, unless transposed to row vectors. We adopt a Matlab convention
(1) X
(1)
(2)
[X ; X ] :=
,
X (2)
where the semicolon denotes the stacking of 2 column vectors in the left hand side of
the equality into a longer one as shown in the right hand side. We are interested in
finding the relation between X (1) and X (2) . The LCCA describes linear relation by
reducing the correlation structure between these two sets of variables to the simplest
possible form by means of linear transformations on X (1) and X (2) . For given data
(1)
(2)
set A = {xj = [xj ; xj ]}nj=1 , let X denote the data design matrix
 (1) (2) 
x1
x1
 ..
.. 
:= [X1 X2 ].
X= .
. 
(1)
xn
(2)
xn
n×(q1 +q2 )
For the first pair of canonical variates, the LCCA seeks a pair of linear variates that
maximize the correlation, namely, it solves the following optimization problem:
ρ :=
max
α Σ12 β subject to α Σ11 α = 1 and β Σ22 β = 1,
q +q
[α;β]∈R
1
(1)
2
where Σij is the sample covariance matrix of X (i) and X (j) , i, j = 1, 2. Denote the
solution to (1) by [α1 ; β1 ]. For the rest pairs of canonical vectors, it solves sequentially
3
the same problem as (1) with extra constraints of iterative orthonormality: for the
kth pair
αk Σ11 αi = 0 and βk Σ22 βi = 0, ∀i = 1, . . . , k − 1.
(2)
The sets {αi } and {βi }, called (linear) canonical vectors, can be regarded as coordinate
axes in a pair of new coordinate systems. The sequence of correlation coefficients
{ρ1 , ρ2 . . .} describes only the linear relation between X (1) and X (2) . The LCCA can
be justified by assuming that X (1) and X (2) have a joint Gaussian distribution, and
the likelihood ratio criterion for testing independent of X (1) and X (2) can be expressed
entirely in terms of sample correlation coefficients.
2.2
Kernel generalization of canonical correlation analysis
There are cases where linear correlations may not be adequate for describing the “association” between X (1) and X (2) . A natural alternative, therefore, is to explore and
exploit their nonlinear relation. Several authors in the machine learning community
(see Introduction) have resorted to kernel method for exploring nonlinear relation.
The so called KCCA can be viewed as a special case of a series of study on nonlinear
canonical correlation analysis by Dauxois et al. (Dauxois, Romain and Viguier, 1993;
Dauxois and Nkiet, 1997; Dauxois and Nkiet, 1998; Dauxois and Nkiet, 2002; Dauxois, Nkiet and Romain, 2004). There are also other nonlinear canonical correlation
analysis. See, for instance, Ramsay and Silverman (Chapter 12, 1997), Eubank and
Hsing (2005), Hsing, Liu, Brun and Dougherty (2005) and references therein. Here
we give a brief description of the KCCA.1
Let κ1 (·, ·) and κ2 (·, ·) be two positive definite kernels defined on X1 ⊕ X1 and
X2 ⊕ X2 , respectively. Let each data point be augmented with kernel values via
(1)
(2)
x → γx = [γx(1) ; γx(2) ] or xj → γj = [γj ; γj ],
where
(3)
(i)
γx(i) = [κi (x(i) , x1 ); . . . ; κi (x(i) , x(i)
n )], i = 1, 2,
and
(i)
(i)
(i)
(i)
γj = [κi (xj , x1 ); . . . ; κi (xj , x(i)
n )], i = 1, 2, j = 1, . . . , n.
By matrix notation, the augmented kernel design matrix is given by
 (1) (2) 
γ1
γ1
 ..
.. 
K := [ K1 K2 ] :=  .
.
. 
(1)
γn
1
(4)
(2)
γn
n×2n
The KCCA formulation here is different from those in the above mentioned works originated
from the machine learning community. Our formulation is closer to Dauxois et al. with Aronszajn
kernel map as our feature mapping, which is isometrically isomorphic to the kernel spectrum-based
feature mapping.
4
(1)
(2)
This augmented representation of xj by γj = [γj ; γj ] ∈ R2n can be regarded as an
alternative way of recording data measurements with high inputs.
The KCCA procedure consists of two major steps:
(a) Transform each of the data points to a kernel-augmented representation as in
(3), or equivalently (4) in matrix notation. One common choice for the kernel
function is the Gaussian density function.
(b) Carry out the LCCA on the kernel augmented data K = [ K1 K2 ]. Note that
some sort of regularization is necessary here to solve the corresponding canonical analysis problem. It involves a spectrum decomposition (also known as
singular value decomposition) of extracting the canonical variates and correlation coefficients. Detailed computational implementation is discussed in the
Implementation Section.
(1)
(2)
We can consider augmenting the data A = {[xj ; xj ]}nj=1 by a functional representation:
(1)
(2)
F = {κ1 (xj , ·) + κ2 (xj , ·)}nj=1 .
The KCCA is a canonical analysis for these functional data. However, these functional data are not congenital but are intentionally made so, in order to study the
nonlinear relation in a canonical analysis. The kernel matrix K in (4) is a discretized
approximation of these functional data using a certain choice of basis set. More
theoretic details are in Section 2.3 and the Appendix.
2.3
More on kernel-augmented data
Here we will introduce a feature mapping, which brings the original sample space
X1 ⊕ X2 ⊂ Rq1 +q2 into an infinite dimensional Hilbert space. Parallel to the classical
multivariate analysis in Euclidean spaces, procedures of the same kind can be developed in Hilbert spaces for convenient nonlinear analysis. It is often done via using a
certain parametric notion of a classical method, here, namely the notion of LCCA.
For a given positive definite kernel κ(x, t) = κ1 (x(1) , t(1) ) + κ2
(x(2) , t(2) ) defined on
X ⊕X , there exists a Hilbert space consisting of functions {f (x) : m
i=1 ai κ(x, ti ), ∀1 ≤
m < ∞, ∀ai ∈ R, ∀ti ∈ X } and their closure with respect to the norm given by the
inner product:
m
n
ai κ(x, ti ),
bj κ(x, sj )
=
ai bj κ(ti , sj ).
i=1
j=1
i,j
It happens that this Hilbert space is a reproducing kernel Hilbert space.
Definition 1 (Reproducing kernel Hilbert space) A Hilbert space of real-valued
functions on an index set X satisfying the property that, all the evaluation functionals
are bounded linear functionals, is called a reproducing kernel Hilbert space.
5
To every RKHS of functions on X there corresponds a unique positive-definite kernel
κ satisfying the reproducing property, f (·), κ(x, ·)
= f (x) for all f in this RKHS
and all x ∈ X . We say that this RKHS admits the kernel κ.
Let H1 and H2 be the associated RKHSs admitting κ1 and κ2 , respectively. Consider a transformation γ : X1 ⊕ X2 → H1 ⊕ H2 given by
γ : x = [x(1) ; x(2) ] → [κ1 (x(1) , ·); κ2 (x(2) , ·)].
(5)
The original sample space X is then embedded into a new sample space H = H1 ⊕ H2
via the transformation γ. Each point x ∈ X is mapped to an element in H. Let
m2
1
{φν }m
ν=1 and {ψµ }µ=1 be our choice of (incomplete) linear independent systems in H1
(1)
(2)
and H2 , respectively. The discretization of functional data {[κ1 (xj , ·); κ2 (xj , ·)]}nj=1
by the systems is given by
(1)
(2)
(1)
(2)
[κ1 (xj , ·), φν (·)
H1 , κ2 (xj , ·), ψµ (·)
H2 ] = [φν (xj ), ψµ (xj )],
where j = 1, . . . , n, ν = 1, . . . , m1 and µ = 1, . . . , m2 , which gives an n × (m1 + m2 )
(1)
(2)
kernel-augmented data matrix. If we choose {κ1 (xj , ·)}nj=1 and {κ2 (xj , ·)}nj=1 as
our choice of linear independent systems for discretization, it results in the kernel
data matrix K in (4). That is, the KCCA is carried out by performing LCCA on
discretized kernel data.
3
Implementation
There have been various implementation algorithms for KCCA (e.g., Gretton, Herbrich and Smola, 2003; Hardoon, Szedmak and Shawe-Taylor, 2004; Kuss and Graepel, 2003; Fukumizu, Bach and Jordan, 2004; Gretton, Herbrich, Smola and Schölkopf,
2005). Our aims here are quite different from those of computer scientists. In Section
2.2 we have interpreted the KCCA as a LCCA functioning on the kernel augmented
(1)
(2)
data [γj ; γj ], j = 1, . . . , n. The implementation will thus be different from theirs,
too. The procedure for KCCA is already stated in Section 2.1. The main purpose of
this section is, after bridging the kernel method and the linear procedures, to lessen
the programming burden by using code already available in commercial packages.
The LCCA has long been a standard statistical analysis tool for the measure of association, independence test, etc., and it has been implemented in various mathematical
and statistical software (e.g., canoncorr in Matlab; cancor in R; cancor in Splus; and
proc cancorr in SAS). These packages are ready for use for KCCA. One advantage for
using a conventional LCCA code from any popular package is to allow statisticians
unfamiliar with kernel machines can still have an easy access to using KCCA. The
extra efforts required are to prepare the data in an appropriate kernel augmentation
form. In the following, we discuss two more steps, the regularization step and the parameter selection step, in kernel data preparation before feeding them into the LCCA
6
procedure. They are adopted to avoid computational instability and to enhance the
computational speed and efficiency in the later LCCA procedure.
3.1
Regularization
The optimization problem to be solved is:
max α Λ12 β subject to α Λ11 α = 1 and β Λ22 β = 1,
[α;β]∈R2n
(i)
(6)
(k)
where Λik is the sample covariance of γj and γj . Note that both Λ11 and Λ22
are singular, which causes computational difficulty. The optimization problem (6)
is ill-conditioned and some sort of regularization is needed in preparing the kernel
data matrices K1 and K2 . There are three commonly seen regularization methods for
coping with ill-conditioned eigen-components extraction.
• Ridge-type regularization,
• Principal components approach,
• Reduced set (of bases) approach.
The ridge-type regularization adds a small quantity to the diagonals,
α (Λ11 + λ1 I)α = 1 and β (Λ22 + λ2 I)β = 1,
to stabilize the numerical computation for solving problem (6).
The principal components approach extracts leading eigen-vectors from K1
and K2 , and denotes them by U1 and U2 , respectively. Next, it projects columns
of K1 and K2 onto the column spaces of U1 and U2 , respectively, and obtains the
reduced-rank approximation of kernel data
K̃1 = K1 U1 and K̃2 = K2 U2 .
(7)
The LCCA is then functioning on the reduced kernel data K̃ = [K̃1 K̃2 ]. The PCA
approach corresponds to using the following linear independent systems:
(1)
φν (x ) =
n
(1)
(1)
(2)
(2)
κ1 (x(1) , xj )ujν , ν = 1, . . . , m1 ,
j=1
and
(2)
ψµ (x ) =
n
κ2 (x(2) , xj )ujµ , µ = 1, . . . , m2 ,
j=1
(1)
(2)
where [ujν ] = U1 and [ujµ ] = U2 .
7
The reduced set approach is also a type of reduced-rank approximation of
(1)
(1)
(2)
(2)
kernel data. Let A1 = {x1 , . . . , xn } and A2 = {x1 , . . . , xn } denote the full sets
of data. The reduced set method selects a small portion of data Ã1 and Ã2 from the
full sets to form reduced-rank kernel matrices
K̃1 = K1 (A1 , Ã1 ) and K̃2 = K2 (A2 , Ã2 ),
(1)
(8)
(1)
where K1 (A1 , Ã1 ) = [κ1 (xi , xj )]x(1) ∈A1 ,x(1) ∈Ã1 is a thin column matrix of size n × m1
i
j
with m1 the size of Ã1 , and similarly for K̃2 (A2 , Ã2 ). The corresponding choices of
linear independent systems are:
(1)
(2)
(2)
(2)
(2)
φν (x(1) ) = κ1 (x(1) , x(1)
ν ), xν ∈ Ã1 , and ψµ (x ) = κ2 (x , xµ ), xµ ∈ Ã2 .
The choice of subsets Ã1 and Ã2 is often made by uniform random sampling from
A1 and A2 , respectively. The approach is termed random subset and is the most
economic one among the three in terms of computational complexity, and it results
in sparse representation with only a small number of kernel functions involved in the
underlying model and in computation as well. This approach can effectively speed up
the computation and cut down underlying model complexity (Lee and Mangasarian,
2001; Williams and Seeger, 2001; Lee, Hsieh and Huang, 2005; Lee and Huang, 2006).
However, as the subset selection is purely random, it is recommended for median to
large data sets.
Both the ridge-type regularization and the principal components approach are
well suited for small to median sized problems and provide better approximation
than the random subset approach. As the data size n becomes extremely large,
the full kernel matrices K1 and K2 are both of size n × n, and there are problems
confronted in massive data computing including the memory requirement for storing
the kernel data, the size of the mathematical programming problem and problems of
prediction instability. In the large data case, we adopt the random subset approach.
The approach is also reasonably well for median size problems as well.
3.2
Parameters selection
There are three parameters involved in the entire procedure of KCCA. One is the
kernel window width, one is the random subset size and the other is the number
of principal components. (In examples throughout this article we do not consider
the use of ridge-type regularization and hence no ridge parameter is involved.)√Also
we use Gaussian pdf as our choice of kernel and the window widths used is 10S
throughout, where S is the sample variance. Such a choice is based on empirical
experience, which results in good √
normality checks on kernel data (Huang, Hwang and
Lin, 2005). The window width 10S is a universal rule of thumb choice. Though
it might not be optimal or even far away from an optimal window width, it does
8
give robust satisfactory results. Besides the examples considered in this article, the
universal rule of thumb has been used in other data sets as well, including Adult data
set, Educational placement data set and a synthetic data set with data size ranging
from hundreds to tens of thousands. As for the choice of random subset size and the
number of principal components, we suggest users start with an adequate proportion
of kernel bases and then cut down the effective dimensionality by the PCA approach
depending on the availability and performance of computer facility. See individual
examples in next section for further discussion.
4
Statistical applications
4.1
Measure of association
The KCCA results in a sequence of pairs of nonlinear variates and correlation coefficients. Here we use the leading correlation coefficient as a measure of nonlinear
association for variables X (1) and X (2) . That is, we use
ρ := sup cor f (X (1) ), g(X (2) )
f ∈H1 ,g∈H2
as a nonlinear association measure. This association measure is also used in Bach
and Jordan (2002) and Gretton et al. (2005). Below we give a couple of examples
of KCCA usage. In Example 1 we first illustrate the measure of nonlinear association between two univariate random variables. In Example 2 we demonstrate the
association measures between two sets of multivariate data. In Example 3 we use the
KCCA-found variates for discriminant analysis for the UCI Pendigits data set.
Example 1 Synthetic data set: Association for two random variables.
In this example, we use the leading kernel canonical correlation as our association
measure and compare it with two well-known nonparametric rank-based association
measures, Kendall’s τ and Spearman’s ρ, on their ability of assessing nonlinear relation. In the following three cases, let X (1) and be two independent and identically
distributed standard normals, and the models of X (2) considered are
I. X (2) = f (X (1) ) +
k
with f (x) = cos(πx);
II. X (2) = f (X (1) ) +
k
with f (x) = exp(−|x|); and
III. X (2) = X (1) + k ;
where k is a constant in (0, 10]. In these models, the amount of association increases
with respect to k.
In each setting, we generate 500 samples of X (1) and , and take account of relationship between X (1) and X (2) . Kernel data using full bases are prepared and
9
then the dimensionality is cut down by the PCA extracting 99% of data variation.
For models I and II, X (1) and X (2) are nonlinearly correlated. Five quantities are
considered to measure the correlation and the corresponding curves are plotted in
Figure 1 (a) and (b). The solid curve in each plot indicates the Pearson’s correlation of X (2) and f (X (1) ) taken as the target association measure between X (1) and
X (2) . Because the models cos(πx) and exp(−|x|) are both symmetric about zero, the
classical correlations, which can only catch linear association, for models I and II are
around zero. The rank-based Kendall’s τ and Spearman’s ρ are also around zero.
The results indicate that the linear correlation coefficients as well as the rank-based τ
and ρ cannot catch nonlinear association. Overall speaking, kernel canonical correlation outperforms the rest. It approximates the true association well. For the linearly
correlated case (model III in Figure 1 (c)), the solid curve still represents the true
association. The measure of Spearman’s ρ is very close to the true association but
Kendall’s τ is a bit far away from the true curve. Again, kernel canonical correlation
follows the true curve very closely.
——— Put Figure 1 here ———
Example 2 Synthetic data set: Association for two sets of variables.
Let X1 and X2 be independent and identically distributed random variables having
uniform distribution over the interval (-2,2). Let
Y1 = X12 + 0.11 and Y2 = cos(πX2 ) + 0.12 ,
where ’s are standard normal random noises. Let X (1) = [X1 ; X2 ] and X (2) = [Y1 ; Y2 ].
In each simulation, we sample 1000 pairs of [X (1) ; X (2) ] from the described model and
calculate the leading 2 pairs of sample canonical variates using both the linear and
the kernel CCA approaches. Here we use the random subset of size 200 as our regularization approach. Of course, one can use the PCA approach with the full set kernel
bases of size 1000. The PCA with full set bases is better accurate but computationally
more intensive. For demonstration purpose, we choose the computationally economic
random subset approach. The mutual correlation carried in the leading 2 pairs of
canonical variates found by LCCA and KCCA are listed in Table 1 for 30 random
replicate runs. Reported are the average correlation measures over 30 replicate runs
and their standard errors. It is evident that kernel canonical variates capture more
correlation, and thus more association explained, between two groups of data. We
randomly choose a copy of the 30 replicate runs and project these 1000 data points
along the leading 2 pairs of sample canonical variates found by LCCA and KCCA.
Data scatter plots along these variates are depicted in Figure 2. Figure 2(a) and
Figure 2(b) are data scatter along the first and the second pair of linear canonical
variates, respectively. There is indication of some strong association left unexplained.
Figure 2(c) and Figure 2(d) are data scatter along the first and the second pair of
10
kernel canonical variates, respectively. They show strong correlation within each
pair of kernel canonical variates. The correlations are 0.993 and 0.965 respectively.
After the kernel transformation, the relation between these two vectors are indeed
well depicted. Furthermore, the data projected along the KCCA-found canonical
variates have elliptical symmetric like distribution. That implies the applicability of
many multivariate data analysis tools, which originally are constrained by Gaussian
assumption, now on the kernel data.
——— Put Table 1 and Figure 2 here ———
Example 3 Nonlinear discriminant for pen-based recognition of hand-written digits.
In this example we utilize the kernel canonical variates as discriminant variates for
multiple classification. Take, for instance, a k-classes problem, we first construct a
k-dimensional class indicator variable “x(2) = (c1 , . . . , ck )” as follows:
1, if an instance input x(1) is from the ith class,
ci =
0, otherwise.
Then use KCCA on training inputs and their associated class labels to find the (k −1)
canonical variates as discriminant directions. The pendigits data set is taken from
the UCI machine learning data bases. The number of training instances is 7494
and the number of testing instances is 3498. For each instance there are 16 input
measurements (i.e., x(1) is 16 dimensional) and a corresponding group label from
{0, 1, 2, . . . , 9}. A Gaussian kernel with the covariance matrix diag(10S1 , . . . , 10S16 )
is used and a random subset of training inputs of size 30 (stratified over 10 classes)
is adopted to produce the kernel data. Scatter plots of data projected along kernel
canonical variates and along linear canonical variates are given below. To avoid
excessive ink we only sample 20 test instances per digit to produce the data scatter
plots. Different classes are labeled with distinct symbols. Figure 3 are data scatter
plots along kernel canonical variates. Digits 0, 8, 6, and 4 can be easily separated
when plotting the first versus second kernel canonical variates. Digits 3 and 2 are
identified in the second versus third kernel canonical variates, digits 9 and 5 are in
3rd-and-4th, and digits 1 and 7 are identified in the 4th-and-5th variates. In fact,
the first three kernel canonical variates can pretty much classify a majority of the ten
digits already; while in Figure 4, it is till difficult to discriminate the ten digits even
with 5 canonical variates.
The 9-dimensional subspace spanned by kernel canonical variates based on training
inputs is used as the designated discriminant subspace. Test instances are projected
onto this subspace and classification is made by Fisher’s linear discriminant analysis,
i.e., a test instance is assigned to the class with the closest group center (according
to Mahalanobis distance). The accuracy of the kernel classification on test data can
achieve the rate 97.24% (with standard error 0.056%).
11
——— Put Figures 3 & 4 here ———
4.2
Test of independence between two sets of variables
In this section we generalize the use of Bartlett’s test of independence (Bartlett, 1947a;
1947b) to kernel augmented data. The theoretic foundation of this generalization
is based on Dauxois and Nkiet (1998). The idea is to approximate the nonlinear
canonical correlation analysis by a suitable linear canonical correlation analysis on
augmented data through certain linear independent systems of functions, and then
this approximation by linear canonical analysis is estimated by empirical data. We
leave the technical details in the Appendix (see Proposition 1).
This test can be easily done with commercial statistical packages. Here we use
the Matlab m-file canoncorr for Example 4. It has the implementation of Bartlett’s
independent test. Tests of two sets of variables for several distributions are carried
out on reduced-rank kernel data. The reduction is done by PCA extracting 99% of
data variation. The first 5 distributions in this study are taken from Dauxois and
Nkiet (1998) and the last one is similar to Example 2 in Section 4.1 without the
additive noise.
Example 4 Independence tests.
In the first three cases, we are interested in testing the independence between X
and Y that are both one-dimensional. In cases IV and V, the Pθ = (Pθ1 , Pθ2 ) is
two-dimensional, and we test the independence between the first Pθ1 and second Pθ2
components of Pθ . In the last case VI, the independence between two vectors X and
Y are tested, where both X and Y are two-dimensional.
I. X ∼ N (0, 1) and Y = X 2 ;
II. [X; Y ] ∼ uniform distribution on the unit disk;
III. [X; Y ] ∼ bivariate standard normal with correlation ρ;
IV. a mixture distribution: Pθ = θQ1 + (1 − θ)Q2 with θ = 0.5, where Q1 is the
distribution of [X; X 2 ] with X a univariate standard normal and Q2 is the
bivariate standard normal with correlation ρ = 0.25;
V. a mixture distribution: Pθ = θQ1 + (1 − θ)Q2 with θ = 0.75;
VI. X = [X1 ; X2 ] has a bivariate uniform distribution over (−2, 2) × (−2, 2) and
Y = [X12 ; cos(πX2 )].
Five hundreds sample points are drawn from each of the above distributions. The
total number of replicate runs is 100. In each run, the kernel transformation is first
12
performed on X and Y respectively (on Pθ1 and Pθ2 in cases IV and V), and then PCA
is conducted to extract leading variates accounting for 99% of variation.
The power estimates (or type-I error estimates for case III with ρ = 0) and their
standard errors for independence test are reported in Table 2. It is evident that
KCCA-based test has power of 1, in both one- and two-dimensional cases; while
LCCA catches the relation only when linearity is present. The pattern remains even
when the relation between X and Y are complex (cases IV and V). For the type I error,
it is about the same for KCCA as that of LCCA-based test (case III-1 for ρ = 0). The
KCCA performs about the same as the order 2 and order 3 spline approximations.
The kernel approximation provides an alternative choice besides splines.
——— Put Table 2 here ———
5
Conclusion
In this paper, we discuss the use of kernel method on two sets of multivariate data
to construct the relation via canonical correlation analysis. Various examples are
investigated for illustration. We also describe the procedure of KCCA and discuss
implementation technique. There are several points worth mentioning.
The association defined on the kernel transformed data is one special example
of association measure for Hilbert subspaces (Dauxois and Nkiet, 1998 and 2002).
It can be used for association measure and there are a couple of applications. The
LCCA can be used for dimension reduction. The KCCA can be used for the same
purpose in a similar manner. In LCCA, the first few leading canonical variates in
X (1) and X (2) are used as dimension reduction directions in X1 and X2 respectively. It
finds linear dimension reduction subspaces in X1 and X2 , while the KCCA first maps
the data into a very high dimensional space via a nonlinear transformation and then
finds dimension reduction subspaces therein. In this way the KCCA allows nonlinear
dimension reduction directions, which are actually functional directions. KCCA can
be used for nonlinear discriminant as well. Only the explanatory variables are subject
to kernel augmentation and the group labels remain unchanged. In the pen digit data
the KCCA is used to select the leading discriminant variates and classification is done
in the subspace spanned by these functional discriminant variates. A test instance is
assigned to the nearest group centroid according to Mahalanobis distance.
Traditional tests of independence apply mostly the statistics that take into account only linear relationship and rely on the Gaussian assumption. With the KCCA
independence test,the distributional assumption can be avoided. As seen in the simulations, it outperforms the LCCA test.
13
6
Appendix: nonlinear canonical correlation
analysis
In this appendix we give a brief review of the nonlinear canonical correlation analysis
(NLCCA), including the definition, its approximation, estimation and asymptotic
distribution of estimates, introduced by Dauxois et al. (mainly Dauxois and Nkiet,
1998; also consult Dauxois and Nkiet, 2002; Dauxois and Romain and Viguier, 1993;
Dauxois and Nkiet, 1997; Dauxois, Nkiet and Romain, 2004). We then link the
NLCCA to a special case, namely, the KCCA.
NLCCA and measure of association. Consider a probability space (X , B, P )
such that the Hilbert space L2 (P ) is separable.2 Let X = [X (1) ; X (2) ] be random
vector on (X , B, P ) with marginals denoted by P1 and P2 , respectively. Let L2 (P1 ) =
{f ∈ L2 (P1 ) : Ef (X (1) ) = 0} and L2 (P2 ) be defined similarly. The NLCCA is the
search of two variates f1 (X (1) ) (f1 ∈ L2 (P1 )) and g1 (X (2) ) (g1 ∈ L2 (P2 )) such that
the pair (f1 , g1 ) maximizes f, g
L2 (P ) , where f ∈ L2 (P1 ) and g ∈ L2 (P2 ), under
the constraints f L2 (P1 ) = 1 and gL2 (P2 ) = 1. For ν ≥ 2, one searches for two
variates fν ∈ L2 (P1 ) and gν ∈ L2 (P2 ) that maximizes the above-mentioned problem
with the additional iterations of orthonormality constraints fk , fν L2 (P1 ) = 0 and
gk , gν L2 (P2 ) = 0 for all ν = 1, . . . , k − 1. Denote ρν = fν (X (1) ), gν (X (2) )
L2 (P ) . The
NLCCA is characterized by a triple:
{(ρν , fν (X (1) ), gν (X (2) )), ν = 0, 1, 2, . . .},
where the trivial canonical terms ρ0 = 1 and f0 (X (1) ) = g0 (X (2) ) = 1 do not give
any information about independence of X and Y . The numbers 0 ≤ ρν ≤ 1 are
arranged in decreasing order and are termed nonlinear canonical coefficients. The
trivial canonical terms can be easily avoided by restricting the NLCCA to centered
variables.
Approximation to NLCCA. For k ∈ N, let {φkν }1≤ν≤pk be a linearly independent system in L2 (P1 ) (noncentered subspace) and {ψµk }1≤µ≤qk be a linearly independent system in L2 (P2 ) (noncentered subspace). Let V1k and V2k be, respectively, the
subspaces spanned by the systems {φkν }1≤ν≤pk and {ψµk }1≤µ≤qk . Assume that these sequences of subspaces are nondecreasing and that their unions are dense in L2 (P1 ) and
L2 (P2 ), respectively. It was shown (Dauxois and Pousse, 1977; Lafaye de Micheaux,
1978) that the NLCCA can be approximated (in terms of uniform convergence of a
certain underlying sequence of linear operators, which we will not get into details
here) by the LCCA of random vectors
φk (X (1) ) := [φk1 (X (1) ); . . . ; φkpk (X (1) )]
2
That is, it has a countable dense subset. In a separable Hilbert space countable orthonormal
systems are used to expand any element as an infinite sum.
14
and
ψ k (X (2) ) := [ψ1k (X (2) ); . . . ; ψqkk (X (2) )].
In summary, the NLCCA of X (1) and X (2) can be approximated by a sequence of
suitable LCCA’s. The underlying sequence of LCCA approximations depends on the
choice of linear independent systems {φkν }1≤ν≤pk and {ψµk }1≤µ≤qk , which are required
to satisfy conditions (C1) V1k ⊂ V1k+1 and V2k ⊂ V2k+1 , ∀k ∈ N and (C2) ∪k V1k and
∪k V2k are dense in L2 (P1 ) and L2 (P2 ), respectively. Choices of the systems {φkν }1≤ν≤pk
and {ψµk }1≤µ≤qk can be step functions, or B-splines (Dauxois and Nkiet, 1998). In
this article we use kernel functions as our choice of linear independent systems.
(1)
(2)
Estimates and asymptotic distribution. Let {[xj ; xj ]}nj=1 be iid sample
having the same distribution as [X (1) ; X (2) ]. Assume, without loss of generality,
that pk ≤ qk . Let ρν , ν = 1, . . . , pk , be the sample correlation coefficients of data
(1)
(1)
(2)
(2)
{φk (x1 ), . . . , φk (xn )} and {ψ k (x1 ), . . . , ψ k (xn )}.
Proposition 1 (Proposition 5.3 in Dauxois and Nkiet, 1998) Under the null
hypothesis that X (1) and X (2) are independent, nrk,n converges in distribution, as
n → ∞, to χ2(pk −1)(qk −1) , where
r
k,n
=−
pk
log(1 − ρ2ν ).
ν=1
Proposition 1 is a special caseof Proposition 5.3 (Dauxois and Nkiet, 1998) by taking,
k
in their notation, Φk (λ) = − pν=1
log(1−λν ) with λ = (λ1 , . . . , λPk ) and λν = ρ2ν > 0.
Then the constant KΦk is given by KΦk = (∂Φk /∂λ1 )(0) = 1. For the independence
test in Section 4.2, Bartlett’s modification is adopted, namely,
1
n − (pk + qk + 1) rk,n ∼ χ(pk −1)(qk −1) .
2
Remark 1 The KCCA takes some special φ- and ψ-functions:
(1)
φki (x(1) ) = κ1,σ1 (x(1) , xi ), i = 1, . . . , pk ,
(2)
ψik (x(2) ) = κ2,σ2 (x(2) , xi ), i = 1, . . . , qk ,
(1)
(2)
where σ1 and σ2 are the kernel widths and {xi }p1k and {xi }q1k are often taken to be
random subsets from full data.
15
References
Akaho, S. (2001). A kernel method for canonical correlation analysis. International
Meeting of Psychometric Society (IMPS2001).
Bach, F.R. and Jordan, M.I. (2002). Kernel independent component analysis. J.
Mach. Learning Res., 3, 1–48.
Bartlett, M.S. (1947a). Multivariate analysis. Supp. J. Roy. Statist. Soc., 9,
176–197.
Bartlett, M.S. (1947b). The general canonical correlation distribution. Ann. Math.
Statist., 18, 1–17.
Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20,
273–279.
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector
Machines and Other Kernel-Based Learning Methods. Cambridge University
Press.
Dauxois, J. and Nkiet, G.M. (1997). Canonical analysis of two Euclidean subspaces
and its applications. Linear Algebra Appl., 264, 355–388.
Dauxois, J. and Nkiet, G.M. (1998). Nonlinear canonical analysis and independence
tests. Ann. Statist., 26, 1254–1278.
Dauxois, J. and Nkiet, G.M. (2002). Measure of association for Hilbert subspaces
and some applications. J. Multivariate. Anal., 82, 263–298.
Dauxois, J., Nkiet, G.M. and Romain, Y. (2004). Canonical analysis relative to a
closed subspace. Linear Algebra Appl., 388, 119–145.
Dauxois, J. and Pousse, A. (1977). Some convergence problems in factor analyses.
Recent Developments in Statistics, 387–402, North-Holland, Amsterdam. (Proc.
European Meeting Statisticians, Grenoble, 1976.)
Dauxois, J., Romain, Y. and Viguier, S. (1993). Comparison of two factor subspaces.
J. Multivariate Anal., 44, 160–178.
Eubank, R. and Hsing, T. (2005). Canonical correlation for stochastic processes.
preprint.
Fukumizu, K., Bach, F.R. and Jordan, M.I. (2004). Dimensionality reduction for
supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learning
Res., 5, 73–99.
16
Gel fand, I.M. and Yaglom, A.M. (1959). Calculation of the amount of information
about a random function contained in another such function. Amer. Math.
Soc. Transl. (2), 12, 199–246.
Gretton, A., Herbrich, R. and Smola, A. (2003). The Kernel Mutual Information.
Technical Report, MPI for Biological Cybernetics Technical Report, Tuebingen,Germany.
Gretton, A., Herbrich, R., Smola, A., Bousquet, O. and Schölkopf, B. (2005). Kernel
methods for measuring independence. J. Machine Learning Research, 6, 2075–
2129.
Hardoon, D.R., Szedmak, S. and Shawe-Taylor, J. (2004). Canonical correlation
analysis: An overview with application to learning methods. Neural Computation, 16, 2639–2664.
Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York.
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–
377.
Hsing, T., Liu, L.-Y., Brun, M. and Dougherty, E. R. (2005). The coefficient of
intrinsic dependence. Pattern Recognition, 38, 623-636.
Huang, S.Y., Hwang, C.R. and Lin, M.H. (2005). Kernel Fisher discriminant analysis
in Gaussian reproducing kernel Hilbert space. Technical report, Institute of
Statistical Science, Academia Sinica, Taiwan.
http://www.stat.sinica.edu.tw/syhuang.
Lafaye de Micheaux, D. (1978). Approximation d’analyses canoniques non linéaires
et analyses factorielles privilégiantes. Thèses de Docteur Ingénieur, Univ. Nice.
Kuss, M. and Graepel, T. (2003). The geometry of kernel canonical correlation
analysis. Technical report, Max Planck Institute for Biological Cybernetics,
Germany.
Lee, Y.J. and Huang, S.Y. (2006). Reduced support vector machines: a statistical
theory. IEEE Trans. Neural Networks, accepted.
http://www.stat.sinica.edu.tw/syhuang.
Lee, Y.J. and Mangasarian, O.L. (2001). RSVM: reduced support vector machines.
Proceeding 1st International Conference on Data Mining, SIAM.
Ramsay, J.O. and Silverman, B.W. (1997). Functional Data Analysis. Springer.
17
Schölkopf, B. and Smola, A.J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA.
Silvey, S.D.(1964). On a measure of association. Ann. Math. Statist., 35, 1157-1166
Vapnik, V.N. (1998). Statistical Learning Theory. Wiley, New York.
Williams, C.K.I. and Seeger, M. (2001). Using the Nyström method to speed up
kernel machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp,
editors, Advances in Neural Information Processing Systems, 13, 682-688, Cambridge, MA, MIT Press.
18
Table 1: Correlation for the first and the second pairs of canonical variates and kernel
canonical variates for Example 2.
average correlation measure
LCCA (s.e.)
KCCA (s.e.)
between 1st pair cv’s
0.0573 (0.0046) 0.9926 (0.0001)
between 2nd pair cv’s
0.0132 (0.0020) 0.9646 (0.0005)
Table 2: Power estimates (or type-I error estimate in III-1 case) for testing independence
of (X, Y ) for several distributions, the significance level α = 0.05 based on 100 replicate
runs. The D&N columns are power estimates taken from Dauxois and Nkiet (1998) using
order 2 and 3 splines for approximation. (For case I, the D&N has used sample sizes 200
and 400, instead of 500.)
case
I
II
III-1,
III-2,
III-3,
III-4,
IV
V
VI
n
LCCA (s.e.)
500* 0.42 (0.049)
500 0.02 (0.014)
ρ=0
500 0.06 (0.024)
ρ = 0.2 500 0.99 (0.010)
ρ = 0.5 500 1.00 (0.000)
ρ = 0.8 500 1.00 (0.000)
500 0.62 (0.049)
500 0.37 (0.048)
500 0.09 (0.029)
KCCA (s.e.)
1.00 (0.000)
1.00 (0.000)
0.04 (0.020)
0.96 (0.020)
1.00 (0.000)
1.00 (0.000)
1.00 (0.000)
1.00 (0.000)
1.00 (0.000)
19
D&N spline2
1.00
0.99
NA
NA
NA
0.99
1.00
1.00
NA
D&N spline3
1.00
0.99
NA
NA
NA
0.73
1.00
1.00
NA
1.0
0.8
0.6
stat
0.0
0.2
0.4
Cor(cos(pi*X^(1)),X^(2))
Kernel Cor(X^(1),X^(2))
Cor(X^(1),X^(2))
Kendall’s(X^(1),X^(2))
Spearman’s(X^(1),X^(2))
0
2
4
6
8
10
k
0.6
0.8
1.0
Figure 1: (a) Example 1, model I
0.0
0.2
0.4
stat
Cor(exp(-abs(X^(1))),X^(2))
Kernel Cor(X^(1),X^(2))
Spearman’s
Kendall’s
0
2
4
6
8
10
k
0.6
0.8
1.0
Figure 1: (b) Example 1, model II
0.0
0.2
0.4
stat
Cor(X^(1),X^(2))
Kernel Cor(X^(1),X^(2))
Spearman’s
Kendall’s
0
2
4
6
8
k
Figure 1: (c) Example 1, model III
20
10
(a)
(b)
4
3
2
2
1
0
0
−2
−4
−4
−1
−2
0
2
−2
−4
4
(c)
3
1
1
0
0
−1
0
1
2
−2
−2
3
0
2
4
1
2
(d)
2
2
−1
−1
−2
−1
0
Figure 2: Scatter plots of the first and second pairs of linear canonical variates ((a) and
(b)) and kernel canonical variates ((c) and (d)) for Example 2.
21
(a)
66666
6 6
6 66
6 666
6
6
1
0
−1
−2
−3
0
0000
0000000
0
(b)
44444444
44
444
44444
1.5
3rd kernel canonical variates
2nd kernel canonical variates
2
4
9999
9
99
595
4 5 59
55
55555
59
99
5
595959
6
5
333
33
33333
3333
71 3
1 7711
11 2
7
11
1
1
71
7
12222
77
7
771
77
22
2
2
77222
222
7
8
8
8
8
888888
8
88 8
888 8
−2
−1
0
1
1st kernel canonical variates
9 999 9
9 999999 9
95 5
5
59 5555 55 5
559 559
5
9
55
3
333
3339
3
5
33333
3
3 33 3 3
1
0.5
0
1
1
11111117 7
1
71111177
7771
1 1
7
77 777
−1 77 7
7 2 22
2222222
222
2222
−1.5
−1
−0.5
0
0.5
1
2nd kernel canonical variates
−0.5
(c)
1
0.5
0
−0.5
−1.5
11
11
11
11171
11 1
771171
7
77777771
77
7 7
1
7
77
(d)
0.5
5th kernel canonical variates
4th kernel canonical variates
1.5
1.5
9
9 9 99 999999
999
9
99
55
9
5
5 5
55 5
5
5 5
555
55 5
5
55
−1
−0.5
0
0.5
1
3rd kernel canonical variates
7
0
−0.5
−1
1.5
7
0
7 7
7
77
1
7 7 7 77
1
7 7
1 1 1 1
111
1 7
117 1
1
171 1 1
7
1
0.5
1
4th kernel canonical variates
1.5
Figure 3: Scatter plots of pendigits over the 1st-and-2nd, 2nd-and-3rd, 3rd-and-4th, and
4th-and-5th kernel canonical variates.
22
(a)
(b)
3
2
88 88
8
8 8 8 88 88 0
8 08 8
7
5
3 3 555558 8
777
0
33
8 0
7
3
3
7
2 2 77
5
5
7
733 1 75 56 8 0
27
22
11 5999 666 8 00000
222
733113
77
7
3
3
2
1
1
0
2
1
1
3
00
7 171 99
6
222211
212 3
99
559
99 6 6 0 0
5
991
2221 1
5
1
9
9
6
6
0
9
97 55
5 669 6 0 0
4 9 566 9
7
6696
4 0 6 6 60
4
4 44444
444 4
444 4 4
4
1
0
−1
−2
−3
−3
−2
−1
0
1
1st canonical variates
2
3rd canonical variates
2nd canonical variates
3
2
1
0
−1
−2
7 9
9 11 95 5
4
71
4 4444
9 599951997971991 5555
3
5
44 4444 0
0099
00999 313753
95
333
44 4 40 4
30001773375
5
5558 8888
8
5
2
3
3
590 103
17
3
33300
7078
7
08888 8
0
0
7
44
5 5
1
3
0
1 21212772278 87 8
80
8
1
2
2
2
4
2 128 2
2 7 8
1
112262 76
66 6262122
66
66666 6 666 6 7
6
6
−3
3
−3
−2
(c)
−1
0
1
2
2nd canonical variates
3
(d)
1
0
−1
−2
−3
−3
3
6
2
6
0
0 7 7000
70
707707 00400 4 4
7
7 7 34008 7 4 4
7
87887 833 34 4 4
54 4
8 7
34
013335
6 62728 11
3383325853 45 5 5
0
1
8
4
2
8 28 8 5 43 43
2 1 24 1
43 05775
85 3
6661 121 12
2 82
6 22 822 22 88 4
33 3115 175 7
6
6
8
1
2
66
1 1
5 4999995
2 01
99199 1
66
5 59
9 9 99
6 6
1
6 66
9
59
5 9
9
9
−2
−1
0
1
3rd canonical variates
2
5th canonical variates
4th canonical variates
3
2
1
0
−1
−2
−3
−3
3
5
55 5 5
55 5
5
4
8
4
1
527210 882 4 74444 747 7
82
55
696 6 2412
1 8 2 1 4447 4 7
128228
6
1
9
2
22
248
12041 8 8 70
9 6 66 6761
14144
28
4
8
9 65 0 6 562
00 00 0
7 12 0 87
870
5
9
2
9 9
6
5
6 25 66 888 07 0 0070
1
6
5
1
1
6
7
9 15 1 85 8
7 0 6
999991
33 77
99999 3 3 3 6 7 7 0
9 3 3 33
333 3333
33 33
−2
−1
0
1
4th canonical variates
2
3
Figure 4: Scatter plots of pendigits over the 1st-and-2nd, 2nd-and-3rd, 3rd-and-4th, and
4th-and-5th canonical variates.
23