Nonlinear Measure of Association and Test of Independence by Kernel Canonical Correlation Analysis∗ Su-Yun Huang,1† Mei-Hsien Lee2 and Chuhsing Kate Hsiao2 1 Institute of Statistical Science, Academia Sinica, Taiwan 2 Division of Biostatistics, Institute of Epidemiology National Taiwan University April 15, 2006 Abstract Measures of association between two sets of random variables have long been of interests to statisticians. The classical canonical correlation analysis can characterize, but also be limited to, linear association. In this article we study nonlinear association measure using the kernel method. This measure can further be used to characterize stochastic independence between two sets of variables. The introduction of kernel method from machine learning community has a great impact on statistical analysis. The kernel canonical correlation analysis is a method that generalizes the classical linear canonical correlation to nonlinear setting. Such a generalization is nonparametric. It allows us to depict the nonlinear relation of two sets of variables and enables the application of many classical multivariate data analyses originally constrained to linearity relation. Moreover, the kernel-based canonical correlation analysis no longer requires the elliptically symmetric distributional assumption on observations, and therefore enhances greatly the applicability. ∗ Running head: Nonlinear measure of association and test of independence. Corresponding author: Su-Yun Huang, Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan, [email protected]. † 1 In this article we first apply the kernel canonical correlation analysis to measure the association between two sets of variables. Our proposal can measure the relation beyond linear one. In addition, the kernel-based approach can reduce the dimension of multivariate data, and conduct further discriminant analysis. We also introduce the kernel method to test of independence between two sets of variables without the usual normality assumption. Theoretical background and implementation algorithms will be discussed and several examples will be illustrated. Key words and phrases: association, canonical correlation analysis, kernel canonical correlation analysis, nonlinear canonical correlation analysis, test of independence. 1 Introduction The description and classification of relation between two sets of variables have been a long interest to many researchers. Hotelling (1936) introduced the canonical correlation analysis to describe the linear relation between two sets. We will use the abbreviation LCCA for linear canonical correlation analysis. The LCCA is concerned with the linear relation between two sets of variables having a joint distribution. It defines a new orthogonal coordinate system for each of the sets in a way that the new pair of coordinate systems are optimal in maximizing the correlations. The new systems of coordinates are simply linear systems of the original ones. Thus, the LCCA can only be used to describe linear relation. Via such linear relation it finds only linear dimension reduction subspace and linear discriminant subspace, too. Under Gaussian assumption, the LCCA can also be used for testing independence between two sets of variables. However, all these become invalid, if data are not Gaussian nor at least elliptically symmetrically distributed. Thus, one has to resort to some nonparametric strategy. Motivated from the active development of statistical learning theory (Vapnik, 1998; Hastie, Tibshirani and Friedman, 2001; and references therein) and the popular and successful usage of various kernel machines (Cortes and Vapnik, 1995; Cristianini and Shawe-Taylor, 2000; Schölkopf and Smola, 2002), there has emerged a hybrid approach of the LCCA with a kernel machine (Akaho, 2001; Bach and Jordan, 2002), named kernel canonical correlation analysis (KCCA). The KCCA was also studied recently by Gretton, Herbrich and Smola (2003), Kuss and Graepel (2003), Hardoon, Szedmak and Shawe-Taylor (2004) and Gretton, Herbrich, Smola, Bousquet and Schölkopf (2005). All the above works are more on the computer science side and there is still limited statistical applications of the KCCA method. In this article we introduce the method for several statistical applications including association measure, nonlinear dimension reduction for discriminant analysis and test of independence. 2 The rest of the article is organized as follows. In Section 2 we brief the methods of LCCA and KCCA. In Section 3 we discuss some implementation issues concerning regularization and parameter selection. In Section 4 we introduce some statistical applications using KCCA. These applications are nonlinear association measure, dimension reduction for nonlinear discriminant analysis and a test of independence. Finally, concluding remarks and discussion are given in Section 5. All relevant theoretical backgrounds including the nonlinear canonical correlation analysis, its approximation, estimation and asymptotic distribution are in the Appendix. 2 2.1 Canonical correlation analysis A review of linear canonical correlation analysis Suppose the random vector X of q components has a probability distribution P on X ⊂ Rq . We partition X into q1 and q2 components: (1) X X= . X (2) Let the corresponding partition of X be denoted by X1 ⊕ X2 . All vectors are column vectors throughout, unless transposed to row vectors. We adopt a Matlab convention (1) X (1) (2) [X ; X ] := , X (2) where the semicolon denotes the stacking of 2 column vectors in the left hand side of the equality into a longer one as shown in the right hand side. We are interested in finding the relation between X (1) and X (2) . The LCCA describes linear relation by reducing the correlation structure between these two sets of variables to the simplest possible form by means of linear transformations on X (1) and X (2) . For given data (1) (2) set A = {xj = [xj ; xj ]}nj=1 , let X denote the data design matrix (1) (2) x1 x1 .. .. := [X1 X2 ]. X= . . (1) xn (2) xn n×(q1 +q2 ) For the first pair of canonical variates, the LCCA seeks a pair of linear variates that maximize the correlation, namely, it solves the following optimization problem: ρ := max α Σ12 β subject to α Σ11 α = 1 and β Σ22 β = 1, q +q [α;β]∈R 1 (1) 2 where Σij is the sample covariance matrix of X (i) and X (j) , i, j = 1, 2. Denote the solution to (1) by [α1 ; β1 ]. For the rest pairs of canonical vectors, it solves sequentially 3 the same problem as (1) with extra constraints of iterative orthonormality: for the kth pair αk Σ11 αi = 0 and βk Σ22 βi = 0, ∀i = 1, . . . , k − 1. (2) The sets {αi } and {βi }, called (linear) canonical vectors, can be regarded as coordinate axes in a pair of new coordinate systems. The sequence of correlation coefficients {ρ1 , ρ2 . . .} describes only the linear relation between X (1) and X (2) . The LCCA can be justified by assuming that X (1) and X (2) have a joint Gaussian distribution, and the likelihood ratio criterion for testing independent of X (1) and X (2) can be expressed entirely in terms of sample correlation coefficients. 2.2 Kernel generalization of canonical correlation analysis There are cases where linear correlations may not be adequate for describing the “association” between X (1) and X (2) . A natural alternative, therefore, is to explore and exploit their nonlinear relation. Several authors in the machine learning community (see Introduction) have resorted to kernel method for exploring nonlinear relation. The so called KCCA can be viewed as a special case of a series of study on nonlinear canonical correlation analysis by Dauxois et al. (Dauxois, Romain and Viguier, 1993; Dauxois and Nkiet, 1997; Dauxois and Nkiet, 1998; Dauxois and Nkiet, 2002; Dauxois, Nkiet and Romain, 2004). There are also other nonlinear canonical correlation analysis. See, for instance, Ramsay and Silverman (Chapter 12, 1997), Eubank and Hsing (2005), Hsing, Liu, Brun and Dougherty (2005) and references therein. Here we give a brief description of the KCCA.1 Let κ1 (·, ·) and κ2 (·, ·) be two positive definite kernels defined on X1 ⊕ X1 and X2 ⊕ X2 , respectively. Let each data point be augmented with kernel values via (1) (2) x → γx = [γx(1) ; γx(2) ] or xj → γj = [γj ; γj ], where (3) (i) γx(i) = [κi (x(i) , x1 ); . . . ; κi (x(i) , x(i) n )], i = 1, 2, and (i) (i) (i) (i) γj = [κi (xj , x1 ); . . . ; κi (xj , x(i) n )], i = 1, 2, j = 1, . . . , n. By matrix notation, the augmented kernel design matrix is given by (1) (2) γ1 γ1 .. .. K := [ K1 K2 ] := . . . (1) γn 1 (4) (2) γn n×2n The KCCA formulation here is different from those in the above mentioned works originated from the machine learning community. Our formulation is closer to Dauxois et al. with Aronszajn kernel map as our feature mapping, which is isometrically isomorphic to the kernel spectrum-based feature mapping. 4 (1) (2) This augmented representation of xj by γj = [γj ; γj ] ∈ R2n can be regarded as an alternative way of recording data measurements with high inputs. The KCCA procedure consists of two major steps: (a) Transform each of the data points to a kernel-augmented representation as in (3), or equivalently (4) in matrix notation. One common choice for the kernel function is the Gaussian density function. (b) Carry out the LCCA on the kernel augmented data K = [ K1 K2 ]. Note that some sort of regularization is necessary here to solve the corresponding canonical analysis problem. It involves a spectrum decomposition (also known as singular value decomposition) of extracting the canonical variates and correlation coefficients. Detailed computational implementation is discussed in the Implementation Section. (1) (2) We can consider augmenting the data A = {[xj ; xj ]}nj=1 by a functional representation: (1) (2) F = {κ1 (xj , ·) + κ2 (xj , ·)}nj=1 . The KCCA is a canonical analysis for these functional data. However, these functional data are not congenital but are intentionally made so, in order to study the nonlinear relation in a canonical analysis. The kernel matrix K in (4) is a discretized approximation of these functional data using a certain choice of basis set. More theoretic details are in Section 2.3 and the Appendix. 2.3 More on kernel-augmented data Here we will introduce a feature mapping, which brings the original sample space X1 ⊕ X2 ⊂ Rq1 +q2 into an infinite dimensional Hilbert space. Parallel to the classical multivariate analysis in Euclidean spaces, procedures of the same kind can be developed in Hilbert spaces for convenient nonlinear analysis. It is often done via using a certain parametric notion of a classical method, here, namely the notion of LCCA. For a given positive definite kernel κ(x, t) = κ1 (x(1) , t(1) ) + κ2 (x(2) , t(2) ) defined on X ⊕X , there exists a Hilbert space consisting of functions {f (x) : m i=1 ai κ(x, ti ), ∀1 ≤ m < ∞, ∀ai ∈ R, ∀ti ∈ X } and their closure with respect to the norm given by the inner product: m n ai κ(x, ti ), bj κ(x, sj ) = ai bj κ(ti , sj ). i=1 j=1 i,j It happens that this Hilbert space is a reproducing kernel Hilbert space. Definition 1 (Reproducing kernel Hilbert space) A Hilbert space of real-valued functions on an index set X satisfying the property that, all the evaluation functionals are bounded linear functionals, is called a reproducing kernel Hilbert space. 5 To every RKHS of functions on X there corresponds a unique positive-definite kernel κ satisfying the reproducing property, f (·), κ(x, ·) = f (x) for all f in this RKHS and all x ∈ X . We say that this RKHS admits the kernel κ. Let H1 and H2 be the associated RKHSs admitting κ1 and κ2 , respectively. Consider a transformation γ : X1 ⊕ X2 → H1 ⊕ H2 given by γ : x = [x(1) ; x(2) ] → [κ1 (x(1) , ·); κ2 (x(2) , ·)]. (5) The original sample space X is then embedded into a new sample space H = H1 ⊕ H2 via the transformation γ. Each point x ∈ X is mapped to an element in H. Let m2 1 {φν }m ν=1 and {ψµ }µ=1 be our choice of (incomplete) linear independent systems in H1 (1) (2) and H2 , respectively. The discretization of functional data {[κ1 (xj , ·); κ2 (xj , ·)]}nj=1 by the systems is given by (1) (2) (1) (2) [κ1 (xj , ·), φν (·) H1 , κ2 (xj , ·), ψµ (·) H2 ] = [φν (xj ), ψµ (xj )], where j = 1, . . . , n, ν = 1, . . . , m1 and µ = 1, . . . , m2 , which gives an n × (m1 + m2 ) (1) (2) kernel-augmented data matrix. If we choose {κ1 (xj , ·)}nj=1 and {κ2 (xj , ·)}nj=1 as our choice of linear independent systems for discretization, it results in the kernel data matrix K in (4). That is, the KCCA is carried out by performing LCCA on discretized kernel data. 3 Implementation There have been various implementation algorithms for KCCA (e.g., Gretton, Herbrich and Smola, 2003; Hardoon, Szedmak and Shawe-Taylor, 2004; Kuss and Graepel, 2003; Fukumizu, Bach and Jordan, 2004; Gretton, Herbrich, Smola and Schölkopf, 2005). Our aims here are quite different from those of computer scientists. In Section 2.2 we have interpreted the KCCA as a LCCA functioning on the kernel augmented (1) (2) data [γj ; γj ], j = 1, . . . , n. The implementation will thus be different from theirs, too. The procedure for KCCA is already stated in Section 2.1. The main purpose of this section is, after bridging the kernel method and the linear procedures, to lessen the programming burden by using code already available in commercial packages. The LCCA has long been a standard statistical analysis tool for the measure of association, independence test, etc., and it has been implemented in various mathematical and statistical software (e.g., canoncorr in Matlab; cancor in R; cancor in Splus; and proc cancorr in SAS). These packages are ready for use for KCCA. One advantage for using a conventional LCCA code from any popular package is to allow statisticians unfamiliar with kernel machines can still have an easy access to using KCCA. The extra efforts required are to prepare the data in an appropriate kernel augmentation form. In the following, we discuss two more steps, the regularization step and the parameter selection step, in kernel data preparation before feeding them into the LCCA 6 procedure. They are adopted to avoid computational instability and to enhance the computational speed and efficiency in the later LCCA procedure. 3.1 Regularization The optimization problem to be solved is: max α Λ12 β subject to α Λ11 α = 1 and β Λ22 β = 1, [α;β]∈R2n (i) (6) (k) where Λik is the sample covariance of γj and γj . Note that both Λ11 and Λ22 are singular, which causes computational difficulty. The optimization problem (6) is ill-conditioned and some sort of regularization is needed in preparing the kernel data matrices K1 and K2 . There are three commonly seen regularization methods for coping with ill-conditioned eigen-components extraction. • Ridge-type regularization, • Principal components approach, • Reduced set (of bases) approach. The ridge-type regularization adds a small quantity to the diagonals, α (Λ11 + λ1 I)α = 1 and β (Λ22 + λ2 I)β = 1, to stabilize the numerical computation for solving problem (6). The principal components approach extracts leading eigen-vectors from K1 and K2 , and denotes them by U1 and U2 , respectively. Next, it projects columns of K1 and K2 onto the column spaces of U1 and U2 , respectively, and obtains the reduced-rank approximation of kernel data K̃1 = K1 U1 and K̃2 = K2 U2 . (7) The LCCA is then functioning on the reduced kernel data K̃ = [K̃1 K̃2 ]. The PCA approach corresponds to using the following linear independent systems: (1) φν (x ) = n (1) (1) (2) (2) κ1 (x(1) , xj )ujν , ν = 1, . . . , m1 , j=1 and (2) ψµ (x ) = n κ2 (x(2) , xj )ujµ , µ = 1, . . . , m2 , j=1 (1) (2) where [ujν ] = U1 and [ujµ ] = U2 . 7 The reduced set approach is also a type of reduced-rank approximation of (1) (1) (2) (2) kernel data. Let A1 = {x1 , . . . , xn } and A2 = {x1 , . . . , xn } denote the full sets of data. The reduced set method selects a small portion of data Ã1 and Ã2 from the full sets to form reduced-rank kernel matrices K̃1 = K1 (A1 , Ã1 ) and K̃2 = K2 (A2 , Ã2 ), (1) (8) (1) where K1 (A1 , Ã1 ) = [κ1 (xi , xj )]x(1) ∈A1 ,x(1) ∈Ã1 is a thin column matrix of size n × m1 i j with m1 the size of Ã1 , and similarly for K̃2 (A2 , Ã2 ). The corresponding choices of linear independent systems are: (1) (2) (2) (2) (2) φν (x(1) ) = κ1 (x(1) , x(1) ν ), xν ∈ Ã1 , and ψµ (x ) = κ2 (x , xµ ), xµ ∈ Ã2 . The choice of subsets Ã1 and Ã2 is often made by uniform random sampling from A1 and A2 , respectively. The approach is termed random subset and is the most economic one among the three in terms of computational complexity, and it results in sparse representation with only a small number of kernel functions involved in the underlying model and in computation as well. This approach can effectively speed up the computation and cut down underlying model complexity (Lee and Mangasarian, 2001; Williams and Seeger, 2001; Lee, Hsieh and Huang, 2005; Lee and Huang, 2006). However, as the subset selection is purely random, it is recommended for median to large data sets. Both the ridge-type regularization and the principal components approach are well suited for small to median sized problems and provide better approximation than the random subset approach. As the data size n becomes extremely large, the full kernel matrices K1 and K2 are both of size n × n, and there are problems confronted in massive data computing including the memory requirement for storing the kernel data, the size of the mathematical programming problem and problems of prediction instability. In the large data case, we adopt the random subset approach. The approach is also reasonably well for median size problems as well. 3.2 Parameters selection There are three parameters involved in the entire procedure of KCCA. One is the kernel window width, one is the random subset size and the other is the number of principal components. (In examples throughout this article we do not consider the use of ridge-type regularization and hence no ridge parameter is involved.)√Also we use Gaussian pdf as our choice of kernel and the window widths used is 10S throughout, where S is the sample variance. Such a choice is based on empirical experience, which results in good √ normality checks on kernel data (Huang, Hwang and Lin, 2005). The window width 10S is a universal rule of thumb choice. Though it might not be optimal or even far away from an optimal window width, it does 8 give robust satisfactory results. Besides the examples considered in this article, the universal rule of thumb has been used in other data sets as well, including Adult data set, Educational placement data set and a synthetic data set with data size ranging from hundreds to tens of thousands. As for the choice of random subset size and the number of principal components, we suggest users start with an adequate proportion of kernel bases and then cut down the effective dimensionality by the PCA approach depending on the availability and performance of computer facility. See individual examples in next section for further discussion. 4 Statistical applications 4.1 Measure of association The KCCA results in a sequence of pairs of nonlinear variates and correlation coefficients. Here we use the leading correlation coefficient as a measure of nonlinear association for variables X (1) and X (2) . That is, we use ρ := sup cor f (X (1) ), g(X (2) ) f ∈H1 ,g∈H2 as a nonlinear association measure. This association measure is also used in Bach and Jordan (2002) and Gretton et al. (2005). Below we give a couple of examples of KCCA usage. In Example 1 we first illustrate the measure of nonlinear association between two univariate random variables. In Example 2 we demonstrate the association measures between two sets of multivariate data. In Example 3 we use the KCCA-found variates for discriminant analysis for the UCI Pendigits data set. Example 1 Synthetic data set: Association for two random variables. In this example, we use the leading kernel canonical correlation as our association measure and compare it with two well-known nonparametric rank-based association measures, Kendall’s τ and Spearman’s ρ, on their ability of assessing nonlinear relation. In the following three cases, let X (1) and be two independent and identically distributed standard normals, and the models of X (2) considered are I. X (2) = f (X (1) ) + k with f (x) = cos(πx); II. X (2) = f (X (1) ) + k with f (x) = exp(−|x|); and III. X (2) = X (1) + k ; where k is a constant in (0, 10]. In these models, the amount of association increases with respect to k. In each setting, we generate 500 samples of X (1) and , and take account of relationship between X (1) and X (2) . Kernel data using full bases are prepared and 9 then the dimensionality is cut down by the PCA extracting 99% of data variation. For models I and II, X (1) and X (2) are nonlinearly correlated. Five quantities are considered to measure the correlation and the corresponding curves are plotted in Figure 1 (a) and (b). The solid curve in each plot indicates the Pearson’s correlation of X (2) and f (X (1) ) taken as the target association measure between X (1) and X (2) . Because the models cos(πx) and exp(−|x|) are both symmetric about zero, the classical correlations, which can only catch linear association, for models I and II are around zero. The rank-based Kendall’s τ and Spearman’s ρ are also around zero. The results indicate that the linear correlation coefficients as well as the rank-based τ and ρ cannot catch nonlinear association. Overall speaking, kernel canonical correlation outperforms the rest. It approximates the true association well. For the linearly correlated case (model III in Figure 1 (c)), the solid curve still represents the true association. The measure of Spearman’s ρ is very close to the true association but Kendall’s τ is a bit far away from the true curve. Again, kernel canonical correlation follows the true curve very closely. ——— Put Figure 1 here ——— Example 2 Synthetic data set: Association for two sets of variables. Let X1 and X2 be independent and identically distributed random variables having uniform distribution over the interval (-2,2). Let Y1 = X12 + 0.11 and Y2 = cos(πX2 ) + 0.12 , where ’s are standard normal random noises. Let X (1) = [X1 ; X2 ] and X (2) = [Y1 ; Y2 ]. In each simulation, we sample 1000 pairs of [X (1) ; X (2) ] from the described model and calculate the leading 2 pairs of sample canonical variates using both the linear and the kernel CCA approaches. Here we use the random subset of size 200 as our regularization approach. Of course, one can use the PCA approach with the full set kernel bases of size 1000. The PCA with full set bases is better accurate but computationally more intensive. For demonstration purpose, we choose the computationally economic random subset approach. The mutual correlation carried in the leading 2 pairs of canonical variates found by LCCA and KCCA are listed in Table 1 for 30 random replicate runs. Reported are the average correlation measures over 30 replicate runs and their standard errors. It is evident that kernel canonical variates capture more correlation, and thus more association explained, between two groups of data. We randomly choose a copy of the 30 replicate runs and project these 1000 data points along the leading 2 pairs of sample canonical variates found by LCCA and KCCA. Data scatter plots along these variates are depicted in Figure 2. Figure 2(a) and Figure 2(b) are data scatter along the first and the second pair of linear canonical variates, respectively. There is indication of some strong association left unexplained. Figure 2(c) and Figure 2(d) are data scatter along the first and the second pair of 10 kernel canonical variates, respectively. They show strong correlation within each pair of kernel canonical variates. The correlations are 0.993 and 0.965 respectively. After the kernel transformation, the relation between these two vectors are indeed well depicted. Furthermore, the data projected along the KCCA-found canonical variates have elliptical symmetric like distribution. That implies the applicability of many multivariate data analysis tools, which originally are constrained by Gaussian assumption, now on the kernel data. ——— Put Table 1 and Figure 2 here ——— Example 3 Nonlinear discriminant for pen-based recognition of hand-written digits. In this example we utilize the kernel canonical variates as discriminant variates for multiple classification. Take, for instance, a k-classes problem, we first construct a k-dimensional class indicator variable “x(2) = (c1 , . . . , ck )” as follows: 1, if an instance input x(1) is from the ith class, ci = 0, otherwise. Then use KCCA on training inputs and their associated class labels to find the (k −1) canonical variates as discriminant directions. The pendigits data set is taken from the UCI machine learning data bases. The number of training instances is 7494 and the number of testing instances is 3498. For each instance there are 16 input measurements (i.e., x(1) is 16 dimensional) and a corresponding group label from {0, 1, 2, . . . , 9}. A Gaussian kernel with the covariance matrix diag(10S1 , . . . , 10S16 ) is used and a random subset of training inputs of size 30 (stratified over 10 classes) is adopted to produce the kernel data. Scatter plots of data projected along kernel canonical variates and along linear canonical variates are given below. To avoid excessive ink we only sample 20 test instances per digit to produce the data scatter plots. Different classes are labeled with distinct symbols. Figure 3 are data scatter plots along kernel canonical variates. Digits 0, 8, 6, and 4 can be easily separated when plotting the first versus second kernel canonical variates. Digits 3 and 2 are identified in the second versus third kernel canonical variates, digits 9 and 5 are in 3rd-and-4th, and digits 1 and 7 are identified in the 4th-and-5th variates. In fact, the first three kernel canonical variates can pretty much classify a majority of the ten digits already; while in Figure 4, it is till difficult to discriminate the ten digits even with 5 canonical variates. The 9-dimensional subspace spanned by kernel canonical variates based on training inputs is used as the designated discriminant subspace. Test instances are projected onto this subspace and classification is made by Fisher’s linear discriminant analysis, i.e., a test instance is assigned to the class with the closest group center (according to Mahalanobis distance). The accuracy of the kernel classification on test data can achieve the rate 97.24% (with standard error 0.056%). 11 ——— Put Figures 3 & 4 here ——— 4.2 Test of independence between two sets of variables In this section we generalize the use of Bartlett’s test of independence (Bartlett, 1947a; 1947b) to kernel augmented data. The theoretic foundation of this generalization is based on Dauxois and Nkiet (1998). The idea is to approximate the nonlinear canonical correlation analysis by a suitable linear canonical correlation analysis on augmented data through certain linear independent systems of functions, and then this approximation by linear canonical analysis is estimated by empirical data. We leave the technical details in the Appendix (see Proposition 1). This test can be easily done with commercial statistical packages. Here we use the Matlab m-file canoncorr for Example 4. It has the implementation of Bartlett’s independent test. Tests of two sets of variables for several distributions are carried out on reduced-rank kernel data. The reduction is done by PCA extracting 99% of data variation. The first 5 distributions in this study are taken from Dauxois and Nkiet (1998) and the last one is similar to Example 2 in Section 4.1 without the additive noise. Example 4 Independence tests. In the first three cases, we are interested in testing the independence between X and Y that are both one-dimensional. In cases IV and V, the Pθ = (Pθ1 , Pθ2 ) is two-dimensional, and we test the independence between the first Pθ1 and second Pθ2 components of Pθ . In the last case VI, the independence between two vectors X and Y are tested, where both X and Y are two-dimensional. I. X ∼ N (0, 1) and Y = X 2 ; II. [X; Y ] ∼ uniform distribution on the unit disk; III. [X; Y ] ∼ bivariate standard normal with correlation ρ; IV. a mixture distribution: Pθ = θQ1 + (1 − θ)Q2 with θ = 0.5, where Q1 is the distribution of [X; X 2 ] with X a univariate standard normal and Q2 is the bivariate standard normal with correlation ρ = 0.25; V. a mixture distribution: Pθ = θQ1 + (1 − θ)Q2 with θ = 0.75; VI. X = [X1 ; X2 ] has a bivariate uniform distribution over (−2, 2) × (−2, 2) and Y = [X12 ; cos(πX2 )]. Five hundreds sample points are drawn from each of the above distributions. The total number of replicate runs is 100. In each run, the kernel transformation is first 12 performed on X and Y respectively (on Pθ1 and Pθ2 in cases IV and V), and then PCA is conducted to extract leading variates accounting for 99% of variation. The power estimates (or type-I error estimates for case III with ρ = 0) and their standard errors for independence test are reported in Table 2. It is evident that KCCA-based test has power of 1, in both one- and two-dimensional cases; while LCCA catches the relation only when linearity is present. The pattern remains even when the relation between X and Y are complex (cases IV and V). For the type I error, it is about the same for KCCA as that of LCCA-based test (case III-1 for ρ = 0). The KCCA performs about the same as the order 2 and order 3 spline approximations. The kernel approximation provides an alternative choice besides splines. ——— Put Table 2 here ——— 5 Conclusion In this paper, we discuss the use of kernel method on two sets of multivariate data to construct the relation via canonical correlation analysis. Various examples are investigated for illustration. We also describe the procedure of KCCA and discuss implementation technique. There are several points worth mentioning. The association defined on the kernel transformed data is one special example of association measure for Hilbert subspaces (Dauxois and Nkiet, 1998 and 2002). It can be used for association measure and there are a couple of applications. The LCCA can be used for dimension reduction. The KCCA can be used for the same purpose in a similar manner. In LCCA, the first few leading canonical variates in X (1) and X (2) are used as dimension reduction directions in X1 and X2 respectively. It finds linear dimension reduction subspaces in X1 and X2 , while the KCCA first maps the data into a very high dimensional space via a nonlinear transformation and then finds dimension reduction subspaces therein. In this way the KCCA allows nonlinear dimension reduction directions, which are actually functional directions. KCCA can be used for nonlinear discriminant as well. Only the explanatory variables are subject to kernel augmentation and the group labels remain unchanged. In the pen digit data the KCCA is used to select the leading discriminant variates and classification is done in the subspace spanned by these functional discriminant variates. A test instance is assigned to the nearest group centroid according to Mahalanobis distance. Traditional tests of independence apply mostly the statistics that take into account only linear relationship and rely on the Gaussian assumption. With the KCCA independence test,the distributional assumption can be avoided. As seen in the simulations, it outperforms the LCCA test. 13 6 Appendix: nonlinear canonical correlation analysis In this appendix we give a brief review of the nonlinear canonical correlation analysis (NLCCA), including the definition, its approximation, estimation and asymptotic distribution of estimates, introduced by Dauxois et al. (mainly Dauxois and Nkiet, 1998; also consult Dauxois and Nkiet, 2002; Dauxois and Romain and Viguier, 1993; Dauxois and Nkiet, 1997; Dauxois, Nkiet and Romain, 2004). We then link the NLCCA to a special case, namely, the KCCA. NLCCA and measure of association. Consider a probability space (X , B, P ) such that the Hilbert space L2 (P ) is separable.2 Let X = [X (1) ; X (2) ] be random vector on (X , B, P ) with marginals denoted by P1 and P2 , respectively. Let L2 (P1 ) = {f ∈ L2 (P1 ) : Ef (X (1) ) = 0} and L2 (P2 ) be defined similarly. The NLCCA is the search of two variates f1 (X (1) ) (f1 ∈ L2 (P1 )) and g1 (X (2) ) (g1 ∈ L2 (P2 )) such that the pair (f1 , g1 ) maximizes f, g L2 (P ) , where f ∈ L2 (P1 ) and g ∈ L2 (P2 ), under the constraints f L2 (P1 ) = 1 and gL2 (P2 ) = 1. For ν ≥ 2, one searches for two variates fν ∈ L2 (P1 ) and gν ∈ L2 (P2 ) that maximizes the above-mentioned problem with the additional iterations of orthonormality constraints fk , fν L2 (P1 ) = 0 and gk , gν L2 (P2 ) = 0 for all ν = 1, . . . , k − 1. Denote ρν = fν (X (1) ), gν (X (2) ) L2 (P ) . The NLCCA is characterized by a triple: {(ρν , fν (X (1) ), gν (X (2) )), ν = 0, 1, 2, . . .}, where the trivial canonical terms ρ0 = 1 and f0 (X (1) ) = g0 (X (2) ) = 1 do not give any information about independence of X and Y . The numbers 0 ≤ ρν ≤ 1 are arranged in decreasing order and are termed nonlinear canonical coefficients. The trivial canonical terms can be easily avoided by restricting the NLCCA to centered variables. Approximation to NLCCA. For k ∈ N, let {φkν }1≤ν≤pk be a linearly independent system in L2 (P1 ) (noncentered subspace) and {ψµk }1≤µ≤qk be a linearly independent system in L2 (P2 ) (noncentered subspace). Let V1k and V2k be, respectively, the subspaces spanned by the systems {φkν }1≤ν≤pk and {ψµk }1≤µ≤qk . Assume that these sequences of subspaces are nondecreasing and that their unions are dense in L2 (P1 ) and L2 (P2 ), respectively. It was shown (Dauxois and Pousse, 1977; Lafaye de Micheaux, 1978) that the NLCCA can be approximated (in terms of uniform convergence of a certain underlying sequence of linear operators, which we will not get into details here) by the LCCA of random vectors φk (X (1) ) := [φk1 (X (1) ); . . . ; φkpk (X (1) )] 2 That is, it has a countable dense subset. In a separable Hilbert space countable orthonormal systems are used to expand any element as an infinite sum. 14 and ψ k (X (2) ) := [ψ1k (X (2) ); . . . ; ψqkk (X (2) )]. In summary, the NLCCA of X (1) and X (2) can be approximated by a sequence of suitable LCCA’s. The underlying sequence of LCCA approximations depends on the choice of linear independent systems {φkν }1≤ν≤pk and {ψµk }1≤µ≤qk , which are required to satisfy conditions (C1) V1k ⊂ V1k+1 and V2k ⊂ V2k+1 , ∀k ∈ N and (C2) ∪k V1k and ∪k V2k are dense in L2 (P1 ) and L2 (P2 ), respectively. Choices of the systems {φkν }1≤ν≤pk and {ψµk }1≤µ≤qk can be step functions, or B-splines (Dauxois and Nkiet, 1998). In this article we use kernel functions as our choice of linear independent systems. (1) (2) Estimates and asymptotic distribution. Let {[xj ; xj ]}nj=1 be iid sample having the same distribution as [X (1) ; X (2) ]. Assume, without loss of generality, that pk ≤ qk . Let ρν , ν = 1, . . . , pk , be the sample correlation coefficients of data (1) (1) (2) (2) {φk (x1 ), . . . , φk (xn )} and {ψ k (x1 ), . . . , ψ k (xn )}. Proposition 1 (Proposition 5.3 in Dauxois and Nkiet, 1998) Under the null hypothesis that X (1) and X (2) are independent, nrk,n converges in distribution, as n → ∞, to χ2(pk −1)(qk −1) , where r k,n =− pk log(1 − ρ2ν ). ν=1 Proposition 1 is a special caseof Proposition 5.3 (Dauxois and Nkiet, 1998) by taking, k in their notation, Φk (λ) = − pν=1 log(1−λν ) with λ = (λ1 , . . . , λPk ) and λν = ρ2ν > 0. Then the constant KΦk is given by KΦk = (∂Φk /∂λ1 )(0) = 1. For the independence test in Section 4.2, Bartlett’s modification is adopted, namely, 1 n − (pk + qk + 1) rk,n ∼ χ(pk −1)(qk −1) . 2 Remark 1 The KCCA takes some special φ- and ψ-functions: (1) φki (x(1) ) = κ1,σ1 (x(1) , xi ), i = 1, . . . , pk , (2) ψik (x(2) ) = κ2,σ2 (x(2) , xi ), i = 1, . . . , qk , (1) (2) where σ1 and σ2 are the kernel widths and {xi }p1k and {xi }q1k are often taken to be random subsets from full data. 15 References Akaho, S. (2001). A kernel method for canonical correlation analysis. International Meeting of Psychometric Society (IMPS2001). Bach, F.R. and Jordan, M.I. (2002). Kernel independent component analysis. J. Mach. Learning Res., 3, 1–48. Bartlett, M.S. (1947a). Multivariate analysis. Supp. J. Roy. Statist. Soc., 9, 176–197. Bartlett, M.S. (1947b). The general canonical correlation distribution. Ann. Math. Statist., 18, 1–17. Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–279. Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press. Dauxois, J. and Nkiet, G.M. (1997). Canonical analysis of two Euclidean subspaces and its applications. Linear Algebra Appl., 264, 355–388. Dauxois, J. and Nkiet, G.M. (1998). Nonlinear canonical analysis and independence tests. Ann. Statist., 26, 1254–1278. Dauxois, J. and Nkiet, G.M. (2002). Measure of association for Hilbert subspaces and some applications. J. Multivariate. Anal., 82, 263–298. Dauxois, J., Nkiet, G.M. and Romain, Y. (2004). Canonical analysis relative to a closed subspace. Linear Algebra Appl., 388, 119–145. Dauxois, J. and Pousse, A. (1977). Some convergence problems in factor analyses. Recent Developments in Statistics, 387–402, North-Holland, Amsterdam. (Proc. European Meeting Statisticians, Grenoble, 1976.) Dauxois, J., Romain, Y. and Viguier, S. (1993). Comparison of two factor subspaces. J. Multivariate Anal., 44, 160–178. Eubank, R. and Hsing, T. (2005). Canonical correlation for stochastic processes. preprint. Fukumizu, K., Bach, F.R. and Jordan, M.I. (2004). Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learning Res., 5, 73–99. 16 Gel fand, I.M. and Yaglom, A.M. (1959). Calculation of the amount of information about a random function contained in another such function. Amer. Math. Soc. Transl. (2), 12, 199–246. Gretton, A., Herbrich, R. and Smola, A. (2003). The Kernel Mutual Information. Technical Report, MPI for Biological Cybernetics Technical Report, Tuebingen,Germany. Gretton, A., Herbrich, R., Smola, A., Bousquet, O. and Schölkopf, B. (2005). Kernel methods for measuring independence. J. Machine Learning Research, 6, 2075– 2129. Hardoon, D.R., Szedmak, S. and Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16, 2639–2664. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321– 377. Hsing, T., Liu, L.-Y., Brun, M. and Dougherty, E. R. (2005). The coefficient of intrinsic dependence. Pattern Recognition, 38, 623-636. Huang, S.Y., Hwang, C.R. and Lin, M.H. (2005). Kernel Fisher discriminant analysis in Gaussian reproducing kernel Hilbert space. Technical report, Institute of Statistical Science, Academia Sinica, Taiwan. http://www.stat.sinica.edu.tw/syhuang. Lafaye de Micheaux, D. (1978). Approximation d’analyses canoniques non linéaires et analyses factorielles privilégiantes. Thèses de Docteur Ingénieur, Univ. Nice. Kuss, M. and Graepel, T. (2003). The geometry of kernel canonical correlation analysis. Technical report, Max Planck Institute for Biological Cybernetics, Germany. Lee, Y.J. and Huang, S.Y. (2006). Reduced support vector machines: a statistical theory. IEEE Trans. Neural Networks, accepted. http://www.stat.sinica.edu.tw/syhuang. Lee, Y.J. and Mangasarian, O.L. (2001). RSVM: reduced support vector machines. Proceeding 1st International Conference on Data Mining, SIAM. Ramsay, J.O. and Silverman, B.W. (1997). Functional Data Analysis. Springer. 17 Schölkopf, B. and Smola, A.J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA. Silvey, S.D.(1964). On a measure of association. Ann. Math. Statist., 35, 1157-1166 Vapnik, V.N. (1998). Statistical Learning Theory. Wiley, New York. Williams, C.K.I. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems, 13, 682-688, Cambridge, MA, MIT Press. 18 Table 1: Correlation for the first and the second pairs of canonical variates and kernel canonical variates for Example 2. average correlation measure LCCA (s.e.) KCCA (s.e.) between 1st pair cv’s 0.0573 (0.0046) 0.9926 (0.0001) between 2nd pair cv’s 0.0132 (0.0020) 0.9646 (0.0005) Table 2: Power estimates (or type-I error estimate in III-1 case) for testing independence of (X, Y ) for several distributions, the significance level α = 0.05 based on 100 replicate runs. The D&N columns are power estimates taken from Dauxois and Nkiet (1998) using order 2 and 3 splines for approximation. (For case I, the D&N has used sample sizes 200 and 400, instead of 500.) case I II III-1, III-2, III-3, III-4, IV V VI n LCCA (s.e.) 500* 0.42 (0.049) 500 0.02 (0.014) ρ=0 500 0.06 (0.024) ρ = 0.2 500 0.99 (0.010) ρ = 0.5 500 1.00 (0.000) ρ = 0.8 500 1.00 (0.000) 500 0.62 (0.049) 500 0.37 (0.048) 500 0.09 (0.029) KCCA (s.e.) 1.00 (0.000) 1.00 (0.000) 0.04 (0.020) 0.96 (0.020) 1.00 (0.000) 1.00 (0.000) 1.00 (0.000) 1.00 (0.000) 1.00 (0.000) 19 D&N spline2 1.00 0.99 NA NA NA 0.99 1.00 1.00 NA D&N spline3 1.00 0.99 NA NA NA 0.73 1.00 1.00 NA 1.0 0.8 0.6 stat 0.0 0.2 0.4 Cor(cos(pi*X^(1)),X^(2)) Kernel Cor(X^(1),X^(2)) Cor(X^(1),X^(2)) Kendall’s(X^(1),X^(2)) Spearman’s(X^(1),X^(2)) 0 2 4 6 8 10 k 0.6 0.8 1.0 Figure 1: (a) Example 1, model I 0.0 0.2 0.4 stat Cor(exp(-abs(X^(1))),X^(2)) Kernel Cor(X^(1),X^(2)) Spearman’s Kendall’s 0 2 4 6 8 10 k 0.6 0.8 1.0 Figure 1: (b) Example 1, model II 0.0 0.2 0.4 stat Cor(X^(1),X^(2)) Kernel Cor(X^(1),X^(2)) Spearman’s Kendall’s 0 2 4 6 8 k Figure 1: (c) Example 1, model III 20 10 (a) (b) 4 3 2 2 1 0 0 −2 −4 −4 −1 −2 0 2 −2 −4 4 (c) 3 1 1 0 0 −1 0 1 2 −2 −2 3 0 2 4 1 2 (d) 2 2 −1 −1 −2 −1 0 Figure 2: Scatter plots of the first and second pairs of linear canonical variates ((a) and (b)) and kernel canonical variates ((c) and (d)) for Example 2. 21 (a) 66666 6 6 6 66 6 666 6 6 1 0 −1 −2 −3 0 0000 0000000 0 (b) 44444444 44 444 44444 1.5 3rd kernel canonical variates 2nd kernel canonical variates 2 4 9999 9 99 595 4 5 59 55 55555 59 99 5 595959 6 5 333 33 33333 3333 71 3 1 7711 11 2 7 11 1 1 71 7 12222 77 7 771 77 22 2 2 77222 222 7 8 8 8 8 888888 8 88 8 888 8 −2 −1 0 1 1st kernel canonical variates 9 999 9 9 999999 9 95 5 5 59 5555 55 5 559 559 5 9 55 3 333 3339 3 5 33333 3 3 33 3 3 1 0.5 0 1 1 11111117 7 1 71111177 7771 1 1 7 77 777 −1 77 7 7 2 22 2222222 222 2222 −1.5 −1 −0.5 0 0.5 1 2nd kernel canonical variates −0.5 (c) 1 0.5 0 −0.5 −1.5 11 11 11 11171 11 1 771171 7 77777771 77 7 7 1 7 77 (d) 0.5 5th kernel canonical variates 4th kernel canonical variates 1.5 1.5 9 9 9 99 999999 999 9 99 55 9 5 5 5 55 5 5 5 5 555 55 5 5 55 −1 −0.5 0 0.5 1 3rd kernel canonical variates 7 0 −0.5 −1 1.5 7 0 7 7 7 77 1 7 7 7 77 1 7 7 1 1 1 1 111 1 7 117 1 1 171 1 1 7 1 0.5 1 4th kernel canonical variates 1.5 Figure 3: Scatter plots of pendigits over the 1st-and-2nd, 2nd-and-3rd, 3rd-and-4th, and 4th-and-5th kernel canonical variates. 22 (a) (b) 3 2 88 88 8 8 8 8 88 88 0 8 08 8 7 5 3 3 555558 8 777 0 33 8 0 7 3 3 7 2 2 77 5 5 7 733 1 75 56 8 0 27 22 11 5999 666 8 00000 222 733113 77 7 3 3 2 1 1 0 2 1 1 3 00 7 171 99 6 222211 212 3 99 559 99 6 6 0 0 5 991 2221 1 5 1 9 9 6 6 0 9 97 55 5 669 6 0 0 4 9 566 9 7 6696 4 0 6 6 60 4 4 44444 444 4 444 4 4 4 1 0 −1 −2 −3 −3 −2 −1 0 1 1st canonical variates 2 3rd canonical variates 2nd canonical variates 3 2 1 0 −1 −2 7 9 9 11 95 5 4 71 4 4444 9 599951997971991 5555 3 5 44 4444 0 0099 00999 313753 95 333 44 4 40 4 30001773375 5 5558 8888 8 5 2 3 3 590 103 17 3 33300 7078 7 08888 8 0 0 7 44 5 5 1 3 0 1 21212772278 87 8 80 8 1 2 2 2 4 2 128 2 2 7 8 1 112262 76 66 6262122 66 66666 6 666 6 7 6 6 −3 3 −3 −2 (c) −1 0 1 2 2nd canonical variates 3 (d) 1 0 −1 −2 −3 −3 3 6 2 6 0 0 7 7000 70 707707 00400 4 4 7 7 7 34008 7 4 4 7 87887 833 34 4 4 54 4 8 7 34 013335 6 62728 11 3383325853 45 5 5 0 1 8 4 2 8 28 8 5 43 43 2 1 24 1 43 05775 85 3 6661 121 12 2 82 6 22 822 22 88 4 33 3115 175 7 6 6 8 1 2 66 1 1 5 4999995 2 01 99199 1 66 5 59 9 9 99 6 6 1 6 66 9 59 5 9 9 9 −2 −1 0 1 3rd canonical variates 2 5th canonical variates 4th canonical variates 3 2 1 0 −1 −2 −3 −3 3 5 55 5 5 55 5 5 4 8 4 1 527210 882 4 74444 747 7 82 55 696 6 2412 1 8 2 1 4447 4 7 128228 6 1 9 2 22 248 12041 8 8 70 9 6 66 6761 14144 28 4 8 9 65 0 6 562 00 00 0 7 12 0 87 870 5 9 2 9 9 6 5 6 25 66 888 07 0 0070 1 6 5 1 1 6 7 9 15 1 85 8 7 0 6 999991 33 77 99999 3 3 3 6 7 7 0 9 3 3 33 333 3333 33 33 −2 −1 0 1 4th canonical variates 2 3 Figure 4: Scatter plots of pendigits over the 1st-and-2nd, 2nd-and-3rd, 3rd-and-4th, and 4th-and-5th canonical variates. 23
© Copyright 2026 Paperzz