Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach 1 Outline Introduction – dimension reduction and conditional independence Conditional covariance operators on RKHS Kernel Dimensionality Reduction for regression Manifold KDR Summary 2 Sufficient Dimension Reduction • Regression setting: observe (X,Y) pairs, where the covariate X is high-dimensional • Find a (hopefully small) subspace S of the covariate space that retains the information pertinent to the response Y • Semiparametric formulation: treat the conditional distribution p(Y | X) nonparametrically, and estimate the parameter S 3 Perspectives • Classically the covariate vector X has been treated as ancillary in regression • The sufficient dimension reduction (SDR) literature has aimed at making use of the randomness in X (in settings where this is reasonable) • This has generally been achieved via inverse regression • at the cost of introducing strong assumptions on the distribution of the covariate X • We’ll make use of the randomness in X without employing inverse regression 4 Dimension Reduction for Regression • Regression: p(Y | X ) Y : response variable, X = (X1,...,Xm): m-dimensional covariate • Goal: Find the central subspace, which is defined via: p(Y | X ) ~ p (Y | b1T X ,...,bdT X ) ~p(Y | BT X ) 5 Y Y X1 X2 X1 1.2 1 1 Y N (0; 0.12 ) 1 exp( X 1 ) 0.8 Y 0.6 0.4 0.2 0 central subspace = X1 axis -0.2 -10 -8 -6 -4 -2 0 2 4 6 8 X2 6 Some Existing Methods Sliced Inverse Regression (SIR, Li 1991) – PCA of E[X|Y] use slice of Y – Elliptic assumption on the distribution of X Principal Hessian Directions (pHd, Li 1992) T – Average Hessian yxx E[(Y Y )( X X )( X X ) ] is used – If X is Gaussian, eigenvectors gives the central subspace – Gaussian assumption on X. Y must be one-dimensional Projection pursuit approach (e.g., Friedman et al. 1981) – Additive model E[Y|X] = g1(b1TX) + ... + gd(bdTX) is used Canonical Correlation Analysis (CCA) / Partial Least Squares (PLS) – Linear assumption on the regression Contour Regression (Li, Zha & Chiaromonte, 2004) – Elliptic assumption on the distribution of X 7 Dimension Reduction and Conditional Independence • (U, V)=(BTX, CTX) where C: m x (m-d) with columns orthogonal to B • B gives the projector onto the central subspace pY | X ( y | x) pY |U ( y | BT x) pY |U ,V ( y | u, v) pY |U ( y | u ) for all y, u, v Conditional independence Y V |U Y X U V • Our approach: Characterize conditional independence 8 Outline Introduction – dimension reduction and conditional independence Conditional covariance operators on RKHS Kernel Dimensionality Reduction for regression Manifold KDR Summary 9 Reproducing Kernel Hilbert Spaces “Kernel methods” – RKHS’s have generally been used to provide basis expansions for regression and classification (e.g., support vector machine) – Kernelization: map data into the RKHS and apply linear or secondorder methods in the RKHS – But RKHS’s can also be used to characterize independence and conditional independence WX feature map X FX(X) FY(Y) FX feature map FY HX RKHS HY WY Y RKHS 10 Positive Definite Kernels and RKHS Positive definite kernel (p.d. kernel) k :W W R k is positive definite if k(x,y) = k(y,x) and for any n N, x1 , xn W the matrix k(xi , x j ) (Gram matrix) is positive semidefinite. i, j – Example: Gaussian RBF kernel k(x, y) exp x y 2 2 Reproducing kernel Hilbert space (RKHS) k: p.d. kernel on W H : reproducing kernel Hilbert space (RKHS) 1) k( , x) H for all x W. 2) Span k(, x) | x W is dense in H. 3) k( , x), f H f (x) (reproducing property) 11 Functional data F : W H , x k ( , x) Data: X1, …, XN i.e. F( x) k ( , x) FX(X1),…, FX(XN) : functional data Why RKHS? – By the reproducing property, computing the inner product on RKHS is easy: F(x),F(y) k(x, y) g Nj 1 b j F( x j ) j b j k ( , x j ) f iN1 ai F( xi ) i ai k ( , xi ), f , g i , j aib j k ( xi , x j ) – The computational cost essentially depends on the sample size. Advantageous for high-dimensional data of small sample size. 12 Covariance Operators on RKHS • X , Y : random variables on WX and WY , resp. • Prepare RKHS (HX, kX) and (HY , kY) defined on WX and WY, resp. • Define random variables on the RKHS HX and HY by F X ( X ) k X ( , X ) FY (Y ) kY ( , Y ) • Define the covariance operator YX YX E[FY (Y ) F X (X), ] E[FY (Y )]E[ F X (X), ] FX(X) WX X FX HX YX FY(Y) WY FY HY Y 13 Covariance Operators on RKHS • Definition YX E[FY (Y ) F X (X), ] E[FY (Y )]E[ F X (X), ] YX is an operator from HX to HY such that g, YX f E[g(Y ) f (X)] E[g(Y )]E[ f (X)] ( Cov[ f (X), g(Y )]) for all f H X , g HY • cf. Euclidean case VYX = E[YXT] – E[Y]E[X]T : covariance matrix b,VYX a Cov[(b, Y ), (a, X )] 14 Characterization of Independence • Independence and cross-covariance operators If the RKHS’s are “rich enough”: X Y is always true requires an assumption on the kernel (universality) XY O Cov[ f (X), g(Y )] 0 or E[g(Y ) f (X)] E[g(Y )]E[ f (X)] for all f H X , g HY e.g., Gaussian RBF kernels are universal k(x, y) exp x y 2 2 – cf. for Gaussian variables, X and Y are independent VXY O i.e. uncorrelated 15 • Independence and characteristic functions Random variables X and Y are independent E XY I.e., ei T x ei X ei Y E X ei X EY ei Y T and ei T T y T T for all and work as test functions • RKHS characterization Random variables X W X and Y WY are independent E XY f ( X ) g (Y ) E X f ( X ) EY g (Y ) for all f H X , g H Y – RKHS approach is a generalization of the characteristic-function approach 16 RKHS and Conditional Independence • Conditional covariance operator X and Y are random vectors. HX , HY : RKHS with kernel kX, kY, resp. 1 Def. YY |X YY YX XX XY : conditional covariance operator – Under a universality assumption on the kernel g, YY |X g E Var[g(Y ) | X] cf. For Gaussian VarY |X [aT Y | X x] aT VYY VYXVXX 1VXY a – Monotonicity of conditional covariance operators X = (U,V) : random vectors YY |U YY | X : in the sense of self-adjoint operators 17 • Conditional independence Theorem X = (U,V) and Y are random vectors. HX , HU , HY : RKHS with Gaussian kernel kX, kU, kY, resp. Y V |U YY |U YY | X This theorem provides a new methodology for solving the sufficient dimension reduction problem 18 Outline Introduction – dimension reduction and conditional independence Conditional covariance operators on RKHS Kernel Dimensionality Reduction for regression Manifold KDR Summary 19 Kernel Dimension Reduction • Use a universal kernel for BTX and Y ( : the partial order of self-adjoint operators) YY|BT X YY|X YY|BT X YY|X X Y | B TX • KDR objective function: min Tr YY|BT X B: BT B I d which is an optimization over the Stiefel manifold 20 Estimator • Empirical cross-covariance operator ̂ (N ) YX 1 N kY (,Yi ) m̂Y k X (, Xi ) m̂ X N i1 1 N 1 N m̂ X k X (, Xi ) m̂Y kY (,Yi ) N i1 N i1 (N ) ̂YX gives the empirical covariance: (N) g, ̂YX f 1 N N i1 f ( X i )g(Yi ) N1 i1 f ( X i ) N1 i1 g(Yi ) N N • Empirical conditional covariance operator ̂ (N ) YY |X ̂ (N ) YY ̂ (N ) YX ̂ (N ) XX eN I 1 ̂ (XYN ) eN: regularization coefficient 21 • Estimating function for KDR: Tr ̂ (N ) YY |U (N ) (N ) (N ) Tr ̂YY ̂ ̂ YU UU e N I 1 (N ) ̂UY U BT X 1 Tr GY GY GU GU N e N I N where GU I N N1 1N 1TN KU I N N1 1N 1TN : centered Gram matrix KU k ( BT X i , BT X j ) • Optimization problem: G G T N e I min Tr N N Y B X B:BT BI d 1 22 Experiments with KDR Wine data KDR 20 Data 13 dim. 178 data. 3 classes 2 dim. projection 20 15 Partial Least Square 15 10 10 5 5 0 0 -5 -5 -10 -10 -15 k(z1 , z2 ) -15 -20 exp z1 z2 = 30 2 2 -20 -20 -10 0 10 CCA 20 -20 20 20 15 15 10 10 5 5 0 0 -5 -5 -10 -10 -15 -15 -20 -20 -20 -10 0 10 20 -10 0 10 20 Sliced Inverse Regression -20 -10 0 10 20 23 Consistency of KDR Theorem Suppose kd is bounded and continuous, and e N 0, N 1/2e N (N ). Let S0 be the set of optimal parameters: B B' min Tr YY|X S0 B | BT B I d , Tr YY|X B' Then, under some conditions, for any open set U S0 Pr B̂( N ) U 1 (N ). 24 Lemma Suppose kd is bounded and continuous, and e N 0, N 1/2e N (N ). Then, under some conditions, sup B:BT B I d B( N ) B YY|X 0 (N ) Tr ̂YY|X Tr in probability. 25 Conclusions Introduction – dimension reduction and conditional independence Conditional covariance operators on RKHS Kernel Dimensionality Reduction for regression Technical report available at: www.cs.berkeley.edu/~jordan/papers/kdr.pdf 26
© Copyright 2026 Paperzz