jordan-part1 - Stanford University

Kernel-Based Contrast Functions
for Sufficient Dimension Reduction
Michael I. Jordan
Departments of Statistics and EECS
University of California, Berkeley
Joint work with Kenji Fukumizu and Francis Bach
1
Outline
 Introduction
– dimension reduction and conditional independence
 Conditional covariance operators on RKHS
 Kernel Dimensionality Reduction for regression
 Manifold KDR
 Summary
2
Sufficient Dimension Reduction
• Regression setting: observe (X,Y) pairs, where the
covariate X is high-dimensional
• Find a (hopefully small) subspace S of the covariate
space that retains the information pertinent to the
response Y
• Semiparametric formulation: treat the conditional
distribution p(Y | X) nonparametrically, and estimate the
parameter S
3
Perspectives
• Classically the covariate vector X has been treated as
ancillary in regression
• The sufficient dimension reduction (SDR) literature has
aimed at making use of the randomness in X (in settings
where this is reasonable)
• This has generally been achieved via inverse regression
• at the cost of introducing strong assumptions on the distribution of
the covariate X
• We’ll make use of the randomness in X without employing
inverse regression
4
Dimension Reduction for Regression
• Regression: p(Y | X )
Y : response variable,
X = (X1,...,Xm): m-dimensional covariate
• Goal: Find the central subspace, which is defined via:
p(Y | X )  ~
p (Y | b1T X ,...,bdT X )
  ~p(Y | BT X ) 
5
Y
Y
X1
X2
X1
1.2
1
1
Y
 N (0; 0.12 )
1  exp(  X 1 )
0.8
Y
0.6
0.4
0.2
0
central subspace = X1 axis
-0.2
-10
-8
-6
-4
-2
0
2
4
6
8
X2
6
Some Existing Methods
 Sliced Inverse Regression (SIR, Li 1991)
– PCA of E[X|Y]  use slice of Y
– Elliptic assumption on the distribution of X
 Principal Hessian Directions (pHd, Li 1992)
T
– Average Hessian  yxx  E[(Y  Y )( X  X )( X  X ) ] is used
– If X is Gaussian, eigenvectors gives the central subspace
– Gaussian assumption on X. Y must be one-dimensional
 Projection pursuit approach (e.g., Friedman et al. 1981)
– Additive model E[Y|X] = g1(b1TX) + ... + gd(bdTX) is used
 Canonical Correlation Analysis (CCA) / Partial Least Squares (PLS)
– Linear assumption on the regression
 Contour Regression (Li, Zha & Chiaromonte, 2004)
– Elliptic assumption on the distribution of X
7
Dimension Reduction and Conditional Independence
• (U, V)=(BTX, CTX)
where C: m x (m-d) with columns orthogonal to B
• B gives the projector onto the central subspace

pY | X ( y | x)  pY |U ( y | BT x)

pY |U ,V ( y | u, v)  pY |U ( y | u ) for all y, u, v

Conditional independence
Y V |U
Y
X
U
V
• Our approach: Characterize conditional independence
8
Outline
 Introduction
– dimension reduction and conditional independence
 Conditional covariance operators on RKHS
 Kernel Dimensionality Reduction for regression
 Manifold KDR
 Summary
9
Reproducing Kernel Hilbert Spaces
 “Kernel methods”
– RKHS’s have generally been used to provide basis expansions for
regression and classification (e.g., support vector machine)
– Kernelization: map data into the RKHS and apply linear or secondorder methods in the RKHS
– But RKHS’s can also be used to characterize independence and
conditional independence
WX
feature map
X
FX(X)
FY(Y)
FX
feature map
FY
HX
RKHS
HY
WY
Y
RKHS
10
Positive Definite Kernels and RKHS
 Positive definite kernel (p.d. kernel)
k :W  W  R
k is positive definite if k(x,y) = k(y,x) and for any n N, x1 , xn W
the matrix k(xi , x j )
(Gram matrix) is positive semidefinite.


i, j

– Example: Gaussian RBF kernel k(x, y)  exp  x  y
2
2

 Reproducing kernel Hilbert space (RKHS)
k: p.d. kernel on W
 H : reproducing kernel Hilbert space (RKHS)
1) k( , x) H for all x  W.
2) Span k(, x) | x W is dense in H.
3) k( , x), f H  f (x) (reproducing property)
11
 Functional data
F : W  H , x  k (  , x)
Data: X1, …, XN

i.e. F( x)  k (  , x)
FX(X1),…, FX(XN) : functional data
 Why RKHS?
– By the reproducing property, computing the inner product on RKHS
is easy:
F(x),F(y)  k(x, y)
g   Nj 1 b j F( x j )   j b j k (  , x j )
f  iN1 ai F( xi )  i ai k (  , xi ),
f , g  i , j aib j k ( xi , x j )
– The computational cost essentially depends on the sample size.
Advantageous for high-dimensional data of small sample size.
12
Covariance Operators on RKHS
• X , Y : random variables on WX and WY , resp.
• Prepare RKHS (HX, kX) and (HY , kY) defined on WX and WY, resp.
• Define random variables on the RKHS HX and HY by
F X ( X )  k X ( , X )
FY (Y )  kY (  , Y )
• Define the covariance operator YX
YX  E[FY (Y ) F X (X),  ]  E[FY (Y )]E[ F X (X),  ]
FX(X)
WX
X
FX
HX
YX
FY(Y)
WY
FY
HY
Y
13
Covariance Operators on RKHS
• Definition
YX  E[FY (Y ) F X (X),  ]  E[FY (Y )]E[ F X (X),  ]
YX is an operator from HX to HY such that
g, YX f  E[g(Y ) f (X)]  E[g(Y )]E[ f (X)] ( Cov[ f (X), g(Y )])
for all f  H X , g  HY
• cf. Euclidean case
VYX = E[YXT] – E[Y]E[X]T : covariance matrix
b,VYX a   Cov[(b, Y ), (a, X )]
14
Characterization of Independence
• Independence and cross-covariance operators
If the RKHS’s are “rich enough”:
X
Y
is always true
requires an assumption
on the kernel (universality)

 XY  O
Cov[ f (X), g(Y )]  0
or
E[g(Y ) f (X)]  E[g(Y )]E[ f (X)]
for all f  H X , g  HY
e.g., Gaussian RBF kernels are
universal

k(x, y)  exp  x  y
2
2

– cf. for Gaussian variables,
X and Y are independent

VXY  O
i.e. uncorrelated
15
• Independence and characteristic functions
Random variables X and Y are independent
 E XY
I.e., ei
T
x
 ei X ei Y   E X  ei X  EY  ei Y 



 

T
and ei
T
T
y
T
T
for all  and 
work as test functions
• RKHS characterization
Random variables X  W X and Y  WY are independent

E XY  f ( X ) g (Y )  E X  f ( X ) EY g (Y )
for all f  H X , g  H Y
– RKHS approach is a generalization of the characteristic-function approach
16
RKHS and Conditional Independence
• Conditional covariance operator
X and Y are random vectors. HX , HY : RKHS with kernel kX, kY, resp.
1
Def. YY |X  YY  YX  XX  XY
: conditional covariance operator
– Under a universality assumption on the kernel
g, YY |X g  E Var[g(Y ) | X]
cf.


For Gaussian VarY |X [aT Y | X  x]  aT VYY  VYXVXX 1VXY a
– Monotonicity of conditional covariance operators
X = (U,V) : random vectors
YY |U  YY | X
 : in the sense of
self-adjoint operators
17
• Conditional independence
Theorem
X = (U,V) and Y are random vectors.
HX , HU , HY : RKHS with Gaussian kernel kX, kU, kY, resp.
Y V |U
 YY |U  YY | X
This theorem provides a new methodology for solving the
sufficient dimension reduction problem
18
Outline
 Introduction
– dimension reduction and conditional independence
 Conditional covariance operators on RKHS
 Kernel Dimensionality Reduction for regression
 Manifold KDR
 Summary
19
Kernel Dimension Reduction
• Use a universal kernel for BTX and Y
(  : the partial order of
self-adjoint operators)
YY|BT X  YY|X
YY|BT X  YY|X

X
Y | B TX
• KDR objective function:
min Tr  YY|BT X 


B: BT B I d
which is an optimization over the Stiefel manifold
20
Estimator
• Empirical cross-covariance operator
̂
(N )
YX
1 N
  kY (,Yi )  m̂Y  k X (, Xi )  m̂ X 
N i1
1 N
1 N
m̂ X   k X (, Xi )
m̂Y   kY (,Yi )
N i1
N i1
(N )
̂YX
gives the empirical covariance:
(N)
g, ̂YX
f 
1
N

N
i1
f ( X i )g(Yi )  N1  i1 f ( X i ) N1  i1 g(Yi )
N
N
• Empirical conditional covariance operator
̂
(N )
YY |X
 ̂
(N )
YY
 ̂
(N )
YX
̂
(N )
XX
 eN I

1
̂ (XYN )
eN: regularization coefficient
21
• Estimating function for KDR:
Tr  ̂
(N )
YY |U

(N )
(N )
(N )
  Tr  ̂YY

̂
̂
YU
UU  e N I



1
(N ) 
̂UY

U  BT X
1
 Tr GY  GY GU GU  N e N I N  


where

 
GU  I N  N1 1N 1TN KU I N  N1 1N 1TN
 : centered Gram matrix
KU  k ( BT X i , BT X j )
• Optimization problem:

G G T  N e I
min
Tr
N N
 Y B X
B:BT BI d
 
1
22
Experiments with KDR
 Wine data
KDR
20
Data
13 dim. 178 data.
3 classes
2 dim. projection
20
15
Partial Least Square
15
10
10
5
5
0
0
-5
-5
-10
-10
-15
k(z1 , z2 )
-15
-20

 exp  z1  z2
 = 30
2
2

-20
-20
-10
0
10
CCA
20
-20
20
20
15
15
10
10
5
5
0
0
-5
-5
-10
-10
-15
-15
-20
-20
-20
-10
0
10
20
-10
0
10
20
Sliced Inverse Regression
-20
-10
0
10
20
23
Consistency of KDR
Theorem
Suppose kd is bounded and continuous, and
e N  0, N 1/2e N   (N  ).
Let S0 be the set of optimal parameters:
B
B' 
  min Tr  YY|X
S0  B | BT B  I d , Tr  YY|X



B'


Then, under some conditions, for any open set U  S0


Pr B̂( N ) U  1 (N  ).
24
Lemma
Suppose kd is bounded and continuous, and
e N  0, N 1/2e N   (N  ).
Then, under some conditions,
sup
B:BT B I d
B( N ) 
B
 YY|X
  0 (N  )
Tr  ̂YY|X

Tr



in probability.
25
Conclusions
 Introduction
– dimension reduction and conditional independence
 Conditional covariance operators on RKHS
 Kernel Dimensionality Reduction for regression
 Technical report available at:
www.cs.berkeley.edu/~jordan/papers/kdr.pdf
26