Regularized Sufficient Dimension Reduction
and Variable Selection
Lexin Li
BSWG, 2007 Raleigh, NC
Dimension Reduction and Variable Selection
1
Outline
• Introduction to sufficient dimension reduction (SDR)
• Generalized eigen decomposition approach with regularization
– Review of existing methods
– Regularization
– Examples
• Minimum discrepancy approach with regularization
– Review of existing methods
– Regularization
– Examples
• Concluding remarks and discussions
Dimension Reduction and Variable Selection
2
Introduction to SDR
General framework of SDR:
• study conditional distribution of Y ∈ IRr given X ∈ IRp
• find a p × d matrix Γ = (γ1 , . . . , γd ), d ≤ p, such that
Y |X
D
=
Y |ΓT X
⇔
Y
X|ΓT X
• replace X with ΓT X without loss of information on regression Y |X
Key concept: central subspace SY |X
Y
X|ΓT X
⇒
SDRS = Span(Γ)
Dimension Reduction and Variable Selection
3
⇒
SY |X = ∩ SDRS
Introduction to SDR
Known regression models:
• multi-index model: Y = f1 (γ1T X) + . . . + fd (γdT X) + ε
• heteroscedastic model: Y = f (γ1T X) + g(γ2T X)ε
P r(Y =1|X)
• logit model: log 1−P r(Y =1|X) = f (γ1T X) . . .
Goal: reduce dimension of X for the study of Y |X
Existing SDR estimation methods:
• generalized eigen decomposition approaches: SIR, SAVE, PHD, . . .
• minimum discrepancy approaches: IRE, CIRE, . . .
Dimension Reduction and Variable Selection
4
A Motivating Example
Consider a response model:
Y = exp(−0.5γ1T X) + 0.5ε
• all predictors X and error ε are independent standard normal
√
T
• SY |X = Span(γ1 ), where γ1 = (1, −1, 0, . . . , 0) / 2
• SIR estimate (n = 100, p = 6):
(0.651, −0.745, −0.063, 0.134, 0.014, 0.003)T
Observations:
• all existing methods produce linear combinations of all the predictors
• interpretation can be difficult
Dimension Reduction and Variable Selection
5
Generalized Eigen Decomposition Approach
A generalized eigen decomposition formulation:
M υi = ρi Gυi ,
i = 1, . . . , p
where M is a method-specific, symmetric, nonnegative definite matrix; G
is a symmetric, positive definite matrix; υiT Gυj = 1, i = j and 0, i 6= j.
• Span(υ1 , . . . , υd ) ⊆ SY |X for those ρ1 ≥ . . . ≥ ρd > 0
• the conditions are typically imposed on the marginal distribution of
X, rather than Y |X (model-free)
• asymptotic or permutation tests are available to determine the
structural dimension d = dim(SY |X )
Dimension Reduction and Variable Selection
6
Generalized Eigen Decomposition Approach
Formulations for some common SDR methods:
Sliced inverse regression:
M = cov[E{X − E(X)|Y }]
G = Σx
Sliced average variance estimation:
2
1/2
−1/2
M = Σ1/2
{X − E(X)}
x E[{I − cov(Z|Y )} ]Σx , where Z = Σx
G = Σx
Principal Hessian directions (y-based):
1/2
T
M = Σ1/2
x Σyzz Σyzz Σx , where Σyzz = E[{Y − E(Y )}ZZ ]
G = Σx
Principal Hessian directions (r-based):
1/2
T
T
M = Σ1/2
x Σrzz Σrzz Σx , where Σrzz = E[{Y − E(Y ) − E(Y Z )Z}ZZ ]
G = Σx
Iterative Hessian transformation:
M =
T
1/2
Σ1/2
x Σ̃yzz Σ̃yzz Σx ,
where Σ̃yzz =
“
βyz , . . . , Σp−1
yzz βyz
”
, βyz = E(Y Z)
G = Σx
Principal components analysis:
M = Σx
Dimension Reduction and Variable Selection
G = Ip
7
A Regression-type Formulation
Proposition 1: Let mi , i = 1, . . . , p, denote the columns of M 1/2 , and
let β be a p × d matrix. Let
β̂ = arg min
β
p
X
i=1
kG−1 mi − ββ T mi k2G ,
subject to β T Gβ = Id , where the norm is with respect to the G inner
product. Then β̂j = υj , j = 1, . . . , d, where β̂j is the jth column of β̂.
Dimension Reduction and Variable Selection
8
A Regression-type Formulation
A possible regularization solution:
min
β
p
X
i=1
kG−1 mi − ββ T mi k2G , subject to β T Gβ = Id , and |βj |1 ≤ τj ,
Pp
where |βj |1 = k=1 |βjk | is the sum of absolute values of all the
components of βj
Observations:
• optimization is difficult, with many potential local optima
• letting M = Σx and G = Ip , this is the idea of SCoTLASS (Jolliffe
et al. 2003) for sparse PCA
Dimension Reduction and Variable Selection
9
A Regression-type Formulation
Proposition 2: Let mi , i = 1, . . . , p, denote the columns of M 1/2 , and
let α and β be p × d matrices. For any λ2 > 0, let
( p
)
X
∗
kG−1 mi − αβ T mi k2G + λ2 trace(β T Gβ) ,
(α̂, β̂ ) = arg min
α,β
i=1
subject to αT Gα = Id , where the norm is with respect to the G inner
product. Then β̂j∗ ∝ υj , j = 1, . . . , d, where β̂j∗ is the jth column of β̂ ∗ .
• motivated by Zou, Hastie, and Tibshriani (2006)
• permits a regularized formulation that is easy to optimize
Dimension Reduction and Variable Selection
10
Regularized Eigen Decomposition Approach
Proposed regularization solution:
p
d
X
X
min
kG−1 mi − αβ T mi k2G + λ2 trace(β T Gβ) +
λ1j |βj |1 ,
α,β
i=1
j=1
subject to αT Gα = Id , where λ1j ≥ 0, j = 1, . . . , d, are lasso shrinkage
parameters. We call the solution β̂ the sparse sufficient dimension
reduction estimator.
• as λ1j ’s increase, β̂j ’s are to have coefficients shrunk to zero
• easy to optimize
Dimension Reduction and Variable Selection
11
Regularized Eigen Decomposition Approach
Given α:
T
T
β̂αj = arg min βj (M + λ2 G)βj − 2αj M βj + λ1j |βj |1
βj
• equivalent to d independent lasso optimizations, with
√
T
1/2
1/2
T
(αj M , 0)2p×1 as the “response”, and (M , λ2 G1/2 )T2p×p as
the “predictors”
• can use any lasso algorithm, e.g., LARS (Efron et al., 2004)
Given β:
α̂β = arg max trace(αT M β), subject to αT Gα = Id
α
• α̂β = G−1/2 U V T , where G−1/2 M β = U DV T .
Dimension Reduction and Variable Selection
12
Regularized Eigen Decomposition Approach
An alternating minimization algorithm:
1. Choose the usual sufficient dimension reduction estimator with no
lasso constraint as an initial value for α.
2. Given fixed α, solve d independent lasso problems to obtain the
estimate of β = (β1 , . . . , βd ).
3. For fixed β, carry out singular value decomposition of
G−1/2 M β = U DV T , and update α = G−1/2 U V T .
4. Repeat Steps 2 and 3, until β converges.
5. Normalize β as βj = βj /kβj k, j = 1, . . . , d.
Dimension Reduction and Variable Selection
13
Regularized Eigen Decomposition Approach
Additional observations:
• the algorithm is guaranteed to converge, and it typically converges
fast
• use an information-type criterion to select the tuning parameters
λ1j ’s and λ2
Motivating example revisited:
• SY |X = Span(γ1 ), where γ1 = (0.707, −0.707, 0, . . . , 0)T
• SIR estimate: (0.651, −0.745, −0.063, 0.134, 0.014, 0.003)T
• Sparse-SIR estimate: (0.716, −0.698, 0, 0, 0, 0)T
Dimension Reduction and Variable Selection
14
Sparse Sliced Inverse Regression
Consider a response model:
Y = sign(β1T X) log(|β2T X + 5|) + 0.2ε,
• all predictors X and error ε are independent standard normal
• SY |X = Span(β1 , β2 ),
– (i) β1 = (1, 1, 1, 1, 0, . . . , 0)T , β2 = (0, . . . , 0, 1, 1, 1, 1)T
– (ii) β1 = (1, 1, 0.1, 0.1, 0, . . . , 0)T , β2 = (0, . . . , 0, 0.1, 0.1, 1, 1)T
– (iii) β1 = (1, . . . , 1, 0, . . . , 0)T , β2 = (0, . . . , 0, 1, . . . , 1)T
• n = 200
Dimension Reduction and Variable Selection
15
Sparse Sliced Inverse Regression
β̂1
(i)
(ii)
(iii)
β̂2
(β̂1 , β̂2 )
num
cor
mse
num
cor
mse
vcc
SIR
0.000
0.926
1.604
0.000
0.911
1.245
0.934
S-SIR
15.16
0.975
1.352
15.38
0.974
1.026
0.946
SIR
0.000
0.884
0.551
0.000
0.856
0.544
0.932
S-SIR
17.67
0.984
0.245
17.68
0.986
0.205
0.968
SIR
0.000
0.916
4.793
0.000
0.942
4.168
0.917
S-SIR
9.220
0.877
5.006
9.630
0.908
4.329
0.816
Dimension Reduction and Variable Selection
16
Sparse Sliced Average Variance Estimation
Swiss bank notes data:
• response: binary indicator, authentic versus counterfeit
• predictors: p = 6, lengths at center, left-edge, right-edge,
bottom-edge, top-edge, diagonal
• n = 200 observations
• SAVE concluded that d = 2
(−0.03×C − 0.20×L + 0.25×R + 0.59×B + 0.57×T − 0.47×D),
(−0.28×C − 0.06×L − 0.16×R + 0.51×B + 0.33×T + 0.73×D)
• Sparse SAVE
(0×C + 0×L + 0×R + 0.79×B + 0.62×T + 0×D),
(0×C + 0×L + 0×R + 0.40×B + 0×T + 0.92×D)
Dimension Reduction and Variable Selection
17
(b) Sparse SAVE
−3
−2.0
−2
−1
S−SAVE dir2
−0.5
−1.0
−1.5
SAVE dir2
0
0.0
1
0.5
1.0
(a) SAVE
−2
−1
0
1
2
−2
SAVE dir1
−1
0
1
2
S−SAVE dir1
Figure 1: Summary plot for Swiss bank notes data. Circles indicate authentic notes and dots indicate counterfeit notes.
Dimension Reduction and Variable Selection
18
Discussions
Summary:
• can be applied to most existing SDR methods, which can be
formulated as a generalized eigen decomposition problem
• can facilitate interpretation
• Li, L. (2007) A Note on Sparse Sufficient Dimension Reduction.
Biometrika, in press.
Limitations:
• not real variable selection yet
• difficult to derive asymptotic properties
Dimension Reduction and Variable Selection
19
Minimum Discrepancy Approach
A minimum discrepancy formulation:
• Cook and Ni (IRE, 2005), Cook and Ni (CIRE, 2006)
• assumes Y is categorical; Js (Y ) = 1 if Y is in slice s, s = 1, . . . , h
• starts with the construction of a p × h matrix θ = (θ1 , . . . , θh ), with
the property that Span(θ) = SY |X .
– For SIR: θs = fs Σ−1
x {E(X|Js = 1) − E(X)}, fs = P r(Js = 1)
– For CIRE: θs = Σ−1
x cov(Y Js , X)
• let η p×d denote a basis of SY |X , then θp×h = η p×d γ d×h
√
• given data, construct a n-consistent estimator θ̂ of θ
Dimension Reduction and Variable Selection
20
Minimum Discrepancy Approach
• estimates (η, γ) by minimizing over B p×d and C d×h a quadratic
discrepancy function
n
o
n
oT
G(B, C) = vec(θ̂) − vec(BC) Vn vec(θ̂) − vec(BC) ,
where Vn is an estimator of a positive definite matrix V ph×ph
• represents a class of estimators, with its individual member
determined by the choice of the pair (θ̂, Vn ).
• any choice of positive definite Vn will result in a consistent estimator
of a basis for SY |X .
• the asymptotically optimal choice is obtained by setting Vn as a
consistent estimator of Γ−1 , where Γ is the asymptotic covariance
matrix of n1/2 {vec(θ̂) − vec(θ)}
Dimension Reduction and Variable Selection
21
Regularized Minimum Discrepancy Approach
Proposed regularization solution:
n
oT
n
o
Gs (α) = n vec(θ̂) − vec(diag(α)η̂γ̂) Vn vec(θ̂) − vec(diag(α)η̂γ̂) ,
subject to
p
X
j=1
|αj | ≤ τ, τ ≥ 0.
Here α = (α1 , . . . , αp )T is a p-dimensional shrinkage vector. Let
α̂ = arg minα Gs (α). We call Span{diag(α̂)η̂} the shrinkage inverse
regression estimator of the central subspace SY |X .
• when τ ≥ p, α̂j = 1 for all j’s
• when τ decreases, some α̂j ’s are shrunken to zero, which in turn
shrinking the entire rows of η
Dimension Reduction and Variable Selection
22
Regularized Minimum Discrepancy Approach
Optimization:
diag(η̂γ̂1 )
..
Gs (α) = n vec(θ̂) −
.
diag(η̂γ̂h )
T
diag(η̂γ̂1 )
..
α Vn vec(θ̂) −
.
diag(η̂γ̂h )
It becomes a quadratic programming problem with linear constraint,
with the “response” U ph , and the “predictors” W ph×p :
diag(η̂γ̂1 )
√ 1/2
√ 1/2
..
U = nVn vec(θ̂),
W = nVn
,
.
diag(η̂γ̂h )
1/2
where Vn
represents the symmetric square root of Vn .
Dimension Reduction and Variable Selection
23
α
Regularized Minimum Discrepancy Approach
Additional observations:
• closely related to nonnegative garrote (Breiman, 1995)
• P r(α̂j ≥ 0) → 1 for all j’s
• use an information-type criterion to select the tuning parameter τ
• achieves simultaneous dimension reduction and variable selection
The equivalent Lagrangian formulation:
α̂ = arg min kU − W αk2 + λn
α
for some nonnegative penalty constant λn .
Dimension Reduction and Variable Selection
24
p
X
j=1
|αj |,
Variable Selection without a Model
Goal: to seek the smallest subset of the predictors XA , with partition
T
T
T
X = (XA
, XA
c ) , such that
Y
XAc |XA
Here A denotes a subset of indices of {1, . . . , p} corresponding to the
relevant predictor set XA , and Ac is the compliment of A.
Existence and uniqueness: Given the existence of the central subspace
SY |X , A uniquely exists.
Dimension Reduction and Variable Selection
25
Variable Selection without a Model
Observation: (Cook, 2004, Proposition 1)
ηA
p×d
, ηA ∈ IR(p−p0 )×d , ηAc ∈ IRp0 ×d .
=
η
ηAc
The rows of a basis of the central subspace corresponding to XAc , i.e.,
ηAc , are all zero vectors; and all the predictors whose corresponding rows
of the SY |X basis equal zero are irrelevant.
Partition:
θ
p×h
=
θA
θAc
, θA ∈ IR(p−p0 )×h , θAc ∈ IRp0 ×h ,
Re-describe A as A = {j : θjk 6= 0 for some k, 1 ≤ j ≤ p, 1 ≤ k ≤ h}.
Dimension Reduction and Variable Selection
26
Asymptotic Properties
Theorem 1:
1. Let θ̃ = diag(α̂)η̂γ̂, and let An = {j : θ̃jk 6= 0 for some k,
1 ≤ j ≤ p, 1 ≤ k ≤ h}.
√
2. n
Assume that the o
initial estimator satisfies that n
vec(θ̂) − vec(θ) → N (0, Γ), for some Γ > 0, and that
√
1/2
1/2
Vn = V
+ o(1/ n).
√
3. Assume that λn → ∞ and λn / n → 0
Then the shrinkage estimator θ̃ satisfies that
1. Consistency in variable selection: P r(An = A) → 1
o
√ n
2. Asymptotic normality: n vec(θ̃A ) − vec(θA ) → N (0, Λ), for
some Λ > 0
Dimension Reduction and Variable Selection
27
Shrinkage Covariance Inverse Regression
Estimation
Simulation:
• works quite well in terms of both variable selection and basis
estimation
• compares competitively with the test-based variable selection
approaches within SDR framework
Automobile data:
• response: log car price
• predictors: p = 13
• n = 195 observations
Dimension Reduction and Variable Selection
28
1.5
1.0
0.5
0.0
alpha
citympg
engsize
horsepower
hwympg cratio
peakrpm
length
wbase
height
width
bore
stroke
−0.5
curbwt
0
2
4
6
8
10
12
tau
Figure 2: Solution paths for the automobile data.
Dimension Reduction and Variable Selection
29
Discussions
Summary:
• can select predictors consistently, without assuming a model
• the basis estimation corresponding to the relevant predictors is
√
n-consistent
• applies to a wide class of minimum discrepancy based SDR methods
• joint work with Dr. Bondell
Future work:
• now still in the traditional context of asymptotics of fixed p and
infinite n; how about p → ∞ along with n?
• alternative tuning parameter selection option
Dimension Reduction and Variable Selection
30
A Wilder Discussion
A Bayesian framework???
Other future work:
• missing data
• survival data
• longitudinal data
• categorical predictors data
• ...
Dimension Reduction and Variable Selection
31
Thank You!
Dimension Reduction and Variable Selection
32
© Copyright 2026 Paperzz