Trace Pursuit: A General Framework for Model

Trace Pursuit: A General Framework for Model-Free
Variable Selection
Lixing Zhu
Jointly with Zhou Yu and Yuexiao Dong
Hong Kong Baptist University
October 16, 2014
( HKBU)
October 16, 2014
1 / 39
Outline
1
Model-free variable selection
2
Methodology development
3
Numerical studies
( HKBU)
October 16, 2014
2 / 39
Outline
1
Model-free variable selection
2
Methodology development
3
Numerical studies
( HKBU)
October 16, 2014
3 / 39
Variable selection
Classical variable selection: Nonnegative garrotte (Breiman, 1995),
LASSO, adaptive LASSO and group LASSO(Tibshirani, 1996; Zou,
2006; Yuan and Lin, 2006), SCAD (Fan and Li, 2001), Dantzig selector (Candés and Tao, 2007), and MCP (Zhang, 2010), etc.
Model-free selection: Let X = (x1 , · · · , xp )T be the vector of predictors and Y be the scalar response. Define Ac the complement of A in
the index set I = {1, · · · , p}, XA = {xi : i ∈ A} is the set of active
predictors, and XAc = {xi : i ∈ Ac } the set of inactive predictors.
Why model-free selection? Hard to get idea about data structure in
high-dimensional scenario.
( HKBU)
October 16, 2014
4 / 39
Variable selection
Consider the nonparametric dimension reduction sparse model
Y
Here where
X|PS X, and Y
XAc |XA ,
(1)
stands for independence, P(·) denotes the q-dimensional
projection operator with respect to the standard inner product.
A special model is Y = G(β1T X, · · · , βqT X, ) and β = (β1 , · · · , βq ) is
an p × q orthonormal matrix, where q is unknown.
( HKBU)
October 16, 2014
5 / 39
Variable selection: p < n
Sufficient dimension reduction-based methods: regularized SIR (Li and
Yin, 2008); coordinate-independent sparse dimension reduction (CISE)
(Chen et al., 2010); (p < n)
Sequential test-based methods: Cook (2004), Shao et al. (2007), and
Li et al. (2005).
All these methods rely on an initial estimator of the central space SY |X .
Then p > n? not possible.
Methods with no such an initial estimator in demand for ultra-high
dimensional paradigms.
( HKBU)
October 16, 2014
6 / 39
Variable selection: p > n
Correlation pursuit (Zhong et al. (2012))
It is based on SIR, and then would miss significant variables linked to
Y in a quadratic function or interaction.
SIRI (Jiang and Liu (2013)): SIRI is also based on SIR, but can deal
with a model with interaction terms.
Both involve estimating the structural dimension q of SY |X , which
makes the problem very challenging when p > n.
( HKBU)
October 16, 2014
7 / 39
Outline
1
Model-free variable selection
2
Methodology development
3
Numerical studies
( HKBU)
October 16, 2014
8 / 39
Trace pursuit: the basic procedure
Design method-specific (SIR, SAVE, or DR) trace tests;
Basic algorithm: Stepwise trace pursuit (STP) algorithm. Like stepwise
regression selecting predictors into model forward and backward. Time
consuming when p is large!
A more efficient two-stage hybrid trace pursuit (HTP) algorithm: 1).
Forward trace pursuit (FTP) algorithm forward selection with a modified BIC criterion to get a chosen model containing all active predictors
as a screening step. 2). STP is followed for the refined variable selection.
( HKBU)
October 16, 2014
9 / 39
Trace pursuit: An example of the basic procedure
Assume E(X) = 0 and E(Y ) = 0.
The principle of the SIR-based trace pursuit. For working index set F
and index j ∈ F c , check whether or not
Y
xj |XF
(2)
For any index set F, denote XF = {xi : i ∈ F}, var(XF ) = ΣF , and
UF ,h = E(XF |Y ∈ Jh ). Define the SIR matrix Msir
F as
−1/2
Msir
F = ΣF
H
X
!
ph UF ,h UTF ,h
−1/2
ΣF
.
(3)
h=1
( HKBU)
October 16, 2014
10 / 39
Trace pursuit: An example of the basic procedure
Denote F ∪ j as the index set of j together with all the indices in F.
Given that XF is already in the model, check the contribution of
sir
the additional variable xj to Y by tr(Msir
F ∪j ) − tr(MF )(≥ 0).
Recall that A denotes the active index set satisfying Y
XAc |XA , and
I = {1, . . . , p} denotes the full index set.
Proposition 1: For any index set F such that A ⊆ F ⊆ I, we have
sir
sir
sir
sir
tr(Msir
A ) = tr(MF ) = tr(MI ). Then, tr(MF ∪j ) − tr(MF ) = 0
given that A ⊆ F.
It means that when A ⊆ F, any additional variable brings no more
information on Y .
( HKBU)
October 16, 2014
11 / 39
Trace pursuit: An example of the basic procedure
Suppose β ∈ Rp×q is the basis of SY |X where q is the unknown structural dimension. Recall βi = (βi,1 , . . . , βi,p )T for i = 1, . . . , q, and let
qP
q
2
βmin = minj∈A
i=1 βi,j .
Proposition 2: For any F such that F c ∩ A =
6 ∅, we have
sir
−1
2
2
max {tr(Msir
F ∪j ) − tr(MF )} ≥ λq λmax (Σ)λmin (Σ)β min > 0,
j∈F c ∩A
where λmax (Σ) and λmin (Σ) are the largest and the smallest eigenvalues of Σ respectively.
It means that when F does not contain A, there is at least one element
in F c contains information on Y .
( HKBU)
October 16, 2014
12 / 39
Trace pursuit: An example of the basic procedure
sir
When the SIR matrix Msir
F is replaced by its empirical version M̂F ,
we can define a test statistic and have the following property.
Proposition 3: When Y
sir
Tj|F
−→
H
X
xj |XF , j ∈ F c , the test statistic
o
n
sir
sir
2
sir
sir
M̂
)
.
ωj|F
,
where
T
)
−
tr(
χ
=
n
tr(
M̂
F
F ∪j
,k 1
j|F
k=1
sir
sir
sir
Here ωj|F
,1 ≥ . . . ≥ ωj|F ,H are the eigenvalues of a matrix Ωj|F that
is related to the eigenvectors of the SIR matrix (see Yu, Dong and Zhu
(2014) for details.)
Similar results can be derived when SAVE or DR is used.
( HKBU)
October 16, 2014
13 / 39
Trace pursuit: An example of the basic procedure
Proposition 4: Assume that there exist 0 < c < C and 0 < ξmin <
1/2 such that
sir
−ξmin
max {tr(Msir
.
F ∪j ) − tr(MF )} > ςn
min
F :F c ∩A6=∅ j∈F c ∩A
(4)
If we set 0 < c̄sir < cn1−ξmin , then as n → ∞,
P r(
min
SIR
max Tj|F
> c̄sir ) → 1.
F :F c ∩A6=∅ j∈F c ∩A
If we set cSIR > Cn1−ξmin , then as n → ∞,
P r(
( HKBU)
max
SIR
SIR
min Tj|{F
) → 1.
\j} < c
F :F c ∩A=∅ j∈F
October 16, 2014
14 / 39
Trace pursuit: A comparison with COP
(The COP procedure(Zhong et al 2012):) Given the structural
(k)
dimension q, denote the largest q eigenvalues of Msir
F ∪j as λF ∪j , and
(k)
the largest q eigenvalues of Msir
F as λF , k = 1, . . . , q.
P
(k)
(k)
(k)
COP is based on the key quantity qk=1 (1 − λF ∪j )−1 (λF ∪j − λF ).
The COP test reduces to the trace test with our quantity tr(Msir
F ∪j ) −
(k)
−1 and assume both
tr(Msir
F ) if we drop the scaling factor (1 − λF ∪j )
sir
Msir
F ∪j and MF have rank q.
Compared with COP, the SIR-based trace pursuit does not involve estimating q, which is a challenge task when p is large.
( HKBU)
October 16, 2014
15 / 39
The stepwise trace pursuit algorithm
(a) Initialization. Set the initial working set to be F = ∅.
(b) Forward addition. Find index aF such that
aF = arg max tr M̂sir
F ∪j .
(5)
j∈F c
sir
sir
If TasirF |F = n{tr M̂sir
F∪aF −tr M̂F } > c̄ , update F to be F ∪aF .
(c) Backward deletion. Find index dF such that
dF = arg max tr M̂sir
F \j .
(6)
j∈F
SIR , update F to be
If TdsirF |{F \dF } = n{tr M̂sir
− tr M̂sir
F
F \dF } < c
F\dF .
(d) Repeat steps (b) and (c) until no predictors can be added or deleted.
( HKBU)
October 16, 2014
16 / 39
The stepwise trace pursuit algorithm
To determine only one index aF , STP needs to go over all possible
candidate indices in F c and compare a total of p − |F | test statistics:
overwhelming computation burden when p is large!
( HKBU)
October 16, 2014
17 / 39
The forward trace pursuit algorithm
An initial screening algorithm:
(a) Initialization. Set S (0) = ∅.
(b) Forward addition. For k ≥ 1, S (k−1) is given at the beginning of the
kth iteration. For every j ∈ I\S (k−1) , compute tr M̂SIR
, and
S (k−1) ∪j
find ak such that
ak = arg max tr M̂SIR
.
(k−1)
S
∪j
j∈I\S (k−1)
(c) Solution path. Repeat step (b) n times, to get a sequence of n nested
candidate models. Denote the solution path as S = {S (k) : 1 ≤ k ≤
n}, where S (k) = {a1 , . . . , ak }.
( HKBU)
October 16, 2014
18 / 39
The forward trace pursuit algorithm
Use the modified BIC criterion (idea from Chen and Chen 2008)
n
o
BIC(F) = − log tr(M̂sir
)
+ n−1 |F| (log n + 2 log p) .
F
(7)
The candidate model S (m̂) is selected with m̂ = arg min1≤k≤n BIC(S (k) ).
Proposition 5: Under certain regularity conditions, as n → ∞ and
p → ∞, P r A ⊂ S (m̂) → 1.
( HKBU)
October 16, 2014
19 / 39
The hybrid trace pursuit (HTP) algorithm
Combines FTP as the initial screening step with STP as the refined
selection step:
(a) Perform FTP to get solution path S = {S (k) : 1 ≤ k ≤ n}.
(b) Based on BIC of (7), select S (m̂) with m̂ = arg min1≤k≤n BIC(S (k) ).
(c) Perform STP with C being the 1 − 0.1/p (α = 0.1/p) quantile of the
weighted chi-square distribution as the critical value in both forward
addition and backward delection step. (Jiang and Liu 2013 used cross
validation.), where the full index set I = {1, . . . , p} is updated to the
screened index set S (m̂) .
( HKBU)
October 16, 2014
20 / 39
Outline
1
Model-free variable selection
2
Methodology development
3
Numerical studies
( HKBU)
October 16, 2014
21 / 39
Simulations
We consider the following models:
I : Y = sgn(x1 + xp ) exp(x2 + xp−1 ) + , (asymmetric)
II : Y = 2x21 x2p − 2x22 x2p−1 + , (symmetric)
III : Y = x41 − x4p + 3 exp(.8x2 + .6xp−1 ) + .(mixture)
X = (x1 , . . . , xp )T is multivariate normal with E(X) = 0 and var(X) =
Σ, and ∼ N (0, σ 2 ) is independent of X.
The structural dimensions for Models I to III are respectively q = 2, 4
and 3. The index set of significant predictors is A = {1, 2, p − 1, p}.
( HKBU)
October 16, 2014
22 / 39
Simulations
Set σ = .2, the sample size n = 300, the number of slices H = 4.
Consider three settings of p: p = 10 for small dimensionality, p = 100
for moderate dimensionality, and p = 1000 for high dimensionality.
Denote the (i, j)th entry of Σ as ρ|i−j| , and ρ = 0 is with uncorrelated
predictors and ρ = .5 with correlated predictors.
Use Bentler and Xie (2000)’s approach to approximate the αth upper
quantile of a weighted χ2 distribution.
( HKBU)
October 16, 2014
23 / 39
Simulations
Based on the N = 100 repetitions, report the underfitted count (UF),
the correctly fitted count (CF), the overfitted count (OF), and the
average model size (MS):
Let Ab(i) be the estimated active set in the ith repetition and define
UF =
OF =
N
X
i=1
N
X
i=1
( HKBU)
I(A 6⊆ Ab(i) ), CF =
N
X
I(A = Ab(i) ),
i=1
I(A ⊂ Ab(i) ), and M S = N −1
N
X
|Ab(i) |.
i=1
October 16, 2014
24 / 39
Table 1
ρ=0
Model
Method
HTP-SIR
I
HTP-SAVE
HTP-DR
( HKBU)
ρ = .5
p
UF
CF
OF
MS
UF
CF
OF
MS
10
0
100
0
4.00
0
100
0
4.00
100
0
100
0
4.00
0
100
0
4.00
1000
0
100
0
4.00
0
100
0
4.00
10
9
59
32
4.31
4
39
57
4.00
100
32
0
68
20.53
46
1
53
18.14
1000
90
0
10
18.93
91
0
9
15.59
10
0
98
2
4.02
0
99
1
4.01
100
0
95
5
4.07
0
93
7
4.08
1000
0
96
4
4.04
0
94
6
4.08
October 16, 2014
25 / 39
Table 2
ρ=0
Model
Method
HTP-SIR
II
HTP-SAVE
HTP-DR
( HKBU)
ρ = .5
p
UF
CF
OF
MS
UF
CF
OF
MS
10
100
0
0
.31
100
0
0
.24
100
100
0
0
.13
100
0
0
.08
1000
100
0
0
.03
100
0
0
.03
10
2
97
1
3.99
2
94
4
4.02
100
3
53
44
4.63
3
50
47
4.71
1000
3
48
49
4.79
7
41
52
4.95
10
3
95
2
3.99
3
93
4
4.01
100
2
56
42
4.70
3
46
51
4.77
1000
7
44
49
4.76
7
45
48
4.91
October 16, 2014
26 / 39
Table 3
ρ=0
Model
Method
HTP-SIR
III
HTP-SAVE
HTP-DR
( HKBU)
ρ = .5
p
UF
CF
OF
MS
UF
CF
OF
MS
10
100
0
0
2.07
100
0
0
2.23
100
100
0
0
2.58
100
0
0
3.56
1000
100
0
0
6.34
100
0
0
6.56
10
4
33
63
5.14
7
45
48
4.61
100
47
8
45
12.42
36
6
58
8.43
1000
86
0
14
21.76
78
2
20
16.86
10
0
91
9
4.11
0
98
2
4.02
100
3
83
14
4.13
4
79
17
4.16
1000
4
88
8
4.06
5
61
34
4.39
October 16, 2014
27 / 39
Table 4
Table: Comparison between COP, SIRI and HTP-DR. Selection performances based
on p = 1000 and N = 100 repetitions are reported.
Model
I
II
III
Method
COP
SIRI
HTP-DR
COP
SIRI
HTP-DR
COP
SIRI
HTP-DR
( HKBU)
UF
0
0
0
100
52
7
100
1
4
ρ=0
CF OF
86 14
66 34
96
4
0
0
38 10
44 49
0
0
99
0
88
8
MS
4.14
4.46
4.04
4.00
3.79
4.76
3.09
3.99
4.06
UF
0
0
0
100
36
7
100
2
5
ρ = .5
CF OF
85 15
86 14
94
6
0
0
45 19
45 48
0
0
98
0
61 34
MS
4.16
4.19
4.08
4.00
4.05
4.91
3.15
3.98
4.39
October 16, 2014
28 / 39
Table 5
Table: Comparison between SIS, DC-SIS and FTP algorithms for screening. Frequencies of cases including all active predictors are reported based on p = 2000
and N = 100 repetitions.
ρ=0
ρ = .5
Method
Model I
Model II
Model III
Model I
Model II
Model III
DC-SIS
100
100
100
100
100
100
FTP-SIR
100
0
0
100
0
0
FTP-SAVE
12
97
7
10
98
31
FTP-DR
100
97
98
100
98
97
We also found that DC-SIS tends to choose much more variables.
( HKBU)
October 16, 2014
29 / 39
THANK YOU!
( HKBU)
October 16, 2014
30 / 39