statistical inference with high-dimensional data

Inference in LM
PLSE
Data Swap
STATISTICAL INFERENCE WITH
HIGH-DIMENSIONAL DATA
Cun-Hui Zhang
January, 2017
The YES Workshop, Eindhoven
Partially based on recent joint work with Long Feng
Inference in LM
PLSE
Data Swap
High-dimensional (HD) data
Linear regression
Y = X β + ε ∈ Rn ,
β ∈ Rp ,
pn
GraphicalKendall
models (Gaussian, Ising)
Vertices: X1 , . . . , Xp
Edges: Cov Xj , Xk X{j,k}c 6= 0
Data: x 1 , . . . , x p in Rn
GLM, Cox PH regression, quantile, Kendall’s tau, ...
Statistical inference: Confidence intervals/regions, p-values
Inference in LM
PLSE
Data Swap
Linear regression:
Y = X β + ε ∈ Rn ,
β ∈ Rp
Model selection
When p > n, β is not identifiable
Sparsity: β is identifiable from X β under the assumption
kβk0 = |supp(β)| ≤ rank(X )/2
Capturing the benefit of sparsity
Stepwise regression, subset selection, AIC (Akaike, 73), Cp
(Mallows, 73), BIC (Schwarz, 78), RIC (Foster-George, 94), ...
Lasso (Tibshirani, 96), Dantzig selector (Candes-Tao, 05)
Concave penalized LSE (Fan-Li, 01; Z, 10)
More
Inference in LM
PLSE
Data Swap
Statistical inference of θ = a T β after model selection
An optimistic approach
Assume conditions to guarantee the “oracle property” (Fan-Li, 01),
P Sb = S ≈ 1, S = supp(β)
Treat Sb as the true model and claim with 95% confidence
b (lse)
≤ 1.96 σ a b(X Tb X b)−1 a b 1/2
θ − a Tb β
b
S
S
S
S
S
S
The required conditions are quite strong, including necessarily the
“beta-min” condition
p
min |βj | > CX σ log(p)/n
j∈S
Leeb-Pötscher (06)
Inference in LM
PLSE
Data Swap
Statistical inference of θ = a T β after model selection
A conservative approach (Berk et al, 13; Laber-Murphy, 11)
Define model specific regression coefficient vectors
βM = (X TM X M )−1 X TM EY ∈ R|M|
Construct a confidence band for θM = a TM β M ,
n
1/2
o
b (lse) ≤ K σ a M (X TM X M )−1 a M
P θM − a TM β
∀M ≥ 95%
M
Claim with 95% confidence
1/2
b (lse)
≤ K σ a b(X Tb X b)−1 a b
θ b − a Tb β
b
S
S
S
S
S
S
S
K is large
Conditional approach (Lockhart et al, 14; Lee et al, 14):
n
o
P θM ∈ CIM Sb = M = 95%
Stability selection (Meinshausen-Bühlmann, 10), ....
Inference in LM
PLSE
Data Swap
Our focus
Construction of θb and vb such that
n
√ o
P b
θ − θ ≤ 1.96 vb ≈ 95%
under proper conditions
Approximate confidence interval in the traditional sense
For simplicity we assume ε ∼ N(0, σ 2 I n×n ) unless otherwise stated
Inference in LM
PLSE
Data Swap
A semi-LD approach
Semiparametric inference:
Parametric component + NP component
Low-dimensional statistical inference with high-dimensional data, or
“semi-LD inference” (Z, 11):
LD component + HD component
General methodology:
HD estimation ⇒ semi-LD inference
is parallel to
NP estimation ⇒ semiparametric inference
Semiparametric literature: Engle et al (81), Bickel (82), Wahba (84),
Chen (85, 88), Rice (86), Heckman (86), Bickel et al (90), Kosorok (09)
Inference in LM
PLSE
Data Swap
Translation of the inference problem to its semiparametric version
P
Model: E [y |x, t] = θx + g (t) = θx + j βj ψj (t)
The optimistic approach of inference after model selection
b = {j : |βbj | > λj }
Model selection: M
b
b
Estimation: θ = LSE of θ based
p on y , x, ψj (t), j ∈ M
b
Confidence interval: θ ± 1.96 vboptimistic
Semiparametric inference:
Efficient score: x − h(t), h(t) = E [x|t]
Estimation after approximate orthogonalization:
T
b
b
(x − h(t))
(y − θ(x − h(t))
− fb(t)) = 0
p
Confidence interval: θb ± 1.96 vbregular
Asymptotic efficiency: vbregular ≈ 1/(nFθ )
The difference from the semiparametric setting
{x j } can be less orthogonal/regular than {ψj (t)}
Large |βj | can be anywhere
Inference in LM
PLSE
Data Swap
Bias correction
b (init) = β
b (lasso) :
Linear bias correction from an initial estimator, e.g. β
(init)
b (init) )
βbj = βbj
+ w>
j (Y − X β
Error decomposition:
>
> b (init)
βbj − βj = w >
− β)
j ε − (X w j − e j ) (β
The LS property: X > w j = e j , i.e. w j ∝ X ⊥
j ⊥ X k ∀ k 6= j
A relaxed LS property is sufficient
Noise factor: τj = kw j k2 , hopefully τj n−1/2
Bias factor: ηj = kX > w j − e j k∞ /kw j k2
Asymptotic theory based on an `∞ -`1 split:
b
η j kβ
(init)
− βk1 1 ⇒ (βbj − βj )/(b
σ kw j k2 ) → N(0, 1)
Low-dimensional projection estimator (LDPE, Z-Zhang, 14; arXiv 11)
Inference in LM
PLSE
Data Swap
Stability of the LDPE
b (init) = β
b (lasso) :
Linear bias correction from an initial estimator, e.g. β
(init)
b (init) )
βbj = βbj
+ w>
j (Y − X β
Asymptotic theory:
Noise factor: τj = kw j k2 , hopefully τj n−1/2
Bias factor: ηj = kX > w j − e j k∞ /kw j k2
Asymptotic theory based on an `∞ -`1 split:
b (init) − βk1 1 ⇒ (βbj − βj )/(b
η j kβ
σ kw j k2 ) → N(0, 1)
b (init) , e.g. from βb(init) = 0 to 6= 0, does not change the
Small change in β
j
conclusion
b (init) is treated as an estimate of β, not a model selector
β
Inference in LM
PLSE
Data Swap
Asymptotic theory
>
> b (init)
βbj − βj = w >
− β)
j ε − (X w j − e j ) (β
b (init) − βk1 1 ⇒ (βbj − βj )/(b
ηj kβ
σ kw j k2 ) → N(0, 1)
where ηj = kX > w j − e j k∞ /kw j k2 , hopefully ηj ≤
√
2 log p
Sample size requirement
Statistical inference of aP
single preconceived βj
p
∗
− Sparsity assumption:
(2/n) log p
j min(|βj |/λuniv , 1) ≤ s , λuniv = σ
(init)
∗
b
− Minimax `1 error bound: kβ
− βk1 s λuniv
p
b (init) − βk1 . s ∗ σ(log p) 2/n
− ηj kβ
− Sample size requirement: n (s ∗ log p)2
Bonferroni adjustment for simultaneous interval estimation of all βj
b (init) − βk1 √log p
− ηj kβ
− Sample size requirement: n (s ∗ )2 log p
Possible improvements?
Inference in LM
PLSE
Data Swap
Finding a “score vector” for LD projection
Asymptotic theory:
b (init) − βk1 1 ⇒ (βbj − βj )/(b
σ τj ) → √
N(0, 1)
− ηj kβ
− ηj = kX > w j − e j k∞ /kw j k2 , hopefully ηj ≤ 2 log p
− τj = kw j k2 , a noise factor, hopefully τj n−1/2
− Both ηj and τj are known given w j . However,
w j =?
When rank(X ) = p,
T ⊥
⊥
⊥ 2
− wj = X⊥
j /X j X j = X j /kX j k2 yields the LSE
− ηj = 0 and τj = 1/kX ⊥
k
2
j
When p > n,
− X⊥
j = 0 when X is in “general position”
− z j = a relaxed version of X ⊥
j
− w j = z j /z Tj X j to ensure w Tj X j = 1; Alternatively, w j = z j /kz j k22
Inference in LM
PLSE
Data Swap
Finding a “score vector” for LD projection
Asymptotic theory:
b (init) − βk1 1 ⇒ (βbj − βj )/(b
σ τj ) → √
N(0, 1)
− ηj kβ
− ηj = kX > w j − e j k∞ /kw j k2 , hopefully ηj ≤ 2 log p
− τj = kw j k2 , a noise factor, hopefully τj n−1/2
When p > n,
− X⊥
j = 0 when X is in “general position”
− z j = a relaxed version of X ⊥
j
− w j = z j /z Tj X j to ensure w Tj X j = 1; Alternatively, w j = z j /kz j k22
Random designs with EX T X /n = Σ
− Regression model
X j = X −j γ −j + z oj ,
σj2 = Ekz oj k22 /n = 1/(Σ−1 )j,j
o
o
−1
− Population version of X ⊥
e j , u oj = 1, u o−j = −γ −j
j : zj = X u ∝ X Σ
o
o
− When z j is unknown: z j ≈ z j , or z j = X u with u ≈ u o
− Expected: τj = (1 + o(1))n−1/2 /σj
Inference in LM
PLSE
Data Swap
Finding a “score vector” for LD projection
Asymptotic theory:
b (init) − βk1 1 ⇒ (βbj − βj )/(b
− ηj kβ
σ τj ) → √
N(0, 1)
− ηj = kX > w j − e j k∞ /kw j k2 , hopefully ηj ≤ 2 log p
− τj = kw j k2 , a noise factor, hopefully τj n−1/2
− w j = z j /z Tj X j to ensure w Tj X j = 1, or w j = z j /kz j k22
For random designs with EX T X /n = Σ
− Regression model
X j = X −j γ −j + z oj ,
σj2 = Ekz oj k22 /n = 1/(Σ−1 )j,j
o
o
−1
− Population version of X ⊥
e j , u oj = 1, u o−j = −γ −j
j : zj = X u ∝ X Σ
o
o
− When z j is unknown: z j ≈ z j , or z j = X u with u ≈ u o
− Expected: τj = (1 + o(1))n−1/2 /σj
Algorithms for both random and deterministic designs
Choice 1:
zj
Choice 2:
zj
= residual n
vector of PLSE(X −j , X j )
= arg min kzk22 : z T X j = n, kz T X −j /nk∞ ≤ λ0
z
o
Inference in LM
PLSE
Data Swap
Finding a score vector with Lasso/PLSE
Asymptotic theory:
b (init) − βk1 1 ⇒ (βbj − βj )/(b
− ηj kβ
σ τj ) → √
N(0, 1)
− ηj = kX > w j − e j k∞ /kw j k2 , hopefully ηj ≤ 2 log p
− τj = kw j k2 , a noise factor, hopefully τj n−1/2
− w j = z j /z Tj X j to ensure w Tj X j = 1; scale free in z j
b −j
Choice 1, the Lasso: z j = X j − X −j γ
n
o
b −j = arg min kX j − X −j γk22 /(2n) + λ0 kγk1
γ
γ
Some properties:
b −j )/n = X T−j z j /n ∈ λ0 ∂kb
− KKT: X T−j (X j − X −j γ
γ −j k1
T
2
b −j )T z j /n = kz j k22 /n + λ0 kb
γ −j k1
− X j z j /n = kz j k2 /n + (X −j γ
− τj = kw j k2 = kz j k2 /|z Tj X j | ≤ 1/kz j k2
− ηj = kX T−j w j k∞ /kw j k2 = kX T−j z j k∞ /kz j k2 = λ0 n/kz j k2
− ηj = kX T−j z j k∞ /kz j k2 ↓ λ0 ; τj ≤ 1/kz j k2 ↑ λ0
p
√
− λ0 = kz j /n1/2 k2 (2/n) log p ⇒ ηj ≤ 2 log p (scaled Lasso)
Z-Zhang (14; arXiv 11)
Inference in LM
PLSE
Data Swap
Finding a score vector by quadratic programming
Asymptotic theory:
b (init) − βk1 1 ⇒ (βbj − βj )/(b
− ηj kβ
σ τj ) → √
N(0, 1)
− ηj = kX > w j − e j k∞ /kw j k2 , hopefully ηj ≤ 2 log p
− τj = kw j k2 , a noise factor, hopefully τj n−1/2
− w j = z j /z Tj X j to ensure w Tj X j = 1; scale free in z j
− ηj = kX T−j z j k∞ /kz j k2 as X Tj w j = 1
− τj = kz j k2 /|z Tj X j | = kz j k2 /n when X Tj z j = n
√
Choice 2, quadratic programming: minimize τj given ηj ≤ 2 log p
n
o
p
z j = arg min τj : z T X j = n, ηj ≤ 2 log p
z
n
o
p
= arg min kzk22 : z T X j = n, kX T−j zk∞ ≤ kzk2 2 log p
z
Z-Zhang (14; arXiv 11), Javanmard-Montanari (14)
Near estimability:
√
√
− If ηj ≤ 2 log p is attainable, or C 2 log p, βj is “nearly estimable”
− Otherwise, βj is not nearly estimable
√
− For Gaussian designs, ηj ≤ 2 log p is attainable with large probability
Inference in LM
PLSE
Data Swap
Some simulation results:
p
n = 200, p = 3000, σ = 1, λuniv = (2/n) log p = 0.283,
βj = 3λuniv = 0.849 for j = 1500, 1800, . . . , 3000, and βj = 3λuniv /j α
otherwise; βj 6= 0 for all j
√
(s, s ∗ (log p)/ n) = (8.93, 5.05) and (29.24, 16.55)
√ respectively for
α =P
1 and 2, while the theory requires s(log p)/ n → 0, where
s = j min(|βj |/λuniv , 1).
e has iid N(0, Σ) rows with
e , X , ε) in each replication, where X
Generate (X
|j−k|
e
Σ = (ρ
)p×p and X is the column normalized version of X
Four settings, labeled (A), (B), (C), and (D), respectively, with
(α, ρ) = (2, 1/5), (1, 1/5), (2, 4/5), and (1, 4/5)
Case (D) is most difficult
all βj
maximal βj
LDPE
R-LDPE
LDPE
R-LDPE
(A)
(B)
(C)
(D)
0.9597
0.9595
0.9571
0.9614
0.9845
0.9848
0.9814
0.9786
0.9556
0.9557
0.9029
0.9414
0.9855
0.9885
0.9443
0.9786
Table: Mean coverage probability of LDPE and R-LDPE.
Inference in LM
PLSE
Data Swap
LDPE
Rest
r
i
c
t
edLDPE
Inference in LM
PLSE
Data Swap
Bias correction (summary)
b (init) = β
b (lasso) :
Linear bias correction from an initial estimator, e.g. β
(init)
b (init) )
βbj = βbj
+ w>
j (Y − X β
>
> b (init)
− β)
Error decomposition: βbj − βj = w >
j ε − (X w j − e j ) (β
σ kw j k2 ) → N(0, 1) when
Asymptotic normality (βbj − βj )/(b
− w j = z j /z Tj X j
√
−P
ηj = kX T−j z j k∞ /kz j k2 ≤ 2 log p
∗
− j min(|βj |/λuniv , 1) ≤ s
− s ∗ log p n1/2
Algorithms:
Choice 1:
Choice 2:
zj =
residual vector of PLSE(X −j , X j ) √
z j = arg minz kzk22 : z T X j = n, kX T−j zk∞ ≤ kzk2 2 log p
Approximate confidence interval for βj is available if either Choice 2 is
feasible or Choice 1 yields a feasible solution for Choice 2
Statistical inference for βj is “difficult” conditionally on X if the
algorithms are infeasible, i.e. failing to be nearly estimable
Inference in LM
PLSE
Data Swap
Asymptotic efficiency
Fisher information for θ = τ (β) in small submodels (Stein, 56)
Suppose X has iid rows. Let Σ = E(X T X /n) and
a = (∂/∂β)τ (β)
The minimum Fisher information in submodels of the form β = uθ:
Fθ = min u T Σu/σ 2 = (a T Σ−1 a)−1 /σ 2
u T a=1
The least favorable submodel subject to u T a = 1:
−1 o
u o = Σ−1 a a T Σ−1 a
, z = X u o , w o = n−1 X Σ−1 a
One-step correction: With u ≈ u o and u T a = 1,
θb = τ (β (init) ) + arg max log-lik(β (init) + uφ)
φ
Le Cam (69), Bickel (82), Schick (86), Klaassen (87), Bickel et al (90);
Z(11, Oberwolfach), Z-Zhang (14), van de Geer et al (2014); Sun-Z (12),
Ren et al (15)
Inference in LM
PLSE
Data Swap
Asymptotic efficiency of the LDPE/de-biasing method
(init)
− βk1 1 and τj = kw j k2 ≈ kw oj k,
σ 2 τj2 ≈ σ 2 ke Tj Σ−1 X /nk22 ≈ (σ 2 /n) Σ−1 j,j = 1/(nFj )
b
When ηj kβ
Suppose X has iid sub-Gaussian rows. Let Σ = EX > X /n, D = diag(Σ),
s∗ =
p
X
min(1, D j βj2 /(σ 2 (2/n) log p))
j=1
Suppose λmin D −1/2 ΣD −1/2 ) ≥ c∗ and s ∗ log p n1/2 . Then, the
minimum Fisher information is Fj = 1/{σ 2 (Σ−1 )j,j }, and
(nFbj )1/2 (βbj − βj ) → N(0, 1),
Fbj /Fj → 1.
Inference in LM
PLSE
Data Swap
The sample size condition is rate optimal
Suppose X has iid N(0, Σ) rows under P. Let
n
o
P = P : c∗ ≤ eigen D −1/2 ΣD −1/2 ) ≤ c ∗ , s log p ≤ c∗0 n
For certain c∗00 > 0 depending on {c∗ , c ∗ , c∗0 } only,
n
o
inf sup P n1/2 |θb − θ| ≥ c∗00 s(log p)/n1/2 > 0
b
θ
P
Le Cam (73), Ren et al (15)
When n1/2 s log p n, n−1/2 rate is impossible without some
additional assumption
Javanmard-Montanari (14): Σ = I (known) with n s log p
Inference in LM
PLSE
Data Swap
Estimation of the noise level
p
The Lasso theory suggests λ = σλ0 with λ0 = 2(log p)/n,
b
β(λ)
= arg minβ kY − X βk22 /(2n) + λkβk1
b σ λ0 )k22 /n
A “naive estimate” of σ (Sun-Z, 2010): σ
b2 = kY − X β(b
Convex minimization formulation for the same σ
b (Antoniadis, 2010):
b σ
b} = arg min{β,σ} kY − X βk22 /(2σn) + σ/2 + λ0 kβk1
{β,
Theory for this scaled Lasso (Sun-Z, 2012, 2013):
√
σ
b/σ ∗ − 1 . s ∗ λ20 where σ ∗ = kεk2 / n
√
s ∗ log p n1/2 ⇒ n(b
σ /σ − 1) → N(0, 1/2)
b (Belloni et al, 2011)
An equivalent “square-root Lasso” formulation of β
An earlier scale-free Lasso proposal (Städler et al, 2010) yields
kb
σ /σ ∗ − 1| . s ∗ λ20 + λ0 kβ/σk1
Bootstrap (Chernozukov et al, 14; Zhang-Cheng, 16; Dezeure et al, 16)
Inference in LM
PLSE
Data Swap
What has happened?
The de-biasing/LD projection approach has been taken in a spate of papers on
linear regression, graphical models, FDR control, GLM, Cox PH regression,
group inference, bootstrap and more: Zhang-Z (14), Sun-Z (12), Belloni et al
(14,15), van de Geer-Bühlmann-Ritov (14), Javanmard-Montanari (14a, 14b,
15), Chernozukov et al (14), Ren et al (15), Ning-Liu (14), Fang et al (14),
Cai-Guo (15), Mitra-Z (15), Bloniarz et al (15), Zhang-Cheng (16), Dezeure et
al (16), . . .
Inference in LM
PLSE
Data Swap
A basic analysis of the LDPE/de-biasing method:
(init)
b (init) )
βbj = βbj
+ w>
j (Y − X β
>
> b (init)
βbj − βj = w >
− β)
j ε − (X w j − e j ) (β
b (init) − βk1 1 ⇒ (βbj − βj )/(b
ηj kβ
σ kw j k2 ) → N(0, 1)
where ηj = kX > w j − e j k∞ /kw j k2 , hopefully ηj ≤
√
2 log p
Sample size requirement
p
− λuniv = σ (2/n) log pP
∗
− Sparsity assumption:
j min(|βj |/λuniv , 1) ≤ s
(init)
b
− Minimax `1 error bound: kβ
− βk1 s ∗ λuniv
p
(init)
∗
b
− η j kβ
− βk1 . s σ(log p) 2/n
− Sample size requirement: n (s ∗ log p)2
Possible improvements?
− Analysis: error decomposition and the `1 -`∞ split
b (init)
− Choice of β
− Better approximation of z oj or w oj
Inference in LM
PLSE
Data Swap
A modification of the asymptotic theory
b o = (X TS X S )−1 X TS Y
The oracle LSE: S = supp(β), β
S
βbjo
= βj +
Ij∈S e Tj (X TS X S )−1 X S P S ε
An expansion around the oracle LSE
(init)
b (init) )
βbj = βbj
+ w>
j (Y − X β
⊥
>
> b (init)
bo )
βbj − βbjo = w >
−β
j P S ε − (X w j − e j ) (β
b (init) − β
b o k1 1 ⇒ (βbj − βj )/(b
ηj kβ
σ τej ) → N(0, 1)
where ηj = kX > w j − e j k∞ /kw j k2 , hopefully ηj ≤
w oj
√
2 log p
e Tj Σ−1 X /n,
Asymptotic variance: For w j =
=
2
−1 σ 2 τej2 = Var Ij∈S e Tj (X TS X S )−1 X S P S ε + w Tj P ⊥
S ε ≈ (σ /n) Σ
j,j
b o should depend on
The complexity of the estimation of β
{|βj | : 0 ≤ |βj | ≤ C0 λuniv }
Inference in LM
PLSE
Data Swap
Concave Penalized LSE
PLSE: Global or suitable local minimizers of
Lλ (β) = kY − X βk22 /(2n) + kpλ (β)k1
P
where kpλ (β)k1 = pj=1 pλ (βj )
Basic properties of penalty functions
Zero baseline penalty: pλ (0) = 0
Symmetry: pλ (−t) = pλ (t)
Monotonicity: pλ (x) ≤ pλ (y ) for all 0 ≤ x < y < ∞
Sparsity: pλ (0+) + pλ0 (0+) > 0
Subadditivity: pλ (x + y ) ≤ pλ (x) + pλ (y ) for all 0 ≤ x ≤ y < ∞
One-sided differentiability: pλ0 (t±) are defined for all real t
Convention: pλ0 (t) = x means
min pλ0 (t+), pλ0 (t−) ≤ x ≤ max pλ0 (t+), pλ0 (t−)
Inference in LM
PLSE
Data Swap
Important quantities
Maximum concavity: κ(pλ ) = supt,6=0 {pλ0 (t − ) − pλ0 (t)}/
Bias threshold: aλ = inf t > 0 : pλ (x) = pλ (t) ∀ |x| ≥ t
− For p = 1, PLSE = LSE when |βb(lse) | ≥ aλ
Global threshold level: λ∗ = λ∗ (pλ ) = inf t>0 t/2 + pλ (t)/t
− For p = 1, βb = 0 attains global minimum iff |βb(lse) | ≤ λ∗
Standardization of penalty functions (scaling key quantities in λ)
Local threshold level: pλ0 (0+) = λ if max |pλ0 (0)| < ∞
Global threshold level: λ∗ (pλ ) = λ if max |pλ0 (0)| = ∞
Bias threshold: a = aλ /λ
Local threshold level: a∗ = pλ0 (0+)/λ
Global threshold level: a∗ = λ∗ /λ
Maximum penalty: γ ∗ = kpλ (t)k∞ /λ2
The above quantities are indeed constants when pλ (t) = λ2 p1 (t/λ), as in the
following examples. See Fan-Li (2001, JASA), Z (2010, AOS), and Z-Zhang
(2012, StatSci) for general discussion of penalty functions
Inference in LM
PLSE
Data Swap
Penalty
p1 (t)
Bias
a
Maximum
Concavity
κ(pλ )
Thresh
Levels
a∗ /a∗
Max
Penalty
γ∗
`0
2−1 I {t 6= 0}
0
∞
∞/1
1/2
Bridge/`α
Cα |t|α
∞
∞
∞/1
∞
Lasso/`1
|t|
R |t| +
dx
1 − (x−1)
a−1
0
+
R |t|
1 − x/a dx
0
∞
0
1/1
∞
a≥2
1/(a − 1)
1/1
a/2 + 1/2
a≥1
1/a
1/1
a/2
min(|t|, a)
a ≥ 1/2
∞
1/1
a
SCAD
MCP
+
Capped-`1
Table: Specific penalty functions: Cα = {2(1 − α)}1−α /(2 − α)2−α , 0 < α < 1
Inference in LM
PLSE
Data Swap
The Lasso, SCAD and MCP: p1 (t) and p10 (t) for t ≥ 0
3
3
t
2
2
scad
t’=1
mcp
1
1
mcp
0
1
2
3
0
1
scad
2
3
Inference in LM
PLSE
Data Swap
The Lasso, SCAD and MC+ estimators for p = 1
(vertical line indicates the bias threshold a)
b ^
b ^
2
hard
2
soft
1
1
mc+/firm
mc+
0
1
hard
2
>
z
0
1
scad
2
>
z
Inference in LM
PLSE
Data Swap
Concave PLSE: Global or suitable local minimizers of
Lλ (β) = kY − X βk22 /(2n) + kpλ (β)k1
P
where kpλ (β)k1 = pj=1 pλ (βj )
Critical point:
o
n
b + tu) − Lλ (β)
b ≥0
lim inf t→0+ inf 0<kuk2 ≤1 t −1 Lλ (β
Local minimizer: There exists t0 > 0 such that
b
b 2 ≤ t0
Lλ (b) ≥ Lλ (β),
∀ 0 < kb − βk
Strict local minimizer: There exists t0 > 0 such that
b
b 2 ≤ t0
Lλ (b) > Lλ (β),
∀ 0 < kb − βk
Global minimizer:
b = arg min p Lλ (b)
β
b∈R
For quadratic spline penalties such as the SCAD and MCP, the type of local
solution is determined by the second local directional derivative of the
b
penalized loss at β.
Inference in LM
PLSE
Data Swap
Theorem. KKT-type conditions for local solutions: Suppose that pλ (t) is
twice left- and right-differentiable for all real t.
b is a critical point iff
(i) A vector β
b
X Tj (Y − X β)/n
= pλ0 (βbj ),
∀ j = 1, . . . , p.
b
In particular, |X Tj (Y − X β)/n|
≤ pλ0 (0+) for all j with βbj = 0.
b is a strict local minimizer if
(ii) A critical point β
X 00
X 00
kX uk22 /n +
pλ (βbj +)uj2 +
pλ (βbj −)uj2 > 0
uj >0
uj <0
for all u with {j : uj > 0} ⊆ S+ and {j : uj < 0} ⊆ S− , where
b
S± = {j : X Tj (Y − X β)/n
= pλ0 (βbj ±)}.
b is a
(iii) Suppose pλ (t) is a quadratic spline in [0, ∞). Then, a critical point β
strict local minimizer iff the condition in (ii) holds, and it is a local minimizer iff
X 00
X 00
kX uk22 /n +
pλ (βbj +)uj2 +
pλ (βbj −)uj2 ≥ 0
uj >0
for the same set of u.
uj <0
Inference in LM
PLSE
Data Swap
b = sgn(β)
Oracle local solutions: sgn(β)
Let S = supp(β) and
n
o
−1
−β S k∞
θselect (pλ , β) = inf θ : kv Sθλ+λ
≤ 1 ⇒ kΣS,S pλ0 (v S )k∞ ≤ θλ ,
0
n
0
o
−1
p (v ) kv S −β S k∞
κselect (pλ , β) = sup ΣS c ,S ΣS,S λ λ S : θselect
≤1 .
(pλ ,β)λ+λ0
∞
o
b be the oracle LSE. Suppose pλ0 (t) is continuous in t > 0,
Theorem. Let β
rank(ΣS,S ) = |S|, minj∈S |βj | ≥ θselect (pλ , β)λ + λ0 , and
o
b − βk∞ ≤ λ0 ,
kβ
o
b )/nk∞ ≤ λ(1 − κselect (pλ , β))+ .
kX T (Y − X β
b such that
Then, there exists a local solution β
b = sgn(β),
sgn(β)
b − βk∞ ≤ θselect (pλ , β)λ + λ0 .
kβ
Moreover, if in addition θselect (pλ , β) = 0, then κselect (pλ , β) = 0 and
b o is a local minimizer.
β
Inference in LM
PLSE
Data Swap
Remarks:
The selection consistency theory for the Lasso is included as special case
Folded concave penalties provide oracle local solutions under much
b o is an oracle solution
weaker condition on X than the Lasso, e.g. when β
under the signal strength condition θselect (pλ , β) = 0
Due to the multiplicity of solutions, the challenges of concave PLSE are
− to find such oracle solutions if exists
− to cause little harm or even some gain in performance in prediction and
coefficient estimation, e.g. in view of the rate minimaxity of the Lasso,
when such oracle solution does not exist
− to compute such solutions with reasonable cost
Outline of the proof: Apply Brouwer’s fixed-point theorem to
b
b supp(β)
b = S,
X T (Y − X β)/n
= pλ0 (β),
b and then check the condition on S c using κselect (pλ , β).
on S to find β
S
See Z-Zhang (2012).
Inference in LM
PLSE
Data Swap
Proof:
b o − βk∞ ≤ λ0 .
− Recall that kβ
− Let B = {h S : kh S k∞ ≤ θselect (pλ , β)λ}.
b o + h S satisfies kv S − β k∞ ≤ θselect (pλ , β)λ + λ0 .
− For h S ∈ B, v S = β
S
S
−1
− It follows that kΣS,S pλ0 v S k∞ ≤ θselect (pλ , β)λ.
−1
b o + h S has a solution (Brouwer).
− Thus, h S = −ΣS,S pλ0 β
S
o
b =β
b + h S and β
b c = 0. We have
− Let β
S
S
S
b
b o − X β)/n
b
b ).
X TS (Y − X β)/n
= X TS (X β
= −ΣS,S h S = pλ0 (β
S
− Moreover, by the definition of κselect (pλ , β),
T
b
X S c Y − X S β
n ∞
S
T
o
−1
b b
=
X S c (Y − X β )/n + ΣS c ,S ΣS,S pλ0 β
S
∞
b o )/n + κselect (pλ , β)λ
≤ X TS c (Y − X β
∞
≤ λ.
Inference in LM
PLSE
Data Swap
Local solutions connected to the origin:
n
o
P(λ0 , κ0 ) = pλ (·) : pλ0 (0+) = λ ≥ λ0 , kpλ0 (·)k∞ ≤ λ, κ(pλ ) ≤ κ0
where κ(pλ ) = supt1 <t2 {pλ0 (t1 ) − pλ0 (t2 )}/(t2 − t1 )
B(λ0 , κ0 ) = the set of all local solutions for some pλ ∈ P(λ0 , κ0 )
B0 (λ0 , κ0 ) = the set of all solutions connected to 0 in B(λ0 , κ0 )
RE = inf u6=0 kX uk2 /(n1/2 kuk2 ) : ku S c k1 ≤ ku S ||1 + ηkuk1
General picture: When RE2 ≥ κ0 and λ0 ≥ η −1 σ
p
2(log p)/n,
B0 (λ0 , κ0 ) is a set of “good” local solutions
for prediction, the estimation of β and variable selection
Inference in LM
PLSE
Data Swap
Local solutions connected to the origin:
− B0 (λ0 , κ0 ) = the set of local solutions connected to 0 as before
− S = supp(β), S∗ = {j : 0 < |βj | < γλ0 }, γ > a
− β ∗ be a target vector satisfying kX T (Y − X β ∗ )/nk∞ < ηλ0 with η < 1
b o = (X TS X S )−1 X TS Y be the oracle LSE
−β
b be a local minimizer in B0 (λ0 , κ0 ) with a penalty
Theorem Let β
pλ ∈ P(λ0 , κ0 ). Suppose RE2 ≥ κ0 . Then, for all seminorms k · k,
n
o
b − β ∗ k ≤ λ sup (1 + η)kuk/kΣuk∞ : (1 − η)ku S c k1 ≤ w TS u S ,
kβ
where w TS = {X TS (Y − X β ∗ )/n − pλ0 (β ∗S )}/λ, and
b = β∗ .
β
when φmin (ΣS,S ) > κ0 and β ∗ is an oracle solution for pλ . Moreover,
n 2
2
2 o
b −β
b o k22
kβ
b − Xβ
b o k22 /n ≤ Cη min λ |S| , λ kw S k2 ,
≤ kX β
2
2
2
RE
RE RE − κ0
−1
with P kw S k22 ≤ (1 + η)2 |S∗ | → 1 when γ ≥ a + maxj∈S (ΣS,S )j,j .
Remark: The cone is smaller compared with the standard Lasso theory
Inference in LM
PLSE
Data Swap
Remarks:
B0 (λ0 , κ0 ) is the set of all local solutions computable by path following
algorithms stating from the origin, with the constraints λ ≥ λ0 and
κ(pλ ) ≤ κ0 on the penalty and concavity levels
The RE condition, which is of the `2 type, is arguably nearly the weakest
proven one on X for the rate optimal prediction and coefficient
estimation error bounds for the Lasso,
Under the RE condition, any path following solution achieves the same or
sharper prediction and coefficient estimation error rates as the Lasso
b o | are large, w S can be
When a large majority of the components of |β
S
o
b . In this case, concave penalty outperforms the Lasso in
small for β
prediction and coefficient estimation
Under the same RE condition, the path following solution achieves
b = β o under the beta-min condition
variable selection consistency, β
o
b
minj∈S |βj | ≥ aλ
Inference in LM
PLSE
Data Swap
Implication on the sample size requirement for the inference of θ = a T β
b (init) be a suitable concave penalized LSE
Let β
Under proper conditions on the design matrix, the sample size
b (init) becomes
requirement for de-biasing β
n (s log p)(1 + s∗ log p) with s = |S| and s∗ = |S∗ |,
a partial reduction from S = supp(β) to S∗ = {j : 0 < |βj | < γλ0 }
Under the beta-min and restricted eigenvalue conditions, we need
n (s + 1) log p
If w j are computed with the constraint
(init) βb
> λ0 ⇒ (X T w j − e j )k = 0
k
in a “data swap” scheme, then the sample size requirement for de-biasing
b (init) becomes
β
n
o
n max (s + 1) log p, (s∗ log p)2
Inference in LM
PLSE
Data Swap
A modification of the asymptotic theory
b o = (X TS X S )−1 X TS Y
The oracle LSE: S = supp(β), β
S
βbjo
= βj +
Ij∈S e Tj (X TS X S )−1 X S P S ε
An expansion around the oracle LSE
(init)
b (init) )
βbj = βbj
+ w>
j (Y − X β
⊥
>
> b (init)
bo )
βbj − βbjo = w >
−β
j P S ε − (X w j − e j ) (β
b (init) − β
b o k1 1 ⇒ (βbj − βj )/(b
ηj kβ
σ τej ) → N(0, 1)
where ηj = kX > w j − e j k∞ /kw j k2 , hopefully ηj ≤
w oj
√
2 log p
e Tj Σ−1 X /n,
Asymptotic variance: For w j =
=
2
−1 σ 2 τej2 = Var Ij∈S e Tj (X TS X S )−1 X S P S ε + w Tj P ⊥
S ε ≈ (σ /n) Σ
j,j
b o should depend on
The complexity of the estimation of β
{|βj | : 0 ≤ |βj | ≤ C0 λuniv }
Inference in LM
PLSE
Data Swap
A path following algorithm:
pλ (t) = λ2 p1 (|t|/λ) with a quadratic spline p1 (t) in [0, ∞)
b
β(λ):
minimizing kY − X βk22 /(2n) + λ2 kp1 (β/λ)k1
b
Rescale: τ = 1/λ, b = b(τ ) = β(λ)/λ,
Z = X T y /n
b: minimizing b T Σb/2 − τ Z T b + kp1 (b)k1
“KKT”: Σb(τ ) − τ Z + p10 (b(τ )) = 0, linear spline path
Direction of path: Σb 0 (τ ) − Z + p100 (b(τ )) ◦ b 0 (τ ) = 0
Initialization: τ (0) = 0, Sb(0) = ∅, b(τ ) = 0, k = 1
Iteration: − Find τ (k) as the beginning point for new Sb(k) or new pλ00 (b)
− Find the new Sb(k) & pλ00 (b)
−1
(k)
b(k) & pλ00 (b)
− Compute ∂b Sb = Σ + diag(p100 (b)) S,
b SbZ Sb with new S
− b(τ ) = b(τ (k) ) + (τ − τ (k) )∂b (k) from τ = τ (k) to τ = τ (k+1)
− k =k +1
References
− Osborne et al (2000), Efron et al (2004), LARS for the Lasso
− Z (2010), PLUS for the Lasso, SCAD, MCP and ...
0.0
Inference in LM
PLSE
0,1,2
0
Data Swap
2
0
0.0
0
loop
!2
!1.2
!2
0,1,2
0
0.0
(1,2)
lasso
!0.5
0
!1
!2
main branch = mc+
(5)
0.2
0.4
!1.0
!0.8
!0.4
0
!1
!2
(0)
(4,7)
(6)
2
0.6
0.8
1.0
1.2
0.0
0.5
1.0
(3)
1.5
Inference in LM
PLSE
Data Swap
Majorize-Minimization (MM) algorithms:
b (old)
− Lλ (β) = L0 (β) + Rλ (β), β
(old)
b
b
− f (β) ≥ Lλ (β), f (β
) = L λ (β
(new )
b
= arg min f (β)
−β
(old)
)
β
b (new ) ) ≤ f (β
b (new ) ) ≤ f (β
b (old) ) = Lλ (β
b (old) )
− Lλ (β
Majorizing the penalty
− Local quadratic approximation (LQA; Fan-Li, 2001):
(old)
(old)
(old)
(old)
pλ (|βj |) ≤ pλ (|βbj |) + pλ0 (|βbj |)(βj2 − |βbj |2 )/(2|βbj |)
− Local linear approximation (LLA; Zou-Li, 2008):
(old)
pλ (|βj |) ≤ pλ (|βbj
|) + pλ0 (|βbj
(old)
(old)
|)(|βj | − |βbj
|) = pλ0 (|βbj
(old)
|)|βj | + Cj
Majorizing the loss
− (Fast) iterative shrinkage-thresholding algorithm (FISTA; Nesterov, 83;
Daubechies et al, 2004; Beck-Teboulle, 2009)
D
E
b (old) ) + ∇L0 (β
b (old) ), β − β
b (old) + kβ − β
b (old) k22 /(2s)
L0 (β) ≤ L0 (β
with supβ k∇⊗2 L0 (β)kspectrum ≤ 1/s
Inference in LM
PLSE
Data Swap
5
LQA and LLA
5
LQA and LLA
3
2
1
0
0
1
2
3
4
LQA
LLA
4
LQA
LLA
−4
−2
0
(a)
See Zou-Li 08; Thanks to J. Fan
2
4
−4
−2
0
(a)
2
4
Inference in LM
PLSE
Data Swap
e=β
b (old) and β
b be an estimator satisfying
Properties of LLA: Let β
T
b − Y )/n + pλ0 (βej )∂|βbj | ≤ λνj
X j (X β
Basic inequality: Suppose |pλ0 (t1 ) − pλ0 (t2 )| ≤ κ(pλ )|t1 − t2 | ∀ 0 < t1 < t2 .
Similar to the basic inequality for the path following algorithm, we have
e 2 + λkνk2
h T Σh + (1 − η)λkh S c k1 ≤ λw TS h S + khk2 κ(pλ )khk
e=β
b − β∗ , h
e − β ∗ , and w S = {X TS (Y − X β ∗ )/n − pλ0 (β ∗ )}/λ.
with h = β
S
Theorem: (i) Suppose pλ0 satisfies the Lipschitz condition. Then,
n
o
khk ≤ λ sup kukψ(u) : u T Σu = 1
e 2 /λ + kνk2 .
with ψ(u) = w TS u S − (1 − η)ku S c k1 + kuk2 κ(pλ )khk
T
(ii) Let ρ0 = sup{kuk2 : ψ(u) > 0, u Σu = 1}. Then,
n
o
e 2 + λ sup
kuk2 w TS u S − (1 − η)ku S c k1 + kuk2 kνk2
khk2 ≤ ρ0 κ(pλ )khk
u T Σu=1
− Multistage LLA/approximate solution (Zhang, 2010, 2013; Z-Zhang, 2012)
− RSC (Negahban et al, 2012)
Inference in LM
PLSE
Data Swap
Statistical inference of βj in linear regression:
(init)
b (init) )
+ w>
LDPE/de-biasing: βbj = βbj
j (Y − X β
Analysis based on an `∞ -`1 split of the inner product in the r.h.s. of
>
> b (init)
βbj − βj − w >
− β)
j ε = −(X w j − e j ) (β
p
P
requires n (s ∗ log p)2 when j min((|βj |/σ) n/ log p, 1) ≤ s ∗
Ideal choice of w j = z j /X Tj z j for random design:
X j = X −j γ −j + z oj ,
w oj = z oj /X Tj z oj ,
with X Tj z oj ≈ kz oj k22 ≈ Ekz oj k22 = n/(Σ−1 )j,j and Σ = EX T X /n
Possibility of improvement in sample size requirement:
b (init) − β)−j
(X T−j z oj )T (β
(init)
b
= oP (n−1/2 )
(X w j − e j ) (β
− β) =
X Tj z oj
>
>
b (init) is `2 -consistent and based on historical data
if w o is used and β
Data swap and semi-supervised learning
Inference in LM
PLSE
Data Swap
Data swap
Extension: Statistical inference of a smooth function
θ(β)
of a high-dimensional parameter β with high-dimensional data.
Inference in LM
PLSE
Data Swap
Some examples
Linear regression: y = X β + ε ∈ Rn , β ∈ Rp , p n.
Estimation of ha o , βi, kβkqq , etc.
GLM, graphical models, Cox regression
Two samples with a common set of covariates.
Estimation of hβ, γi or Corr(β, γ)
Nonlinear regression:
y = f0 (X β) + ε
with known f0 . For example, in phase retrieval, f0 (t) = t 2 .
Inference in LM
PLSE
Data Swap
For estimation of regression coefficients, de-biasing leads to
βbj = βj + N(0, 1/(nF )) + OP (n−1 σ 2 s log p)
Relatively straightforward extension to the inference of linear functionals
ha o , βi
For the estimation of kβk22 ,
(Lasso) 2
b (Lasso) − β + kβ
b − βk22
b
= kβk22 + 2 β, β
β
2
(Lasso)
b
With b
a=β
, can we improve the naive estimator by
D
E
b (Lasso) − β debiasing N 0, (σ 2 /n)βΣ−1 β
b
a, β
=⇒
Inference in LM
PLSE
Data Swap
A general solution
Suppose we observe independent data points with negative log-likelihood
`i (β) = `i (β; datai ), i ≤ n, with a general HD β ∈ Rp
Suppose that we are interested in estimating a smooth function of β:
θ = θ(β)
P
Fisher information: F = (∂/∂β)⊗2 E ni=1 `i (β)/n
Let a o = (∂θ/∂β). The direction of the least favorable submodel is
u o = arg minu hF u, ui : ha o , ui = 1 = F −1 a o /hF −1 a o , a o i
The (minimum) Fisher information for the estimation of θ:
Fθ = hF u o , u o i = 1/hF −1 a o , a o i
Stein (1956), Bickel et al (90), Schick (86), ... , Z (11)
Inference in LM
PLSE
Data Swap
Local semi-LD decomposition: For b 0 ≈ β
β − b 0 = LD component + HD component = u o φ + ν
When ha o , νi = 0,
θ(β) − θ(b 0 ) ≈ ha o , u o φ + νi ≈ φ
LD projection estimator (LDPE): Using estimated b 0 and u o if necessary,
P
θb = θ(b 0 ) + arg minφ∈R ni=1 `i b 0 + u o φ
P
= arg minθ∈R ni=1 `i b 0 + u o {θ − θ(b 0 )}
The HD problem is projected to a LD problem
Claim: Under proper conditions,
p
nFθ (θb − θ) → N(0, 1)
Inference in LM
PLSE
Data Swap
Linear approximation of nonlinear problems:
b = θ(β) + ha o , β
b − βi + Rem
θ(β)
where a o = (∂/∂β)θ(β). For the estimation of kβk22 ,
(Lasso) 2
b (Lasso) − β + kβ
b − βk22
b
= kβk22 + 2 β, β
β
2
b
Assume that kβ
(Lasso)
− βk22 is small. We may de-bias a linear functional
b
ha, β
b
in an estimated direction a = β
(Lasso)
(Lasso)
.
− βi
Inference in LM
PLSE
Data Swap
Data swap
Schick (86), Klaassen (87), and more
Partition [n] into I1 and I2 of about the same size
Data swapped estimation of θ:
P
P
b (m) + u (m) θ − θ(β
b (m) ) ,
θb = arg minθ∈R 2m=1 i∈Im `i β
b (m) ≈ β and possibly u (m) ≈ u o are estimates based on data in Imc
where β
Recall that without data swap,
P
b (init) + u (init) θ − θ(β
b (init) ) ,
θb = arg minθ∈R i∈[n] `i β
b
where {β
(init)
, u (init) } and {`i } are based on the same data
Idea: Decouple the estimated nuisance parameters and still use likelihood
for all data points
When `i (β) is convex, the local minimizer is unique.
Inference in LM
PLSE
Data Swap
An `2 -based theorem: Let a o = (∂/∂β)θ(β). Suppose
b (m) − β)k2 ≤ δ1,n , kF 1/2 (b
u (m) − u o )k2 ≤ ku o k2 δ2,n → 1,
P kF 1/2 (β
and for {h, u} satisfying kF 1/2 hk2 ≤ 3δ1,n and kF 1/2 (u − u o )k2 ≤ ku o k2 δ2,n ,
θ(β + h) − θ(β) − a o , h ≤ C1,n khk1+α1 ,
2
2
Var h`˙i (h + β) − `˙i (β), ui ≤ δ4,n
kF 1/2 uk22 ,
Eh`˙i (h + β), ui − hF h, ui ≤ C2,n kF 1/2 hk22 kF 1/2 uk2 ,
where {Cj,n , δj,n , α1 } are nonnegative constants satisfying δj,n = o(1) and
2
1+α1
+ C2,n δ1,n
n−1/2 .
δ1,n δ2,n + Fθ C1,n δ1,n
1/2
Then, there exists a local minimizer of the data-swapped LDPE such that
D
P (nFθ )1/2 θb − θ = oP (1) + (nFθ )−1/2 ni=1 `˙i (β), u o −→ N(0, 1)
p
In the regression setting: C1,n = C2,n = 0 and δ1,n σ s(log p)/n
Inference in LM
PLSE
Data Swap
Remarks
The theorem assumes the existence of estimators satisfying
b (m) − β)k2 ≤ δ1,n , kF 1/2 (b
P kF 1/2 (β
u (m) − u o )k2 ≤ ku o k2 δ2,n → 1
For the estimation of linear functionals in linear regression,
C1,n = C2,n = 0 and the condition becomes
δ1,n δ2,n n−1/2
with δ1,n σ
p
s(log p)/n
Good news: n (s log p)2 can be weakened when u o is easier to
estimate than β
Recall that u o is the direction of the least favorable submodel for the
estimation of θ
Inference in LM
PLSE
Data Swap
Finding the least favorable direction? No, u o = ...
Inference in LM
PLSE
Data Swap
Remarks
Let a o = (∂/∂β)θ(β) and
n
o
u o = arg min hu, F ui : hu, a o i = 1 = F −1 a o /ha o , F −1 a o i
u
A regularized estimate of u o , (e.g. Lasso, e-net, etc) is
n
o
b ui + pen(u) : hu, b
b = arg min hu, F
ai = 1
u
u
Good news: Fast convergence to u o under sparsity and regularity
conditions
Bad news: u o does not have to be sparse or easy to estimate
Inference in LM
PLSE
Data Swap
One-step gradient approximation
P
b (init) + u (init) θ − θ(β
b (init) )
θb = arg minθ∈R i∈[n] `i β
Inference in LM
PLSE
Data Swap
One-step gradient approximation (Bickel et al, 90)
b (init) ) − n−1 Pn `˙i β
b (init) , vb(init) ,
θb = θ(β
i=1
where vb(init) are estimates of
v o = u o /Fθ = F −1 a o
If data swap is deployed, then
P
P
b (m) ) − `˙i β
b (m) , vb(m)
θb = n−1 2m=1 i∈Im θ(β
If we consider the estimation of all linear functionals of β, the one-step LDPE
can be (symbolically) written as
b=β
b (init) − (nF
b (init)
b )−1 Pn `˙i β
β
i=1
Inference in LM
PLSE
Data Swap
b(m) } can be replaced by {v o , v , vb(m) } in the
Theorem: Suppose {u o , u, u
regularity conditions and
1+α1
2
δ1,n δ2,n + Fθ C1,n δ1,n
+ C2,n δ1,n
n−1/2 .
1/2
Then, the one-step estimator is asymptotically efficient,
p
D
P nFθ θb − θ = oP (1) + (nFθ )−1/2 ni=1 `˙i (β), u o −→ N(0, 1)
Inference in LM
PLSE
Data Swap
Remarks
The estimation of v o = F −1 a o is a some what different problem from the
estimation of u o = v o Fθ
For example, instead of the Lasso (Choice 1), one may consider quadratic
programming (Choice 2)
n
o
b v i : kF
bv − b
vb = arg min hv , F
a k∞ ≤ λ
as in Z-Zhang (14) and Javanmard-Montanari (14)
b (m) and do not take advantage of the
This requires an `1 bound for β
o
sparsity of u
Inference in LM
PLSE
Data Swap
Examples
Inference in LM
PLSE
Data Swap
Regular regime for the estimation of θ(β, σ) in LM
Theorem (i) Suppose that the Hessian of θ has bounded spectrum norm.
Suppose data-swapped one-step LDPE is used, a (m) = (∂/∂b)θ(b, σ
b)b=βb (m) ,
and a quadratic programming is used to find vb(m) . Suppose
(m)
b − βk22 + |b
kβ
σ /σ ∗ − 1| . sλ20
X i,∗ is sub-Gaussian with regular Σ
λa kak2 λ0
n (s log p)2
√
Then, n(θb − θ) = OP (1).
(ii) If in addition λa ku o k1 = oP (1), then the LDPE is efficient
q
nFbθ θb − θ → N(0, 1), Fbθ /Fθ ≈ 1,
with plug-in estimate Fbθ .
Remark: The key condition, n (s log p)2 , is hard to avoid here due to
nonlinearity
Inference in LM
PLSE
Data Swap
Estimation of kβkqq , 1 < q ≤ 2: Suppose kβk0 = s and X is sub-Gaussian
with regular Σ. Let a o = q(sgn(βj )|βj |q−1 ). Then,
√
σ 2 (1 + s(log p)q/2 )
n θb − kβkqq = N 0, σ 2 ha o , Σ−1 a o i + OP (1)
nq/2−1/2
Related problems: Lepski et al (99), Cai-Low (11)
Estimation of hβ, γi, two samples of equal size:
√
2 T −1
n θb − hβ, γi = N 0, σ12 γ T Σ−1
1 γ + σ2 β Σ 2 β
with error term log p
p
s1 s2 /n
Phase retrieval (Shechtman et al 15; Candes et al, 15, Cai et al, 15):
Nonlinear regression with yi = (X i,∗ β)2 + i and X i,∗ ∼ N(0, I ),
√
n θb − kβk22 → N(0, σ 2 /3)
with error term of the order (σ 2 /n1/2 )(s log p)/kβk22 .
Inference in LM
PLSE
Data Swap
Estimation of precision matrix Θ = Σ−1 based on X with iid N(0, Σ) rows
Yuan-Lin (07), Friedman et al (08), Rothman et al (08), Lam-Fan (09),
Ravikumar et al (11), Sun-Z (12) and Ren et al (15)
Ren et al (15): n (s ∗ log p)2 is optimal where s ∗ = maxj kΘ∗,j k0
The problem is linear: `˙i (Θ) = X Ti,∗ X i,∗ − Θ−1 /2
b (init) , the one-step LDPE is
Given the initial estimator Θ
(m)
b = n−1 P2 P
b
b (m) )T − (Θ
b (m) )T (X Ti,∗ X i,∗ )Θ
b (m)
Θ
+ (Θ
m=1
i∈Im Θ
Automatic symmetrization
D
2
Sample size requirement for θj,k −→ N(0, θj,k
+ θj,j θk,k ):
n sj sk (log p)2
Graphical model with a few modest hubs
Inference in LM
PLSE
Data Swap
Estimation of linear functions in linear regression
Data y = X β + ε with ε ∼ N(0, σ 2 ), iid X i,∗ and Σ = EX T X /n
Problem: Given a o , we want to find sufficient conditions for
√
D
n(θb − ha o , βi) −→ N 0, σ 2 ha o , Σ−1 a o i
Known Σ:
E(X i,∗ b)4 ≤ C0 {E(X i,∗ b)2 }2 for all b
0 < c∗ ≤ eigen(Σ) ≤ c ∗ < ∞
n s log p
Sub-Gaussian design with unknown Σ: With s 0 =capped-`1 (u o )
n min(s 0 s, s 2 )(log p)2
Gaussian design with unknown Σ:
√
kΣ−1 a o k1 . log p
n s log p
n (s 0 ∧ s)2 log p
Inference in LM
PLSE
Sample size requirement: A summary
Data Swap
Inference in LM
PLSE
Data Swap
Sample size requirement
for efficient inference of linear functionals in linear model
Supervised
Data
Semisupervised
Data
Sub-Gaussian
n s log p
n min(s 0 s, s 2 )(log p)2
n s log p
N min(s 0 s, s 2 )(log p)2
Gaussian √
design
ku o k1 . log p
n s log p
n (s 0 ∧ s)2 (log p)2
n s log p
N (s 0 ∧ s)2 (log p)2
Known Σ
n s log p
n s log p
Side condition
p
P
s = pj=1 min 1, |βj /σ| n/ log p
p
P
s 0 = pj=1 min 1, (|ujo |/ku o k2 ) n/ log p
Inference in LM
PLSE
Data Swap
Thanks!