Uniform post selection inference for LAD regression models

Uniform post selection
inference for LAD regression
models
Alexandre Belloni
Victor Chernozhukov
Kengo Kato
The Institute for Fiscal Studies
Department of Economics, UCL
cemmap working paper CWP24/13
UNIFORM POST SELECTION INFERENCE FOR LAD REGRESSION MODELS
A. BELLONI, V. CHERNOZHUKOV, AND K. KATO
Abstract. We develop uniformly valid con๏ฌdence regions for a regression coe๏ฌƒcient in a high-dimensional
sparse LAD (least absolute deviation or median) regression model. The setting is one where the number
of regressors ๐‘ could be large in comparison to the sample size ๐‘›, but only ๐‘  โ‰ช ๐‘› of them are needed
to accurately describe the regression function. Our new methods are based on the instrumental LAD
regression estimator that assembles the optimal estimating equation from either post โ„“1 -penalized LAD
regression or โ„“1 -penalized LAD regression. The estimating equation is immunized against non-regular
estimation of nuisance part of the regression function, in the sense of Neyman. We establish that in a
homoscedastic regression model, under certain conditions, the instrumental LAD regression estimator
of the regression coe๏ฌƒcient is asymptotically root-๐‘› normal uniformly with respect to the underlying
sparse model. The resulting con๏ฌdence regions are valid uniformly with respect to the underlying model.
The new inference methods outperform the naive, โ€œoracle basedโ€ inference methods, which are known
to be not uniformly valid โ€“ with coverage property failing to hold uniformly with respect the underlying
model โ€“ even in the setting with ๐‘ = 2. We also provide Monte-Carlo experiments which demonstrate
that standard post-selection inference breaks down over large parts of the parameter space, and the
proposed method does not.
Key words: median regression, uniformly valid inference, instruments, Neymanization, optimality,
sparsity, post selection inference
1. Introduction
We consider the following regression model
๐‘ฆ๐‘– = ๐‘‘๐‘– ๐›ผ0 + ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 + ๐œ–๐‘– , ๐‘– = 1, . . . , ๐‘›,
(1.1)
where ๐‘‘๐‘– is the โ€œmain regressorโ€ of interest, whose coe๏ฌƒcient ๐›ผ0 we would like to estimate and perform
(robust) inference on. The (๐‘ฅ๐‘– )๐‘›๐‘–=1 are other high-dimensional regressors or โ€œcontrolsโ€ and are treated as
๏ฌxed (๐‘‘๐‘– โ€™s are random). The regression error ๐œ–๐‘– is independent of ๐‘‘๐‘– and has median 0. The errors (๐œ–๐‘– )๐‘›๐‘–=1
are i.i.d. with distribution function ๐น (โ‹…) and probability density function ๐‘“๐œ– (โ‹…) such that ๐น (0) = 1/2
and ๐‘“๐œ– = ๐‘“๐œ– (0) > 0. The assumption on the error term motivates the use of the least absolute deviation
(LAD) or median regression, suitably adjusted for use in high-dimensional settings.
Date: First version: May 2012, this version April 30, 2013. We would like to thank the participants of Luminy conference
on Nonparametric and high-dimensional statistics (December 2012), Oberwolfach workshop on Frontiers in Quantile Regression (November 2012), 8th World Congress in Probability and Statistics (August 2012), and seminar at the University
of Michigan (October 2012). We are grateful to Sara van de Geer, Xuming He, Richard Nickl, Roger Koenker, Vladimir
Koltchinskii, Steve Portnoy, Philippe Rigollet, and Bin Yu for useful comments and discussions.
1
2
UNIFORM POST SELECTION INFERENCE FOR LAD
The dimension ๐‘ of โ€œcontrolsโ€ ๐‘ฅ๐‘– is large, potentially much larger than ๐‘›, which creates a challenge for
inference on ๐›ผ0 . Although the unknown true parameter ๐›ฝ0 lies in this large space, the key assumption
that will make estimation possible is its sparsity, namely ๐‘‡ = support(๐›ฝ0 ) has ๐‘  < ๐‘› elements (where ๐‘ 
can depend on ๐‘›; we shall use array asymptotics). This in turn motivates the use of regularization or
model selection methods.
A standard (non-robust) approach towards inference in this setting would be ๏ฌrst to perform model
selection via the โ„“1 -penalized LAD regression estimator
ห† โˆˆ arg min ๐”ผ๐‘› [โˆฃ๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ โˆ’ ๐‘ฅโ€ฒ ๐›ฝโˆฃ] + ๐œ† โˆฅ(๐›ผ, ๐›ฝ โ€ฒ )โ€ฒ โˆฅ1 ,
(ห†
๐›ผ, ๐›ฝ)
๐‘–
๐›ผ,๐›ฝ
๐‘›
(1.2)
and then to use the post-model selection estimator
{
}
หœ โˆˆ arg min ๐”ผ๐‘› [โˆฃ๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ โˆ’ ๐‘ฅโ€ฒ ๐›ฝโˆฃ] : ๐›ฝ๐‘— = 0 if ๐›ฝห†๐‘— = 0
(หœ
๐›ผ, ๐›ฝ)
๐‘–
๐›ผ,๐›ฝ
(1.3)
to perform โ€œusualโ€ inference for ๐›ผ0 . (The notation ๐”ผ๐‘› [โ‹…] denotes the average over index 1 โฉฝ ๐‘– โฉฝ ๐‘›.)
This standard approach is justi๏ฌed if (1.2) achieves perfect model selection with probability approaching 1, so that the estimator (1.3) has the โ€œoracleโ€ property with probability approaching 1. However
conditions for โ€œperfect selectionโ€ are very restrictive in this model, in particular, requiring signi๏ฌcant
separation of non-zero coe๏ฌƒcients away from zero. If these conditions do not hold, the estimator ๐›ผ
หœ does
โˆš
not converge to ๐›ผ0 at the ๐‘›-rate โ€“ uniformly with respect to the underlying modelโ€“ which implies that
โ€œusualโ€ inference breaks down and is not valid. (The statements continue to apply if ๐›ผ is not penalized
in (1.2), ๐›ผ is restricted in (1.3), or if thresholding is applied.) We shall demonstrate the breakdown of
such naive inference in the Monte-Carlo experiments where non-zero coe๏ฌƒcients in ๐œƒ0 are not signi๏ฌcantly
separated from zero.
Note that the breakdown of inference does not mean that the aforementioned procedures are not
suitable for prediction purposes. Indeed, the โ„“1 -LAD estimator (1.2) and post โ„“1 -LAD estimator (1.3)
โˆš
attain (essentially) optimal rates (๐‘  log ๐‘)/๐‘› of convergence for estimating the entire median regression
function, as has been shown in [24, 3, 13, 26] and in [3]. This property means that while these procedures
will not deliver perfect model recovery, they will only make โ€œmoderateโ€ model selection mistakes (omitting
only controls with coe๏ฌƒcients local to zero).
To achieve uniformly valid inferential performance we propose a procedure whose performance does
not require perfect model selection and allows potential โ€œmoderateโ€ model selection mistakes. The latter
feature is critical in achieving uniformity over a large class of data generating processes, similarly to
the results for instrumental regression and mean regression studied in [27], [2], [7], [6]. This allows us to
overcome the impact of (moderate) model selection mistakes on inference, avoiding (in part) the criticisms
in [17], who prove that the โ€œoracle propertyโ€ sometime achieved by the naive estimators necessarily implies
the failure of uniform validity of inference and their semiparametric ine๏ฌƒciency [18].
In order to achieve robustness with respect to moderate model selection mistakes, it will be necessary
to achieve the proper orthogonality condition between the main regressors and the control variables.
UNIFORM POST SELECTION INFERENCE FOR LAD
3
Towards that goal the following auxiliary equation plays a key role (in the homoscedastic case):
๐‘‘๐‘– = ๐‘ฅโ€ฒ๐‘– ๐œƒ0 + ๐‘ฃ๐‘– , E[๐‘ฃ๐‘– ] = 0, ๐‘– = 1, . . . , ๐‘›;
(1.4)
describing the relevant dependence of the regressor of interest ๐‘‘๐‘– to the other controls ๐‘ฅ๐‘– . We shall assume
the sparsity of ๐œƒ0 , namely ๐‘‡๐‘‘ = support(๐œƒ0 ) has at most ๐‘  < ๐‘› elements, and estimate the relation (1.4)
via Lasso or post-Lasso methods described below.
Given ๐‘ฃ๐‘– , which โ€œpartials outโ€ the e๏ฌ€ect of ๐‘ฅ๐‘– from ๐‘‘๐‘– , we shall use it as an instrument in the following
estimating equations for ๐›ผ0 :
E[๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ0 โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 )๐‘ฃ๐‘– ] = 0, ๐‘– = 1, ..., ๐‘›,
where ๐œ‘(๐‘ก) = 1/2 โˆ’ 1(๐‘ก < 1/2). We shall use the empirical analog of this equation to form an instrumental
LAD regression estimator of ๐›ผ0 , using a plug-in estimator for ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 . The estimating equation above has
the following feature:
โˆ‚
โ€ฒ
E[๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ0 โˆ’ ๐‘ฅ๐‘– ๐›ฝ)๐‘ฃ๐‘– ]
= 0, ๐‘– = 1, ..., ๐‘›,
โˆ‚๐›ฝ
๐›ฝ=๐›ฝ0
(1.5)
As a result, the estimator of ๐›ผ0 will be โ€œimmunizedโ€ against โ€œcrudeโ€ estimation of ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 , for example, via a
post-selection procedure or some regularization procedure. As we explain in Section 5, such immunization
ideas can be traced back to Neyman ([19, 20]).
Our estimation procedure has the following three steps.
Step 1: Estimation of the confounding function ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 in (1.1).
Step 2: Estimation of the instruments (residuals) ๐‘ฃ๐‘– in (1.4).
Step 3: Estimation of the main e๏ฌ€ect ๐›ผ0 based on the instrumental
LAD regression using ๐‘ฃ๐‘– as instruments for ๐‘‘๐‘– .
Each step is computationally tractable, involving solutions of convex problems and a one-dimensional
search, and relies on a di๏ฌ€erent identi๏ฌcation condition which in turn requires a di๏ฌ€erent estimation
procedure:
Step 1 constructs an estimate for the nuisance function ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 and not an estimate for ๐›ผ0 . Here we
โˆš
do not need a ๐‘›-rate consistency for the estimates of the nuisance function; slower rate like ๐‘œ(๐‘›โˆ’1/4 )
will su๏ฌƒce. Thus, this can be based either on the โ„“1 -LAD regression estimator (1.2) or the associated
post-model selection estimator (1.3).
Step 2 partials out the impact of the covariates ๐‘ฅ๐‘– on the main regressor ๐‘‘๐‘– , obtaining the estimate
of the residuals ๐‘ฃ๐‘– in the decomposition (1.4). In order to estimate these residuals we rely either on
heteroscedastic Lasso [2], a version of the Lasso estimator of [23, 9]:
๐œ†
ห† ๐‘– = 1, . . . , ๐‘›,
๐œƒห† โˆˆ arg min ๐”ผ๐‘› [(๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ)2 ] + โˆฅฮ“ฬ‚๐œƒโˆฅ1 and set ๐‘ฃห†๐‘– = ๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ,
๐œƒ
๐‘›
(1.6)
where ๐œ† and ฮ“ฬ‚ are the penalty level and data-driven penalty loadings described in [2] (restated in Appendix
D), or the associated post-model selection estimator (Post-Lasso) [4, 2] de๏ฌned as
{
}
หœ
๐œƒหœ โˆˆ arg min ๐”ผ๐‘› [(๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ)2 ] : ๐œƒ๐‘— = 0 if ๐œƒห†๐‘— = 0 and set ๐‘ฃห†๐‘– = ๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ.
๐œƒ
(1.7)
4
UNIFORM POST SELECTION INFERENCE FOR LAD
Step 3 constructs an estimator ๐›ผ
ห‡ of the coe๏ฌƒcient ๐›ผ0 via an instrumental LAD regression proposed in
[10], using (ห†
๐‘ฃ๐‘– )๐‘›๐‘–=1 as instruments. Formally, ๐›ผ
ห‡ is de๏ฌned as
๐›ผ
ห‡ โˆˆ arg inf ๐ฟ๐‘› (๐›ผ), where ๐ฟ๐‘› (๐›ผ) =
๐›ผโˆˆ๐’œ
4โˆฃ๐”ผ๐‘› [๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝห† โˆ’ ๐‘‘๐‘– ๐›ผ)ห†
๐‘ฃ๐‘– ]โˆฃ2
,
๐”ผ๐‘› [ห†
๐‘ฃ๐‘–2 ]
(1.8)
๐œ‘(๐‘ก) = 1/2 โˆ’ 1{๐‘ก โฉฝ 0} and ๐’œ is a parameter space for ๐›ผ0 . We will analyze the choice of ๐’œ = [ห†
๐›ผโˆ’
๐ถ logโˆ’1 ๐‘›, ๐›ผ
ห† + ๐ถ logโˆ’1 ๐‘›] with a suitable constant ๐ถ > 0.
1
Several other choices for ๐’œ are possible.
Our main result establishes conditions under which ๐›ผ
ห‡ is root-๐‘› consistent for ๐›ผ0 , asymptotically normal,
and achieves the semi-parametric e๏ฌƒciency bound for estimating ๐›ผ0 in the current homoscedastic setting,
provided that (๐‘ 3 log3 ๐‘)/๐‘› โ†’ 0 and other regularity conditions hold. Speci๏ฌcally, we show that, despite
possible model selection mistakes in Steps 1 and 2, the estimator ๐›ผ
ห‡ obeys
โˆš
๐œŽ๐‘›โˆ’1 ๐‘›(ห‡
๐›ผ โˆ’ ๐›ผ0 ) โ‡ ๐‘ (0, 1),
(1.9)
where ๐œŽ๐‘›2 := 1/(4๐‘“๐œ–2Eฬ„[๐‘ฃ๐‘–2 ]) with ๐‘“๐œ– = ๐‘“๐œ– (0). An alternative (and more robust) expression for ๐œŽ๐‘›2 is given
by Huberโ€™s sandwich:
๐œŽ๐‘›2 = ๐ฝ โˆ’1 ฮฉ๐ฝ โˆ’1 , where ฮฉ := Eฬ„[๐‘ฃ๐‘–2 ]/4 and ๐ฝ := Eฬ„[๐‘“๐œ– ๐‘‘๐‘– ๐‘ฃ๐‘– ].
(1.10)
We recommend to estimate ฮฉ by the plug-in method and to estimate ๐ฝ by Powellโ€™s method [21]. Furthermore, we show that the criterion function at the true value ๐›ผ0 in Step 3 has the following pivotal
behavior
๐‘›๐ฟ๐‘› (๐›ผ0 ) โ‡ ๐œ’2 (1).
(1.11)
ห†๐‘›,๐œ‰ with asymptotic coverage 1 โˆ’ ๐œ‰ based on the
This allows the construction of a con๏ฌdence region ๐ด
statistic ๐ฟ๐‘› ,
ห†๐‘›,๐œ‰ ) โ†’ 1 โˆ’ ๐œ‰ where ๐ด
ห†๐‘›,๐œ‰ = {๐›ผ โˆˆ ๐’œ : ๐‘›๐ฟ๐‘› (๐›ผ) โฉฝ (1 โˆ’ ๐œ‰)-quantile of ๐œ’2 (1)}.
P(๐›ผ0 โˆˆ ๐ด
(1.12)
Importantly, the robustness with respect to moderate model selection mistakes, which occurs because of
(1.5), allows the results (1.9) and (1.11) to hold uniformly over a large range of data generating processes,
similarly to the results for instrumental regression and partially linear mean regression model established
in [6, 27, 2]. One of our proposed algorithms explicitly uses โ„“1 -regularization methods, similarly to [27]
and [2], while the main algorithm we propose uses post-selection methods, similarly to [6, 2].
Throughout the paper, we use array asymptotics โ€“ asymptotics where the model changes with ๐‘› โ€“ to
better capture some ๏ฌnite-sample phenomena such as โ€œsmall coe๏ฌƒcientsโ€ that are local to zero. This
ensures the robustness of conclusions with respect to perturbations of the data-generating process along
various model sequences. This robustness, in turn, translates into uniform validity of con๏ฌdence regions
over substantial regions of data-generating processes.
1For numerical experiments we used ๐ถ = 10(๐”ผ [๐‘‘2 ])โˆ’1/2 and typically we normalize ๐”ผ [๐‘‘2 ] = 1).
๐‘› ๐‘–
๐‘› ๐‘–
UNIFORM POST SELECTION INFERENCE FOR LAD
5
1.1. Notation and convention. Denote by (ฮฉ, โ„ฑ , P) the underlying probability space. The notation
โˆ‘๐‘›
๐”ผ๐‘› [โ‹…] denotes the average over index 1 โฉฝ ๐‘– โฉฝ ๐‘›, i.e., it simply abbreviates the notation ๐‘›โˆ’1 ๐‘–=1 [โ‹…].
โˆ‘
For example, ๐”ผ๐‘› [๐‘ฅ2๐‘–๐‘— ] = ๐‘›โˆ’1 ๐‘›๐‘–=1 ๐‘ฅ2๐‘–๐‘— . Moreover, we use the notation Eฬ„[โ‹…] = ๐”ผ๐‘› [E[โ‹…]]. For example,
โˆ‘๐‘›
โˆ‘๐‘›
Eฬ„[๐‘ฃ๐‘–2 ] = ๐‘›โˆ’1 ๐‘–=1 E[๐‘ฃ๐‘–2 ]. For a function ๐‘“ : โ„ × โ„ × โ„๐‘ โ†’ โ„, we write ๐”พ๐‘› (๐‘“ ) = ๐‘›โˆ’1/2 ๐‘–=1 (๐‘“ (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– ) โˆ’
E[๐‘“ (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )]). The ๐‘™2 -norm is denoted by โˆฅ โ‹… โˆฅ, and the ๐‘™0 -norm, โˆฅ โ‹… โˆฅ0 , denotes the number of non-zero
components of a vector. Denote by โˆฅโ‹…โˆฅโˆž the maximal absolute element of a vector. For a sequence (๐‘ง๐‘– )๐‘›๐‘–=1
โˆš
โˆš
of constants, we write โˆฅ๐‘ง๐‘– โˆฅ2,๐‘› = ๐”ผ๐‘› [๐‘ง๐‘–2 ]. For example, for a vector ๐›ฟ โˆˆ โ„๐‘ , โˆฅ๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ2,๐‘› = ๐”ผ๐‘› [(๐‘ฅโ€ฒ๐‘– ๐›ฟ)2 ]
denotes the prediction norm of ๐›ฟ. Given a vector ๐›ฟ โˆˆ โ„๐‘ , and a set of indices ๐‘‡ โŠ‚ {1, . . . , ๐‘}, we denote
by ๐›ฟ๐‘‡ โˆˆ โ„๐‘ the vector such that (๐›ฟ๐‘‡ )๐‘— = ๐›ฟ๐‘— if ๐‘— โˆˆ ๐‘‡ and (๐›ฟ๐‘‡ )๐‘— = 0 if ๐‘— โˆˆ
/ ๐‘‡ . Also we write the support
of ๐›ฟ as support(๐›ฟ) = {๐‘— โˆˆ {1, ..., ๐‘} : ๐›ฟ๐‘— โˆ•= 0}. We use the notation (๐‘Ž)+ = max{๐‘Ž, 0}, ๐‘Ž โˆจ ๐‘ = max{๐‘Ž, ๐‘},
and ๐‘Ž โˆง ๐‘ = min{๐‘Ž, ๐‘}. We also use the notation ๐‘Ž โ‰ฒ ๐‘ to denote ๐‘Ž โฉฝ ๐‘๐‘ for some constant ๐‘ > 0 that does
not depend on ๐‘›; and ๐‘Ž โ‰ฒ๐‘ƒ ๐‘ to denote ๐‘Ž = ๐‘‚๐‘ƒ (๐‘). The arrow โ‡ denotes convergence in distribution.
We assume that the quantities such as ๐‘ (the dimension of ๐‘ฅ๐‘– ), ๐‘  (a bound on the numbers of non-zero
elements of ๐›ฝ0 and ๐œƒ0 ), and hence ๐‘ฆ๐‘– , ๐‘ฅ๐‘– , ๐›ฝ0 , ๐œƒ0 , ๐‘‡ and ๐‘‡๐‘‘ are all dependent on the sample size ๐‘›, and allow
for the case where ๐‘ = ๐‘๐‘› โ†’ โˆž and ๐‘  = ๐‘ ๐‘› โ†’ โˆž as ๐‘› โ†’ โˆž. However, for the notational convenience,
we shall omit the dependence of these quantities on ๐‘›.
2. The Methods, Conditions, and Results
2.1. The methods. Each of the steps outlined before uses a di๏ฌ€erent identi๏ฌcation condition. Several
combinations are possible to implement each step, two of which are the following.
Algorithm 1 (Based on Post-Model Selection estimators).
หœ
(1) Run Post-โ„“1-penalized LAD (1.3) of ๐‘ฆ๐‘– on ๐‘‘๐‘– and ๐‘ฅ๐‘– ; keep ๏ฌtted value ๐‘ฅโ€ฒ๐‘– ๐›ฝ.
โ€ฒหœ
(2) Run Post-Lasso (1.7) of ๐‘‘๐‘– on ๐‘ฅ๐‘– ; keep the residual ห†
๐‘ฃ๐‘– := ๐‘‘๐‘– โˆ’ ๐‘ฅ๐‘– ๐œƒ.
โ€ฒหœ
(3) Run Instrumental LAD regression (1.8) of ๐‘ฆ๐‘– โˆ’ ๐‘ฅ๐‘– ๐›ฝ on ๐‘‘๐‘– using ๐‘ฃห†๐‘– as the instrument for ๐‘‘๐‘– to
compute the estimator ๐›ผ
ห‡ . Report ๐›ผ
ห‡ and/or perform inference based upon (1.9) or (1.12).
Algorithm 2 (Based on Regularized Estimators).
ห†
(1) Run โ„“1 -penalized LAD (1.2) of ๐‘ฆ๐‘– on ๐‘‘๐‘– and ๐‘ฅ๐‘– ; keep ๏ฌtted value ๐‘ฅโ€ฒ๐‘– ๐›ฝ.
ห†
(2) Run Lasso of (1.6) ๐‘‘๐‘– on ๐‘ฅ๐‘– ; keep the residual ๐‘ฃห†๐‘– := ๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ.
(3) Run Instrumental LAD regression (1.8) of ๐‘ฆ๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝห† on ๐‘‘๐‘– using ๐‘ฃห†๐‘– as the instrument for ๐‘‘๐‘– to
compute the estimator ๐›ผ
ห‡ . Report ๐›ผ
ห‡ and/or perform inference based upon (1.9) or (1.12).
Comment 2.1 (Penalty Levels). In order to perform โ„“1 -LAD and Lasso, one has to suitably choose the
penalty levels. In the Supplementary Appendix D we provide implementation details including penalty
choices for each step of the algorithm, and in all what follows we shall obey the penalty choices described
in Appendix D.
Comment 2.2 (Di๏ฌ€erences). Algorithm 1 relies on Post-โ„“1-LAD and Post-Lasso while Algorithm 2 relies
on โ„“1 -LAD and Lasso. Since Algorithm 1 re๏ฌts the non-zero coe๏ฌƒcients without the penalty term it has
a smaller bias. Therefore it does rely on โ„“1 -LAD and Lasso obtaining sparse solutions which in turn
6
UNIFORM POST SELECTION INFERENCE FOR LAD
typically relies on restricted isometry conditions [3, 2]. Algorithm 2 relies on penalized estimators. Step
3 of both algorithms relies on instrumental LAD regression with estimated data.
Comment 2.3 (Alternative Implementations). As discussed before, the three step approach proposed
here can be implemented with several di๏ฌ€erent methods each with speci๏ฌc features. For instance, Dantzig
selector, square-root Lasso or the associated post-model selection could be used instead of Lasso or PostLasso. Moreover, the instrumental LAD regression can be substituted by a 1-step estimator from the
ห† ๐‘ฃ๐‘– ] or by a LAD regression
โ„“1 -LAD estimator ๐›ผ
ห† of the form ๐›ผ
ห‡ =๐›ผ
ห† + (๐”ผ๐‘› [๐‘“๐œ– ๐‘ฃห†๐‘–2 ])โˆ’1 ๐”ผ๐‘› [๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ
ห† โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝ)ห†
with all the covariates selected in Steps 1 and 2.
2.2. Regularity Conditions. Here we provide regularity conditions that are su๏ฌƒcient for validity of
the main estimation and inference results. We begin by stating our main condition, which contains the
previously de๏ฌned approximate sparsity as well as other more technical assumptions. Throughout the
paper, let ๐‘ and ๐ถ be positive constants independent of ๐‘›, and let โ„“๐‘› โ†— โˆž, ๐›ฟ๐‘› โ†˜ 0, and ฮ”๐‘› โ†˜ 0 be
sequences of positive constants. Let ๐พ๐‘ฅ := max1โฉฝ๐‘–โฉฝ๐‘› โˆฅ๐‘ฅ๐‘– โˆฅโˆž .
Condition I. (i) (๐œ–๐‘– )๐‘›๐‘–=1 is a sequence of i.i.d. random variables with common distribution function ๐น such that ๐น (0) = 1/2, (๐‘ฃ๐‘– )๐‘›๐‘–=1 is a sequence of independent mean-zero random variables independent of (๐œ–๐‘– )๐‘›๐‘–=1 , and (๐‘ฅ๐‘– )๐‘›๐‘–=1 is a sequence of non-stochastic vectors in โ„๐‘ of covariates normalized in such a way that ๐”ผ๐‘› [๐‘ฅ2๐‘–๐‘— ] = 1 for all 1 โฉฝ ๐‘— โฉฝ ๐‘. The sequence {(๐‘ฆ๐‘– , ๐‘‘๐‘– )โ€ฒ }๐‘›๐‘–=1 of random vectors are generated according to models (1.1) and (1.4). (ii) ๐‘ โฉฝ E[๐‘ฃ๐‘–2 ] โฉฝ ๐ถ for all 1 โฉฝ ๐‘– โฉฝ ๐‘›, and
Eฬ„[๐‘‘4๐‘– ]+ Eฬ„[๐‘ฃ๐‘–4 ]+max1โฉฝ๐‘—โฉฝ๐‘ (Eฬ„[๐‘ฅ2๐‘–๐‘— ๐‘‘2๐‘– ]+ Eฬ„[โˆฃ๐‘ฅ๐‘–๐‘— ๐‘ฃ๐‘– โˆฃ3 ]) โฉฝ ๐ถ. (iii) There exists ๐‘  = ๐‘ ๐‘› โฉพ 1 such that โˆฅ๐›ฝ0 โˆฅ0 โฉฝ ๐‘  and
โˆฅ๐œƒ0 โˆฅ0 โฉฝ ๐‘ . (iv) The error distribution ๐น is absolutely continuous with continuously di๏ฌ€erentiable density
๐‘“๐œ– (โ‹…) such that ๐‘“๐œ– (0) โฉพ ๐‘ > 0 and ๐‘“๐œ– (๐‘ก)โˆจโˆฃ๐‘“๐œ–โ€ฒ (๐‘ก)โˆฃ โฉฝ ๐ถ for all ๐‘ก โˆˆ โ„, and (v) (๐พ๐‘ฅ4 +๐พ๐‘ฅ2 ๐‘ 2 +๐‘ 3 ) log3 (๐‘โˆจ๐‘›) โฉฝ ๐‘›๐›ฟ๐‘› .
Comment 2.4. Condition I(i) imposes the setting discussed in the previous section with the zero conditional median of the error distribution. Condition I(ii) imposes moment conditions on the structural
errors and regressors to ensure good model selection performance of Lasso applied to equation (1.4). The
approximate sparsity I(iii) imposes sparsity of the high-dimensional vectors ๐›ฝ0 and ๐œƒ0 . In the theorems
below we provide the required technical conditions on the growth of ๐‘  log ๐‘ since it is dependent on the
choice of algorithm. Condition I(iv) is a set of standard assumptions in the LAD literature (see [14]) and
in the instrumental quantile regression literature [10]. Condition I(v) restricts the sparsity index, so that
๐‘ 3 log3 (๐‘ โˆจ ๐‘›) = ๐‘œ(๐‘›) is required; this is analogous to the standard assumption ๐‘ 3 (log ๐‘›)2 = ๐‘œ(๐‘›) (see [11])
invoked in the LAD analysis without any selection (i.e, where ๐‘ = ๐‘ ). Most importantly, no assumptions
on the separation from zero of the non-zero coe๏ฌƒcients of ๐œƒ0 and ๐›ฝ0 are made.
The next condition concerns the behavior of the Gram matrix ๐”ผ๐‘› [หœ
๐‘ฅ๐‘– ๐‘ฅหœโ€ฒ๐‘– ] where ๐‘ฅ
หœ๐‘– = (๐‘‘๐‘– , ๐‘ฅโ€ฒ๐‘– )โ€ฒ . Whenever
๐‘+1 > ๐‘›, the empirical Gram matrix ๐”ผ๐‘› [หœ
๐‘ฅ๐‘– ๐‘ฅ
หœโ€ฒ๐‘– ] does not have full rank and in principle is not well-behaved.
However, we only need good behavior of smaller submatrices. De๏ฌne the minimal and maximal ๐‘š-sparse
eigenvalue of ๐”ผ๐‘› [หœ
๐‘ฅ๐‘– ๐‘ฅ
หœโ€ฒ๐‘– ] as
๐œ™min (๐‘š) :=
๐›ฟ โ€ฒ ๐”ผ๐‘› [หœ
๐‘ฅ๐‘– ๐‘ฅ
หœโ€ฒ๐‘– ]๐›ฟ
๐›ฟ โ€ฒ ๐”ผ๐‘› [หœ
๐‘ฅ๐‘– ๐‘ฅ
หœโ€ฒ๐‘– ]๐›ฟ
and ๐œ™max (๐‘š) := max
.
2
2
โˆฅ๐›ฟโˆฅ
โˆฅ๐›ฟโˆฅ
1โฉฝโˆฅ๐›ฟโˆฅ0 โฉฝ๐‘š
1โฉฝโˆฅ๐›ฟโˆฅ0 โฉฝ๐‘š
min
(2.13)
UNIFORM POST SELECTION INFERENCE FOR LAD
7
To assume that ๐œ™min (๐‘š) > 0 requires that all empirical Gram submatrices formed by any ๐‘š components
of ๐‘ฅ
หœ๐‘– are positive de๏ฌnite. We shall employ the following condition as a su๏ฌƒcient condition for our results.
Condition SE. There exists a sequence of constants โ„“๐‘› โ†’ โˆž such that the maximal and minimal โ„“๐‘› ๐‘ -
sparse eigenvalues are bounded from below and away from zero, namely with probability at least 1 โˆ’ ฮ”๐‘› ,
๐œ…โ€ฒ โฉฝ ๐œ™min (โ„“๐‘› ๐‘ ) โฉฝ ๐œ™max (โ„“๐‘› ๐‘ ) โฉฝ ๐œ…โ€ฒโ€ฒ ,
where 0 < ๐œ…โ€ฒ < ๐œ…โ€ฒโ€ฒ < โˆž are constants independent of ๐‘›.
Comment 2.5. Condition SE is quite plausible for many designs of interest. Essentially it can be
established by combining tail conditions of the regressors and a growth restriction on ๐‘  and ๐‘ relative to ๐‘›.
For instance, Theorem 3.2 in [22] (see also [28] and [1]) shows that Condition SE holds for i.i.d. zero-mean
sub-Gaussian regressors and ๐‘  log2 (๐‘› โˆจ ๐‘) โฉฝ ๐›ฟ๐‘› ๐‘›; while Theorem 1.8 [22] (see also Lemma 1 in [4]) shows
that Condition SE holds for i.i.d. uniformly bounded zero-mean regressors and ๐‘ (log3 ๐‘›) log(๐‘ โˆจ ๐‘›) โฉฝ ๐›ฟ๐‘› ๐‘›.
2.3. Results. We begin with considering Algorithm 1.
Theorem 1 (Robust Inference, Algorithm 1). Let ๐›ผ
ห‡ be obtained by Algorithm 1. Suppose that Conditions
หœ 0 โฉฝ ๐ถ๐‘ .
I and SE are satis๏ฌed for all ๐‘› โฉพ 1. Moreover, suppose that with probability at least 1 โˆ’ ฮ”๐‘› , โˆฅ๐›ฝโˆฅ
Then, as ๐‘› โ†’ โˆž and for ๐œŽ๐‘›2 = 1/(4๐‘“๐œ–2Eฬ„[๐‘ฃ๐‘–2 ]),
โˆš
๐œŽ๐‘›โˆ’1 ๐‘›(ห‡
๐›ผ โˆ’ ๐›ผ0 ) โ‡ ๐‘ (0, 1) and ๐‘›๐ฟ๐‘› (๐›ผ0 ) โ‡ ๐œ’2 (1).
Theorem 1 establishes the ๏ฌrst main result of the paper. Theorem 1 relies on the post model selection
ห† Sparsity of the former
estimators which in turn hinge on achieving su๏ฌƒciently sparse estimates ๐›ฝห† and ๐œƒ.
can be directly achieved under sharp penalty choices for optimal rates as discussed in the Supplementary
Appendix D.2. The sparsity for the latter potentially requires heavier penalty as shown in [3]. Alternatively, sparsity for the estimator in Step 1 can also be achieved by truncating the smallest components
ห†2
of estimate ๐›ฝ.
Next we turn to the analysis of Algorithm 2 which relies on the regularized estimators instead of
the post-model selection estimators. Theorem 2 below establishes that Algorithm 2 achieves the same
inferential guarantees as the results in Theorem 1 for Algorithm 1.
Theorem 2 (Robust Inference, Algorithm 2). Let ๐›ผ
ห‡ be obtained by Algorithm 2. Suppose that Conditions
ห† 0 โฉฝ ๐ถ๐‘ .
I and SE are satis๏ฌed for all ๐‘› โฉพ 1. Moreover, suppose that with probability at least 1 โˆ’ ฮ”๐‘› , โˆฅ๐›ฝโˆฅ
Then, as ๐‘› โ†’ โˆž and for ๐œŽ๐‘›2 = 1/(4๐‘“๐œ–2Eฬ„[๐‘ฃ๐‘–2 ]),
โˆš
๐›ผ โˆ’ ๐›ผ0 ) โ‡ ๐‘ (0, 1) and ๐‘›๐ฟ๐‘› (๐›ผ0 ) โ‡ ๐œ’2 (1).
๐œŽ๐‘›โˆ’1 ๐‘›(ห‡
Theorem 2 establishes the second main result of the paper.
An important consequence of these results is the following corollary. Here ๐’ฌ๐‘› denotes a collection of
distributions for {(๐‘ฆ๐‘– , ๐‘‘๐‘– )โ€ฒ }๐‘›๐‘–=1 and for ๐‘„๐‘› โˆˆ ๐’ฌ๐‘› the notation P๐‘„๐‘› means that under P๐‘„๐‘› , {(๐‘ฆ๐‘– , ๐‘‘๐‘– )โ€ฒ }๐‘›๐‘–=1
is distributed according to ๐‘„๐‘› .
2Lemma 3 in Appendix C formally shows that a suitable truncation preserves the rate of convergence under our
conditions.
8
UNIFORM POST SELECTION INFERENCE FOR LAD
Corollary 1 (Uniformly Valid Con๏ฌdence Intervals). Let ๐›ผ
ห‡ be the estimator of ๐›ผ0 constructed according to Algorithm 1 (resp. Algorithm 2) and let ๐’ฌ๐‘› be the collection of all distributions of {(๐‘ฆ๐‘– , ๐‘‘๐‘– )โ€ฒ }๐‘›๐‘–=1
for which the conditions of Theorem 1 (resp. Theorem 2) are satis๏ฌed for given ๐‘› โฉพ 1. Then as ๐‘› โ†’ โˆž,
uniformly in ๐‘„๐‘› โˆˆ ๐’ฌ๐‘›
โˆš
ห†๐‘›,๐œ‰ ) โ†’ 1 โˆ’ ๐œ‰,
P๐‘„๐‘› (๐›ผ0 โˆˆ [ห‡
๐›ผ ± ๐œŽ๐‘› ๐‘ง๐œ‰/2 / ๐‘›]) โ†’ 1 โˆ’ ๐œ‰ and P๐‘„๐‘› (๐›ผ0 โˆˆ ๐ด
ห†๐‘›,๐œ‰ = {๐›ผ โˆˆ ๐’œ : ๐‘›๐ฟ๐‘› (๐›ผ) โฉฝ (1 โˆ’ ๐œ‰)-quantile of ๐œ’2 (1)}.
where ๐‘ง๐œ‰/2 = ฮฆโˆ’1 (1 โˆ’ ๐œ‰/2) and ๐ด
Corollary 1 establishes the third main result of the paper; it highlights the uniformity nature of the
results. As long as the overall sparsity requirements hold, imperfect model selection in Steps 1 and
2 do not compromise the results. The robustness of the approach is also apparent from the fact that
Corollary 1 allows for the data-generating process to change with ๐‘›. This result is new even under the
traditional case of ๏ฌxed-๐‘ asymptotics. Condition I and SE together with the appropriate side conditions
in the theorems explicitly characterize regions of data-generating processes for which the uniformity
result holds. Simulations results discussed next also provide an additional evidence that these regions are
substantial.
3. Monte-Carlo Experiments
In this section we examine the ๏ฌnite sample performance of the proposed estimators. We focus on the
estimator associated with Algorithm 1 based on post-model selection methods.
We considered the following regression model:
๐‘ฆ = ๐‘‘๐›ผ0 + ๐‘ฅโ€ฒ (๐‘๐‘ฆ ๐œƒ0 ) + ๐œ–,
๐‘‘ = ๐‘ฅโ€ฒ (๐‘๐‘‘ ๐œƒ0 ) + ๐‘ฃ,
(3.14)
where ๐›ผ0 = 1/2, ๐œƒ0๐‘— = 1/๐‘— 2 , ๐‘— = 1, . . . , 10, and ๐œƒ0๐‘— = 0 otherwise, ๐‘ฅ = (1, ๐‘ง โ€ฒ )โ€ฒ consists of an intercept and
covariates ๐‘ง โˆผ ๐‘ (0, ฮฃ), and the errors ๐œ– and ๐‘ฃ are independently and identically distributed as ๐‘ (0, 1).
The dimension ๐‘ of the covariates ๐‘ฅ is 300, and the sample size ๐‘› is 250. The regressors are correlated
with ฮฃ๐‘–๐‘— = ๐œŒโˆฃ๐‘–โˆ’๐‘—โˆฃ and ๐œŒ = 0.5. The coe๏ฌƒcients ๐‘๐‘ฆ and ๐‘๐‘‘ are used to control the ๐‘…2 of the reduce
form equation. For each equation, we consider the following values for the ๐‘…2 : {0, 0.1, 0.2, . . . , 0.8, 0.9}.
Therefore we have 100 di๏ฌ€erent designs and results are based on 500 repetitions for each design. For each
repetition we draw new vectors ๐‘ฅ๐‘– โ€™s and errors ๐œ–๐‘– โ€™s and ๐‘ฃ๐‘– โ€™s.
The design above with ๐‘ฅโ€ฒ (๐‘๐‘ฆ ๐œƒ0 ) is a sparse model. However, the decay of the components of ๐œƒ0 rules
out typical โ€œseparation from zeroโ€ assumptions of the coe๏ฌƒcients of โ€œimportantโ€ covariates (since the
last component is of the order of 1/๐‘›), unless ๐‘๐‘ฆ is very large. Thus, we anticipate that โ€œstandardโ€
post-selection inference procedures โ€“ which rely on model selection of the outcome equation only โ€“ work
poorly in the simulation study. In contrast, based upon the prior theoretical arguments, we anticipate
that our instrumental LAD estimatorโ€“ which works o๏ฌ€ both equations in (3.14)โ€“ to work well in the
simulation study.
The simulation study focuses on Algorithm 1. Standard errors are computed using the formula (1.10).
(Algorithm 2 worked similarly, though somewhat worse due to larger biases). As the main benchmark
UNIFORM POST SELECTION INFERENCE FOR LAD
9
we consider the standard post-model selection estimator ๐›ผ
หœ based on the post โ„“1 -penalized LAD method,
as de๏ฌned in (1.3).
In Figure 1, we display the (empirical) rejection probability of tests of a true hypothesis ๐›ผ = ๐›ผ0 ,
with nominal size of tests equal to 0.05. The left-top plot shows the rejection frequency of the standard
post-model selection inference procedure based upon ๐›ผ
หœ (where the inference procedure assumes perfect
recovery of the true model). The rejection frequency deviates very sharply from the ideal rejection
frequency of 0.05. This con๏ฌrms the anticipated failure (lack of uniform validity) of inference based upon
the standard post-model selection procedure in designs where coe๏ฌƒcients are not well separated from zero
(so that perfect recovery does not happen). In sharp contrast, the right top and bottom plots show that
both of our proposed procedures (based on estimator ๐›ผ
ห‡ and the result (1.9) and on the statistic ๐ฟ๐‘› and
the result (1.12)) perform well, closely tracking the ideal level of 0.05. This is achieved uniformly over all
the designs considered in the study, and this con๏ฌrms our theoretical results established in Corollary 1.
In Figure 2, we compare the performance of the standard post-selection estimator ๐›ผ
หœ (de๏ฌned in (1.3))
and our proposed post-selection estimator ๐›ผ
ห‡ (obtained via Algorithm 1) . We display results in three
di๏ฌ€erent metrics of performance โ€“ mean bias (top row), standard deviation (middle row), and root mean
square error (bottom row) of the two approaches. The signi๏ฌcant bias for the standard post-selection
procedure occurs when the indirect equation (1.4) is nontrivial, that is, when the main regressor is
correlated to other controls. Such bias can be positive or negative depending on the particular design.
The proposed post-selection estimator ๐›ผ
ห‡ performs well in all three metrics. The root mean square error
for the proposed estimator ๐›ผ
ห‡ are typically much smaller than those for standard post-model selection
estimators ๐›ผ
หœ (as shown by bottom plots in Figure 2). This is fully consistent with our theoretical results
and minimax e๏ฌƒciency considerations given in Section 5.
4. Generalization to Heteroscedastic Case
We emphasize that both proposed algorithms exploit the homoscedasticity of the model (1.1) with
respect to the error term ๐œ–๐‘– . The generalization to the heteroscedastic case can be achieved as follows.
In order to achieve the semiparametric e๏ฌƒciency bound we need to consider the weighted version of the
auxiliary equation (1.4). Speci๏ฌcally, we can rely on the following of weighted decomposition:
๐‘“๐‘– ๐‘‘๐‘– = ๐‘“๐‘– ๐‘ฅโ€ฒ๐‘– ๐œƒ0โˆ— + ๐‘ฃ๐‘–โˆ— ,
E[๐‘“๐‘– ๐‘ฃ๐‘–โˆ— ] = 0,
๐‘– = 1, . . . , ๐‘›,
(4.15)
where the weights are conditional densities of error terms ๐œ–๐‘– evaluated at their medians of 0,
๐‘“๐‘– = ๐‘“๐œ–๐‘– (0โˆฃ๐‘‘๐‘– , ๐‘ฅ๐‘– ),
๐‘– = 1, . . . , ๐‘›,
(4.16)
which in general vary under heteroscedasticity. With that in mind it is straightforward to adapt the
proposed algorithms when the weights (๐‘“๐‘– )๐‘›๐‘–=1 are known. For example Algorithm 1 becomes as follows.
Algorithm 1โ€ฒ (Based on Post-Model Selection estimators).
หœ
(1) Run Post-โ„“1-penalized LAD of ๐‘ฆ๐‘– on ๐‘‘๐‘– and ๐‘ฅ๐‘– ; keep ๏ฌtted value ๐‘ฅโ€ฒ๐‘– ๐›ฝ.
หœ
(2) Run Post-Lasso of ๐‘“๐‘– ๐‘‘๐‘– on ๐‘“๐‘– ๐‘ฅ๐‘– ; keep the residual ๐‘ฃห†๐‘–โˆ— := ๐‘“๐‘– (๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ).
10
UNIFORM POST SELECTION INFERENCE FOR LAD
(3) Run Instrumental LAD regression of ๐‘ฆ๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝหœ on ๐‘‘๐‘– using ๐‘ฃห†๐‘–โˆ— as the instrument for ๐‘‘๐‘– to compute
the estimator ๐›ผ.
ห‡ Report ๐›ผ
ห‡ and/or perform inference.
An analogous generalization of Algorithm 2 based on regularized estimator results from removing the
word โ€œPostโ€ in the algorithm above.
Under similar regularity conditions, uniformly over a large collection ๐’ฌโˆ—๐‘› of distributions of {(๐‘ฆ๐‘– , ๐‘‘๐‘– )โ€ฒ }๐‘›๐‘–=1 ,
the estimator ๐›ผ
ห‡ above obeys
โˆš
(4Eฬ„[๐‘ฃ๐‘–โˆ—2 ])1/2 ๐‘›(ห‡
๐›ผ โˆ’ ๐›ผ0 ) โ‡ ๐‘ (0, 1).
(4.17)
Moreover, the criterion function at the true value ๐›ผ0 in Step 3 also has a pivotal behavior, namely
๐‘›๐ฟ๐‘› (๐›ผ0 ) โ‡ ๐œ’2 (1),
(4.18)
ห†๐‘›,๐œ‰ based on the ๐ฟ๐‘› -statistic as in (1.12) with
which can also be used to construct a con๏ฌdence region ๐ด
coverage 1 โˆ’ ๐œ‰ uniformly over the collection of distributions ๐’ฌโˆ—๐‘› .
In practice the density function values (๐‘“๐‘– )๐‘›๐‘–=1 are typically unknown and need to be replaced by
estimates (๐‘“ห†๐‘– )๐‘›๐‘–=1 . The analysis of the impact of such estimation is very delicate and is developed in the
companion work [8], which considers the more general problem of uniformly valid inference for quantile
regression models in approximately sparse models.
5. Discussion and Conclusion
5.1. Connection to Neymanization. In this section we make some connections to Neymanโ€™s ๐ถ(๐›ผ)
test ([19, 20]). For the sake of exposition we assume that (๐‘ฆ๐‘– , ๐‘ฅ๐‘– , ๐‘‘๐‘– )๐‘›๐‘–=1 are i.i.d. but we shall use the
heteroscedastic setup introduced in the previous section. We consider the estimating equation for ๐›ผ0 :
E[๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ0 โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 )๐‘ฃ๐‘– ] = 0.
Our problem is to ๏ฌnd useful instruments ๐‘ฃ๐‘– such that
โˆ‚
E[๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ0 โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝ)๐‘ฃ๐‘– ]โˆฃ๐›ฝ=๐›ฝ0 = 0.
โˆ‚๐›ฝ
If this property holds, the estimator of ๐›ผ0 will be โ€œimmunizedโ€ against โ€œcrudeโ€ or nonregular estimation
of ๐›ฝ0 , for example, via a post-selection procedure or some regularization procedure. Such immunization
ideas are in fact behind Neymanโ€™s classical construction of his ๐ถ(๐›ผ) test, so we shall use the term
โ€œNeymanizationโ€ to describe such procedure. There will be many instruments ๐‘ฃ๐‘– that can achieve the
property stated above, and there will be one that is optimal.
The instruments can be constructed by taking ๐‘ฃ๐‘– := ๐‘ง๐‘– /๐‘“๐‘– , where ๐‘ง๐‘– is the residual in the regression
equation:
๐‘ค๐‘– ๐‘‘๐‘– = ๐‘ค๐‘– ๐‘š0 (๐‘ฅ๐‘– ) + ๐‘ง๐‘– , E[๐‘ค๐‘– ๐‘ง๐‘– โˆฃ๐‘ฅ๐‘– ] = 0,
(5.19)
where ๐‘ค๐‘– is a nonnegative weight, a function of (๐‘‘๐‘– , ๐‘ง๐‘– ) only, for example ๐‘ค๐‘– = 1 or ๐‘ค๐‘– = ๐‘“๐‘– โ€“ the latter
choice will in fact be optimal. Note that function ๐‘š0 (๐‘ฅ๐‘– ) solves the least squares problem
]
[
min E {๐‘ค๐‘– ๐‘‘๐‘– โˆ’ ๐‘ค๐‘– โ„Ž(๐‘ฅ๐‘– )}2 ,
(5.20)
โ„Žโˆˆโ„‹
UNIFORM POST SELECTION INFERENCE FOR LAD
11
where โ„‹ is the class of measurable functions โ„Ž(๐‘ฅ๐‘– ) such that E[๐‘ค๐‘–2 โ„Ž2 (๐‘ฅ๐‘– )] < โˆž. Our assumption is that
the ๐‘š0 (๐‘ฅ๐‘– ) is a sparse function ๐‘ฅโ€ฒ๐‘– ๐œƒ0 , with โˆฅ๐œƒ0 โˆฅ0 โฉฝ ๐‘  so that
๐‘ค๐‘– ๐‘‘๐‘– = ๐‘ค๐‘– ๐‘ฅโ€ฒ๐‘– ๐œƒ0 + ๐‘ง๐‘– , E[๐‘ค๐‘– ๐‘ง๐‘– โˆฃ๐‘ฅ๐‘– ] = 0.
(5.21)
In ๏ฌnite samples, the sparsity assumption allows to employ post-Lasso and Lasso to solve the least squares
problem above approximately, and estimate ๐‘ง๐‘– . Of course, the use of other structured assumptions may
motivate the use of other regularization methods.
Arguments similar to those in the proofs show that, for
โˆš
๐‘›(๐›ผ โˆ’ ๐›ผ0 ) = ๐‘‚(1),
โˆš
ห† ๐‘– ] โˆ’ ๐”ผ๐‘› [๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ โˆ’ ๐‘ฅโ€ฒ ๐›ฝ0 )๐‘ฃ๐‘– ]} = ๐‘œ๐‘ƒ (1),
๐‘›{๐”ผ๐‘› [๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝ)๐‘ฃ
๐‘–
for ๐›ฝห† based on a sparse estimation procedure, despite the fact that ๐›ฝห† converges to ๐›ฝ0 at a slower rate
โˆš
than 1/ ๐‘›. That is, the empirical estimating equations behave as if ๐›ฝ0 is known. Hence for estimation
we can use ๐›ผ
ห† as a minimizer of the statistic:
โˆš
โ€ฒห†
2
๐ฟ๐‘› (๐›ผ) = ๐‘โˆ’1
๐‘› โˆฃ ๐‘›๐”ผ๐‘› [๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ โˆ’ ๐‘ฅ๐‘– ๐›ฝ)๐‘ฃ๐‘– ]โˆฃ ,
where ๐‘๐‘› = ๐”ผ๐‘› [๐‘ฃ๐‘–2 ]/4. Since ๐ฟ๐‘› (๐›ผ0 ) โ‡ ๐œ’2 (1), we can also use the statistic directly for testing hypotheses
and for construction of con๏ฌdence sets.
This is in fact a version of Neymanโ€™s ๐ถ(๐›ผ) test statistic, adapted to the present non-smooth setting.
๐œƒ0 =
The usual expression of ๐ถ(๐›ผ) statistic is di๏ฌ€erent.
E[๐‘ค๐‘–2 ๐‘ฅ๐‘– ๐‘ฅโ€ฒ๐‘– ]โˆ’ E[๐‘ค๐‘–2 ๐‘‘๐‘– ๐‘ฅโ€ฒ๐‘– ],
so that,
โˆ’
where ๐ด
To see a more familiar form, note that
denotes a generalized inverse of ๐ด, and write
ห†
ห†๐‘– := ๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝ),
๐‘ฃ๐‘– = (๐‘ค๐‘– /๐‘“๐‘– )๐‘‘๐‘– โˆ’ (๐‘ค๐‘– /๐‘“๐‘– )๐‘ฅโ€ฒ๐‘– E[๐‘ค๐‘–2 ๐‘ฅ๐‘– ๐‘ฅโ€ฒ๐‘– ]โˆ’ E[๐‘ค๐‘–2 ๐‘‘๐‘– ๐‘ฅโ€ฒ๐‘– ], and ๐œ‘
โˆš
๐ฟ๐‘› (๐›ผ) = ๐‘โˆ’1
ห†๐‘– (๐‘ค๐‘– /๐‘“๐‘– )๐‘‘๐‘– ] โˆ’ ๐”ผ๐‘› [๐œ‘
ห†๐‘– (๐‘ค๐‘– /๐‘“๐‘– )๐‘ฅ๐‘– ]โ€ฒ E[๐‘ค๐‘–2 ๐‘ฅ๐‘– ๐‘ฅโ€ฒ๐‘– ]โˆ’ E[๐‘ค๐‘–2 ๐‘‘๐‘– ๐‘ฅโ€ฒ๐‘– ]}โˆฃ2 .
๐‘› โˆฃ ๐‘›{๐”ผ๐‘› [๐œ‘
This is indeed a familiar form of a ๐ถ(๐›ผ) statistic.
The estimator ๐›ผ
ห† that minimizes ๐ฟ๐‘› up to ๐‘œ๐‘ƒ (1), under suitable regularity conditions,
โˆš
๐œŽ๐‘›โˆ’1 ๐‘›(ห†
๐›ผ โˆ’ ๐›ผ0 ) โ‡ ๐‘ (0, 1),
๐œŽ๐‘›2 =
1
E[๐‘“๐‘– ๐‘‘๐‘– ๐‘ฃ๐‘– ]โˆ’2 E[๐‘ฃ๐‘–2 ].
4
The smallest value of ๐œŽ๐‘›2 is achieved by using ๐‘ฃ๐‘– = ๐‘ฃ๐‘–โˆ— induced by setting ๐‘ค๐‘– = ๐‘“๐‘– :
๐œŽ๐‘›โˆ—2 =
1
E[๐‘ฃ๐‘–โˆ—2 ]โˆ’1 .
4
(5.22)
Thus, setting ๐‘ค๐‘– = ๐‘“๐‘– gives an optimal instrument ๐‘ฃ๐‘–โˆ— amongst all โ€œimmunizingโ€ instruments generated
by the process described above. Obviously, this improvement translates into shorter con๏ฌdence intervals
and better testing based on either ๐›ผ
ห† or ๐ฟ๐‘› . While ๐‘ค๐‘– = ๐‘“๐‘– is optimal, ๐‘“๐‘– will have to be estimated in
practice, resulting actually in more stringent condition than when using non-optimal, known weights, e.g.,
๐‘ค๐‘– = 1. The use of known weights may also give better behavior under misspeci๏ฌcation of the model.
Under homoscedasticity, ๐‘ค๐‘– = 1 is an optimal weight.
12
UNIFORM POST SELECTION INFERENCE FOR LAD
5.2. Minimax E๏ฌƒciency. There is also a clean connection to the (local) minimax e๏ฌƒciency analysis
from the semiparametric e๏ฌƒciency analysis. [16] derives an e๏ฌƒcient score function for the partially linear
median regression model:
๐‘†๐‘– = 2๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ0 โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 )๐‘“๐‘– [๐‘‘๐‘– โˆ’ ๐‘šโˆ—0 (๐‘ฅ)],
where ๐‘šโˆ—0 (๐‘ฅ๐‘– ) is ๐‘š0 (๐‘ฅ๐‘– ) in (5.19) induced by the weight ๐‘ค๐‘– = ๐‘“๐‘– :
๐‘šโˆ—0 (๐‘ฅ๐‘– ) =
E[๐‘“๐‘–2 ๐‘‘๐‘– โˆฃ๐‘ฅ๐‘– ]
.
E[๐‘“๐‘–2 โˆฃ๐‘ฅ๐‘– ]
Using the assumption ๐‘šโˆ—0 (๐‘ฅ๐‘– ) = ๐‘ฅโ€ฒ๐‘– ๐œƒ0โˆ— , where โˆฅ๐œƒ0โˆ— โˆฅ0 โฉฝ ๐‘  โ‰ช ๐‘› is sparse, we have that
๐‘†๐‘– = 2๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ0 โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 )๐‘ฃ๐‘–โˆ— ,
which is the score that was constructed using Neymanization. It follows that the estimator based on the
instrument ๐‘ฃ๐‘–โˆ— is actually e๏ฌƒcient in the minimax sense (see Theorem 18.4 in [15]), and inference about
๐›ผ0 based on this estimator provides best minimax power against local alternatives (see Theorem 18.12
in [15]).
The claim above is formal as long as, given a law ๐‘„โˆ—๐‘› , the least favorable submodels are permitted
as deviations that lie within the overall model ๐’ฌ๐‘› . Speci๏ฌcally, given a law ๐‘„โˆ—๐‘› , we shall need to allow
for a certain neighborhood ๐’ฌ๐›ฟ๐‘› of ๐‘„โˆ—๐‘› such that ๐‘„โˆ—๐‘› โˆˆ ๐’ฌ๐›ฟ๐‘› โŠ‚ ๐’ฌ๐‘› , where the overall model ๐’ฌ๐‘› is de๏ฌned
similarly as before, except now permitting heteroscedasticity (or we can keep homoscedasticity ๐‘“๐‘– = ๐‘“๐œ– to
maintain formality). To allow for this we consider a collection of laws indexed by a parameter ๐‘ก = (๐‘ก1 , ๐‘ก2 ),
generated by:
๐‘ฆ๐‘–
= ๐‘‘๐‘– (๐›ผโˆ—0 + ๐‘ก1 ) +๐‘ฅโ€ฒ๐‘– (๐›ฝ0โˆ— + ๐‘ก2 ๐œƒ0โˆ— ) +๐œ–๐‘– ,
| {z }
|
{z
}
๐›ผ0
๐‘“๐‘– ๐‘‘๐‘–
=
๐‘“๐‘– ๐‘ฅโ€ฒ๐‘– ๐œƒ0โˆ—
+
โˆฅ๐‘กโˆฅ โฉฝ ๐›ฟ,
(5.23)
๐›ฝ0
๐‘ฃ๐‘–โˆ— ,
E[๐‘“๐‘– ๐‘ฃ๐‘–โˆ— โˆฃ๐‘ฅ๐‘– ]
= 0,
(5.24)
where โˆฅ๐›ฝ0โˆ— โˆฅ0 +โˆฅ๐œƒ0โˆ— โˆฅ0 โฉฝ ๐‘  and conditions as in Section 2 hold. The case with ๐‘ก = 0 generates the law ๐‘„โˆ—๐‘› ; by
varying ๐‘ก within ๐›ฟ-ball, we generate the set of laws, denoted ๐’ฌ๐›ฟ๐‘› , containing the least favorable deviations
from ๐‘ก = 0. By [16], the e๏ฌƒcient score for the model given above is ๐‘†๐‘– , so we cannot have a better regular
estimator than the estimator whose in๏ฌ‚uence function is ๐ฝ โˆ’1 ๐‘†๐‘– , where ๐ฝ = E[๐‘†๐‘–2 ]. Since our overall model
๐’ฌ๐‘› contains ๐’ฌ๐›ฟ๐‘› , all the formal conclusions about (local minimax) optimality of our estimators hold from
theorems cited above (using subsequence arguments to handle models changing with ๐‘›). Our estimators
โˆš
are regular, since under any law ๐‘„๐‘› in the set ๐’ฌ๐›ฟ๐‘› with ๐›ฟ โ†’ 0, the ๏ฌrst order asymptotics of ๐‘›(ห‡
๐›ผ โˆ’ ๐›ผ0 )
does not change, as a consequence of theorems in Section 2 (in fact our theorems show more than this).
5.3. Conclusion. In this paper we propose a method for inference on the coe๏ฌƒcient ๐›ผ0 of a main regressor
that holds uniformly over many data-generating process which is robust to possible โ€œmoderateโ€ model
selection mistakes. The robustness of the method is achieved by relying on a Neyman type estimating
equation whose gradient with respect to the nuisance parameters is zero. In the present homoscedastic
setting the proposed estimator is asymptotically normal and also achieves the semi-parametric e๏ฌƒciency
bound.
UNIFORM POST SELECTION INFERENCE FOR LAD
13
Appendix A. Instrumental LAD Regression with Estimated Inputs
Throughout this section, let
๐œ“๐›ผ,๐›ฝ,๐œƒ (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– ) = (1/2 โˆ’ 1{๐‘ฆ๐‘– โฉฝ ๐‘ฅโ€ฒ๐‘– ๐›ฝ + ๐‘‘๐‘– ๐›ผ})(๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ)
= (1/2 โˆ’ 1{๐‘ฆ๐‘– โฉฝ ๐‘ฅโ€ฒ๐‘– ๐›ฝ + ๐‘‘๐‘– ๐›ผ}){๐‘ฃ๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– (๐œƒ โˆ’ ๐œƒ0 )}.
For ๏ฌxed ๐›ผ โˆˆ โ„ and ๐›ฝ, ๐œƒ โˆˆ โ„๐‘ , de๏ฌne the function
ฮ“(๐›ผ, ๐›ฝ, ๐œƒ) := Eฬ„[๐œ“๐›ผ,๐›ฝ,๐œƒ (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )].
For the notational convenience, let โ„Ž = (๐›ฝ โ€ฒ , ๐œƒโ€ฒ )โ€ฒ , โ„Ž0 = (๐›ฝ0โ€ฒ , ๐œƒ0โ€ฒ )โ€ฒ and โ„Žฬ‚ = (๐›ฝห†โ€ฒ , ๐œƒห†โ€ฒ )โ€ฒ . The partial derivative of
ฮ“(๐›ผ, ๐›ฝ, ๐œƒ) with respect to ๐›ผ is denoted by ฮ“1 (๐›ผ, ๐›ฝ, ๐œƒ) and the partial derivative of ฮ“(๐›ผ, ๐›ฝ, ๐œƒ) with respect
to โ„Ž = (๐›ฝ โ€ฒ , ๐œƒโ€ฒ )โ€ฒ is denoted by ฮ“2 (๐›ผ, ๐›ฝ, ๐œƒ). Consider the following high-level condition. Here (๐›ฝห†โ€ฒ , ๐œƒห†โ€ฒ )โ€ฒ is a
generic estimator of (๐›ฝ0โ€ฒ , ๐œƒ0โ€ฒ )โ€ฒ (and not necessarily โ„“1 -LAD and Lasso estimators, reps.), and ๐›ผ
ห‡ is de๏ฌned
โ€ฒ ห†โ€ฒ โ€ฒ
ห†
by ๐›ผ
ห‡ โˆˆ arg min๐›ผโˆˆ๐’œ ๐ฟ๐‘› (๐›ผ) with this (๐›ฝ , ๐œƒ ) , where ๐’œ here is also a generic (possibly random) compact
interval. We assume that (๐›ฝห†โ€ฒ , ๐œƒห†โ€ฒ )โ€ฒ , ๐’œ and ๐›ผ
ห‡ satisfy the following conditions.
Condition ILAD. (i) ๐‘“๐œ– (๐‘ก) โˆจ โˆฃ๐‘“๐œ–โ€ฒ (๐‘ก)โˆฃ โฉฝ ๐ถ for all ๐‘ก โˆˆ โ„, Eฬ„[๐‘ฃ๐‘–2 ] โฉพ ๐‘ > 0 and Eฬ„[๐‘ฃ๐‘–4 ] โˆจ Eฬ„[๐‘‘4๐‘– ] โฉฝ ๐ถ.
Moreover, for some sequences ๐›ฟ๐‘› โ†˜ 0 and ฮ”๐‘› โ†˜ 0, with probability at least 1 โˆ’ ฮ”๐‘› ,
(ii) {๐›ผ : โˆฃ๐›ผ โˆ’ ๐›ผ0 โˆฃ โฉฝ ๐‘›โˆ’1/2 /๐›ฟ๐‘› } โŠ‚ ๐’œ, where ๐’œ is a (possibly random) compact interval;
(iii) the estimated parameters (๐›ฝห†โ€ฒ , ๐œƒห†โ€ฒ )โ€ฒ satisfy
{
}1/2 โ€ฒ
1 โˆจ max (E[โˆฃ๐‘ฃ๐‘– โˆฃ] โˆจ โˆฃ๐‘ฅโ€ฒ๐‘– (๐œƒห† โˆ’ ๐œƒ0 )โˆฃ)
โˆฅ๐‘ฅ๐‘– (๐›ฝห† โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› โฉฝ ๐›ฟ๐‘› ๐‘›โˆ’1/4 , โˆฅ๐‘ฅโ€ฒ๐‘– (๐œƒห† โˆ’ ๐œƒ0 )โˆฅ2,๐‘› โฉฝ ๐›ฟ๐‘› ๐‘›โˆ’1/4 ,
1โฉฝ๐‘–โฉฝ๐‘›
sup โˆฃ๐”พ๐‘› (๐œ“๐›ผ,๐›ฝ,
ห† ๐œƒห† โˆ’ ๐œ“๐›ผ,๐›ฝ0 ,๐œƒ0 )โˆฃ โฉฝ ๐›ฟ๐‘› ,
๐›ผโˆˆ๐’œ
where recall that ๐”พ๐‘› (๐‘“ ) = ๐‘›โˆ’1/2
โˆ‘๐‘›
๐‘–=1 {๐‘“ (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )
(iv) the estimator ๐›ผ
ห‡ satis๏ฌes โˆฃห‡
๐›ผ โˆ’ ๐›ผ0 โˆฃ โฉฝ ๐›ฟ๐‘› .
(A.25)
(A.26)
โˆ’ E[๐‘“ (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )]}; and lastly
Comment A.1. Condition ILAD su๏ฌƒces to make the impact of the estimation of instruments negligible
on the ๏ฌrst order asymptotics of the estimator ๐›ผ
ห‡ . We note that Condition ILAD covers several di๏ฌ€erent
estimators including both estimators proposed in Algorithms 1 and 2.
The following lemma summarizes the main inferential result based on the high level Condition ILAD.
Lemma 1. Under Condition ILAD we have, for ๐œŽ๐‘›2 = 1/(4๐‘“๐œ–2 Eฬ„[๐‘ฃ๐‘–2 ]),
โˆš
๐›ผ โˆ’ ๐›ผ0 ) โ‡ ๐‘ (0, 1) and ๐‘›๐ฟ๐‘› (๐›ผ0 ) โ‡ ๐œ’2 (1),
๐œŽ๐‘›โˆ’1 ๐‘›(ห‡
Proof of Lemma 1. We shall separate the proof into two parts.
Part 1. (Proof for the ๏ฌrst assertion). Observe that
๐”ผ๐‘› [๐œ“๐›ผ,
ห† ๐œƒห†(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] = ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] + ๐”ผ๐‘› [๐œ“๐›ผ,
ห† ๐œƒห†(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– ) โˆ’ ๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )]
ห‡ ๐›ฝ,
ห‡ ๐›ฝ,
ห† ๐œƒ)
ห†
= ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] + ฮ“(ห‡
๐›ผ, ๐›ฝ,
โˆ’1/2
+ ๐‘›โˆ’1/2 ๐”พ๐‘› (๐œ“๐›ผ,
๐”พ๐‘› (๐œ“๐›ผ,๐›ฝ
ห‡ 0 ,๐œƒ0 ) + ๐‘›
ห‡ 0 ,๐œƒ0 โˆ’ ๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 )
ห† ๐œƒห† โˆ’ ๐œ“๐›ผ,๐›ฝ
ห‡ ๐›ฝ,
= ๐ผ + ๐ผ๐ผ + ๐ผ๐ผ๐ผ + ๐ผ๐‘‰.
14
UNIFORM POST SELECTION INFERENCE FOR LAD
By Condition ILAD(iii) (A.26) we have with probability at least 1 โˆ’ ฮ”๐‘› that โˆฃ๐ผ๐ผ๐ผโˆฃ โฉฝ ๐›ฟ๐‘› ๐‘›โˆ’1/2 . We
wish to show that
Observe that
๐ผ๐ผ + (๐‘“๐œ– Eฬ„[๐‘ฃ๐‘–2 ])(ห‡
๐›ผ โˆ’ ๐›ผ0 ) โ‰ฒ๐‘ƒ ๐›ฟ๐‘› ๐‘›โˆ’1/2 + ๐›ฟ๐‘› โˆฃห‡
๐›ผ โˆ’ ๐›ผ0 โˆฃ.
(A.27)
ห† ๐œƒ)
ห† = ฮ“(๐›ผ, ๐›ฝ0 , ๐œƒ0 ) + ฮ“(๐›ผ, ๐›ฝ,
ห† ๐œƒ)
ห† โˆ’ ฮ“(๐›ผ, ๐›ฝ0 , ๐œƒ0 )
ฮ“(๐›ผ, ๐›ฝ,
ห† ๐œƒ)
ห† โˆ’ ฮ“(๐›ผ, ๐›ฝ0 , ๐œƒ0 ) โˆ’ ฮ“2 (๐›ผ, ๐›ฝ0 , ๐œƒ0 )โ€ฒ (โ„Žฬ‚ โˆ’ โ„Ž0 )} + ฮ“2 (๐›ผ, ๐›ฝ0 , ๐œƒ0 )โ€ฒ (โ„Žฬ‚ โˆ’ โ„Ž0 ).
= ฮ“(๐›ผ, ๐›ฝ0 , ๐œƒ0 ) + {ฮ“(๐›ผ, ๐›ฝ,
Since ฮ“(๐›ผ0 , ๐›ฝ0 , ๐œƒ0 ) = 0, by Taylorโ€™s theorem, there exists some point ๐›ผ
หœ between ๐›ผ0 and ๐›ผ such that
ฮ“(๐›ผ, ๐›ฝ0 , ๐œƒ0 ) = ฮ“1 (หœ
๐›ผ, ๐›ฝ0 , ๐œƒ0 )(๐›ผ โˆ’ ๐›ผ0 ). By its de๏ฌnition, we have
ฮ“1 (๐›ผ, ๐›ฝ, ๐œƒ) = โˆ’Eฬ„[๐‘“๐œ– (๐‘ฅโ€ฒ๐‘– (๐›ฝโˆ’๐›ฝ0 )+๐‘‘๐‘– (๐›ผโˆ’๐›ผ0 ))๐‘‘๐‘– (๐‘‘๐‘– โˆ’๐‘ฅโ€ฒ๐‘– ๐œƒ)] = โˆ’Eฬ„[๐‘“๐œ– (๐‘ฅโ€ฒ๐‘– (๐›ฝโˆ’๐›ฝ0 )+๐‘‘๐‘– (๐›ผโˆ’๐›ผ0 ))๐‘‘๐‘– {๐‘ฃ๐‘– โˆ’๐‘ฅโ€ฒ๐‘– (๐œƒโˆ’๐œƒ0 )}].
Since ๐‘“๐œ– = ๐‘“๐œ– (0) and ๐‘‘๐‘– = ๐‘ฅโ€ฒ๐‘– ๐œƒ0 + ๐‘ฃ๐‘– with E[๐‘ฃ๐‘– ] = 0, we have ฮ“1 (๐›ผ0 , ๐›ฝ0 , ๐œƒ0 ) = โˆ’๐‘“๐œ– Eฬ„[๐‘‘๐‘– ๐‘ฃ๐‘– ] = โˆ’๐‘“๐œ– Eฬ„[๐‘ฃ๐‘–2 ]. Also
โˆฃฮ“1 (๐›ผ, ๐›ฝ0 , ๐œƒ0 ) โˆ’ ฮ“1 (๐›ผ0 , ๐›ฝ0 , ๐œƒ0 )โˆฃ โฉฝ โˆฃEฬ„[{๐‘“๐œ– (0) โˆ’ ๐‘“๐œ– (๐‘‘๐‘– (๐›ผ โˆ’ ๐›ผ0 ))๐‘‘๐‘– ๐‘ฃ๐‘– ]โˆฃ โฉฝ ๐ถโˆฃ๐›ผ โˆ’ ๐›ผ0 โˆฃEฬ„[โˆฃ๐‘‘2๐‘– ๐‘ฃ๐‘– โˆฃ].
Hence ฮ“(ห‡
๐›ผ, ๐›ฝ0 , ๐œƒ0 ) = โˆ’๐‘“๐œ– Eฬ„[๐‘ฃ๐‘–2 ] + ๐‘‚(1)โˆฃห‡
๐›ผ โˆ’ ๐›ผ0 โˆฃ.
Observe that
ฮ“2 (๐›ผ, ๐›ฝ, ๐œƒ) =
(
โˆ’Eฬ„[๐‘“๐œ– (๐‘ฅโ€ฒ๐‘– (๐›ฝ โˆ’ ๐›ฝ0 ) + ๐‘‘๐‘– (๐›ผ โˆ’ ๐›ผ0 ))(๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ)๐‘ฅ๐‘– ]
โˆ’Eฬ„[(1/2 โˆ’ 1{๐‘ฆ๐‘– โฉฝ ๐‘ฅโ€ฒ๐‘– ๐›ฝ + ๐‘‘๐‘– ๐›ผ})๐‘ฅ๐‘– ]
)
.
Note that since Eฬ„[๐‘“๐œ– (0)(๐‘‘๐‘– โˆ’๐‘ฅโ€ฒ๐‘– ๐œƒ0 )๐‘ฅ๐‘– ] = ๐‘“๐œ– Eฬ„[๐‘ฃ๐‘– ๐‘ฅ๐‘– ] = 0 and Eฬ„[(1/2โˆ’1{๐‘ฆ๐‘– โฉฝ ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 +๐‘‘๐‘– ๐›ผ0 })๐‘ฅ๐‘– ] = Eฬ„[(1/2โˆ’1{๐œ–๐‘– โฉฝ
0})๐‘ฅ๐‘– ] = 0, we have ฮ“2 (๐›ผ0 , ๐›ฝ0 , ๐œƒ0 ) = 0. Moreover,
โˆฃฮ“2 (๐›ผ, ๐›ฝ0 , ๐œƒ0 )โ€ฒ (โ„Žฬ‚ โˆ’ โ„Ž0 )โˆฃ = โˆฃ{ฮ“2 (๐›ผ, ๐›ฝ0 , ๐œƒ0 ) โˆ’ ฮ“2 (๐›ผ0 , ๐›ฝ0 , ๐œƒ0 )}โ€ฒ (โ„Žฬ‚ โˆ’ โ„Ž0 )โˆฃ
โฉฝ โˆฃEฬ„[{๐‘“๐œ– (๐‘‘๐‘– (๐›ผ โˆ’ ๐›ผ0 )) โˆ’ ๐‘“๐œ– (0)}๐‘ฃ๐‘– ๐‘ฅโ€ฒ๐‘– ](๐›ฝห† โˆ’ ๐›ฝ0 )โˆฃ
+ โˆฃEฬ„[{๐น (๐‘‘๐‘– (๐›ผ โˆ’ ๐›ผ0 )) โˆ’ ๐น (0)}๐‘ฅโ€ฒ๐‘– ](๐œƒห† โˆ’ ๐œƒ0 )]โˆฃ
โฉฝ ๐‘‚(1){โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝห† โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› + โˆฅ๐‘ฅโ€ฒ๐‘– (๐œƒห† โˆ’ ๐œƒ0 )โˆฅ2,๐‘› }โˆฃ๐›ผ โˆ’ ๐›ผ0 โˆฃ
= ๐‘‚๐‘ƒ (๐›ฟ๐‘› )โˆฃ๐›ผ โˆ’ ๐›ผ0 โˆฃ.
Hence โˆฃฮ“2 (ห‡
๐›ผ, ๐›ฝ0 , ๐œƒ0 )โ€ฒ (โ„Žฬ‚ โˆ’ โ„Ž0 )โˆฃ โ‰ฒ๐‘ƒ ๐›ฟ๐‘› โˆฃห‡
๐›ผ โˆ’ ๐›ผ0 โˆฃ.
Denote by ฮ“22 (๐›ผ, ๐›ฝ, ๐œƒ) the Hessian matrix of ฮ“(๐›ผ, ๐›ฝ, ๐œƒ) with respect to โ„Ž = (๐›ฝ โ€ฒ , ๐œƒโ€ฒ )โ€ฒ . Then
(
)
โˆ’Eฬ„[๐‘“๐œ–โ€ฒ (๐‘ฅโ€ฒ๐‘– (๐›ฝ โˆ’ ๐›ฝ0 ) + ๐‘‘๐‘– (๐›ผ โˆ’ ๐›ผ0 ))(๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ)๐‘ฅ๐‘– ๐‘ฅโ€ฒ๐‘– ] Eฬ„[๐‘“๐œ– (๐‘ฅโ€ฒ๐‘– (๐›ฝ โˆ’ ๐›ฝ0 ) + ๐‘‘๐‘– (๐›ผ โˆ’ ๐›ผ0 ))๐‘ฅ๐‘– ๐‘ฅโ€ฒ๐‘– ]
ฮ“22 (๐›ผ, ๐›ฝ, ๐œƒ) =
,
Eฬ„[๐‘“๐œ– (๐‘ฅโ€ฒ๐‘– (๐›ฝ โˆ’ ๐›ฝ0 ) + ๐‘‘๐‘– (๐›ผ โˆ’ ๐›ผ0 ))๐‘ฅ๐‘– ๐‘ฅโ€ฒ๐‘– ]
0
so that
(โ„Žฬ‚ โˆ’ โ„Ž0 )โ€ฒ ฮ“22 (๐›ผ, ๐›ฝ, ๐œƒ)(โ„Žฬ‚ โˆ’ โ„Ž0 ) โฉฝ โˆฃ(๐›ฝห† โˆ’ ๐›ฝ0 )โ€ฒ Eฬ„[๐‘“๐œ–โ€ฒ (๐‘ฅโ€ฒ๐‘– (๐›ฝ โˆ’ ๐›ฝ0 ) + ๐‘‘๐‘– (๐›ผ โˆ’ ๐›ผ0 ))(๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ)๐‘ฅ๐‘– ๐‘ฅโ€ฒ๐‘– ](๐›ฝห† โˆ’ ๐›ฝ0 )โˆฃ
+ 2โˆฃ(๐›ฝห† โˆ’ ๐›ฝ0 )โ€ฒ Eฬ„[๐‘“๐œ– (๐‘ฅโ€ฒ๐‘– (๐›ฝ โˆ’ ๐›ฝ0 ) + ๐‘‘๐‘– (๐›ผ โˆ’ ๐›ผ0 ))๐‘ฅ๐‘– ๐‘ฅโ€ฒ๐‘– ](๐œƒห† โˆ’ ๐œƒ0 )โˆฃ
โฉฝ ๐ถ{ max E[โˆฃ๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒโˆฃ]โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝห† โˆ’ ๐›ฝ0 )โˆฅ22,๐‘› + 2โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝห† โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› โ‹… โˆฅ๐‘ฅโ€ฒ๐‘– (๐œƒห† โˆ’ ๐œƒ0 )โˆฅ2,๐‘› }.
1โฉฝ๐‘–โฉฝ๐‘›
UNIFORM POST SELECTION INFERENCE FOR LAD
15
Here โˆฃ๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒโˆฃ = โˆฃ๐‘ฃ๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– (๐œƒ โˆ’ ๐œƒ0 )โˆฃ โฉฝ โˆฃ๐‘ฃ๐‘– โˆฃ + โˆฃ๐‘ฅโ€ฒ๐‘– (๐œƒ โˆ’ ๐œƒ0 )โˆฃ. Hence by Taylorโ€™s theorem together with ILAD(iii),
we conclude that
ห† ๐œƒ)
ห† โˆ’ ฮ“(ห‡
โˆฃฮ“(ห‡
๐›ผ, ๐›ฝ,
๐›ผ, ๐›ฝ0 , ๐œƒ0 ) โˆ’ ฮ“2 (ห‡
๐›ผ, ๐›ฝ0 , ๐œƒ0 )โ€ฒ (โ„Žฬ‚ โˆ’ โ„Ž0 )โˆฃ โ‰ฒ๐‘ƒ ๐›ฟ๐‘› ๐‘›โˆ’1/2 .
This leads to the expansion in (A.27).
We now proceed to bound the fourth term. By Condition ILAD(iii) we have with probability at least
1 โˆ’ ฮ”๐‘› that โˆฃห‡
๐›ผ โˆ’ ๐›ผ0 โˆฃ โฉฝ ๐›ฟ๐‘› . Observe that
(๐œ“๐›ผ,๐›ฝ0 ,๐œƒ0 โˆ’ ๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 )(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– ) = (1{๐‘ฆ๐‘– โฉฝ ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 + ๐‘‘๐‘– ๐›ผ0 } โˆ’ 1{๐‘ฆ๐‘– โฉฝ ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 + ๐‘‘๐‘– ๐›ผ})๐‘ฃ๐‘–
= (1{๐œ–๐‘– โฉฝ 0} โˆ’ 1{๐œ–๐‘– โฉฝ ๐‘‘๐‘– (๐›ผ โˆ’ ๐›ผ0 )})๐‘ฃ๐‘– ,
so that โˆฃ(๐œ“๐›ผ,๐›ฝ0 ,๐œƒ0 โˆ’ ๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 )(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )โˆฃ โฉฝ 1{โˆฃ๐œ–๐‘– โˆฃ โฉฝ ๐›ฟ๐‘› โˆฃ๐‘‘๐‘– โˆฃ}โˆฃ๐‘ฃ๐‘– โˆฃ whenever โˆฃ๐›ผ โˆ’ ๐›ผ0 โˆฃ โฉฝ ๐›ฟ๐‘› . Since the class of
functions {(๐‘ฆ, ๐‘‘, ๐‘ฅ) 7โ†’ (๐œ“๐›ผ,๐›ฝ0 ,๐œƒ0 โˆ’ ๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 )(๐‘ฆ, ๐‘‘, ๐‘ฅ) : โˆฃ๐›ผ โˆ’ ๐›ผ0 โˆฃ โฉฝ ๐›ฟ๐‘› } is a VC subgraph class with VC index
bounded by some constant independent of ๐‘›, using (a version of) Theorem 2.14.1 in [25], we have
sup
โˆฃ๐›ผโˆ’๐›ผ0 โˆฃโฉฝ๐›ฟ๐‘›
โˆฃ๐”พ๐‘› (๐œ“๐›ผ,๐›ฝ0 ,๐œƒ0 โˆ’ ๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 )โˆฃ โ‰ฒ๐‘ƒ (Eฬ„[1{โˆฃ๐œ–๐‘– โˆฃ โฉฝ ๐›ฟ๐‘› โˆฃ๐‘‘๐‘– โˆฃ}๐‘ฃ๐‘–2 ])1/2 โ‰ฒ๐‘ƒ ๐›ฟ๐‘›1/2 .
1/2
This implies that โˆฃ๐ผ๐‘‰ โˆฃ โ‰ฒ๐‘ƒ ๐›ฟ๐‘› ๐‘›โˆ’1/2 .
Combining these bounds on II, III and IV, we have the following stochastic expansion
2
๐”ผ๐‘› [๐œ“๐›ผ,
๐›ผ โˆ’ ๐›ผ0 ) + ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] + ๐‘‚๐‘ƒ (๐›ฟ๐‘›1/2 ๐‘›โˆ’1/2 ) + ๐‘‚๐‘ƒ (๐›ฟ๐‘› )โˆฃห‡
๐›ผ โˆ’ ๐›ผ0 โˆฃ.
ห† ๐œƒห†(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] = โˆ’(๐‘“๐œ– Eฬ„[๐‘ฃ๐‘– ])(ห‡
ห‡ ๐›ฝ,
Let ๐›ผโˆ— = ๐›ผ0 + (๐‘“๐œ– Eฬ„[๐‘ฃ๐‘–2 ])โˆ’1 ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )]. Then ๐›ผโˆ— โˆˆ ๐’œ with probability 1 โˆ’ ๐‘œ(1) since
โˆฃ๐›ผโˆ— โˆ’ ๐›ผ0 โˆฃ โ‰ฒ๐‘ƒ ๐‘›โˆ’1/2 . It is not di๏ฌƒcult to see that the above stochastic expansion holds with ๐›ผ
ห‡ replaced
โˆ—
by ๐›ผ , so that
2
โˆ—
1/2 โˆ’1/2
๐”ผ๐‘› [๐œ“๐›ผโˆ— ,๐›ฝ,
) = ๐‘‚๐‘ƒ (๐›ฟ๐‘›1/2 ๐‘›โˆ’1/2 ).
ห† ๐œƒห†(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] = โˆ’(๐‘“๐œ– Eฬ„[๐‘ฃ๐‘– ])(๐›ผ โˆ’ ๐›ผ0 ) + ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] + ๐‘‚๐‘ƒ (๐›ฟ๐‘› ๐‘›
1/2
โˆ’1/2
Therefore, โˆฃ๐”ผ๐‘› [๐œ“๐›ผ,
), so that
ห† ๐œƒห†(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )]โˆฃ โฉฝ โˆฃ๐”ผ๐‘› [๐œ“๐›ผโˆ— ,๐›ฝ,
ห† ๐œƒห†(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )]โˆฃ = ๐‘‚๐‘ƒ (๐›ฟ๐‘› ๐‘›
ห‡ ๐›ฝ,
(๐‘“๐œ– Eฬ„[๐‘ฃ๐‘–2 ])(ห‡
๐›ผ โˆ’ ๐›ผ0 ) = ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] + ๐‘‚๐‘ƒ (๐›ฟ๐‘›1/2 ๐‘›โˆ’1/2 ),
โˆš
๐›ผ โˆ’ ๐›ผ0 ) โ‡ ๐‘ (0, 1) since by the Lyapunov CLT,
which immediately implies that ๐œŽ๐‘›โˆ’1 ๐‘›(ห‡
โˆš
(Eฬ„[๐‘ฃ๐‘–2 ]/4)โˆ’1/2 ๐‘›๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] โ‡ ๐‘ (0, 1).
Part 2. (Proof for the second assertion). First consider the denominator of ๐ฟ๐‘› (๐›ผ0 ). We have that
โˆฃ๐”ผ๐‘› [ห†
๐‘ฃ๐‘–2 ] โˆ’ ๐”ผ๐‘› [๐‘ฃ๐‘–2 ]โˆฃ = โˆฃ๐”ผ๐‘› [(ห†
๐‘ฃ๐‘– โˆ’ ๐‘ฃ๐‘– )(ห†
๐‘ฃ๐‘– + ๐‘ฃ๐‘– )]โˆฃ โฉฝ โˆฅห†
๐‘ฃ๐‘– โˆ’ ๐‘ฃ๐‘– โˆฅ2,๐‘› โˆฅห†
๐‘ฃ๐‘– + ๐‘ฃ๐‘– โˆฅ2,๐‘›
โฉฝ โˆฅ๐‘ฅโ€ฒ๐‘– (๐œƒห† โˆ’ ๐œƒ0 )โˆฅ2,๐‘› (2โˆฅ๐‘ฃ๐‘– โˆฅ2,๐‘› + โˆฅ๐‘ฅโ€ฒ๐‘– (๐œƒห† โˆ’ ๐œƒ0 )โˆฅ2,๐‘› ) โ‰ฒ๐‘ƒ ๐›ฟ๐‘› ,
where we have used the fact that โˆฅ๐‘ฃ๐‘– โˆฅ2,๐‘› โ‰ฒ๐‘ƒ (Eฬ„[๐‘ฃ๐‘–2 ])1/2 = ๐‘‚(1) (which is guaranteed by ILAD(i)).
Next consider the numerator of ๐ฟ๐‘› (๐›ผ0 ). Since Eฬ„[๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] = 0 we have
โˆ’1/2
ห† ห†
๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ,
๐”พ๐‘› (๐œ“๐›ผ0 ,๐›ฝ,
ห† ๐œƒห†(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] = ๐‘›
ห† ๐œƒห† โˆ’ ๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 ) + ฮ“(๐›ผ0 , ๐›ฝ, ๐œƒ) + ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )].
16
UNIFORM POST SELECTION INFERENCE FOR LAD
By Condition ILAD(iii) and the previous calculation, we have
โˆ’1/2
ห† ห†
โˆฃ๐”พ๐‘› (๐œ“๐›ผ0 ,๐›ฝ,
.
ห† ๐œƒห† โˆ’ ๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 )โˆฃ โ‰ฒ๐‘ƒ ๐›ฟ๐‘› and โˆฃฮ“(๐›ผ0 , ๐›ฝ, ๐œƒ)โˆฃ โ‰ฒ๐‘ƒ ๐›ฟ๐‘› ๐‘›
Therefore, using the simple identity that ๐‘›๐ด2๐‘› = ๐‘›๐ต๐‘›2 + ๐‘›(๐ด๐‘› โˆ’ ๐ต๐‘› )2 + 2๐‘›๐ต๐‘› (๐ด๐‘› โˆ’ ๐ต๐‘› ) with
2
โˆ’1/2
๐ด๐‘› = ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ,
,
ห† ๐œƒห†(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] and ๐ต๐‘› = ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] โ‰ฒ๐‘ƒ (Eฬ„[๐‘ฃ๐‘– ])๐‘›
we have
2
4๐‘›โˆฃ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ,
ห† ๐œƒห†(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )]โˆฃ
4๐‘›โˆฃ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )]โˆฃ2
+ ๐‘‚๐‘ƒ (๐›ฟ๐‘› )
๐”ผ๐‘› [ห†
๐‘ฃ๐‘–2 ]
Eฬ„[๐‘ฃ๐‘–2 ]
since Eฬ„[๐‘ฃ๐‘–2 ] โฉพ ๐‘ is bounded away from zero. The result then follows since
โˆš
(Eฬ„[๐‘ฃ๐‘–2 ]/4)โˆ’1/2 ๐‘›๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] โ‡ ๐‘ (0, 1).
๐‘›๐ฟ๐‘› (๐›ผ0 ) =
=
โ–ก
Comment A.2 (On 1-step procedure). An inspection of the proof leads to the following stochastic
expansion:
2
๐”ผ๐‘› [๐œ“๐›ผห†,๐›ฝ,
๐›ผ โˆ’ ๐›ผ0 ) + ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )]
ห† ๐œƒห†(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] = โˆ’(๐‘“๐œ– Eฬ„[๐‘ฃ๐‘– ])(ห†
+ ๐‘‚๐‘ƒ (๐›ฟ๐‘›1/2 ๐‘›โˆ’1/2 + ๐›ฟ๐‘› ๐‘›โˆ’1/4 โˆฃห†
๐›ผ โˆ’ ๐›ผ0 โˆฃ + โˆฃห†
๐›ผ โˆ’ ๐›ผ0 โˆฃ2 ),
where ๐›ผ
ห† is any consistent estimator of ๐›ผ0 . Hence provided that โˆฃห†
๐›ผ โˆ’ ๐›ผ0 โˆฃ = ๐‘œ๐‘ƒ (๐‘›โˆ’1/4 ), the remainder
term in the above expansion is ๐‘œ๐‘ƒ (๐‘›โˆ’1/2 ), and the 1-step estimator ๐›ผ
ห‡ de๏ฌned by
๐›ผ
ห‡=๐›ผ
ห† + (๐”ผ๐‘› [๐‘“๐œ– ๐‘ฃห†๐‘–2 ])โˆ’1 ๐”ผ๐‘› [๐œ“๐›ผห†,๐›ฝ,
ห† ๐œƒห†(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )]
has the following stochastic expansion:
๐›ผ
ห‡=๐›ผ
ห† + {๐‘“๐œ– Eฬ„[๐‘ฃ๐‘–2 ] + ๐‘œ๐‘ƒ (๐‘›โˆ’1/4 )}โˆ’1 {โˆ’(๐‘“๐œ–Eฬ„[๐‘ฃ๐‘–2 ])(ห†
๐›ผ โˆ’ ๐›ผ0 ) + ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] + ๐‘œ๐‘ƒ (๐‘›โˆ’1/2 )}
= ๐›ผ0 + (๐‘“๐œ– Eฬ„[๐‘ฃ๐‘–2 ])โˆ’1 ๐”ผ๐‘› [๐œ“๐›ผ0 ,๐›ฝ0 ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )] + ๐‘œ๐‘ƒ (๐‘›โˆ’1/2 ),
โˆš
so that ๐œŽ๐‘›โˆ’1 ๐‘›(ห‡
๐›ผ โˆ’ ๐›ผ0 ) โ‡ ๐‘ (0, 1).
Appendix B. Proof of Theorem 1
The proof of Theorem 1 uses the properties of Post-โ„“1-LAD and Post-Lasso. We will collect these
properties together with required regularity conditions in Appendix D.
Proof of Theorem 1. We will verify Condition ILAD and the desired result then follows from Lemma 1.
The assumptions on the error density ๐‘“๐œ– (โ‹…) in Condition ILAD(i) are assumed in Condition I(iv). The
moment conditions on ๐‘‘๐‘– and ๐‘ฃ๐‘– in Condition ILAD(i) are assumed in Condition I(ii).
Condition SE implies that ๐œ…c is bounded away from zero with probability 1โˆ’ฮ”๐‘› for ๐‘› su๏ฌƒciently large,
หœ 0 โฉฝ ๐ถ๐‘ .
see [9]. Step 1 relies on Post-โ„“1 -LAD. By assumption with probability 1 โˆ’ ฮ”๐‘› we have ๐‘ ห† = โˆฅ๐›ฝโˆฅ
Thus, by Condition SE ๐œ™min (ห†
๐‘  + ๐‘ ) is bounded away from zero since ๐‘ ห† + ๐‘  โฉฝ โ„“๐‘› ๐‘  for large enough
๐‘› with probability 1 โˆ’ ฮ”๐‘› . Moreover, Condition PLAD in Appendix D is implied by Condition I.
The required side condition of Lemma 4 is satis๏ฌed by relations (F.41) and (F.42). By Lemma 4 we
โˆš
have โˆฃห†
๐›ผ โˆ’ ๐›ผ0 โˆฃ โ‰ฒ๐‘ƒ
๐‘  log(๐‘ โˆจ ๐‘›)/๐‘› โฉฝ ๐‘œ(1) logโˆ’1 ๐‘› under ๐‘ 3 log3 (๐‘ โˆจ ๐‘›) โฉฝ ๐›ฟ๐‘› ๐‘›. Note that this implies
UNIFORM POST SELECTION INFERENCE FOR LAD
17
{๐›ผ : โˆฃ๐›ผ โˆ’ ๐›ผ0 โˆฃ โฉฝ ๐‘›โˆ’1/2 log ๐‘›} โŠ‚ ๐’œ (with probability 1 โˆ’ ๐‘œ(1)) which is required in ILAD(ii) and the
(shrinking) de๏ฌnition of ๐’œ establishes the initial rate of ILAD(iv). By Lemma 5 in Appendix D we have
โˆš
โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝหœ โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› โ‰ฒ๐‘ƒ ๐‘  log(๐‘› โˆจ ๐‘)/๐‘› since the required side condition holds. Indeed, for ๐‘ฅหœ๐‘– = (๐‘‘๐‘– , ๐‘ฅโ€ฒ๐‘– )โ€ฒ and
๐›ฟ = (๐›ฟ๐‘‘ , ๐›ฟ๐‘ฅโ€ฒ )โ€ฒ , because of Condition SE and the fact that ๐”ผ๐‘› [โˆฃ๐‘‘๐‘– โˆฃ3 ] โ‰ฒ๐‘ƒ Eฬ„[โˆฃ๐‘‘๐‘– โˆฃ3 ] = ๐‘‚(1),
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ32,๐‘›
๐”ผ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฃ3 ]
๐‘› [โˆฃหœ
โˆฅ๐›ฟโˆฅ0 โฉฝ๐‘ +๐ถ๐‘ 
inf
โฉพ
โฉพ
{๐œ™min (๐‘ +๐ถ๐‘ )}3/2 โˆฅ๐›ฟโˆฅ3
โ€ฒ
3
3
3
4๐”ผ
๐‘› [โˆฃ๐‘ฅ๐‘– ๐›ฟ๐‘ฅ โˆฃ ]+4โˆฃ๐›ฟ๐‘‘ โˆฃ ๐”ผ๐‘› [โˆฃ๐‘‘๐‘– โˆฃ ]
โˆฅ๐›ฟโˆฅ0 โฉฝ๐‘ +๐ถ๐‘ 
3/2
{๐œ™min (๐‘ +๐ถ๐‘ )}
โˆฅ๐›ฟโˆฅ3
inf
2 +4โˆฅ๐›ฟโˆฅ3 ๐”ผ [โˆฃ๐‘‘ โˆฃ3 ]
4๐พ
โˆฅ๐›ฟ
โˆฅ
๐œ™
(๐‘ +๐ถ๐‘ )โˆฅ๐›ฟ
โˆฅ
๐‘ฅ
๐‘ฅ
1
max
๐‘ฅ
๐‘›
๐‘–
โˆฅ๐›ฟโˆฅ โฉฝ๐‘ +๐ถ๐‘ 
inf
0
โฉพ
4๐พ๐‘ฅ
{๐œ™min (๐‘ +๐ถ๐‘ )}3/2
โˆš
๐‘ +๐ถ๐‘ ๐œ™max (๐‘ +๐ถ๐‘ )+4๐”ผ๐‘› [โˆฃ๐‘‘๐‘– โˆฃ3 ]
โ‰ณ๐‘ƒ
1โˆš
.
๐พ๐‘ฅ ๐‘ 
โˆš
Therefore, since ๐พ๐‘ฅ2 ๐‘ 2 log2 (๐‘ โˆจ ๐‘›) โฉฝ ๐›ฟ๐‘› ๐‘› and ๐œ† โ‰ฒ ๐‘› log(๐‘ โˆจ ๐‘›) we have
โˆš
โˆš
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ32,๐‘›
๐‘› ๐œ™min (๐‘ +๐ถ๐‘ )
๐‘›
inf
โ‰ณ
โˆš โˆš
โ€ฒ
3
๐‘ƒ
๐”ผ๐‘› [โˆฃหœ
๐‘ฅ ๐›ฟโˆฃ ]
๐พ๐‘ฅ ๐‘  log(๐‘โˆจ๐‘›) โ†’ โˆž.
๐œ† ๐‘ +
๐‘ ๐‘› log(๐‘โˆจ๐‘›) โˆฅ๐›ฟโˆฅ0 โฉฝ๐‘ +๐ถ๐‘ 
๐‘–
Step 2 relies on Post-Lasso. Condition HL in Appendix D is implied by Condition I and Lemma 2
applied twice with ๐œ๐‘– = ๐‘ฃ๐‘– and ๐œ๐‘– = ๐‘‘๐‘– under the condition that ๐พ๐‘ฅ4 log ๐‘ โฉฝ ๐›ฟ๐‘› ๐‘›. By Lemma 7 in Appendix
โˆš
หœ 0 โ‰ฒ ๐‘  with probability 1 โˆ’ ๐‘œ(1).
D we have โˆฅ๐‘ฅโ€ฒ๐‘– (๐œƒหœ โˆ’ ๐œƒ0 )โˆฅ2,๐‘› โ‰ฒ๐‘ƒ ๐‘  log(๐‘› โˆจ ๐‘)/๐‘› and โˆฅ๐œƒโˆฅ
The rates established above for ๐œƒหœ and ๐›ฝหœ imply (A.25) in ILAD(iii) since by Condition I(ii) E[โˆฃ๐‘ฃ๐‘– โˆฃ] โฉฝ
โˆš
(E[๐‘ฃ๐‘–2 ])1/2 = ๐‘‚(1) and max1โฉฝ๐‘–โฉฝ๐‘› โˆฃ๐‘ฅโ€ฒ๐‘– (๐œƒหœ โˆ’ ๐œƒ0 )โˆฃ โ‰ฒ๐‘ƒ ๐พ๐‘ฅ ๐‘ 2 log(๐‘ โˆจ ๐‘›)/๐‘› = ๐‘œ(1).
We now verify the last requirement in Condition ILAD(iii). Consider the following class of functions
โ„ฑ๐‘  = {(๐‘ฆ, ๐‘‘, ๐‘ฅ) 7โ†’ 1{๐‘ฆ โฉฝ ๐‘ฅโ€ฒ ๐›ฝ + ๐‘‘๐›ผ} : ๐›ผ โˆˆ โ„, โˆฅ๐›ฝโˆฅ0 โฉฝ ๐ถ๐‘ },
(๐‘)
which is the union of ๐ถ๐‘ 
VC-subgraph classes of functions with VC indices bounded by ๐ถ โ€ฒ ๐‘ . Hence
log ๐‘ (๐œ€, โ„ฑ๐‘  , โˆฅ โ‹… โˆฅโ„™๐‘› ,2 ) โ‰ฒ ๐‘  log ๐‘ + ๐‘  log(1/๐œ€).
Likewise, consider the following class of functions ๐’ข๐‘ ,๐‘Ÿ = {(๐‘ฆ, ๐‘‘, ๐‘ฅ) 7โ†’ ๐‘ฅโ€ฒ ๐œƒ : โˆฅ๐œƒโˆฅ0 โฉฝ ๐ถ๐‘ , โˆฅ๐‘ฅโ€ฒ๐‘– ๐œƒโˆฅ2,๐‘› โฉฝ ๐‘Ÿ}.
Then
log ๐‘ (๐œ€โˆฅ๐บ๐‘ ,๐‘Ÿ โˆฅโ„™๐‘›,2 , ๐’ข๐‘  , โˆฅ โ‹… โˆฅโ„™๐‘›,2 ) โ‰ฒ ๐‘  log ๐‘ + ๐‘  log(1/๐œ€),
where ๐บ๐‘ ,๐‘Ÿ (๐‘ฆ, ๐‘‘, ๐‘ฅ) = maxโˆฅ๐œƒโˆฅ0 โฉฝ๐ถ๐‘ ,โˆฅ๐‘ฅโ€ฒ๐‘– ๐œƒโˆฅ2,๐‘› โฉฝ๐‘Ÿ โˆฃ๐‘ฅโ€ฒ ๐œƒโˆฃ.
Note that
sup โˆฃ๐”พ๐‘› (๐œ“๐›ผ,๐›ฝ,
หœ ๐œƒหœ โˆ’ ๐œ“๐›ผ,๐›ฝ0 ,๐œƒ0 )โˆฃ โฉฝ sup โˆฃ๐”พ๐‘› (๐œ“๐›ผ,๐›ฝ,
หœ ๐œƒหœ โˆ’ ๐œ“๐›ผ,๐›ฝ,๐œƒ
หœ 0 )โˆฃ
๐›ผโˆˆ๐’œ
(B.28)
๐›ผโˆˆ๐’œ
+ sup โˆฃ๐”พ๐‘› (๐œ“๐›ผ,๐›ฝ,๐œƒ
หœ 0 โˆ’ ๐œ“๐›ผ,๐›ฝ0 ,๐œƒ0 )โˆฃ.
(B.29)
๐›ผโˆˆ๐’œ
Consider to bound (B.28). Observe that
๐œ“๐›ผ,๐›ฝ,๐œƒ (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– ) โˆ’ ๐œ“๐›ผ,๐›ฝ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– ) = โˆ’(1/2 โˆ’ 1{๐‘ฆ๐‘– โฉฝ ๐‘ฅโ€ฒ๐‘– ๐›ฝ + ๐‘‘๐‘– ๐›ผ})๐‘ฅโ€ฒ๐‘– (๐œƒ โˆ’ ๐œƒ0 ),
1
and consider the class of functions โ„‹๐‘ ,๐‘Ÿ
= {(๐‘ฆ, ๐‘‘, ๐‘ฅ) 7โ†’ (1/2 โˆ’ 1{๐‘ฆ โฉฝ ๐‘ฅโ€ฒ ๐›ฝ + ๐‘‘๐›ผ})๐‘ฅโ€ฒ (๐œƒ โˆ’ ๐œƒ0 ) : ๐›ผ โˆˆ โ„, โˆฅ๐›ฝโˆฅ0 โฉฝ
โˆš
๐ถ๐‘ , โˆฅ๐œƒโˆฅ0 โฉฝ ๐ถ๐‘ , โˆฅ๐‘ฅโ€ฒ๐‘– (๐œƒ โˆ’ ๐œƒ0 )โˆฅ2,๐‘› โฉฝ ๐‘Ÿ} with ๐‘Ÿ โ‰ฒ ๐‘  log(๐‘ โˆจ ๐‘›)/๐‘›. Then by Lemma 9 together with the
above entropy calculations (and some straightforward algebras), we have
โˆš
โˆš
sup โˆฃ๐”พ๐‘› (๐‘”)โˆฃ โ‰ฒ๐‘ƒ ๐‘  log(๐‘ โˆจ ๐‘›) ๐‘  log(๐‘ โˆจ ๐‘›)/๐‘› = ๐‘œ๐‘ƒ (1),
๐‘”โˆˆโ„‹1๐‘ ,๐‘Ÿ
18
UNIFORM POST SELECTION INFERENCE FOR LAD
where ๐‘ 2 log2 (๐‘ โˆจ ๐‘›) โฉฝ ๐›ฟ๐‘› ๐‘› is used. Since โˆฅ๐‘ฅโ€ฒ๐‘– (๐œƒหœ โˆ’ ๐œƒ0 )โˆฅ2,๐‘› โ‰ฒ๐‘ƒ
probability 1 โˆ’ ๐‘œ(1), we conclude that (B.28) = ๐‘œ๐‘ƒ (1).
โˆš
หœ 0 โˆจ โˆฅ๐œƒโˆฅ
หœ 0 โ‰ฒ ๐‘  with
๐‘  log(๐‘› โˆจ ๐‘)/๐‘› and โˆฅ๐›ฝโˆฅ
Lastly consider to bound (B.29). Observe that
๐œ“๐›ผ,๐›ฝ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– ) โˆ’ ๐œ“๐›ผ,๐›ฝ,๐œƒ0 (๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– ) = โˆ’(1{๐‘ฆ๐‘– โฉฝ ๐‘ฅโ€ฒ๐‘– ๐›ฝ + ๐‘‘๐‘– ๐›ผ} โˆ’ 1{๐‘ฆ๐‘– โฉฝ ๐‘ฅโ€ฒ๐‘– ๐›ฝ0 + ๐‘‘๐‘– ๐›ผ})๐‘ฃ๐‘– ,
2
where ๐‘ฃ๐‘– = ๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ0 , and consider the class of functions โ„‹๐‘ ,๐‘Ÿ
= {(๐‘ฆ, ๐‘‘, ๐‘ฅ) โ†’
7 (1{๐‘ฆ โฉฝ ๐‘ฅโ€ฒ ๐›ฝ + ๐‘‘๐›ผ} โˆ’ 1{๐‘ฆ โฉฝ
โˆš
โ€ฒ
โ€ฒ
โ€ฒ
๐‘ฅ ๐›ฝ0 + ๐‘‘๐›ผ})(๐‘‘ โˆ’ ๐‘ฅ ๐œƒ0 ) : ๐›ผ โˆˆ โ„, โˆฅ๐›ฝโˆฅ0 โฉฝ ๐ถ๐‘ , โˆฅ๐‘ฅ๐‘– (๐›ฝ โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› โฉฝ ๐‘Ÿ} with ๐‘Ÿ โ‰ฒ ๐‘  log(๐‘ โˆจ ๐‘›)/๐‘›. Then by
Lemma 9 together with the above entropy calculations (and some straightforward algebras), we have
โˆš
โˆš
sup โˆฃ๐”พ๐‘› (๐‘”)โˆฃ โ‰ฒ๐‘ƒ ๐‘  log(๐‘ โˆจ ๐‘›) sup
๐”ผ๐‘› [๐‘”(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )2 ] โˆจ Eฬ„[๐‘”(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )2 ].
๐‘”โˆˆโ„‹2๐‘ ,๐‘Ÿ
๐‘”โˆˆโ„‹2๐‘ ,๐‘Ÿ
Here we have
Eฬ„[๐‘”(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )2 ] โฉฝ ๐ถโˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝ โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› (Eฬ„[๐‘ฃ๐‘–4 ])1/2 โ‰ฒ
On the other hand,
โˆš
๐‘  log(๐‘ โˆจ ๐‘›)/๐‘›.
sup ๐”ผ๐‘› [๐‘”(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )2 ] โฉฝ ๐‘›โˆ’1/2 sup ๐”พ๐‘› (๐‘” 2 ) + sup Eฬ„[๐‘”(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )2 ],
๐‘”โˆˆโ„‹2๐‘ ,๐‘Ÿ
๐‘”โˆˆโ„‹2๐‘ ,๐‘Ÿ
(B.30)
๐‘”โˆˆโ„‹2๐‘ ,๐‘Ÿ
and apply Lemma 9 to the ๏ฌrst term on the right side of (B.30). Then we have
โˆš
โˆš
sup ๐”พ๐‘› (๐‘” 2 ) โ‰ฒ๐‘ƒ ๐‘  log(๐‘ โˆจ ๐‘›) sup
๐”ผ๐‘› [๐‘”(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )4 ] โˆจ Eฬ„[๐‘”(๐‘ฆ๐‘– , ๐‘‘๐‘– , ๐‘ฅ๐‘– )4 ]
๐‘”โˆˆโ„‹2๐‘ ,๐‘Ÿ
๐‘”โˆˆโ„‹2๐‘ ,๐‘Ÿ
Since โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝหœ โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› โ‰ฒ๐‘ƒ
โˆš
โˆš
โˆš
โˆš
โ‰ฒ ๐‘  log(๐‘ โˆจ ๐‘›) ๐”ผ๐‘› [๐‘ฃ๐‘–4 ] โˆจ Eฬ„[๐‘ฃ๐‘–4 ] โ‰ฒ๐‘ƒ ๐‘  log(๐‘ โˆจ ๐‘›) Eฬ„[๐‘ฃ๐‘–4 ].
โˆš
หœ 0 โฉฝ ๐ถ๐‘  with probability 1 โˆ’ ฮ”๐‘› , we conclude that
๐‘  log(๐‘› โˆจ ๐‘)/๐‘› and โˆฅ๐›ฝโˆฅ
โˆš
(B.29) โ‰ฒ๐‘ƒ ๐‘  log(๐‘ โˆจ ๐‘›)(๐‘  log(๐‘ โˆจ ๐‘›)/๐‘›)1/4 = ๐‘œ(1),
where ๐‘ 3 log3 (๐‘ โˆจ ๐‘›) โฉฝ ๐›ฟ๐‘› ๐‘› is used.
โ–ก
Appendix C. Auxiliary Technical Results
In this section we collect two auxiliary technical results. Their proofs are given in the supplementary
appendix.
Lemma 2. Let ๐‘ฅ1 , . . . , ๐‘ฅ๐‘› be non-stochastic vectors in โ„๐‘ with max1โฉฝ๐‘–โฉฝ๐‘› โˆฅ๐‘ฅ๐‘– โˆฅโˆž โฉฝ ๐พ๐‘ฅ . Let ๐œ1 , . . . , ๐œ๐‘›
be independent random variables such that Eฬ„[โˆฃ๐œ๐‘– โˆฃ๐‘ž ] < โˆž for some ๐‘ž โฉพ 4. Then with probability at least
1 โˆ’ 8๐œ ,
max โˆฃ(๐”ผ๐‘› โˆ’
1โฉฝ๐‘—โฉฝ๐‘
Eฬ„)[๐‘ฅ2๐‘–๐‘— ๐œ๐‘–2 ]โˆฃ
โฉฝ4
โˆš
log(2๐‘/๐œ ) 2
๐พ๐‘ฅ (Eฬ„[โˆฃ๐œ๐‘– โˆฃ๐‘ž ]/๐œ )4/๐‘ž .
๐‘›
Lemma 3. Let ๐‘‡ = support(๐›ฝ0 ), โˆฃ๐‘‡ โˆฃ = โˆฅ๐›ฝ0 โˆฅ0 โฉฝ ๐‘  and โˆฅ๐›ฝห†๐‘‡ ๐‘ โˆฅ1 โฉฝ cโˆฅ๐›ฝห†๐‘‡ โˆ’ ๐›ฝ0 โˆฅ1 . Moreover, let ๐›ฝห†(2๐‘š)
denote the vector formed by the largest 2๐‘š components of ๐›ฝห† in absolute value and zero in the remaining
components. Then for ๐‘š โฉพ ๐‘  we have that ๐›ฝห†(2๐‘š) satis๏ฌes
โˆš
โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝห†(2๐‘š) โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› โฉฝ โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝห† โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› + ๐œ™max (๐‘š)/๐‘š cโˆฅ๐›ฝห†๐‘‡ โˆ’ ๐›ฝ0 โˆฅ1 ,
โˆš
where ๐œ™max (๐‘š)/๐‘š โฉฝ 2๐œ™max (๐‘ )/๐‘  and โˆฅ๐›ฝห†๐‘‡ โˆ’ ๐›ฝ0 โˆฅ1 โฉฝ ๐‘ โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝห† โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› /๐œ…c .
UNIFORM POST SELECTION INFERENCE FOR LAD
19
References
[1] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random
matrices. Constructive Approximation, 28:253โ€“263, 2008.
[2] A. Belloni, D. Chen, V. Chernozhukov, and C. Hansen. Sparse models and methods for optimal instruments with an
application to eminent domain. Econometrica, 80(6):2369โ€“2430, November 2012.
[3] A. Belloni and V. Chernozhukov. โ„“1 -penalized quantile regression for high dimensional sparse models. Annals of Statistics, 39(1):82โ€“130, 2011.
[4] A. Belloni and V. Chernozhukov. Least squares after model selection in high-dimensional sparse models. Bernoulli,
19(2):521โ€“547, 2013.
[5] A. Belloni, V. Chernozhukov, and I. Fernandez-Val. Conditional Quantile Processes based on Series or Many Regressors.
ArXiv e-prints, May 2011.
[6] A. Belloni, V. Chernozhukov, and C. Hansen. Inference on treatment e๏ฌ€ects after selection amongst high-dimensional
controls. ArXiv, 2011.
[7] A. Belloni, V. Chernozhukov, and C. Hansen. Inference for high-dimensional sparse econometric models. Advances in
Economics and Econometrics. 10th World Congress of Econometric Society, held in August 2010, III:245โ€“295, 2013.
[8] A. Belloni, V. Chernozhukov, and K. Kato. Robust inference in high-dimensional sparse quantile regression models.
Working Paper, 2013.
[9] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics,
37(4):1705โ€“1732, 2009.
[10] Victor Chernozhukov and Christian Hansen. Instrumental variable quantile regression: A robust inference approach.
Journal of Econometrics, 142:379โ€“398, 2008.
[11] Xuming He and Qi-Man Shao. On parameters of increasing dimensions. J. Multivariate Anal., 73(1):120โ€“135, 2000.
[12] Bing-Yi Jing, Qi-Man Shao, and Qiying Wang. Self-normalized Cramer-type large deviations for independent random
variables. Ann. Probab., 31(4):2167โ€“2215, 2003.
[13] K. Kato. Group lasso for high dimensional sparse quantile regression models. Preprint, ArXiv, 2011.
[14] Roger Koenker. Quantile regression, volume 38 of Econometric Society Monographs. Cambridge University Press,
Cambridge, 2005.
[15] Michael R. Kosorok. Introduction to Empirical Processes and Semiparametric Inference. Series in Statistics. Springer,
Berlin, 2008.
[16] Sokbae Lee. E๏ฌƒcient semiparametric estimation of a partially linear quantile regression model. Econometric theory,
19:1โ€“31, 2003.
[17] Hannes Leeb and Benedikt M. Poฬˆtscher. Model selection and inference: facts and ๏ฌction. Economic Theory, 21:21โ€“59,
2005.
[18] Hannes Leeb and Benedikt M. Poฬˆtscher. Sparse estimators and the oracle property, or the return of Hodgesโ€™ estimator.
J. Econometrics, 142(1):201โ€“211, 2008.
[19] J. Neyman. Optimal asymptotic tests of composite statistical hypotheses. In U. Grenander, editor, Probability and
Statistics, the Harold Cramer Volume. New York: John Wiley and Sons, Inc., 1959.
[20] J. Neyman. ๐‘(๐›ผ) tests and their use. Sankhya, 41:1โ€“21, 1979.
[21] J. L. Powell. Censored regression quantiles. Journal of Econometrics, 32:143โ€“155, 1986.
[22] M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. ArXiv:1106.1151, 2011.
[23] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58:267โ€“288, 1996.
[24] S. A. van de Geer. High-dimensional generalized linear models and the lasso. Annals of Statistics, 36(2):614โ€“645, 2008.
[25] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics, 1996.
[26] Lie Wang. ๐ฟ1 penalized lad estimator for high dimensional ilnear regression. ArXiv, 2012.
[27] Cun-Hui Zhang and Stephanie S. Zhang. Con๏ฌdence intervals for low-dimensional parameters with high-dimensional
data. ArXiv.org, (arXiv:1110.2563v1), 2011.
[28] S. Zhou. Restricted eigenvalue conditions on subgaussian matrices. ArXiv:0904.4723v2, 2009.
20
UNIFORM POST SELECTION INFERENCE FOR LAD
Standard approach: rp(0.05)
Proposed Method 1: rp(0.05)
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.5
0.5
0.5
2
R on d
0 0
R2 on y
2
R on d
Ideal: rp(0.05)
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
2
Figure 1.
0.5
0.5
R on d
0 0
0 0
R2 on y
Proposed Method 2: rp(0.05)
0.4
0.5
0.5
2
R on y
2
R on d
0.5
0 0
2
R on y
The ๏ฌgure displays the empirical rejection probabilities of the nominal 5% level tests of a
true hypothesis based on di๏ฌ€erent testing procedures: the top left plot is based on the standard postmodel selection procedure based on ๐›ผ
หœ, the top right plot is based on the proposed post-model selection
procedure based on ๐›ผ,
ห‡ and the bottom left plot is based on another proposed procedure based on the
statistic ๐ฟ๐‘› . The results are based on 500 replications for each of the 100 combinations of ๐‘…2 โ€™s in the
primary and auxiliary equations in (3.14). Ideally we should observe the 5% rejection rate (of a true
null) uniformly across the parameter space (as in bottom right plot).
UNIFORM POST SELECTION INFERENCE FOR LAD
Standard approach: Bias
Proposed Method: Bias
0.5
0.5
0
0
โˆ’0.5
โˆ’0.5
0.5
2
R on d
0
0
0.6
0.2 0.4
R2 on y
0.8
0.5
R2 on d
Standard approach: Standard Deviation
0.2
0
0
0.5
R on d
0
0
0.6 0.8
0.2 0.4
2
R on d
R on y
0.5
0
0
R on d
Figure 2.
0
0
0.6 0.8
0.2 0.4
2
R on y
R on y
0
0
0.6 0.8
0.2 0.4
2
R on y
Proposed Method: RMSE
0.5
0.5
0
0.5
2
Standard approach: RMSE
2
0
0.6 0.8
0.2 0.4
2
Proposed Method: Standard Deviation
0.2
2
21
0.5
2
R on d
0
0
0.6 0.8
0.2 0.4
2
R on y
The ๏ฌgure displays mean bias (top row), standard deviation (middle row), and root mean
square error (bottom row) for the the proposed post-model selection estimator ๐›ผ
ห‡ (right column) and the
standard post-model selection estimator ๐›ผ
หœ (left column). The results are based on 500 replications for
each of the 100 combinations of ๐‘…2 โ€™s in the primary and auxiliary equations in (3.14).
UNIFORM POST SELECTION INFERENCE FOR LAD
1
Supplementary Appendix for โ€œUniform Post Selection Inference
for LAD Regression Modelsโ€
Appendix D. Auxiliary Results for โ„“1 -LAD and Heteroscedastic Lasso
In this section we state relevant theoretical results on the performance of the estimators โ„“1 -LAD, Postโ„“1 -LAD, heteroscedastic Lasso, and heteroscedastic Post-Lasso. There results were developed in [3] and
[2]. The main design condition relies on the restricted eigenvalue proposed in [9], namely for ๐‘ฅ
หœ๐‘– = (๐‘‘๐‘– , ๐‘ฅโ€ฒ๐‘– )โ€ฒ
๐œ…c =
inf
โˆฅ๐›ฟ๐‘‡ ๐‘ โˆฅ1 โฉฝcโˆฅ๐›ฟ๐‘‡ โˆฅ1
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ2,๐‘› /โˆฅ๐›ฟ๐‘‡ โˆฅ,
(D.31)
where c = (๐‘ + 1)/(๐‘ โˆ’ 1) for the slack constant ๐‘ > 1, see [9]. It is well known that Condition SE implies
that ๐œ…c is bounded away from zero if c is bounded for any subset ๐‘‡ โŠ‚ {1, . . . , ๐‘} with โˆฃ๐‘‡ โˆฃ โฉฝ ๐‘ .
D.1. โ„“1 -Penalized LAD. For a data generating process such that P(๐‘ฆ๐‘– โฉฝ ๐‘ฅหœโ€ฒ๐‘– ๐œ‚0 โˆฃ ๐‘ฅ
หœ๐‘– ) = 1/2, independent
across ๐‘– (๐‘– = 1, . . . , ๐‘›) we consider the estimation of ๐œ‚0 via the โ„“1 -penalized LAD regression estimate
๐œ‚ห† โˆˆ arg min ๐”ผ๐‘› [โˆฃ๐‘ฆ๐‘– โˆ’ ๐‘ฅหœโ€ฒ๐‘– ๐œ‚โˆฃ] +
๐œ‚
๐œ†
โˆฅ๐œ‚โˆฅ1 .
๐‘›
As established in [3] and [26], under the event that
๐œ†
โฉพ 2๐‘โˆฅ๐”ผ๐‘› [(1/2 โˆ’ 1{๐‘ฆ๐‘– โฉฝ ๐‘ฅหœโ€ฒ๐‘– ๐œ‚0 })หœ
๐‘ฅ๐‘– ]โˆฅโˆž ,
๐‘›
(D.32)
the estimator above achieves good theoretical guarantees under mild design conditions. Although ๐œ‚0 is
unknown, we can set ๐œ† so that the event in (D.32) holds with high probability. In particular, the pivotal
rule discussed in [3] proposes to set ๐œ† = ๐‘โ€ฒ ๐‘›ฮ›(1 โˆ’ ๐›พ โˆฃ ๐‘ฅ
หœ) for ๐‘โ€ฒ > ๐‘ and ๐›พ โ†’ 0 where
ฮ›(1 โˆ’ ๐›พ โˆฃ ๐‘ฅ
หœ) := (1 โˆ’ ๐›พ)-quantile of 2โˆฅ๐”ผ๐‘› [(1/2 โˆ’ 1{๐‘ˆ๐‘– โฉฝ 1/2})หœ
๐‘ฅ๐‘– ]โˆฅโˆž ,
(D.33)
and where ๐‘ˆ๐‘– are independent uniform random variables on (0, 1), independent of ๐‘ฅ
หœ1 , . . . , ๐‘ฅหœ๐‘› . We suggest
๐›พ = 0.1/ log ๐‘› and ๐‘โ€ฒ = 1.1๐‘. This quantity can be easily approximated via simulations. Below we
summarize required regularity conditions.
Condition PLAD. Assume that โˆฅ๐œ‚0 โˆฅ0 = ๐‘  โฉพ 1, ๐”ผ๐‘› [หœ
๐‘ฅ2๐‘–๐‘— ] = 1 for all 1 โฉฝ ๐‘— โฉฝ ๐‘, the conditional
density of ๐‘ฆ๐‘– given ๐‘‘๐‘– , denoted by ๐‘“๐‘– (โ‹…), and its derivative are bounded by ๐‘“¯ and ๐‘“¯โ€ฒ , respectively, and
๐‘“๐‘– (หœ
๐‘ฅโ€ฒ๐‘– ๐œ‚0 ) โฉพ ๐‘“ > 0 is bounded away from zero uniformly in ๐‘›.
Condition PLAD is implied by Condition I. The assumption on the conditional density is standard in
the quantile regression literature even with ๏ฌxed ๐‘ or ๐‘ increasing slower than ๐‘› (see respectively [14]
and [5]). Next we present bounds on the prediction norm of the โ„“1 -LAD estimator.
Lemma 4 (Estimation Error of โ„“1 -LAD). Under Condition PLAD, and using ๐œ† = ๐‘โ€ฒ ๐‘›ฮ›(1 โˆ’ ๐›พ โˆฃ ๐‘ฅ
หœ), we
have with probability 1 โˆ’ 2๐›พ โˆ’ ๐‘œ(1) for ๐‘› large enough
โˆš
โˆš
๐œ† ๐‘ 
1
๐‘  log(๐‘/๐›พ)
โ€ฒ
โˆฅหœ
๐‘ฅ๐‘– (ห†
๐œ‚ โˆ’ ๐œ‚0 )โˆฅ2,๐‘› โ‰ฒ
+
,
๐‘›๐œ…c
๐œ…c
๐‘›
2
UNIFORM POST SELECTION INFERENCE FOR LAD
provided
{
๐‘›๐œ…
โˆšc
๐œ† ๐‘ 
๐‘›๐œ…c
๐‘ ๐‘› log([๐‘โˆจ๐‘›]/๐›พ)
+โˆš
}
๐‘“¯๐‘“¯โ€ฒ
inf
๐‘“ ๐›ฟโˆˆฮ”
c
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ32,๐‘›
โ†’ โˆž.
๐”ผ๐‘› [โˆฃหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฃ3 ]
Lemma 4 establishes the rate of convergence in the prediction norm for the โ„“1 -LAD estimator in
a parametric setting. The extra growth condition required for identi๏ฌcation is mild. For instance we
โˆš
typically have ๐œ† โ‰ฒ log(๐‘› โˆจ ๐‘)/๐‘› and for many designs of interest we have inf ๐›ฟโˆˆฮ”c โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ32,๐‘› /๐”ผ๐‘› [โˆฃหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฃ3 ]
bounded away from zero (see [3]). For more general designs we have
inf
๐›ฟโˆˆฮ”c
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ32,๐‘›
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ2,๐‘›
๐œ…c
โฉพ inf
โฉพ โˆš
โ€ฒ
3
๐”ผ๐‘› [โˆฃหœ
๐‘ฅ๐‘– ๐›ฟโˆฃ ] ๐›ฟโˆˆฮ”c โˆฅ๐›ฟโˆฅ1 max๐‘–โฉฝ๐‘› โˆฅหœ
๐‘ฅ๐‘– โˆฅโˆž
๐‘ (1 + c) max๐‘–โฉฝ๐‘› โˆฅหœ
๐‘ฅ๐‘– โˆฅโˆž
which implies the extra growth condition under ๐พ๐‘ฅ2 ๐‘ 2 log(๐‘ โˆจ ๐‘›) โฉฝ ๐›ฟ๐‘› ๐œ…2c ๐‘›.
In order to alleviate the bias introduced by the โ„“1 -penalty, we can consider the associated post-model
selection estimate associated with a selected support ๐‘‡ห†
}
{
๐œ‚หœ โˆˆ arg min ๐”ผ๐‘› [โˆฃ๐‘ฆ๐‘– โˆ’ ๐‘ฅหœโ€ฒ๐‘– ๐œ‚โˆฃ] : ๐œ‚๐‘— = 0 if ๐‘— โˆ•โˆˆ ๐‘‡ห† .
(D.34)
๐œ‚
The following result characterizes the performance of the estimator in (D.34), see [3] for the proof.
Lemma 5 (Estimation Error of Post-โ„“1-LAD). Assume the conditions of Lemma 4 hold, support(ห†
๐œ‚ ) โІ ๐‘‡ห†,
and let ๐‘ ห† = โˆฃ๐‘‡ห†โˆฃ. Then we have for ๐‘› large enough
โˆš
โˆš
โˆš
1 ๐‘  log(๐‘/๐›พ)
(ห†
๐‘  + ๐‘ ) log(๐‘› โˆจ ๐‘) ๐œ† ๐‘ 
โ€ฒ
๐œ‚ โˆ’ ๐œ‚0 )โˆฅ2,๐‘› โ‰ฒ๐‘ƒ
โˆฅหœ
๐‘ฅ๐‘– (หœ
+
+
,
๐‘›๐œ™min (ห†
๐‘  + ๐‘ )
๐‘›๐œ…c
๐œ…c
๐‘›
{ โˆš
}
โˆš
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ32,๐‘›
๐‘› ๐œ™min (ห†
๐‘ +๐‘ )
๐‘› ๐œ™min (ห†
๐‘ +๐‘ )
๐‘“¯๐‘“¯โ€ฒ
โˆš
โˆš
provided
+
inf
๐‘“
๐”ผ๐‘› [โˆฃหœ
๐‘ฅโ€ฒ ๐›ฟโˆฃ3 ] โ†’๐‘ƒ โˆž.
๐œ† ๐‘ 
๐‘ ๐‘› log([๐‘โˆจ๐‘›]/๐›พ)
โˆฅ๐›ฟโˆฅ0 โฉฝห†
๐‘ +๐‘ 
๐‘–
Lemma 5 provides the rate of convergence in the prediction norm for the post model selection estimator
despite of possible imperfect model selection. The rates rely on the overall quality of the selected model
(which is at least as good as the model selected by โ„“1 -LAD) and the overall number of components ๐‘ ห†.
Once again the extra growth condition required for identi๏ฌcation is mild. For more general designs we
have
โˆš
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ32,๐‘›
๐œ™min (ห†
๐‘  + ๐‘ )
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ2,๐‘›
โˆš
inf
โฉพ
inf
โฉพ
.
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฃ3 ] โˆฅ๐›ฟโˆฅ0 โฉฝห†๐‘ +๐‘  โˆฅ๐›ฟโˆฅ1 max๐‘–โฉฝ๐‘› โˆฅหœ
๐‘ฅ๐‘– โˆฅโˆž
โˆฅ๐›ฟโˆฅ0 โฉฝห†
๐‘ +๐‘  ๐”ผ๐‘› [โˆฃหœ
๐‘ ห† + ๐‘  max๐‘–โฉฝ๐‘› โˆฅหœ
๐‘ฅ๐‘– โˆฅโˆž
Comment D.1. In Step 1 of Algorithm 2 we use โ„“1 -LAD with ๐‘ฅหœ๐‘– = (๐‘‘๐‘– , ๐‘ฅโ€ฒ๐‘– )โ€ฒ , ๐›ฟห† := ๐œ‚ห†โˆ’๐œ‚0 = (ห†
๐›ผโˆ’๐›ผ0 , ๐›ฝห†โ€ฒ โˆ’๐›ฝ0โ€ฒ )โ€ฒ ,
ห† 2,๐‘› . However, it follows that
and we are interested on rates for โˆฅ๐‘ฅโ€ฒ (๐›ฝห† โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› instead of โˆฅหœ
๐‘ฅโ€ฒ ๐›ฟโˆฅ
๐‘–
๐‘–
ห† 2,๐‘› + โˆฃห†
โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝห† โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› โฉฝ โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ
๐›ผ โˆ’ ๐›ผ0 โˆฃ โ‹… โˆฅ๐‘‘๐‘– โˆฅ2,๐‘› .
Since ๐‘  โฉพ 1, without loss of generality we can assume the component associated with the treatment
๐‘‘๐‘– belongs to ๐‘‡ (at the cost of increasing the cardinality of ๐‘‡ by one which will not a๏ฌ€ect the rate of
convergence). Therefore we have that
ห† 2,๐‘› /๐œ…c .
โˆฃห†
๐›ผ โˆ’ ๐›ผ0 โˆฃ โฉฝ โˆฅ๐›ฟห†๐‘‡ โˆฅ โฉฝ โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ
In most applications of interest โˆฅ๐‘‘๐‘– โˆฅ2,๐‘› and 1/๐œ…c are bounded from above with high probability. Similarly,
in Step 1 of Algorithm 1 we have that the Post-โ„“1-LAD estimator satis๏ฌes
(
)
โˆš
หœ 2,๐‘› 1 + โˆฅ๐‘‘๐‘– โˆฅ2,๐‘› / ๐œ™min (ห†
โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝหœ โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› โฉฝ โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ
๐‘  + ๐‘ ) .
UNIFORM POST SELECTION INFERENCE FOR LAD
3
D.2. Heteroscedastic Lasso. In this section we consider the equation (1.4) in the form
๐‘‘๐‘– = ๐‘ฅโ€ฒ๐‘– ๐œƒ0 + ๐‘ฃ๐‘– , E[๐‘ฃ๐‘– ] = 0,
(D.35)
where we observe {(๐‘‘๐‘– , ๐‘ฅโ€ฒ๐‘– )โ€ฒ }๐‘›๐‘–=1 , (๐‘ฅ๐‘– )๐‘›๐‘–=1 are non-stochastic and normalized in such a way that ๐”ผ๐‘› [๐‘ฅ2๐‘–๐‘— ] = 1,
for all 1 โฉฝ ๐‘— โฉฝ ๐‘, and (๐‘ฃ๐‘– )๐‘›๐‘–=1 are independent across ๐‘– but not necessary identically distributed. The
unknown support of ๐œƒ0 is denoted by ๐‘‡๐‘‘ and it satis๏ฌes โˆฃ๐‘‡๐‘‘ โˆฃ โฉฝ ๐‘ . To estimate ๐œƒ0 and consequently ๐‘ฃ๐‘– , we
compute
๐œ†
ห† ๐‘– = 1, . . . , ๐‘›,
๐œƒห† โˆˆ arg min ๐”ผ๐‘› [(๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ)2 ] + โˆฅฮ“ฬ‚๐œƒโˆฅ1 and set ๐‘ฃห†๐‘– = ๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ,
(D.36)
๐œƒ
๐‘›
where ๐œ† and ฮ“ฬ‚ are the associated penalty level and loadings which are potentially data-driven. In this
case the following regularization event plays an important role
๐œ†
โฉพ 2๐‘โˆฅฮ“ฬ‚โˆ’1 ๐”ผ๐‘› [๐‘ฅ๐‘– (๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ0 )]โˆฅโˆž .
๐‘›
(D.37)
As discussed in [9], [4] and [2], the event above implies that the estimator ๐œƒห† satis๏ฌes โˆฅ๐œƒห†๐‘‡๐‘‘๐‘ โˆฅ1 โฉฝ cโˆฅ๐œƒห†๐‘‡๐‘‘ โˆ’๐œƒ0 โˆฅ1
where c = (๐‘ + 1)/(๐‘ โˆ’ 1). Thus rates of convergence for ๐œƒห† and ๐‘ฃห†๐‘– de๏ฌned on (D.36) can be established
based on the restricted eigenvalue ๐œ…c de๏ฌned in (D.31) with ๐‘ฅ
หœ๐‘– = ๐‘ฅ๐‘– and ๐‘‡ = ๐‘‡๐‘‘ .
The following are su๏ฌƒcient high-level conditions where again the sequences ฮ”๐‘› and ๐›ฟ๐‘› go to zero and
๐ถ is a positive constant independent of ๐‘›.
Condition HL. For the model (D.35), suppose that for ๐‘  = ๐‘ ๐‘› โฉพ 1 we have โˆฅ๐œƒ0 โˆฅ0 โฉฝ ๐‘  and
(i) max (Eฬ„[โˆฃ๐‘ฅ๐‘–๐‘— ๐‘ฃ๐‘– โˆฃ3 ])1/3 /(Eฬ„[โˆฃ๐‘ฅ๐‘–๐‘— ๐‘ฃ๐‘– โˆฃ2 ])1/2 โฉฝ ๐ถ and ฮฆโˆ’1 (1 โˆ’ ๐›พ/2๐‘) โฉฝ ๐›ฟ๐‘› ๐‘›1/3 ,
1โฉฝ๐‘—โฉฝ๐‘
(ii) max โˆฃ(๐”ผ๐‘› โˆ’ Eฬ„)[๐‘ฅ2๐‘–๐‘— ๐‘ฃ๐‘–2 ]โˆฃ + max โˆฃ(๐”ผ๐‘› โˆ’ Eฬ„)[๐‘ฅ2๐‘–๐‘— ๐‘‘2๐‘– ]โˆฃ โฉฝ ๐›ฟ๐‘› , with probability 1 โˆ’ ฮ”๐‘› .
1โฉฝ๐‘—โฉฝ๐‘
๐‘—โฉฝ๐‘
Condition HL is implied by Conditions I and growth conditions (see Lemma 2). Several primitive
moment conditions imply the various cross moments bounds. These conditions also allow us to invoke
moderate deviation theorems for self-normalized sums from [12] to bound some important error components. Despite heteroscedastic non-Gaussian noise, Those results allows a sharp choice of penalty level
and loadings was analyzed in [2] which is summarized by the following lemma.
Valid options for setting the penalty level and the loadings for ๐‘— = 1, . . . , ๐‘, are
โˆš
¯ 2 ], ๐œ† = 2๐‘โˆš๐‘›ฮฆโˆ’1 (1 โˆ’ ๐›พ/(2๐‘)),
initial
๐›พ
ห†๐‘— = ๐”ผ๐‘› [๐‘ฅ2๐‘–๐‘— (๐‘‘๐‘– โˆ’ ๐‘‘)
โˆš
โˆš
re๏ฌned ๐›พ
ห†๐‘— = ๐”ผ๐‘› [๐‘ฅ2๐‘–๐‘— ๐‘ฃห†๐‘–2 ],
๐œ† = 2๐‘ ๐‘›ฮฆโˆ’1 (1 โˆ’ ๐›พ/(2๐‘)),
(D.38)
where ๐‘ > 1 is a constant, ๐›พ โˆˆ (0, 1), ๐‘‘¯ := ๐”ผ๐‘› [๐‘‘๐‘– ] and ห†
๐‘ฃ๐‘– is an estimate of ๐‘ฃ๐‘– based on Lasso with the
initial option (or iterations). [2] established that using either of the choices in (D.38) implies that the
regularization event (D.37) holds with high probability. Next we present results on the performance of
the estimators generated by Lasso.
โˆš
Lemma 6. Under Condition HL and setting ๐œ† = 2๐‘โ€ฒ ๐‘›ฮฆโˆ’1 (1 โˆ’ ๐›พ/2๐‘) for ๐‘โ€ฒ > ๐‘ > 1, and using penalty
loadings as in (D.38), there is an uniformly bounded c such that we have
โˆš
๐œ† ๐‘ 
โ€ฒ ห†
โˆฅห†
๐‘ฃ๐‘– โˆ’ ๐‘ฃ๐‘– โˆฅ2,๐‘› = โˆฅ๐‘ฅ๐‘– (๐œƒ โˆ’ ๐œƒ0 )โˆฅ2,๐‘› โ‰ฒ๐‘ƒ
and
โˆฅห†
๐‘ฃ๐‘– โˆ’ ๐‘ฃ๐‘– โˆฅโˆž โฉฝ โˆฅ๐œƒห† โˆ’ ๐œƒ0 โˆฅ1 max โˆฅ๐‘ฅ๐‘– โˆฅโˆž .
๐‘–โฉฝ๐‘›
๐‘›๐œ…c
4
UNIFORM POST SELECTION INFERENCE FOR LAD
Associated with Lasso we can de๏ฌne the Post-Lasso estimator as
{
}
หœ
๐œƒหœ โˆˆ arg min ๐”ผ๐‘› [(๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ)2 ] : ๐œƒ๐‘— = 0 if ๐œƒห†๐‘— = 0 and set ๐‘ฃหœ๐‘– = ๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ.
๐œƒ
(D.39)
That is, the Post-Lasso estimator is simply the least squares estimator applied to the covariates selected
by Lasso in (D.36). Sparsity properties of the Lasso estimator ๐œƒห† under estimated weights follows similarly
to the standard Lasso analysis derived in [2]. By combining such sparsity properties and the rates in the
prediction norm we can establish rates for the post-model selection estimator under estimated weights.
The following result summarizes the properties of the Post-Lasso estimator.
Lemma 7 (Model Selection Properties of Lasso and Properties of Post-Lasso). Suppose that Conditions
HL and SE hold. Consider the Lasso estimator with penalty level and loadings speci๏ฌed as in Lemma 6.
Then the data-dependent model ๐‘‡ห†๐‘‘ selected by the Lasso estimator ๐œƒห† satis๏ฌes with probability 1 โˆ’ ฮ”๐‘› :
Moreover, the Post-Lasso estimator obeys
โˆฅหœ
๐‘ฃ๐‘– โˆ’ ๐‘ฃ๐‘– โˆฅ2,๐‘› =
หœ 0 = โˆฃ๐‘‡ห†๐‘‘ โˆฃ โ‰ฒ ๐‘ .
โˆฅ๐œƒโˆฅ
โˆฅ๐‘ฅโ€ฒ๐‘– (๐œƒหœ โˆ’ ๐œƒ0 )โˆฅ2,๐‘›
โ‰ฒ๐‘ƒ
(D.40)
โˆš
๐‘  log(๐‘ โˆจ ๐‘›)
.
๐‘›
Appendix E. Alternative Implementation via Double Selection
An alternative proposal for the method is reminiscent of the double selection method proposed in [6]
for partial linear models. This version replaces Step 3 with a LAD regression of ๐‘ฆ on ๐‘‘ and all covariates
selected in Steps 1 and 2 (i.e. the union of the selected sets). The method is described as follows:
Algorithm 3. (A Double Selection Method)
Step 1 Run Post-โ„“1-LAD of ๐‘ฆ๐‘– on ๐‘‘๐‘– and ๐‘ฅ๐‘– :
ห† โˆˆ arg min ๐”ผ๐‘› [โˆฃ๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ โˆ’ ๐‘ฅโ€ฒ ๐›ฝโˆฃ] +
(ห†
๐›ผ, ๐›ฝ)
๐‘–
๐›ผ,๐›ฝ
๐œ†1
โˆฅ(๐›ผ, ๐›ฝ)โˆฅ1 .
๐‘›
Step 2 Run Heteroscedastic Lasso of ๐‘‘๐‘– on ๐‘ฅ๐‘– :
๐œ†2
๐œƒห† โˆˆ arg min ๐”ผ๐‘› [(๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ๐‘– ๐œƒ)2 ] + โˆฅฮ“ฬ‚๐œƒโˆฅ1 .
๐œƒ
๐‘›
Step 3 Run LAD regression of ๐‘ฆ๐‘– on ๐‘‘๐‘– and the covariates selected in Step 1 and 2:
ห‡ โˆˆ arg min {๐”ผ๐‘› [โˆฃ๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ โˆ’ ๐‘ฅโ€ฒ ๐›ฝโˆฃ] : support(๐›ฝ) โІ support(๐›ฝ)
ห† โˆช support(๐œƒ)}.
ห†
(ห‡
๐›ผ, ๐›ฝ)
๐‘–
๐›ผ,๐›ฝ
The double selection algorithm has three steps: (1) select covariates based on the standard โ„“1 -LAD
regression, (2) select covariates based on heteroscedastic Lasso of the treatment equation, and (3) run a
LAD regression with the treatment and all selected covariates.
This approach can also be analyzed through Lemma 1 since it creates instruments implicitly. To see
ห† โˆช support(๐œƒ).
ห† By the ๏ฌrst
that let ๐‘‡ห†โˆ— denote the variables selected in Step 1 and 2: ๐‘‡ห†โˆ— = support(๐›ฝ)
ห‡ we have
order conditions for (ห‡
๐›ผ, ๐›ฝ)
ห‡ ๐‘– , ๐‘ฅโ€ฒ โˆ— )โ€ฒ ]โˆฅ = ๐‘‚{( max โˆฃ๐‘‘๐‘– โˆฃ + ๐พ๐‘ฅ โˆฃ๐‘‡ห†โˆ— โˆฃ1/2 )(1 + โˆฃ๐‘‡ห†โˆ— โˆฃ)/๐‘›},
โˆฅ๐”ผ๐‘› [๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ
ห‡ โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝ)(๐‘‘
๐‘–๐‘‡ห†
1โฉฝ๐‘–โฉฝ๐‘›
UNIFORM POST SELECTION INFERENCE FOR LAD
5
which creates an orthogonal relation to any linear combination of (๐‘‘๐‘– , ๐‘ฅโ€ฒ๐‘–๐‘‡ห†โˆ— )โ€ฒ . In particular, by taking the
linear combination (๐‘‘๐‘– , ๐‘ฅโ€ฒ )(1, โˆ’๐œƒหœโ€ฒ )โ€ฒ = ๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ ๐œƒหœห†โˆ— = ๐‘‘๐‘– โˆ’ ๐‘ฅโ€ฒ ๐œƒหœ = ๐‘งห†๐‘– , which is the instrument in Step 2
๐‘–๐‘‡ห†โˆ—
๐‘‡ห†โˆ—
๐‘–
๐‘–๐‘‡ห†โˆ— ๐‘‡
of Algorithm 1, we have
ห‡ ๐‘ง๐‘– ] = ๐‘‚{โˆฅ(1, โˆ’๐œƒหœโ€ฒ )โ€ฒ โˆฅ( max โˆฃ๐‘‘๐‘– โˆฃ + ๐พ๐‘ฅ โˆฃ๐‘‡ห†โˆ— โˆฃ1/2 )(1 + โˆฃ๐‘‡ห†โˆ— โˆฃ)/๐‘›}.
๐”ผ๐‘› [๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ
ห‡ โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝ)ห†
1โฉฝ๐‘–โฉฝ๐‘›
As soon as the right side is ๐‘œ๐‘ƒ (๐‘›โˆ’1/2 ), the double selection estimator ๐›ผ
ห‡ approximately minimizes
โ€ฒห‡
๐‘ง๐‘– ]โˆฃ2
หœ ๐‘› (๐›ผ) = โˆฃ๐”ผ๐‘› [๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ โˆ’ ๐‘ฅ๐‘– ๐›ฝ)ห†
๐ฟ
,
ห‡ 2 ๐‘งห†2 ]
๐”ผ๐‘› [{๐œ‘(๐‘ฆ๐‘– โˆ’ ๐‘‘๐‘– ๐›ผ
ห‡ โˆ’ ๐‘ฅโ€ฒ๐‘– ๐›ฝ)}
๐‘–
where ๐‘งห†๐‘– is the instrument created by Step 2 of Algorithm 1. Thus the double selection estimator can be
seen as an iterated version of the method based on instruments where the Step 1 estimate ๐›ฝหœ is updated
ห‡
with ๐›ฝ.
Appendix F. Proof of Theorem 2
Proof of Theorem 2. We will verify Condition ILAD and the desired then follows from Lemma 1. The
assumptions on the error density ๐‘“๐œ– (โ‹…) in Condition ILAD(i) are assumed in Condition I(iv). The moment
conditions on ๐‘‘๐‘– and ๐‘ฃ๐‘– in Condition ILAD(i) are assumed in Condition I(ii).
Condition SE implies that ๐œ…c is bounded away from zero with probability 1 โˆ’ ฮ”๐‘› for ๐‘› su๏ฌƒciently
large, see [9]. Step 1 relies on โ„“1 -LAD. Condition PLAD is implied by Condition I. By Lemma 4 and
Comment D.1 we have
โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝห† โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› โ‰ฒ๐‘ƒ
โˆš
โˆš
๐‘  log(๐‘› โˆจ ๐‘)/๐‘› and โˆฃห†
๐›ผ โˆ’ ๐›ผ0 โˆฃ โ‰ฒ๐‘ƒ ๐‘  log(๐‘ โˆจ ๐‘›)/๐‘› โ‰ฒ ๐‘œ(1) logโˆ’1 ๐‘›
because ๐‘ 3 log3 (๐‘› โˆจ ๐‘) โฉฝ ๐›ฟ๐‘› ๐‘› and the required side condition holds. Indeed, without loss of generality
assume that ๐‘‡ contains the treatment so that for ๐‘ฅ
หœ๐‘– = (๐‘‘๐‘– , ๐‘ฅโ€ฒ๐‘– )โ€ฒ , ๐›ฟ = (๐›ฟ๐‘‘ , ๐›ฟ๐‘ฅโ€ฒ )โ€ฒ , because of Condition SE
and the fact that ๐”ผ๐‘› [โˆฃ๐‘‘๐‘– โˆฃ3 ] โ‰ฒ๐‘ƒ Eฬ„[โˆฃ๐‘‘๐‘– โˆฃ3 ] = ๐‘‚(1), we have
inf ๐›ฟโˆˆฮ”c
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ32,๐‘›
๐”ผ๐‘› [โˆฃหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฃ3 ]
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ22,๐‘› โˆฅ๐›ฟ๐‘‡ โˆฅ๐œ…c
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ22,๐‘› โˆฅ๐›ฟ๐‘‡ โˆฅ๐œ…c
4๐”ผ๐‘› [โˆฃ๐‘ฅโ€ฒ๐‘– ๐›ฟ๐‘ฅ โˆฃ3 ]+4๐”ผ๐‘› [โˆฃ๐‘‘๐‘– ๐›ฟ๐‘‘ โˆฃ3 ] โฉพ inf ๐›ฟโˆˆฮ”c 4๐พ๐‘ฅ โˆฅ๐›ฟ๐‘ฅ โˆฅ1 ๐”ผ๐‘› [โˆฃ๐‘ฅโ€ฒ๐‘– ๐›ฟ๐‘ฅ โˆฃ2 ]+4โˆฃ๐›ฟ๐‘‘ โˆฃ3 ๐”ผ๐‘› [โˆฃ๐‘‘๐‘– โˆฃ3 ]
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ22,๐‘› โˆฅ๐›ฟ๐‘‡ โˆฅ๐œ…c
inf ๐›ฟโˆˆฮ”c 4๐พ๐‘ฅ โˆฅ๐›ฟ๐‘ฅ โˆฅ1 {โˆฅหœ๐‘ฅโ€ฒ ๐›ฟโˆฅ2,๐‘› +โˆฅ๐›ฟ
2
2
3
๐‘‘ ๐‘‘๐‘– โˆฅ2,๐‘› } +4โˆฃ๐›ฟ๐‘‘ โˆฃ ๐”ผ
๐‘–
โˆš๐‘› [โˆฃ๐‘‘๐‘– โˆฃ ]โˆฅ๐›ฟ๐‘‡ โˆฅ1
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ22,๐‘› โˆฅ๐›ฟ๐‘‡ โˆฅ1 ๐œ…c / ๐‘ 
inf ๐›ฟโˆˆฮ”c 8๐พ๐‘ฅ (1+c)โˆฅ๐›ฟ๐‘‡ โˆฅ1 โˆฅหœ๐‘ฅโ€ฒ ๐›ฟโˆฅ2 +8๐พ๐‘ฅ (1+c)โˆฅ๐›ฟ๐‘‡ โˆฅ1 โˆฃ๐›ฟ๐‘‘ โˆฃ2 {โˆฅ๐‘‘๐‘– โˆฅ2 +๐”ผ๐‘› [โˆฃ๐‘‘๐‘– โˆฃ3 ]}
2,๐‘›
2,๐‘›
๐‘–
โˆš
๐œ…c / ๐‘ 
โˆš1 .
โ‰ณ
2
๐‘ƒ
2
3
2
8๐พ๐‘ฅ (1+c){1+โˆฅ๐‘‘๐‘– โˆฅ2,๐‘› /๐œ…c +๐”ผ๐‘› [โˆฃ๐‘‘๐‘– โˆฃ ]/๐œ…c }
๐‘ ๐พ๐‘ฅ
โฉพ inf ๐›ฟโˆˆฮ”c
โฉพ
โฉพ
โฉพ
โˆš
Therefore, since ๐œ† โ‰ฒ ๐‘› log(๐‘ โˆจ ๐‘›) we have
โˆš
โˆฅหœ
๐‘ฅโ€ฒ๐‘– ๐›ฟโˆฅ32,๐‘›
๐‘›๐œ…
๐‘›
โˆš c
inf
โ‰ณ
โ†’๐‘ƒ โˆž
โˆš
๐‘ƒ
โ€ฒ
3
๐‘ฅ๐‘– ๐›ฟโˆฃ ]
๐พ๐‘ฅ ๐‘  log(๐‘ โˆจ ๐‘›)
๐œ† ๐‘  + ๐‘ ๐‘› log(๐‘ โˆจ ๐‘›) ๐›ฟโˆˆฮ”c ๐”ผ๐‘› [โˆฃหœ
(F.41)
(F.42)
under ๐พ๐‘ฅ2 ๐‘ 2 log2 (๐‘ โˆจ ๐‘›) โฉฝ ๐›ฟ๐‘› ๐‘›. Note that the rate for ๐›ผ
ห† and the de๏ฌnition of ๐’œ implies {๐›ผ : โˆฃ๐›ผ โˆ’ ๐›ผ0 โˆฃ โฉฝ
โˆ’1/2
๐‘›
log ๐‘›} โŠ‚ ๐’œ (with probability 1 โˆ’ ๐‘œ(1)) which is required in ILAD(ii). Moreover, by the (shrinking)
de๏ฌnition of ๐’œ we have the initial rate of ILAD(iv). Step 2 relies on Lasso. Condition HL is implied by
Condition I and Lemma 2 applied twice with ๐œ๐‘– = ๐‘ฃ๐‘– and ๐œ๐‘– = ๐‘‘๐‘– under the condition that ๐พ๐‘ฅ4 log ๐‘ โฉฝ ๐›ฟ๐‘› ๐‘›.
โˆš
ห† 0 โ‰ฒ ๐‘  with
By Lemma 6 we have โˆฅ๐‘ฅโ€ฒ (๐œƒห† โˆ’ ๐œƒ0 )โˆฅ2,๐‘› โ‰ฒ๐‘ƒ ๐‘  log(๐‘› โˆจ ๐‘)/๐‘›. Moreover, by Lemma 7 we have โˆฅ๐œƒโˆฅ
๐‘–
probability 1 โˆ’ ๐‘œ(1).
6
UNIFORM POST SELECTION INFERENCE FOR LAD
The rates established above for ๐œƒห† and ๐›ฝห† imply (A.25) in ILAD(iii) since by Condition I(ii) E[โˆฃ๐‘ฃ๐‘– โˆฃ] โฉฝ
โˆš
(E[๐‘ฃ 2 ])1/2 = ๐‘‚(1) and max1โฉฝ๐‘–โฉฝ๐‘› โˆฃ๐‘ฅโ€ฒ (๐œƒห† โˆ’ ๐œƒ0 )โˆฃ โ‰ฒ๐‘ƒ ๐พ๐‘ฅ ๐‘ 2 log(๐‘ โˆจ ๐‘›)/๐‘› = ๐‘œ(1).
๐‘–
๐‘–
To verify Condition ILAD(iii) (A.26), arguing as in the proof of Theorem 1, we can deduce that
sup โˆฃ๐”พ๐‘› (๐œ“๐›ผ,๐›ฝ,
ห† ๐œƒห† โˆ’ ๐œ“๐›ผ,๐›ฝ0 ,๐œƒ0 )โˆฃ = ๐‘œ๐‘ƒ (1).
๐›ผโˆˆ๐’œ
This completes the proof.
โ–ก
Appendix G. Proof of Auxiliary Technical Results
Proof of Lemma 2. We shall use Lemma 8 ahead. Let ๐‘๐‘– = (๐‘ฅ๐‘– , ๐œ๐‘– ) and de๏ฌne โ„ฑ = {๐‘“๐‘— (๐‘ฅ๐‘– , ๐œ๐‘– ) = ๐‘ฅ2๐‘–๐‘— ๐œ๐‘–2 :
โˆš
๐‘— = 1, . . . , ๐‘}. Since P(โˆฃ๐‘‹โˆฃ > ๐‘ก) โฉฝ E[โˆฃ๐‘‹โˆฃ๐‘˜ ]/๐‘ก๐‘˜ , for ๐‘˜ = 2 we have that median(โˆฃ๐‘‹โˆฃ) โฉฝ 2E[โˆฃ๐‘‹โˆฃ2 ] and for
๐‘˜ = ๐‘ž/4 we have (1 โˆ’ ๐œ )-quantile of โˆฃ๐‘‹โˆฃ is bounded by (E[โˆฃ๐‘‹โˆฃ๐‘ž/4 ]/๐œ )4/๐‘ž . Then we have
โˆš
โˆš
max median (โˆฃ๐”พ๐‘› (๐‘“ (๐‘ฅ๐‘– , ๐œ๐‘– ))โˆฃ) โฉฝ 2Eฬ„[๐‘ฅ4๐‘–๐‘— ๐œ๐‘–4 ] โฉฝ ๐พ๐‘ฅ2 2Eฬ„[๐œ๐‘–4 ]
๐‘“ โˆˆโ„ฑ
and
(1 โˆ’ ๐œ )-quantile of max
๐‘—โฉฝ๐‘
โˆš
โˆš
๐”ผ๐‘› [๐‘ฅ4๐‘–๐‘— ๐œ๐‘–4 ] โฉฝ (1 โˆ’ ๐œ )-quantile of ๐พ๐‘ฅ2 ๐”ผ๐‘› [๐œ๐‘–4 ] โฉฝ ๐พ๐‘ฅ2 (Eฬ„[โˆฃ๐œ๐‘– โˆฃ๐‘ž ]/๐œ )4/๐‘ž .
The conclusion follows from Lemma 8.
โ–ก
Proof of Lemma 3. By the triangle inequality we have
ห† 2,๐‘› .
โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝห†(2๐‘š) โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› โฉฝ โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝห† โˆ’ ๐›ฝ0 )โˆฅ2,๐‘› + โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝห†(2๐‘š) โˆ’ ๐›ฝ)โˆฅ
Now let ๐‘‡ 1 denote the ๐‘š largest components of ๐›ฝห† and ๐‘‡ ๐‘˜ corresponds to the ๐‘š largest components of ๐›ฝห†
outside โˆช๐‘˜โˆ’1 ๐‘‡ ๐‘‘. It follows that ๐›ฝห†(2๐‘š) = ๐›ฝห†๐‘‡ 1 โˆช๐‘‡ 2 .
๐‘‘=1
โˆš
Next note that for ๐‘˜ โฉพ 3 we have โˆฅ๐›ฝห†๐‘‡ ๐‘˜+1 โˆฅ โฉฝ โˆฅ๐›ฝห†๐‘‡ ๐‘˜ โˆฅ1 / ๐‘š. Indeed, consider the problem max{โˆฅ๐‘ฃโˆฅ/โˆฅ๐‘ขโˆฅ1 :
๐‘ฃ, ๐‘ข โˆˆ โ„๐‘š , max๐‘– โˆฃ๐‘ฃ๐‘– โˆฃ โฉฝ min๐‘– โˆฃ๐‘ข๐‘– โˆฃ}. Given a ๐‘ฃ and ๐‘ข we can always increase the objective function by
using ๐‘ฃหœ = max๐‘– โˆฃ๐‘ฃ๐‘– โˆฃ(1, . . . , 1)โ€ฒ and ๐‘ข
หœโ€ฒ = min๐‘– โˆฃ๐‘ข๐‘– โˆฃ(1, . . . , 1)โ€ฒ instead. Thus, the maximum is achieved at
โˆš
โˆ—
โˆ—
โ€ฒ
๐‘ฃ = ๐‘ข = (1, . . . , 1) , yielding 1/ ๐‘š.
Thus, by โˆฅ๐›ฝห†๐‘‡ ๐‘ โˆฅ1 โฉฝ cโˆฅ๐›ฟ๐‘‡ โˆฅ1 and โˆฃ๐‘‡ โˆฃ = ๐‘  we have
ห† 2,๐‘›
โˆฅ๐‘ฅโ€ฒ๐‘– (๐›ฝห†(2๐‘š) โˆ’ ๐›ฝ)โˆฅ
โˆ‘๐พ
= โˆฅ๐‘ฅโ€ฒ๐‘– ๐‘˜=3 ๐›ฝห†๐‘‡ ๐‘˜ โˆฅ2,๐‘›
โˆš
โˆ‘๐พ
โˆ‘๐พ
โฉฝ ๐‘˜=3 โˆฅ๐‘ฅโ€ฒ๐‘– ๐›ฝห†๐‘‡ ๐‘˜ โˆฅ โฉฝ ๐œ™max (๐‘š) ๐‘˜=3 โˆฅ๐›ฝห†๐‘‡ ๐‘˜ โˆฅ
โˆš
โˆš
ห† 1 ๐‘ โˆฅ1
โˆ‘
ห† ๐‘˜ โˆฅ1
โˆฅ๐›ฝ
โˆฅ๐›ฝ
๐‘‡
โˆš
โฉฝ ๐œ™max (๐‘š) ๐พโˆ’1
โฉฝ ๐œ™max (๐‘š) (๐‘‡โˆš๐‘š)
๐‘˜=2
๐‘š
โˆš
โˆš
ห† ๐‘
โฉฝ ๐œ™max (๐‘š) โˆฅ๐›ฝโˆš๐‘‡๐‘šโˆฅ1 โฉฝ ๐œ™max (๐‘š)c โˆฅ๐›ฟโˆš๐‘‡๐‘šโˆฅ1 .
โ–ก
UNIFORM POST SELECTION INFERENCE FOR LAD
7
Appendix H. Auxiliary Probabilistic Inequalities
Let ๐‘1 , . . . , ๐‘๐‘› be independent random variables taking values in a measurable space (๐‘†, ๐’ฎ), and
โˆ‘๐‘›
consider an empirical process ๐”พ๐‘› (๐‘“ ) = ๐‘›โˆ’1/2 ๐‘–=1 {๐‘“ (๐‘๐‘– ) โˆ’ E[๐‘“ (๐‘๐‘– )]} indexed by a pointwise measurable
class of functions โ„ฑ on ๐‘† (see [25], Chapter 2.3). Denote by โ„™๐‘› the (random) empirical probability
measure that assigns probability ๐‘›โˆ’1 to each ๐‘๐‘– . Let ๐‘ (๐œ–, โ„ฑ , โˆฅ โ‹… โˆฅโ„™๐‘› ,2 ) denote the ๐œ–-covering number of
โ„ฑ with respect to the ๐ฟ2 (โ„™๐‘› ) seminorm โˆฅ โ‹… โˆฅโ„™๐‘› ,2 .
The following maximal inequality is derived in [6].
Lemma 8 (Maximal inequality for ๏ฌnite classes). Suppose that the class โ„ฑ is ๏ฌnite. Then for every
๐œ โˆˆ (0, 1/2) and ๐›ฟ โˆˆ (0, 1), with probability at least 1 โˆ’ 4๐œ โˆ’ 4๐›ฟ,
{ โˆš
}
max โˆฃ๐”พ๐‘› (๐‘“ )โˆฃ โฉฝ 4 2 log(2โˆฃโ„ฑ โˆฃ/๐›ฟ) ๐‘„(1 โˆ’ ๐œ ) โˆจ 2 max median (โˆฃ๐”พ๐‘› (๐‘“ )โˆฃ) ,
๐‘“ โˆˆโ„ฑ
๐‘“ โˆˆโ„ฑ
โˆš
where ๐‘„(๐‘ข) := ๐‘ข-quantile of max๐‘“ โˆˆโ„ฑ ๐”ผ๐‘› [๐‘“ (๐‘๐‘– )2 ].
The following maximal inequality is derived in [3].
Lemma 9 (Maximal inequality for in๏ฌnite classes). Let ๐น = sup๐‘“ โˆˆโ„ฑ โˆฃ๐‘“ โˆฃ, and suppose that there exist
some constants ๐œ”๐‘› > 1, ๐œ > 1, ๐‘š > 0, and โ„Ž๐‘› โฉพ โ„Ž0 such that
๐‘ (๐œ–โˆฅ๐น โˆฅโ„™๐‘›,2 , โ„ฑ , โˆฅ โ‹… โˆฅโ„™๐‘›,2 ) โฉฝ (๐‘› โˆจ โ„Ž๐‘› )๐‘š (๐œ”๐‘› /๐œ–)๐œ๐‘š , 0 < ๐œ– < 1.
โˆš
โˆš
Set ๐ถ := (1 + 2๐œ)/4. Then for every ๐›ฟ โˆˆ (0, 1/6) and every constant ๐พ โฉพ 2/๐›ฟ, we have
{
}
โˆš
โˆš
โˆš
โˆš
2
2
sup โˆฃ๐”พ๐‘› (๐‘“ )โˆฃ โฉฝ 4 2๐‘๐พ๐ถ ๐‘š log(๐‘› โˆจ โ„Ž๐‘› โˆจ ๐œ”๐‘› ) max sup Eฬ„[๐‘“ (๐‘๐‘– ) ], sup ๐”ผ๐‘› [๐‘“ (๐‘๐‘– ) ] ,
๐‘“ โˆˆโ„ฑ
๐‘“ โˆˆโ„ฑ
๐‘“ โˆˆโ„ฑ
with probability at least 1 โˆ’ ๐›ฟ, provided that ๐‘› โˆจ โ„Ž0 โฉพ 3; the constant ๐‘ < 30 is universal.
References
[1] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random
matrices. Constructive Approximation, 28:253โ€“263, 2008.
[2] A. Belloni, D. Chen, V. Chernozhukov, and C. Hansen. Sparse models and methods for optimal instruments with an
application to eminent domain. Econometrica, 80(6):2369โ€“2430, November 2012.
[3] A. Belloni and V. Chernozhukov. โ„“1 -penalized quantile regression for high dimensional sparse models. Annals of Statistics, 39(1):82โ€“130, 2011.
[4] A. Belloni and V. Chernozhukov. Least squares after model selection in high-dimensional sparse models. Bernoulli,
19(2):521โ€“547, 2013.
[5] A. Belloni, V. Chernozhukov, and I. Fernandez-Val. Conditional Quantile Processes based on Series or Many Regressors.
ArXiv e-prints, May 2011.
[6] A. Belloni, V. Chernozhukov, and C. Hansen. Inference on treatment e๏ฌ€ects after selection amongst high-dimensional
controls. ArXiv, 2011.
[7] A. Belloni, V. Chernozhukov, and C. Hansen. Inference for high-dimensional sparse econometric models. Advances in
Economics and Econometrics. 10th World Congress of Econometric Society, held in August 2010, III:245โ€“295, 2013.
[8] A. Belloni, V. Chernozhukov, and K. Kato. Robust inference in high-dimensional sparse quantile regression models.
Working Paper, 2013.
8
UNIFORM POST SELECTION INFERENCE FOR LAD
[9] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics,
37(4):1705โ€“1732, 2009.
[10] Victor Chernozhukov and Christian Hansen. Instrumental variable quantile regression: A robust inference approach.
Journal of Econometrics, 142:379โ€“398, 2008.
[11] Xuming He and Qi-Man Shao. On parameters of increasing dimensions. J. Multivariate Anal., 73(1):120โ€“135, 2000.
[12] Bing-Yi Jing, Qi-Man Shao, and Qiying Wang. Self-normalized Cramer-type large deviations for independent random
variables. Ann. Probab., 31(4):2167โ€“2215, 2003.
[13] K. Kato. Group lasso for high dimensional sparse quantile regression models. Preprint, ArXiv, 2011.
[14] Roger Koenker. Quantile regression, volume 38 of Econometric Society Monographs. Cambridge University Press,
Cambridge, 2005.
[15] Michael R. Kosorok. Introduction to Empirical Processes and Semiparametric Inference. Series in Statistics. Springer,
Berlin, 2008.
[16] Sokbae Lee. E๏ฌƒcient semiparametric estimation of a partially linear quantile regression model. Econometric theory,
19:1โ€“31, 2003.
[17] Hannes Leeb and Benedikt M. Poฬˆtscher. Model selection and inference: facts and ๏ฌction. Economic Theory, 21:21โ€“59,
2005.
[18] Hannes Leeb and Benedikt M. Poฬˆtscher. Sparse estimators and the oracle property, or the return of Hodgesโ€™ estimator.
J. Econometrics, 142(1):201โ€“211, 2008.
[19] J. Neyman. Optimal asymptotic tests of composite statistical hypotheses. In U. Grenander, editor, Probability and
Statistics, the Harold Cramer Volume. New York: John Wiley and Sons, Inc., 1959.
[20] J. Neyman. ๐‘(๐›ผ) tests and their use. Sankhya, 41:1โ€“21, 1979.
[21] J. L. Powell. Censored regression quantiles. Journal of Econometrics, 32:143โ€“155, 1986.
[22] M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. ArXiv:1106.1151, 2011.
[23] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58:267โ€“288, 1996.
[24] S. A. van de Geer. High-dimensional generalized linear models and the lasso. Annals of Statistics, 36(2):614โ€“645, 2008.
[25] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics, 1996.
[26] Lie Wang. ๐ฟ1 penalized lad estimator for high dimensional ilnear regression. ArXiv, 2012.
[27] Cun-Hui Zhang and Stephanie S. Zhang. Con๏ฌdence intervals for low-dimensional parameters with high-dimensional
data. ArXiv.org, (arXiv:1110.2563v1), 2011.
[28] S. Zhou. Restricted eigenvalue conditions on subgaussian matrices. ArXiv:0904.4723v2, 2009.