Uniform post selection
inference for LAD regression
models
Alexandre Belloni
Victor Chernozhukov
Kengo Kato
The Institute for Fiscal Studies
Department of Economics, UCL
cemmap working paper CWP24/13
UNIFORM POST SELECTION INFERENCE FOR LAD REGRESSION MODELS
A. BELLONI, V. CHERNOZHUKOV, AND K. KATO
Abstract. We develop uniformly valid con๏ฌdence regions for a regression coe๏ฌcient in a high-dimensional
sparse LAD (least absolute deviation or median) regression model. The setting is one where the number
of regressors ๐ could be large in comparison to the sample size ๐, but only ๐ โช ๐ of them are needed
to accurately describe the regression function. Our new methods are based on the instrumental LAD
regression estimator that assembles the optimal estimating equation from either post โ1 -penalized LAD
regression or โ1 -penalized LAD regression. The estimating equation is immunized against non-regular
estimation of nuisance part of the regression function, in the sense of Neyman. We establish that in a
homoscedastic regression model, under certain conditions, the instrumental LAD regression estimator
of the regression coe๏ฌcient is asymptotically root-๐ normal uniformly with respect to the underlying
sparse model. The resulting con๏ฌdence regions are valid uniformly with respect to the underlying model.
The new inference methods outperform the naive, โoracle basedโ inference methods, which are known
to be not uniformly valid โ with coverage property failing to hold uniformly with respect the underlying
model โ even in the setting with ๐ = 2. We also provide Monte-Carlo experiments which demonstrate
that standard post-selection inference breaks down over large parts of the parameter space, and the
proposed method does not.
Key words: median regression, uniformly valid inference, instruments, Neymanization, optimality,
sparsity, post selection inference
1. Introduction
We consider the following regression model
๐ฆ๐ = ๐๐ ๐ผ0 + ๐ฅโฒ๐ ๐ฝ0 + ๐๐ , ๐ = 1, . . . , ๐,
(1.1)
where ๐๐ is the โmain regressorโ of interest, whose coe๏ฌcient ๐ผ0 we would like to estimate and perform
(robust) inference on. The (๐ฅ๐ )๐๐=1 are other high-dimensional regressors or โcontrolsโ and are treated as
๏ฌxed (๐๐ โs are random). The regression error ๐๐ is independent of ๐๐ and has median 0. The errors (๐๐ )๐๐=1
are i.i.d. with distribution function ๐น (โ
) and probability density function ๐๐ (โ
) such that ๐น (0) = 1/2
and ๐๐ = ๐๐ (0) > 0. The assumption on the error term motivates the use of the least absolute deviation
(LAD) or median regression, suitably adjusted for use in high-dimensional settings.
Date: First version: May 2012, this version April 30, 2013. We would like to thank the participants of Luminy conference
on Nonparametric and high-dimensional statistics (December 2012), Oberwolfach workshop on Frontiers in Quantile Regression (November 2012), 8th World Congress in Probability and Statistics (August 2012), and seminar at the University
of Michigan (October 2012). We are grateful to Sara van de Geer, Xuming He, Richard Nickl, Roger Koenker, Vladimir
Koltchinskii, Steve Portnoy, Philippe Rigollet, and Bin Yu for useful comments and discussions.
1
2
UNIFORM POST SELECTION INFERENCE FOR LAD
The dimension ๐ of โcontrolsโ ๐ฅ๐ is large, potentially much larger than ๐, which creates a challenge for
inference on ๐ผ0 . Although the unknown true parameter ๐ฝ0 lies in this large space, the key assumption
that will make estimation possible is its sparsity, namely ๐ = support(๐ฝ0 ) has ๐ < ๐ elements (where ๐
can depend on ๐; we shall use array asymptotics). This in turn motivates the use of regularization or
model selection methods.
A standard (non-robust) approach towards inference in this setting would be ๏ฌrst to perform model
selection via the โ1 -penalized LAD regression estimator
ห โ arg min ๐ผ๐ [โฃ๐ฆ๐ โ ๐๐ ๐ผ โ ๐ฅโฒ ๐ฝโฃ] + ๐ โฅ(๐ผ, ๐ฝ โฒ )โฒ โฅ1 ,
(ห
๐ผ, ๐ฝ)
๐
๐ผ,๐ฝ
๐
(1.2)
and then to use the post-model selection estimator
{
}
ห โ arg min ๐ผ๐ [โฃ๐ฆ๐ โ ๐๐ ๐ผ โ ๐ฅโฒ ๐ฝโฃ] : ๐ฝ๐ = 0 if ๐ฝห๐ = 0
(ห
๐ผ, ๐ฝ)
๐
๐ผ,๐ฝ
(1.3)
to perform โusualโ inference for ๐ผ0 . (The notation ๐ผ๐ [โ
] denotes the average over index 1 โฉฝ ๐ โฉฝ ๐.)
This standard approach is justi๏ฌed if (1.2) achieves perfect model selection with probability approaching 1, so that the estimator (1.3) has the โoracleโ property with probability approaching 1. However
conditions for โperfect selectionโ are very restrictive in this model, in particular, requiring signi๏ฌcant
separation of non-zero coe๏ฌcients away from zero. If these conditions do not hold, the estimator ๐ผ
ห does
โ
not converge to ๐ผ0 at the ๐-rate โ uniformly with respect to the underlying modelโ which implies that
โusualโ inference breaks down and is not valid. (The statements continue to apply if ๐ผ is not penalized
in (1.2), ๐ผ is restricted in (1.3), or if thresholding is applied.) We shall demonstrate the breakdown of
such naive inference in the Monte-Carlo experiments where non-zero coe๏ฌcients in ๐0 are not signi๏ฌcantly
separated from zero.
Note that the breakdown of inference does not mean that the aforementioned procedures are not
suitable for prediction purposes. Indeed, the โ1 -LAD estimator (1.2) and post โ1 -LAD estimator (1.3)
โ
attain (essentially) optimal rates (๐ log ๐)/๐ of convergence for estimating the entire median regression
function, as has been shown in [24, 3, 13, 26] and in [3]. This property means that while these procedures
will not deliver perfect model recovery, they will only make โmoderateโ model selection mistakes (omitting
only controls with coe๏ฌcients local to zero).
To achieve uniformly valid inferential performance we propose a procedure whose performance does
not require perfect model selection and allows potential โmoderateโ model selection mistakes. The latter
feature is critical in achieving uniformity over a large class of data generating processes, similarly to
the results for instrumental regression and mean regression studied in [27], [2], [7], [6]. This allows us to
overcome the impact of (moderate) model selection mistakes on inference, avoiding (in part) the criticisms
in [17], who prove that the โoracle propertyโ sometime achieved by the naive estimators necessarily implies
the failure of uniform validity of inference and their semiparametric ine๏ฌciency [18].
In order to achieve robustness with respect to moderate model selection mistakes, it will be necessary
to achieve the proper orthogonality condition between the main regressors and the control variables.
UNIFORM POST SELECTION INFERENCE FOR LAD
3
Towards that goal the following auxiliary equation plays a key role (in the homoscedastic case):
๐๐ = ๐ฅโฒ๐ ๐0 + ๐ฃ๐ , E[๐ฃ๐ ] = 0, ๐ = 1, . . . , ๐;
(1.4)
describing the relevant dependence of the regressor of interest ๐๐ to the other controls ๐ฅ๐ . We shall assume
the sparsity of ๐0 , namely ๐๐ = support(๐0 ) has at most ๐ < ๐ elements, and estimate the relation (1.4)
via Lasso or post-Lasso methods described below.
Given ๐ฃ๐ , which โpartials outโ the e๏ฌect of ๐ฅ๐ from ๐๐ , we shall use it as an instrument in the following
estimating equations for ๐ผ0 :
E[๐(๐ฆ๐ โ ๐๐ ๐ผ0 โ ๐ฅโฒ๐ ๐ฝ0 )๐ฃ๐ ] = 0, ๐ = 1, ..., ๐,
where ๐(๐ก) = 1/2 โ 1(๐ก < 1/2). We shall use the empirical analog of this equation to form an instrumental
LAD regression estimator of ๐ผ0 , using a plug-in estimator for ๐ฅโฒ๐ ๐ฝ0 . The estimating equation above has
the following feature:
โ
โฒ
E[๐(๐ฆ๐ โ ๐๐ ๐ผ0 โ ๐ฅ๐ ๐ฝ)๐ฃ๐ ]
= 0, ๐ = 1, ..., ๐,
โ๐ฝ
๐ฝ=๐ฝ0
(1.5)
As a result, the estimator of ๐ผ0 will be โimmunizedโ against โcrudeโ estimation of ๐ฅโฒ๐ ๐ฝ0 , for example, via a
post-selection procedure or some regularization procedure. As we explain in Section 5, such immunization
ideas can be traced back to Neyman ([19, 20]).
Our estimation procedure has the following three steps.
Step 1: Estimation of the confounding function ๐ฅโฒ๐ ๐ฝ0 in (1.1).
Step 2: Estimation of the instruments (residuals) ๐ฃ๐ in (1.4).
Step 3: Estimation of the main e๏ฌect ๐ผ0 based on the instrumental
LAD regression using ๐ฃ๐ as instruments for ๐๐ .
Each step is computationally tractable, involving solutions of convex problems and a one-dimensional
search, and relies on a di๏ฌerent identi๏ฌcation condition which in turn requires a di๏ฌerent estimation
procedure:
Step 1 constructs an estimate for the nuisance function ๐ฅโฒ๐ ๐ฝ0 and not an estimate for ๐ผ0 . Here we
โ
do not need a ๐-rate consistency for the estimates of the nuisance function; slower rate like ๐(๐โ1/4 )
will su๏ฌce. Thus, this can be based either on the โ1 -LAD regression estimator (1.2) or the associated
post-model selection estimator (1.3).
Step 2 partials out the impact of the covariates ๐ฅ๐ on the main regressor ๐๐ , obtaining the estimate
of the residuals ๐ฃ๐ in the decomposition (1.4). In order to estimate these residuals we rely either on
heteroscedastic Lasso [2], a version of the Lasso estimator of [23, 9]:
๐
ห ๐ = 1, . . . , ๐,
๐ห โ arg min ๐ผ๐ [(๐๐ โ ๐ฅโฒ๐ ๐)2 ] + โฅฮฬ๐โฅ1 and set ๐ฃห๐ = ๐๐ โ ๐ฅโฒ๐ ๐,
๐
๐
(1.6)
where ๐ and ฮฬ are the penalty level and data-driven penalty loadings described in [2] (restated in Appendix
D), or the associated post-model selection estimator (Post-Lasso) [4, 2] de๏ฌned as
{
}
ห
๐ห โ arg min ๐ผ๐ [(๐๐ โ ๐ฅโฒ๐ ๐)2 ] : ๐๐ = 0 if ๐ห๐ = 0 and set ๐ฃห๐ = ๐๐ โ ๐ฅโฒ๐ ๐.
๐
(1.7)
4
UNIFORM POST SELECTION INFERENCE FOR LAD
Step 3 constructs an estimator ๐ผ
ห of the coe๏ฌcient ๐ผ0 via an instrumental LAD regression proposed in
[10], using (ห
๐ฃ๐ )๐๐=1 as instruments. Formally, ๐ผ
ห is de๏ฌned as
๐ผ
ห โ arg inf ๐ฟ๐ (๐ผ), where ๐ฟ๐ (๐ผ) =
๐ผโ๐
4โฃ๐ผ๐ [๐(๐ฆ๐ โ ๐ฅโฒ๐ ๐ฝห โ ๐๐ ๐ผ)ห
๐ฃ๐ ]โฃ2
,
๐ผ๐ [ห
๐ฃ๐2 ]
(1.8)
๐(๐ก) = 1/2 โ 1{๐ก โฉฝ 0} and ๐ is a parameter space for ๐ผ0 . We will analyze the choice of ๐ = [ห
๐ผโ
๐ถ logโ1 ๐, ๐ผ
ห + ๐ถ logโ1 ๐] with a suitable constant ๐ถ > 0.
1
Several other choices for ๐ are possible.
Our main result establishes conditions under which ๐ผ
ห is root-๐ consistent for ๐ผ0 , asymptotically normal,
and achieves the semi-parametric e๏ฌciency bound for estimating ๐ผ0 in the current homoscedastic setting,
provided that (๐ 3 log3 ๐)/๐ โ 0 and other regularity conditions hold. Speci๏ฌcally, we show that, despite
possible model selection mistakes in Steps 1 and 2, the estimator ๐ผ
ห obeys
โ
๐๐โ1 ๐(ห
๐ผ โ ๐ผ0 ) โ ๐ (0, 1),
(1.9)
where ๐๐2 := 1/(4๐๐2Eฬ[๐ฃ๐2 ]) with ๐๐ = ๐๐ (0). An alternative (and more robust) expression for ๐๐2 is given
by Huberโs sandwich:
๐๐2 = ๐ฝ โ1 ฮฉ๐ฝ โ1 , where ฮฉ := Eฬ[๐ฃ๐2 ]/4 and ๐ฝ := Eฬ[๐๐ ๐๐ ๐ฃ๐ ].
(1.10)
We recommend to estimate ฮฉ by the plug-in method and to estimate ๐ฝ by Powellโs method [21]. Furthermore, we show that the criterion function at the true value ๐ผ0 in Step 3 has the following pivotal
behavior
๐๐ฟ๐ (๐ผ0 ) โ ๐2 (1).
(1.11)
ห๐,๐ with asymptotic coverage 1 โ ๐ based on the
This allows the construction of a con๏ฌdence region ๐ด
statistic ๐ฟ๐ ,
ห๐,๐ ) โ 1 โ ๐ where ๐ด
ห๐,๐ = {๐ผ โ ๐ : ๐๐ฟ๐ (๐ผ) โฉฝ (1 โ ๐)-quantile of ๐2 (1)}.
P(๐ผ0 โ ๐ด
(1.12)
Importantly, the robustness with respect to moderate model selection mistakes, which occurs because of
(1.5), allows the results (1.9) and (1.11) to hold uniformly over a large range of data generating processes,
similarly to the results for instrumental regression and partially linear mean regression model established
in [6, 27, 2]. One of our proposed algorithms explicitly uses โ1 -regularization methods, similarly to [27]
and [2], while the main algorithm we propose uses post-selection methods, similarly to [6, 2].
Throughout the paper, we use array asymptotics โ asymptotics where the model changes with ๐ โ to
better capture some ๏ฌnite-sample phenomena such as โsmall coe๏ฌcientsโ that are local to zero. This
ensures the robustness of conclusions with respect to perturbations of the data-generating process along
various model sequences. This robustness, in turn, translates into uniform validity of con๏ฌdence regions
over substantial regions of data-generating processes.
1For numerical experiments we used ๐ถ = 10(๐ผ [๐2 ])โ1/2 and typically we normalize ๐ผ [๐2 ] = 1).
๐ ๐
๐ ๐
UNIFORM POST SELECTION INFERENCE FOR LAD
5
1.1. Notation and convention. Denote by (ฮฉ, โฑ , P) the underlying probability space. The notation
โ๐
๐ผ๐ [โ
] denotes the average over index 1 โฉฝ ๐ โฉฝ ๐, i.e., it simply abbreviates the notation ๐โ1 ๐=1 [โ
].
โ
For example, ๐ผ๐ [๐ฅ2๐๐ ] = ๐โ1 ๐๐=1 ๐ฅ2๐๐ . Moreover, we use the notation Eฬ[โ
] = ๐ผ๐ [E[โ
]]. For example,
โ๐
โ๐
Eฬ[๐ฃ๐2 ] = ๐โ1 ๐=1 E[๐ฃ๐2 ]. For a function ๐ : โ × โ × โ๐ โ โ, we write ๐พ๐ (๐ ) = ๐โ1/2 ๐=1 (๐ (๐ฆ๐ , ๐๐ , ๐ฅ๐ ) โ
E[๐ (๐ฆ๐ , ๐๐ , ๐ฅ๐ )]). The ๐2 -norm is denoted by โฅ โ
โฅ, and the ๐0 -norm, โฅ โ
โฅ0 , denotes the number of non-zero
components of a vector. Denote by โฅโ
โฅโ the maximal absolute element of a vector. For a sequence (๐ง๐ )๐๐=1
โ
โ
of constants, we write โฅ๐ง๐ โฅ2,๐ = ๐ผ๐ [๐ง๐2 ]. For example, for a vector ๐ฟ โ โ๐ , โฅ๐ฅโฒ๐ ๐ฟโฅ2,๐ = ๐ผ๐ [(๐ฅโฒ๐ ๐ฟ)2 ]
denotes the prediction norm of ๐ฟ. Given a vector ๐ฟ โ โ๐ , and a set of indices ๐ โ {1, . . . , ๐}, we denote
by ๐ฟ๐ โ โ๐ the vector such that (๐ฟ๐ )๐ = ๐ฟ๐ if ๐ โ ๐ and (๐ฟ๐ )๐ = 0 if ๐ โ
/ ๐ . Also we write the support
of ๐ฟ as support(๐ฟ) = {๐ โ {1, ..., ๐} : ๐ฟ๐ โ= 0}. We use the notation (๐)+ = max{๐, 0}, ๐ โจ ๐ = max{๐, ๐},
and ๐ โง ๐ = min{๐, ๐}. We also use the notation ๐ โฒ ๐ to denote ๐ โฉฝ ๐๐ for some constant ๐ > 0 that does
not depend on ๐; and ๐ โฒ๐ ๐ to denote ๐ = ๐๐ (๐). The arrow โ denotes convergence in distribution.
We assume that the quantities such as ๐ (the dimension of ๐ฅ๐ ), ๐ (a bound on the numbers of non-zero
elements of ๐ฝ0 and ๐0 ), and hence ๐ฆ๐ , ๐ฅ๐ , ๐ฝ0 , ๐0 , ๐ and ๐๐ are all dependent on the sample size ๐, and allow
for the case where ๐ = ๐๐ โ โ and ๐ = ๐ ๐ โ โ as ๐ โ โ. However, for the notational convenience,
we shall omit the dependence of these quantities on ๐.
2. The Methods, Conditions, and Results
2.1. The methods. Each of the steps outlined before uses a di๏ฌerent identi๏ฌcation condition. Several
combinations are possible to implement each step, two of which are the following.
Algorithm 1 (Based on Post-Model Selection estimators).
ห
(1) Run Post-โ1-penalized LAD (1.3) of ๐ฆ๐ on ๐๐ and ๐ฅ๐ ; keep ๏ฌtted value ๐ฅโฒ๐ ๐ฝ.
โฒห
(2) Run Post-Lasso (1.7) of ๐๐ on ๐ฅ๐ ; keep the residual ห
๐ฃ๐ := ๐๐ โ ๐ฅ๐ ๐.
โฒห
(3) Run Instrumental LAD regression (1.8) of ๐ฆ๐ โ ๐ฅ๐ ๐ฝ on ๐๐ using ๐ฃห๐ as the instrument for ๐๐ to
compute the estimator ๐ผ
ห . Report ๐ผ
ห and/or perform inference based upon (1.9) or (1.12).
Algorithm 2 (Based on Regularized Estimators).
ห
(1) Run โ1 -penalized LAD (1.2) of ๐ฆ๐ on ๐๐ and ๐ฅ๐ ; keep ๏ฌtted value ๐ฅโฒ๐ ๐ฝ.
ห
(2) Run Lasso of (1.6) ๐๐ on ๐ฅ๐ ; keep the residual ๐ฃห๐ := ๐๐ โ ๐ฅโฒ๐ ๐.
(3) Run Instrumental LAD regression (1.8) of ๐ฆ๐ โ ๐ฅโฒ๐ ๐ฝห on ๐๐ using ๐ฃห๐ as the instrument for ๐๐ to
compute the estimator ๐ผ
ห . Report ๐ผ
ห and/or perform inference based upon (1.9) or (1.12).
Comment 2.1 (Penalty Levels). In order to perform โ1 -LAD and Lasso, one has to suitably choose the
penalty levels. In the Supplementary Appendix D we provide implementation details including penalty
choices for each step of the algorithm, and in all what follows we shall obey the penalty choices described
in Appendix D.
Comment 2.2 (Di๏ฌerences). Algorithm 1 relies on Post-โ1-LAD and Post-Lasso while Algorithm 2 relies
on โ1 -LAD and Lasso. Since Algorithm 1 re๏ฌts the non-zero coe๏ฌcients without the penalty term it has
a smaller bias. Therefore it does rely on โ1 -LAD and Lasso obtaining sparse solutions which in turn
6
UNIFORM POST SELECTION INFERENCE FOR LAD
typically relies on restricted isometry conditions [3, 2]. Algorithm 2 relies on penalized estimators. Step
3 of both algorithms relies on instrumental LAD regression with estimated data.
Comment 2.3 (Alternative Implementations). As discussed before, the three step approach proposed
here can be implemented with several di๏ฌerent methods each with speci๏ฌc features. For instance, Dantzig
selector, square-root Lasso or the associated post-model selection could be used instead of Lasso or PostLasso. Moreover, the instrumental LAD regression can be substituted by a 1-step estimator from the
ห ๐ฃ๐ ] or by a LAD regression
โ1 -LAD estimator ๐ผ
ห of the form ๐ผ
ห =๐ผ
ห + (๐ผ๐ [๐๐ ๐ฃห๐2 ])โ1 ๐ผ๐ [๐(๐ฆ๐ โ ๐๐ ๐ผ
ห โ ๐ฅโฒ๐ ๐ฝ)ห
with all the covariates selected in Steps 1 and 2.
2.2. Regularity Conditions. Here we provide regularity conditions that are su๏ฌcient for validity of
the main estimation and inference results. We begin by stating our main condition, which contains the
previously de๏ฌned approximate sparsity as well as other more technical assumptions. Throughout the
paper, let ๐ and ๐ถ be positive constants independent of ๐, and let โ๐ โ โ, ๐ฟ๐ โ 0, and ฮ๐ โ 0 be
sequences of positive constants. Let ๐พ๐ฅ := max1โฉฝ๐โฉฝ๐ โฅ๐ฅ๐ โฅโ .
Condition I. (i) (๐๐ )๐๐=1 is a sequence of i.i.d. random variables with common distribution function ๐น such that ๐น (0) = 1/2, (๐ฃ๐ )๐๐=1 is a sequence of independent mean-zero random variables independent of (๐๐ )๐๐=1 , and (๐ฅ๐ )๐๐=1 is a sequence of non-stochastic vectors in โ๐ of covariates normalized in such a way that ๐ผ๐ [๐ฅ2๐๐ ] = 1 for all 1 โฉฝ ๐ โฉฝ ๐. The sequence {(๐ฆ๐ , ๐๐ )โฒ }๐๐=1 of random vectors are generated according to models (1.1) and (1.4). (ii) ๐ โฉฝ E[๐ฃ๐2 ] โฉฝ ๐ถ for all 1 โฉฝ ๐ โฉฝ ๐, and
Eฬ[๐4๐ ]+ Eฬ[๐ฃ๐4 ]+max1โฉฝ๐โฉฝ๐ (Eฬ[๐ฅ2๐๐ ๐2๐ ]+ Eฬ[โฃ๐ฅ๐๐ ๐ฃ๐ โฃ3 ]) โฉฝ ๐ถ. (iii) There exists ๐ = ๐ ๐ โฉพ 1 such that โฅ๐ฝ0 โฅ0 โฉฝ ๐ and
โฅ๐0 โฅ0 โฉฝ ๐ . (iv) The error distribution ๐น is absolutely continuous with continuously di๏ฌerentiable density
๐๐ (โ
) such that ๐๐ (0) โฉพ ๐ > 0 and ๐๐ (๐ก)โจโฃ๐๐โฒ (๐ก)โฃ โฉฝ ๐ถ for all ๐ก โ โ, and (v) (๐พ๐ฅ4 +๐พ๐ฅ2 ๐ 2 +๐ 3 ) log3 (๐โจ๐) โฉฝ ๐๐ฟ๐ .
Comment 2.4. Condition I(i) imposes the setting discussed in the previous section with the zero conditional median of the error distribution. Condition I(ii) imposes moment conditions on the structural
errors and regressors to ensure good model selection performance of Lasso applied to equation (1.4). The
approximate sparsity I(iii) imposes sparsity of the high-dimensional vectors ๐ฝ0 and ๐0 . In the theorems
below we provide the required technical conditions on the growth of ๐ log ๐ since it is dependent on the
choice of algorithm. Condition I(iv) is a set of standard assumptions in the LAD literature (see [14]) and
in the instrumental quantile regression literature [10]. Condition I(v) restricts the sparsity index, so that
๐ 3 log3 (๐ โจ ๐) = ๐(๐) is required; this is analogous to the standard assumption ๐ 3 (log ๐)2 = ๐(๐) (see [11])
invoked in the LAD analysis without any selection (i.e, where ๐ = ๐ ). Most importantly, no assumptions
on the separation from zero of the non-zero coe๏ฌcients of ๐0 and ๐ฝ0 are made.
The next condition concerns the behavior of the Gram matrix ๐ผ๐ [ห
๐ฅ๐ ๐ฅหโฒ๐ ] where ๐ฅ
ห๐ = (๐๐ , ๐ฅโฒ๐ )โฒ . Whenever
๐+1 > ๐, the empirical Gram matrix ๐ผ๐ [ห
๐ฅ๐ ๐ฅ
หโฒ๐ ] does not have full rank and in principle is not well-behaved.
However, we only need good behavior of smaller submatrices. De๏ฌne the minimal and maximal ๐-sparse
eigenvalue of ๐ผ๐ [ห
๐ฅ๐ ๐ฅ
หโฒ๐ ] as
๐min (๐) :=
๐ฟ โฒ ๐ผ๐ [ห
๐ฅ๐ ๐ฅ
หโฒ๐ ]๐ฟ
๐ฟ โฒ ๐ผ๐ [ห
๐ฅ๐ ๐ฅ
หโฒ๐ ]๐ฟ
and ๐max (๐) := max
.
2
2
โฅ๐ฟโฅ
โฅ๐ฟโฅ
1โฉฝโฅ๐ฟโฅ0 โฉฝ๐
1โฉฝโฅ๐ฟโฅ0 โฉฝ๐
min
(2.13)
UNIFORM POST SELECTION INFERENCE FOR LAD
7
To assume that ๐min (๐) > 0 requires that all empirical Gram submatrices formed by any ๐ components
of ๐ฅ
ห๐ are positive de๏ฌnite. We shall employ the following condition as a su๏ฌcient condition for our results.
Condition SE. There exists a sequence of constants โ๐ โ โ such that the maximal and minimal โ๐ ๐ -
sparse eigenvalues are bounded from below and away from zero, namely with probability at least 1 โ ฮ๐ ,
๐
โฒ โฉฝ ๐min (โ๐ ๐ ) โฉฝ ๐max (โ๐ ๐ ) โฉฝ ๐
โฒโฒ ,
where 0 < ๐
โฒ < ๐
โฒโฒ < โ are constants independent of ๐.
Comment 2.5. Condition SE is quite plausible for many designs of interest. Essentially it can be
established by combining tail conditions of the regressors and a growth restriction on ๐ and ๐ relative to ๐.
For instance, Theorem 3.2 in [22] (see also [28] and [1]) shows that Condition SE holds for i.i.d. zero-mean
sub-Gaussian regressors and ๐ log2 (๐ โจ ๐) โฉฝ ๐ฟ๐ ๐; while Theorem 1.8 [22] (see also Lemma 1 in [4]) shows
that Condition SE holds for i.i.d. uniformly bounded zero-mean regressors and ๐ (log3 ๐) log(๐ โจ ๐) โฉฝ ๐ฟ๐ ๐.
2.3. Results. We begin with considering Algorithm 1.
Theorem 1 (Robust Inference, Algorithm 1). Let ๐ผ
ห be obtained by Algorithm 1. Suppose that Conditions
ห 0 โฉฝ ๐ถ๐ .
I and SE are satis๏ฌed for all ๐ โฉพ 1. Moreover, suppose that with probability at least 1 โ ฮ๐ , โฅ๐ฝโฅ
Then, as ๐ โ โ and for ๐๐2 = 1/(4๐๐2Eฬ[๐ฃ๐2 ]),
โ
๐๐โ1 ๐(ห
๐ผ โ ๐ผ0 ) โ ๐ (0, 1) and ๐๐ฟ๐ (๐ผ0 ) โ ๐2 (1).
Theorem 1 establishes the ๏ฌrst main result of the paper. Theorem 1 relies on the post model selection
ห Sparsity of the former
estimators which in turn hinge on achieving su๏ฌciently sparse estimates ๐ฝห and ๐.
can be directly achieved under sharp penalty choices for optimal rates as discussed in the Supplementary
Appendix D.2. The sparsity for the latter potentially requires heavier penalty as shown in [3]. Alternatively, sparsity for the estimator in Step 1 can also be achieved by truncating the smallest components
ห2
of estimate ๐ฝ.
Next we turn to the analysis of Algorithm 2 which relies on the regularized estimators instead of
the post-model selection estimators. Theorem 2 below establishes that Algorithm 2 achieves the same
inferential guarantees as the results in Theorem 1 for Algorithm 1.
Theorem 2 (Robust Inference, Algorithm 2). Let ๐ผ
ห be obtained by Algorithm 2. Suppose that Conditions
ห 0 โฉฝ ๐ถ๐ .
I and SE are satis๏ฌed for all ๐ โฉพ 1. Moreover, suppose that with probability at least 1 โ ฮ๐ , โฅ๐ฝโฅ
Then, as ๐ โ โ and for ๐๐2 = 1/(4๐๐2Eฬ[๐ฃ๐2 ]),
โ
๐ผ โ ๐ผ0 ) โ ๐ (0, 1) and ๐๐ฟ๐ (๐ผ0 ) โ ๐2 (1).
๐๐โ1 ๐(ห
Theorem 2 establishes the second main result of the paper.
An important consequence of these results is the following corollary. Here ๐ฌ๐ denotes a collection of
distributions for {(๐ฆ๐ , ๐๐ )โฒ }๐๐=1 and for ๐๐ โ ๐ฌ๐ the notation P๐๐ means that under P๐๐ , {(๐ฆ๐ , ๐๐ )โฒ }๐๐=1
is distributed according to ๐๐ .
2Lemma 3 in Appendix C formally shows that a suitable truncation preserves the rate of convergence under our
conditions.
8
UNIFORM POST SELECTION INFERENCE FOR LAD
Corollary 1 (Uniformly Valid Con๏ฌdence Intervals). Let ๐ผ
ห be the estimator of ๐ผ0 constructed according to Algorithm 1 (resp. Algorithm 2) and let ๐ฌ๐ be the collection of all distributions of {(๐ฆ๐ , ๐๐ )โฒ }๐๐=1
for which the conditions of Theorem 1 (resp. Theorem 2) are satis๏ฌed for given ๐ โฉพ 1. Then as ๐ โ โ,
uniformly in ๐๐ โ ๐ฌ๐
โ
ห๐,๐ ) โ 1 โ ๐,
P๐๐ (๐ผ0 โ [ห
๐ผ ± ๐๐ ๐ง๐/2 / ๐]) โ 1 โ ๐ and P๐๐ (๐ผ0 โ ๐ด
ห๐,๐ = {๐ผ โ ๐ : ๐๐ฟ๐ (๐ผ) โฉฝ (1 โ ๐)-quantile of ๐2 (1)}.
where ๐ง๐/2 = ฮฆโ1 (1 โ ๐/2) and ๐ด
Corollary 1 establishes the third main result of the paper; it highlights the uniformity nature of the
results. As long as the overall sparsity requirements hold, imperfect model selection in Steps 1 and
2 do not compromise the results. The robustness of the approach is also apparent from the fact that
Corollary 1 allows for the data-generating process to change with ๐. This result is new even under the
traditional case of ๏ฌxed-๐ asymptotics. Condition I and SE together with the appropriate side conditions
in the theorems explicitly characterize regions of data-generating processes for which the uniformity
result holds. Simulations results discussed next also provide an additional evidence that these regions are
substantial.
3. Monte-Carlo Experiments
In this section we examine the ๏ฌnite sample performance of the proposed estimators. We focus on the
estimator associated with Algorithm 1 based on post-model selection methods.
We considered the following regression model:
๐ฆ = ๐๐ผ0 + ๐ฅโฒ (๐๐ฆ ๐0 ) + ๐,
๐ = ๐ฅโฒ (๐๐ ๐0 ) + ๐ฃ,
(3.14)
where ๐ผ0 = 1/2, ๐0๐ = 1/๐ 2 , ๐ = 1, . . . , 10, and ๐0๐ = 0 otherwise, ๐ฅ = (1, ๐ง โฒ )โฒ consists of an intercept and
covariates ๐ง โผ ๐ (0, ฮฃ), and the errors ๐ and ๐ฃ are independently and identically distributed as ๐ (0, 1).
The dimension ๐ of the covariates ๐ฅ is 300, and the sample size ๐ is 250. The regressors are correlated
with ฮฃ๐๐ = ๐โฃ๐โ๐โฃ and ๐ = 0.5. The coe๏ฌcients ๐๐ฆ and ๐๐ are used to control the ๐
2 of the reduce
form equation. For each equation, we consider the following values for the ๐
2 : {0, 0.1, 0.2, . . . , 0.8, 0.9}.
Therefore we have 100 di๏ฌerent designs and results are based on 500 repetitions for each design. For each
repetition we draw new vectors ๐ฅ๐ โs and errors ๐๐ โs and ๐ฃ๐ โs.
The design above with ๐ฅโฒ (๐๐ฆ ๐0 ) is a sparse model. However, the decay of the components of ๐0 rules
out typical โseparation from zeroโ assumptions of the coe๏ฌcients of โimportantโ covariates (since the
last component is of the order of 1/๐), unless ๐๐ฆ is very large. Thus, we anticipate that โstandardโ
post-selection inference procedures โ which rely on model selection of the outcome equation only โ work
poorly in the simulation study. In contrast, based upon the prior theoretical arguments, we anticipate
that our instrumental LAD estimatorโ which works o๏ฌ both equations in (3.14)โ to work well in the
simulation study.
The simulation study focuses on Algorithm 1. Standard errors are computed using the formula (1.10).
(Algorithm 2 worked similarly, though somewhat worse due to larger biases). As the main benchmark
UNIFORM POST SELECTION INFERENCE FOR LAD
9
we consider the standard post-model selection estimator ๐ผ
ห based on the post โ1 -penalized LAD method,
as de๏ฌned in (1.3).
In Figure 1, we display the (empirical) rejection probability of tests of a true hypothesis ๐ผ = ๐ผ0 ,
with nominal size of tests equal to 0.05. The left-top plot shows the rejection frequency of the standard
post-model selection inference procedure based upon ๐ผ
ห (where the inference procedure assumes perfect
recovery of the true model). The rejection frequency deviates very sharply from the ideal rejection
frequency of 0.05. This con๏ฌrms the anticipated failure (lack of uniform validity) of inference based upon
the standard post-model selection procedure in designs where coe๏ฌcients are not well separated from zero
(so that perfect recovery does not happen). In sharp contrast, the right top and bottom plots show that
both of our proposed procedures (based on estimator ๐ผ
ห and the result (1.9) and on the statistic ๐ฟ๐ and
the result (1.12)) perform well, closely tracking the ideal level of 0.05. This is achieved uniformly over all
the designs considered in the study, and this con๏ฌrms our theoretical results established in Corollary 1.
In Figure 2, we compare the performance of the standard post-selection estimator ๐ผ
ห (de๏ฌned in (1.3))
and our proposed post-selection estimator ๐ผ
ห (obtained via Algorithm 1) . We display results in three
di๏ฌerent metrics of performance โ mean bias (top row), standard deviation (middle row), and root mean
square error (bottom row) of the two approaches. The signi๏ฌcant bias for the standard post-selection
procedure occurs when the indirect equation (1.4) is nontrivial, that is, when the main regressor is
correlated to other controls. Such bias can be positive or negative depending on the particular design.
The proposed post-selection estimator ๐ผ
ห performs well in all three metrics. The root mean square error
for the proposed estimator ๐ผ
ห are typically much smaller than those for standard post-model selection
estimators ๐ผ
ห (as shown by bottom plots in Figure 2). This is fully consistent with our theoretical results
and minimax e๏ฌciency considerations given in Section 5.
4. Generalization to Heteroscedastic Case
We emphasize that both proposed algorithms exploit the homoscedasticity of the model (1.1) with
respect to the error term ๐๐ . The generalization to the heteroscedastic case can be achieved as follows.
In order to achieve the semiparametric e๏ฌciency bound we need to consider the weighted version of the
auxiliary equation (1.4). Speci๏ฌcally, we can rely on the following of weighted decomposition:
๐๐ ๐๐ = ๐๐ ๐ฅโฒ๐ ๐0โ + ๐ฃ๐โ ,
E[๐๐ ๐ฃ๐โ ] = 0,
๐ = 1, . . . , ๐,
(4.15)
where the weights are conditional densities of error terms ๐๐ evaluated at their medians of 0,
๐๐ = ๐๐๐ (0โฃ๐๐ , ๐ฅ๐ ),
๐ = 1, . . . , ๐,
(4.16)
which in general vary under heteroscedasticity. With that in mind it is straightforward to adapt the
proposed algorithms when the weights (๐๐ )๐๐=1 are known. For example Algorithm 1 becomes as follows.
Algorithm 1โฒ (Based on Post-Model Selection estimators).
ห
(1) Run Post-โ1-penalized LAD of ๐ฆ๐ on ๐๐ and ๐ฅ๐ ; keep ๏ฌtted value ๐ฅโฒ๐ ๐ฝ.
ห
(2) Run Post-Lasso of ๐๐ ๐๐ on ๐๐ ๐ฅ๐ ; keep the residual ๐ฃห๐โ := ๐๐ (๐๐ โ ๐ฅโฒ๐ ๐).
10
UNIFORM POST SELECTION INFERENCE FOR LAD
(3) Run Instrumental LAD regression of ๐ฆ๐ โ ๐ฅโฒ๐ ๐ฝห on ๐๐ using ๐ฃห๐โ as the instrument for ๐๐ to compute
the estimator ๐ผ.
ห Report ๐ผ
ห and/or perform inference.
An analogous generalization of Algorithm 2 based on regularized estimator results from removing the
word โPostโ in the algorithm above.
Under similar regularity conditions, uniformly over a large collection ๐ฌโ๐ of distributions of {(๐ฆ๐ , ๐๐ )โฒ }๐๐=1 ,
the estimator ๐ผ
ห above obeys
โ
(4Eฬ[๐ฃ๐โ2 ])1/2 ๐(ห
๐ผ โ ๐ผ0 ) โ ๐ (0, 1).
(4.17)
Moreover, the criterion function at the true value ๐ผ0 in Step 3 also has a pivotal behavior, namely
๐๐ฟ๐ (๐ผ0 ) โ ๐2 (1),
(4.18)
ห๐,๐ based on the ๐ฟ๐ -statistic as in (1.12) with
which can also be used to construct a con๏ฌdence region ๐ด
coverage 1 โ ๐ uniformly over the collection of distributions ๐ฌโ๐ .
In practice the density function values (๐๐ )๐๐=1 are typically unknown and need to be replaced by
estimates (๐ห๐ )๐๐=1 . The analysis of the impact of such estimation is very delicate and is developed in the
companion work [8], which considers the more general problem of uniformly valid inference for quantile
regression models in approximately sparse models.
5. Discussion and Conclusion
5.1. Connection to Neymanization. In this section we make some connections to Neymanโs ๐ถ(๐ผ)
test ([19, 20]). For the sake of exposition we assume that (๐ฆ๐ , ๐ฅ๐ , ๐๐ )๐๐=1 are i.i.d. but we shall use the
heteroscedastic setup introduced in the previous section. We consider the estimating equation for ๐ผ0 :
E[๐(๐ฆ๐ โ ๐๐ ๐ผ0 โ ๐ฅโฒ๐ ๐ฝ0 )๐ฃ๐ ] = 0.
Our problem is to ๏ฌnd useful instruments ๐ฃ๐ such that
โ
E[๐(๐ฆ๐ โ ๐๐ ๐ผ0 โ ๐ฅโฒ๐ ๐ฝ)๐ฃ๐ ]โฃ๐ฝ=๐ฝ0 = 0.
โ๐ฝ
If this property holds, the estimator of ๐ผ0 will be โimmunizedโ against โcrudeโ or nonregular estimation
of ๐ฝ0 , for example, via a post-selection procedure or some regularization procedure. Such immunization
ideas are in fact behind Neymanโs classical construction of his ๐ถ(๐ผ) test, so we shall use the term
โNeymanizationโ to describe such procedure. There will be many instruments ๐ฃ๐ that can achieve the
property stated above, and there will be one that is optimal.
The instruments can be constructed by taking ๐ฃ๐ := ๐ง๐ /๐๐ , where ๐ง๐ is the residual in the regression
equation:
๐ค๐ ๐๐ = ๐ค๐ ๐0 (๐ฅ๐ ) + ๐ง๐ , E[๐ค๐ ๐ง๐ โฃ๐ฅ๐ ] = 0,
(5.19)
where ๐ค๐ is a nonnegative weight, a function of (๐๐ , ๐ง๐ ) only, for example ๐ค๐ = 1 or ๐ค๐ = ๐๐ โ the latter
choice will in fact be optimal. Note that function ๐0 (๐ฅ๐ ) solves the least squares problem
]
[
min E {๐ค๐ ๐๐ โ ๐ค๐ โ(๐ฅ๐ )}2 ,
(5.20)
โโโ
UNIFORM POST SELECTION INFERENCE FOR LAD
11
where โ is the class of measurable functions โ(๐ฅ๐ ) such that E[๐ค๐2 โ2 (๐ฅ๐ )] < โ. Our assumption is that
the ๐0 (๐ฅ๐ ) is a sparse function ๐ฅโฒ๐ ๐0 , with โฅ๐0 โฅ0 โฉฝ ๐ so that
๐ค๐ ๐๐ = ๐ค๐ ๐ฅโฒ๐ ๐0 + ๐ง๐ , E[๐ค๐ ๐ง๐ โฃ๐ฅ๐ ] = 0.
(5.21)
In ๏ฌnite samples, the sparsity assumption allows to employ post-Lasso and Lasso to solve the least squares
problem above approximately, and estimate ๐ง๐ . Of course, the use of other structured assumptions may
motivate the use of other regularization methods.
Arguments similar to those in the proofs show that, for
โ
๐(๐ผ โ ๐ผ0 ) = ๐(1),
โ
ห ๐ ] โ ๐ผ๐ [๐(๐ฆ๐ โ ๐๐ ๐ผ โ ๐ฅโฒ ๐ฝ0 )๐ฃ๐ ]} = ๐๐ (1),
๐{๐ผ๐ [๐(๐ฆ๐ โ ๐๐ ๐ผ โ ๐ฅโฒ๐ ๐ฝ)๐ฃ
๐
for ๐ฝห based on a sparse estimation procedure, despite the fact that ๐ฝห converges to ๐ฝ0 at a slower rate
โ
than 1/ ๐. That is, the empirical estimating equations behave as if ๐ฝ0 is known. Hence for estimation
we can use ๐ผ
ห as a minimizer of the statistic:
โ
โฒห
2
๐ฟ๐ (๐ผ) = ๐โ1
๐ โฃ ๐๐ผ๐ [๐(๐ฆ๐ โ ๐๐ ๐ผ โ ๐ฅ๐ ๐ฝ)๐ฃ๐ ]โฃ ,
where ๐๐ = ๐ผ๐ [๐ฃ๐2 ]/4. Since ๐ฟ๐ (๐ผ0 ) โ ๐2 (1), we can also use the statistic directly for testing hypotheses
and for construction of con๏ฌdence sets.
This is in fact a version of Neymanโs ๐ถ(๐ผ) test statistic, adapted to the present non-smooth setting.
๐0 =
The usual expression of ๐ถ(๐ผ) statistic is di๏ฌerent.
E[๐ค๐2 ๐ฅ๐ ๐ฅโฒ๐ ]โ E[๐ค๐2 ๐๐ ๐ฅโฒ๐ ],
so that,
โ
where ๐ด
To see a more familiar form, note that
denotes a generalized inverse of ๐ด, and write
ห
ห๐ := ๐(๐ฆ๐ โ ๐๐ ๐ผ โ ๐ฅโฒ๐ ๐ฝ),
๐ฃ๐ = (๐ค๐ /๐๐ )๐๐ โ (๐ค๐ /๐๐ )๐ฅโฒ๐ E[๐ค๐2 ๐ฅ๐ ๐ฅโฒ๐ ]โ E[๐ค๐2 ๐๐ ๐ฅโฒ๐ ], and ๐
โ
๐ฟ๐ (๐ผ) = ๐โ1
ห๐ (๐ค๐ /๐๐ )๐๐ ] โ ๐ผ๐ [๐
ห๐ (๐ค๐ /๐๐ )๐ฅ๐ ]โฒ E[๐ค๐2 ๐ฅ๐ ๐ฅโฒ๐ ]โ E[๐ค๐2 ๐๐ ๐ฅโฒ๐ ]}โฃ2 .
๐ โฃ ๐{๐ผ๐ [๐
This is indeed a familiar form of a ๐ถ(๐ผ) statistic.
The estimator ๐ผ
ห that minimizes ๐ฟ๐ up to ๐๐ (1), under suitable regularity conditions,
โ
๐๐โ1 ๐(ห
๐ผ โ ๐ผ0 ) โ ๐ (0, 1),
๐๐2 =
1
E[๐๐ ๐๐ ๐ฃ๐ ]โ2 E[๐ฃ๐2 ].
4
The smallest value of ๐๐2 is achieved by using ๐ฃ๐ = ๐ฃ๐โ induced by setting ๐ค๐ = ๐๐ :
๐๐โ2 =
1
E[๐ฃ๐โ2 ]โ1 .
4
(5.22)
Thus, setting ๐ค๐ = ๐๐ gives an optimal instrument ๐ฃ๐โ amongst all โimmunizingโ instruments generated
by the process described above. Obviously, this improvement translates into shorter con๏ฌdence intervals
and better testing based on either ๐ผ
ห or ๐ฟ๐ . While ๐ค๐ = ๐๐ is optimal, ๐๐ will have to be estimated in
practice, resulting actually in more stringent condition than when using non-optimal, known weights, e.g.,
๐ค๐ = 1. The use of known weights may also give better behavior under misspeci๏ฌcation of the model.
Under homoscedasticity, ๐ค๐ = 1 is an optimal weight.
12
UNIFORM POST SELECTION INFERENCE FOR LAD
5.2. Minimax E๏ฌciency. There is also a clean connection to the (local) minimax e๏ฌciency analysis
from the semiparametric e๏ฌciency analysis. [16] derives an e๏ฌcient score function for the partially linear
median regression model:
๐๐ = 2๐(๐ฆ๐ โ ๐๐ ๐ผ0 โ ๐ฅโฒ๐ ๐ฝ0 )๐๐ [๐๐ โ ๐โ0 (๐ฅ)],
where ๐โ0 (๐ฅ๐ ) is ๐0 (๐ฅ๐ ) in (5.19) induced by the weight ๐ค๐ = ๐๐ :
๐โ0 (๐ฅ๐ ) =
E[๐๐2 ๐๐ โฃ๐ฅ๐ ]
.
E[๐๐2 โฃ๐ฅ๐ ]
Using the assumption ๐โ0 (๐ฅ๐ ) = ๐ฅโฒ๐ ๐0โ , where โฅ๐0โ โฅ0 โฉฝ ๐ โช ๐ is sparse, we have that
๐๐ = 2๐(๐ฆ๐ โ ๐๐ ๐ผ0 โ ๐ฅโฒ๐ ๐ฝ0 )๐ฃ๐โ ,
which is the score that was constructed using Neymanization. It follows that the estimator based on the
instrument ๐ฃ๐โ is actually e๏ฌcient in the minimax sense (see Theorem 18.4 in [15]), and inference about
๐ผ0 based on this estimator provides best minimax power against local alternatives (see Theorem 18.12
in [15]).
The claim above is formal as long as, given a law ๐โ๐ , the least favorable submodels are permitted
as deviations that lie within the overall model ๐ฌ๐ . Speci๏ฌcally, given a law ๐โ๐ , we shall need to allow
for a certain neighborhood ๐ฌ๐ฟ๐ of ๐โ๐ such that ๐โ๐ โ ๐ฌ๐ฟ๐ โ ๐ฌ๐ , where the overall model ๐ฌ๐ is de๏ฌned
similarly as before, except now permitting heteroscedasticity (or we can keep homoscedasticity ๐๐ = ๐๐ to
maintain formality). To allow for this we consider a collection of laws indexed by a parameter ๐ก = (๐ก1 , ๐ก2 ),
generated by:
๐ฆ๐
= ๐๐ (๐ผโ0 + ๐ก1 ) +๐ฅโฒ๐ (๐ฝ0โ + ๐ก2 ๐0โ ) +๐๐ ,
| {z }
|
{z
}
๐ผ0
๐๐ ๐๐
=
๐๐ ๐ฅโฒ๐ ๐0โ
+
โฅ๐กโฅ โฉฝ ๐ฟ,
(5.23)
๐ฝ0
๐ฃ๐โ ,
E[๐๐ ๐ฃ๐โ โฃ๐ฅ๐ ]
= 0,
(5.24)
where โฅ๐ฝ0โ โฅ0 +โฅ๐0โ โฅ0 โฉฝ ๐ and conditions as in Section 2 hold. The case with ๐ก = 0 generates the law ๐โ๐ ; by
varying ๐ก within ๐ฟ-ball, we generate the set of laws, denoted ๐ฌ๐ฟ๐ , containing the least favorable deviations
from ๐ก = 0. By [16], the e๏ฌcient score for the model given above is ๐๐ , so we cannot have a better regular
estimator than the estimator whose in๏ฌuence function is ๐ฝ โ1 ๐๐ , where ๐ฝ = E[๐๐2 ]. Since our overall model
๐ฌ๐ contains ๐ฌ๐ฟ๐ , all the formal conclusions about (local minimax) optimality of our estimators hold from
theorems cited above (using subsequence arguments to handle models changing with ๐). Our estimators
โ
are regular, since under any law ๐๐ in the set ๐ฌ๐ฟ๐ with ๐ฟ โ 0, the ๏ฌrst order asymptotics of ๐(ห
๐ผ โ ๐ผ0 )
does not change, as a consequence of theorems in Section 2 (in fact our theorems show more than this).
5.3. Conclusion. In this paper we propose a method for inference on the coe๏ฌcient ๐ผ0 of a main regressor
that holds uniformly over many data-generating process which is robust to possible โmoderateโ model
selection mistakes. The robustness of the method is achieved by relying on a Neyman type estimating
equation whose gradient with respect to the nuisance parameters is zero. In the present homoscedastic
setting the proposed estimator is asymptotically normal and also achieves the semi-parametric e๏ฌciency
bound.
UNIFORM POST SELECTION INFERENCE FOR LAD
13
Appendix A. Instrumental LAD Regression with Estimated Inputs
Throughout this section, let
๐๐ผ,๐ฝ,๐ (๐ฆ๐ , ๐๐ , ๐ฅ๐ ) = (1/2 โ 1{๐ฆ๐ โฉฝ ๐ฅโฒ๐ ๐ฝ + ๐๐ ๐ผ})(๐๐ โ ๐ฅโฒ๐ ๐)
= (1/2 โ 1{๐ฆ๐ โฉฝ ๐ฅโฒ๐ ๐ฝ + ๐๐ ๐ผ}){๐ฃ๐ โ ๐ฅโฒ๐ (๐ โ ๐0 )}.
For ๏ฌxed ๐ผ โ โ and ๐ฝ, ๐ โ โ๐ , de๏ฌne the function
ฮ(๐ผ, ๐ฝ, ๐) := Eฬ[๐๐ผ,๐ฝ,๐ (๐ฆ๐ , ๐๐ , ๐ฅ๐ )].
For the notational convenience, let โ = (๐ฝ โฒ , ๐โฒ )โฒ , โ0 = (๐ฝ0โฒ , ๐0โฒ )โฒ and โฬ = (๐ฝหโฒ , ๐หโฒ )โฒ . The partial derivative of
ฮ(๐ผ, ๐ฝ, ๐) with respect to ๐ผ is denoted by ฮ1 (๐ผ, ๐ฝ, ๐) and the partial derivative of ฮ(๐ผ, ๐ฝ, ๐) with respect
to โ = (๐ฝ โฒ , ๐โฒ )โฒ is denoted by ฮ2 (๐ผ, ๐ฝ, ๐). Consider the following high-level condition. Here (๐ฝหโฒ , ๐หโฒ )โฒ is a
generic estimator of (๐ฝ0โฒ , ๐0โฒ )โฒ (and not necessarily โ1 -LAD and Lasso estimators, reps.), and ๐ผ
ห is de๏ฌned
โฒ หโฒ โฒ
ห
by ๐ผ
ห โ arg min๐ผโ๐ ๐ฟ๐ (๐ผ) with this (๐ฝ , ๐ ) , where ๐ here is also a generic (possibly random) compact
interval. We assume that (๐ฝหโฒ , ๐หโฒ )โฒ , ๐ and ๐ผ
ห satisfy the following conditions.
Condition ILAD. (i) ๐๐ (๐ก) โจ โฃ๐๐โฒ (๐ก)โฃ โฉฝ ๐ถ for all ๐ก โ โ, Eฬ[๐ฃ๐2 ] โฉพ ๐ > 0 and Eฬ[๐ฃ๐4 ] โจ Eฬ[๐4๐ ] โฉฝ ๐ถ.
Moreover, for some sequences ๐ฟ๐ โ 0 and ฮ๐ โ 0, with probability at least 1 โ ฮ๐ ,
(ii) {๐ผ : โฃ๐ผ โ ๐ผ0 โฃ โฉฝ ๐โ1/2 /๐ฟ๐ } โ ๐, where ๐ is a (possibly random) compact interval;
(iii) the estimated parameters (๐ฝหโฒ , ๐หโฒ )โฒ satisfy
{
}1/2 โฒ
1 โจ max (E[โฃ๐ฃ๐ โฃ] โจ โฃ๐ฅโฒ๐ (๐ห โ ๐0 )โฃ)
โฅ๐ฅ๐ (๐ฝห โ ๐ฝ0 )โฅ2,๐ โฉฝ ๐ฟ๐ ๐โ1/4 , โฅ๐ฅโฒ๐ (๐ห โ ๐0 )โฅ2,๐ โฉฝ ๐ฟ๐ ๐โ1/4 ,
1โฉฝ๐โฉฝ๐
sup โฃ๐พ๐ (๐๐ผ,๐ฝ,
ห ๐ห โ ๐๐ผ,๐ฝ0 ,๐0 )โฃ โฉฝ ๐ฟ๐ ,
๐ผโ๐
where recall that ๐พ๐ (๐ ) = ๐โ1/2
โ๐
๐=1 {๐ (๐ฆ๐ , ๐๐ , ๐ฅ๐ )
(iv) the estimator ๐ผ
ห satis๏ฌes โฃห
๐ผ โ ๐ผ0 โฃ โฉฝ ๐ฟ๐ .
(A.25)
(A.26)
โ E[๐ (๐ฆ๐ , ๐๐ , ๐ฅ๐ )]}; and lastly
Comment A.1. Condition ILAD su๏ฌces to make the impact of the estimation of instruments negligible
on the ๏ฌrst order asymptotics of the estimator ๐ผ
ห . We note that Condition ILAD covers several di๏ฌerent
estimators including both estimators proposed in Algorithms 1 and 2.
The following lemma summarizes the main inferential result based on the high level Condition ILAD.
Lemma 1. Under Condition ILAD we have, for ๐๐2 = 1/(4๐๐2 Eฬ[๐ฃ๐2 ]),
โ
๐ผ โ ๐ผ0 ) โ ๐ (0, 1) and ๐๐ฟ๐ (๐ผ0 ) โ ๐2 (1),
๐๐โ1 ๐(ห
Proof of Lemma 1. We shall separate the proof into two parts.
Part 1. (Proof for the ๏ฌrst assertion). Observe that
๐ผ๐ [๐๐ผ,
ห ๐ห(๐ฆ๐ , ๐๐ , ๐ฅ๐ )] = ๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )] + ๐ผ๐ [๐๐ผ,
ห ๐ห(๐ฆ๐ , ๐๐ , ๐ฅ๐ ) โ ๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )]
ห ๐ฝ,
ห ๐ฝ,
ห ๐)
ห
= ๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )] + ฮ(ห
๐ผ, ๐ฝ,
โ1/2
+ ๐โ1/2 ๐พ๐ (๐๐ผ,
๐พ๐ (๐๐ผ,๐ฝ
ห 0 ,๐0 ) + ๐
ห 0 ,๐0 โ ๐๐ผ0 ,๐ฝ0 ,๐0 )
ห ๐ห โ ๐๐ผ,๐ฝ
ห ๐ฝ,
= ๐ผ + ๐ผ๐ผ + ๐ผ๐ผ๐ผ + ๐ผ๐.
14
UNIFORM POST SELECTION INFERENCE FOR LAD
By Condition ILAD(iii) (A.26) we have with probability at least 1 โ ฮ๐ that โฃ๐ผ๐ผ๐ผโฃ โฉฝ ๐ฟ๐ ๐โ1/2 . We
wish to show that
Observe that
๐ผ๐ผ + (๐๐ Eฬ[๐ฃ๐2 ])(ห
๐ผ โ ๐ผ0 ) โฒ๐ ๐ฟ๐ ๐โ1/2 + ๐ฟ๐ โฃห
๐ผ โ ๐ผ0 โฃ.
(A.27)
ห ๐)
ห = ฮ(๐ผ, ๐ฝ0 , ๐0 ) + ฮ(๐ผ, ๐ฝ,
ห ๐)
ห โ ฮ(๐ผ, ๐ฝ0 , ๐0 )
ฮ(๐ผ, ๐ฝ,
ห ๐)
ห โ ฮ(๐ผ, ๐ฝ0 , ๐0 ) โ ฮ2 (๐ผ, ๐ฝ0 , ๐0 )โฒ (โฬ โ โ0 )} + ฮ2 (๐ผ, ๐ฝ0 , ๐0 )โฒ (โฬ โ โ0 ).
= ฮ(๐ผ, ๐ฝ0 , ๐0 ) + {ฮ(๐ผ, ๐ฝ,
Since ฮ(๐ผ0 , ๐ฝ0 , ๐0 ) = 0, by Taylorโs theorem, there exists some point ๐ผ
ห between ๐ผ0 and ๐ผ such that
ฮ(๐ผ, ๐ฝ0 , ๐0 ) = ฮ1 (ห
๐ผ, ๐ฝ0 , ๐0 )(๐ผ โ ๐ผ0 ). By its de๏ฌnition, we have
ฮ1 (๐ผ, ๐ฝ, ๐) = โEฬ[๐๐ (๐ฅโฒ๐ (๐ฝโ๐ฝ0 )+๐๐ (๐ผโ๐ผ0 ))๐๐ (๐๐ โ๐ฅโฒ๐ ๐)] = โEฬ[๐๐ (๐ฅโฒ๐ (๐ฝโ๐ฝ0 )+๐๐ (๐ผโ๐ผ0 ))๐๐ {๐ฃ๐ โ๐ฅโฒ๐ (๐โ๐0 )}].
Since ๐๐ = ๐๐ (0) and ๐๐ = ๐ฅโฒ๐ ๐0 + ๐ฃ๐ with E[๐ฃ๐ ] = 0, we have ฮ1 (๐ผ0 , ๐ฝ0 , ๐0 ) = โ๐๐ Eฬ[๐๐ ๐ฃ๐ ] = โ๐๐ Eฬ[๐ฃ๐2 ]. Also
โฃฮ1 (๐ผ, ๐ฝ0 , ๐0 ) โ ฮ1 (๐ผ0 , ๐ฝ0 , ๐0 )โฃ โฉฝ โฃEฬ[{๐๐ (0) โ ๐๐ (๐๐ (๐ผ โ ๐ผ0 ))๐๐ ๐ฃ๐ ]โฃ โฉฝ ๐ถโฃ๐ผ โ ๐ผ0 โฃEฬ[โฃ๐2๐ ๐ฃ๐ โฃ].
Hence ฮ(ห
๐ผ, ๐ฝ0 , ๐0 ) = โ๐๐ Eฬ[๐ฃ๐2 ] + ๐(1)โฃห
๐ผ โ ๐ผ0 โฃ.
Observe that
ฮ2 (๐ผ, ๐ฝ, ๐) =
(
โEฬ[๐๐ (๐ฅโฒ๐ (๐ฝ โ ๐ฝ0 ) + ๐๐ (๐ผ โ ๐ผ0 ))(๐๐ โ ๐ฅโฒ๐ ๐)๐ฅ๐ ]
โEฬ[(1/2 โ 1{๐ฆ๐ โฉฝ ๐ฅโฒ๐ ๐ฝ + ๐๐ ๐ผ})๐ฅ๐ ]
)
.
Note that since Eฬ[๐๐ (0)(๐๐ โ๐ฅโฒ๐ ๐0 )๐ฅ๐ ] = ๐๐ Eฬ[๐ฃ๐ ๐ฅ๐ ] = 0 and Eฬ[(1/2โ1{๐ฆ๐ โฉฝ ๐ฅโฒ๐ ๐ฝ0 +๐๐ ๐ผ0 })๐ฅ๐ ] = Eฬ[(1/2โ1{๐๐ โฉฝ
0})๐ฅ๐ ] = 0, we have ฮ2 (๐ผ0 , ๐ฝ0 , ๐0 ) = 0. Moreover,
โฃฮ2 (๐ผ, ๐ฝ0 , ๐0 )โฒ (โฬ โ โ0 )โฃ = โฃ{ฮ2 (๐ผ, ๐ฝ0 , ๐0 ) โ ฮ2 (๐ผ0 , ๐ฝ0 , ๐0 )}โฒ (โฬ โ โ0 )โฃ
โฉฝ โฃEฬ[{๐๐ (๐๐ (๐ผ โ ๐ผ0 )) โ ๐๐ (0)}๐ฃ๐ ๐ฅโฒ๐ ](๐ฝห โ ๐ฝ0 )โฃ
+ โฃEฬ[{๐น (๐๐ (๐ผ โ ๐ผ0 )) โ ๐น (0)}๐ฅโฒ๐ ](๐ห โ ๐0 )]โฃ
โฉฝ ๐(1){โฅ๐ฅโฒ๐ (๐ฝห โ ๐ฝ0 )โฅ2,๐ + โฅ๐ฅโฒ๐ (๐ห โ ๐0 )โฅ2,๐ }โฃ๐ผ โ ๐ผ0 โฃ
= ๐๐ (๐ฟ๐ )โฃ๐ผ โ ๐ผ0 โฃ.
Hence โฃฮ2 (ห
๐ผ, ๐ฝ0 , ๐0 )โฒ (โฬ โ โ0 )โฃ โฒ๐ ๐ฟ๐ โฃห
๐ผ โ ๐ผ0 โฃ.
Denote by ฮ22 (๐ผ, ๐ฝ, ๐) the Hessian matrix of ฮ(๐ผ, ๐ฝ, ๐) with respect to โ = (๐ฝ โฒ , ๐โฒ )โฒ . Then
(
)
โEฬ[๐๐โฒ (๐ฅโฒ๐ (๐ฝ โ ๐ฝ0 ) + ๐๐ (๐ผ โ ๐ผ0 ))(๐๐ โ ๐ฅโฒ๐ ๐)๐ฅ๐ ๐ฅโฒ๐ ] Eฬ[๐๐ (๐ฅโฒ๐ (๐ฝ โ ๐ฝ0 ) + ๐๐ (๐ผ โ ๐ผ0 ))๐ฅ๐ ๐ฅโฒ๐ ]
ฮ22 (๐ผ, ๐ฝ, ๐) =
,
Eฬ[๐๐ (๐ฅโฒ๐ (๐ฝ โ ๐ฝ0 ) + ๐๐ (๐ผ โ ๐ผ0 ))๐ฅ๐ ๐ฅโฒ๐ ]
0
so that
(โฬ โ โ0 )โฒ ฮ22 (๐ผ, ๐ฝ, ๐)(โฬ โ โ0 ) โฉฝ โฃ(๐ฝห โ ๐ฝ0 )โฒ Eฬ[๐๐โฒ (๐ฅโฒ๐ (๐ฝ โ ๐ฝ0 ) + ๐๐ (๐ผ โ ๐ผ0 ))(๐๐ โ ๐ฅโฒ๐ ๐)๐ฅ๐ ๐ฅโฒ๐ ](๐ฝห โ ๐ฝ0 )โฃ
+ 2โฃ(๐ฝห โ ๐ฝ0 )โฒ Eฬ[๐๐ (๐ฅโฒ๐ (๐ฝ โ ๐ฝ0 ) + ๐๐ (๐ผ โ ๐ผ0 ))๐ฅ๐ ๐ฅโฒ๐ ](๐ห โ ๐0 )โฃ
โฉฝ ๐ถ{ max E[โฃ๐๐ โ ๐ฅโฒ๐ ๐โฃ]โฅ๐ฅโฒ๐ (๐ฝห โ ๐ฝ0 )โฅ22,๐ + 2โฅ๐ฅโฒ๐ (๐ฝห โ ๐ฝ0 )โฅ2,๐ โ
โฅ๐ฅโฒ๐ (๐ห โ ๐0 )โฅ2,๐ }.
1โฉฝ๐โฉฝ๐
UNIFORM POST SELECTION INFERENCE FOR LAD
15
Here โฃ๐๐ โ ๐ฅโฒ๐ ๐โฃ = โฃ๐ฃ๐ โ ๐ฅโฒ๐ (๐ โ ๐0 )โฃ โฉฝ โฃ๐ฃ๐ โฃ + โฃ๐ฅโฒ๐ (๐ โ ๐0 )โฃ. Hence by Taylorโs theorem together with ILAD(iii),
we conclude that
ห ๐)
ห โ ฮ(ห
โฃฮ(ห
๐ผ, ๐ฝ,
๐ผ, ๐ฝ0 , ๐0 ) โ ฮ2 (ห
๐ผ, ๐ฝ0 , ๐0 )โฒ (โฬ โ โ0 )โฃ โฒ๐ ๐ฟ๐ ๐โ1/2 .
This leads to the expansion in (A.27).
We now proceed to bound the fourth term. By Condition ILAD(iii) we have with probability at least
1 โ ฮ๐ that โฃห
๐ผ โ ๐ผ0 โฃ โฉฝ ๐ฟ๐ . Observe that
(๐๐ผ,๐ฝ0 ,๐0 โ ๐๐ผ0 ,๐ฝ0 ,๐0 )(๐ฆ๐ , ๐๐ , ๐ฅ๐ ) = (1{๐ฆ๐ โฉฝ ๐ฅโฒ๐ ๐ฝ0 + ๐๐ ๐ผ0 } โ 1{๐ฆ๐ โฉฝ ๐ฅโฒ๐ ๐ฝ0 + ๐๐ ๐ผ})๐ฃ๐
= (1{๐๐ โฉฝ 0} โ 1{๐๐ โฉฝ ๐๐ (๐ผ โ ๐ผ0 )})๐ฃ๐ ,
so that โฃ(๐๐ผ,๐ฝ0 ,๐0 โ ๐๐ผ0 ,๐ฝ0 ,๐0 )(๐ฆ๐ , ๐๐ , ๐ฅ๐ )โฃ โฉฝ 1{โฃ๐๐ โฃ โฉฝ ๐ฟ๐ โฃ๐๐ โฃ}โฃ๐ฃ๐ โฃ whenever โฃ๐ผ โ ๐ผ0 โฃ โฉฝ ๐ฟ๐ . Since the class of
functions {(๐ฆ, ๐, ๐ฅ) 7โ (๐๐ผ,๐ฝ0 ,๐0 โ ๐๐ผ0 ,๐ฝ0 ,๐0 )(๐ฆ, ๐, ๐ฅ) : โฃ๐ผ โ ๐ผ0 โฃ โฉฝ ๐ฟ๐ } is a VC subgraph class with VC index
bounded by some constant independent of ๐, using (a version of) Theorem 2.14.1 in [25], we have
sup
โฃ๐ผโ๐ผ0 โฃโฉฝ๐ฟ๐
โฃ๐พ๐ (๐๐ผ,๐ฝ0 ,๐0 โ ๐๐ผ0 ,๐ฝ0 ,๐0 )โฃ โฒ๐ (Eฬ[1{โฃ๐๐ โฃ โฉฝ ๐ฟ๐ โฃ๐๐ โฃ}๐ฃ๐2 ])1/2 โฒ๐ ๐ฟ๐1/2 .
1/2
This implies that โฃ๐ผ๐ โฃ โฒ๐ ๐ฟ๐ ๐โ1/2 .
Combining these bounds on II, III and IV, we have the following stochastic expansion
2
๐ผ๐ [๐๐ผ,
๐ผ โ ๐ผ0 ) + ๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )] + ๐๐ (๐ฟ๐1/2 ๐โ1/2 ) + ๐๐ (๐ฟ๐ )โฃห
๐ผ โ ๐ผ0 โฃ.
ห ๐ห(๐ฆ๐ , ๐๐ , ๐ฅ๐ )] = โ(๐๐ Eฬ[๐ฃ๐ ])(ห
ห ๐ฝ,
Let ๐ผโ = ๐ผ0 + (๐๐ Eฬ[๐ฃ๐2 ])โ1 ๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )]. Then ๐ผโ โ ๐ with probability 1 โ ๐(1) since
โฃ๐ผโ โ ๐ผ0 โฃ โฒ๐ ๐โ1/2 . It is not di๏ฌcult to see that the above stochastic expansion holds with ๐ผ
ห replaced
โ
by ๐ผ , so that
2
โ
1/2 โ1/2
๐ผ๐ [๐๐ผโ ,๐ฝ,
) = ๐๐ (๐ฟ๐1/2 ๐โ1/2 ).
ห ๐ห(๐ฆ๐ , ๐๐ , ๐ฅ๐ )] = โ(๐๐ Eฬ[๐ฃ๐ ])(๐ผ โ ๐ผ0 ) + ๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )] + ๐๐ (๐ฟ๐ ๐
1/2
โ1/2
Therefore, โฃ๐ผ๐ [๐๐ผ,
), so that
ห ๐ห(๐ฆ๐ , ๐๐ , ๐ฅ๐ )]โฃ โฉฝ โฃ๐ผ๐ [๐๐ผโ ,๐ฝ,
ห ๐ห(๐ฆ๐ , ๐๐ , ๐ฅ๐ )]โฃ = ๐๐ (๐ฟ๐ ๐
ห ๐ฝ,
(๐๐ Eฬ[๐ฃ๐2 ])(ห
๐ผ โ ๐ผ0 ) = ๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )] + ๐๐ (๐ฟ๐1/2 ๐โ1/2 ),
โ
๐ผ โ ๐ผ0 ) โ ๐ (0, 1) since by the Lyapunov CLT,
which immediately implies that ๐๐โ1 ๐(ห
โ
(Eฬ[๐ฃ๐2 ]/4)โ1/2 ๐๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )] โ ๐ (0, 1).
Part 2. (Proof for the second assertion). First consider the denominator of ๐ฟ๐ (๐ผ0 ). We have that
โฃ๐ผ๐ [ห
๐ฃ๐2 ] โ ๐ผ๐ [๐ฃ๐2 ]โฃ = โฃ๐ผ๐ [(ห
๐ฃ๐ โ ๐ฃ๐ )(ห
๐ฃ๐ + ๐ฃ๐ )]โฃ โฉฝ โฅห
๐ฃ๐ โ ๐ฃ๐ โฅ2,๐ โฅห
๐ฃ๐ + ๐ฃ๐ โฅ2,๐
โฉฝ โฅ๐ฅโฒ๐ (๐ห โ ๐0 )โฅ2,๐ (2โฅ๐ฃ๐ โฅ2,๐ + โฅ๐ฅโฒ๐ (๐ห โ ๐0 )โฅ2,๐ ) โฒ๐ ๐ฟ๐ ,
where we have used the fact that โฅ๐ฃ๐ โฅ2,๐ โฒ๐ (Eฬ[๐ฃ๐2 ])1/2 = ๐(1) (which is guaranteed by ILAD(i)).
Next consider the numerator of ๐ฟ๐ (๐ผ0 ). Since Eฬ[๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )] = 0 we have
โ1/2
ห ห
๐ผ๐ [๐๐ผ0 ,๐ฝ,
๐พ๐ (๐๐ผ0 ,๐ฝ,
ห ๐ห(๐ฆ๐ , ๐๐ , ๐ฅ๐ )] = ๐
ห ๐ห โ ๐๐ผ0 ,๐ฝ0 ,๐0 ) + ฮ(๐ผ0 , ๐ฝ, ๐) + ๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )].
16
UNIFORM POST SELECTION INFERENCE FOR LAD
By Condition ILAD(iii) and the previous calculation, we have
โ1/2
ห ห
โฃ๐พ๐ (๐๐ผ0 ,๐ฝ,
.
ห ๐ห โ ๐๐ผ0 ,๐ฝ0 ,๐0 )โฃ โฒ๐ ๐ฟ๐ and โฃฮ(๐ผ0 , ๐ฝ, ๐)โฃ โฒ๐ ๐ฟ๐ ๐
Therefore, using the simple identity that ๐๐ด2๐ = ๐๐ต๐2 + ๐(๐ด๐ โ ๐ต๐ )2 + 2๐๐ต๐ (๐ด๐ โ ๐ต๐ ) with
2
โ1/2
๐ด๐ = ๐ผ๐ [๐๐ผ0 ,๐ฝ,
,
ห ๐ห(๐ฆ๐ , ๐๐ , ๐ฅ๐ )] and ๐ต๐ = ๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )] โฒ๐ (Eฬ[๐ฃ๐ ])๐
we have
2
4๐โฃ๐ผ๐ [๐๐ผ0 ,๐ฝ,
ห ๐ห(๐ฆ๐ , ๐๐ , ๐ฅ๐ )]โฃ
4๐โฃ๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )]โฃ2
+ ๐๐ (๐ฟ๐ )
๐ผ๐ [ห
๐ฃ๐2 ]
Eฬ[๐ฃ๐2 ]
since Eฬ[๐ฃ๐2 ] โฉพ ๐ is bounded away from zero. The result then follows since
โ
(Eฬ[๐ฃ๐2 ]/4)โ1/2 ๐๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )] โ ๐ (0, 1).
๐๐ฟ๐ (๐ผ0 ) =
=
โก
Comment A.2 (On 1-step procedure). An inspection of the proof leads to the following stochastic
expansion:
2
๐ผ๐ [๐๐ผห,๐ฝ,
๐ผ โ ๐ผ0 ) + ๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )]
ห ๐ห(๐ฆ๐ , ๐๐ , ๐ฅ๐ )] = โ(๐๐ Eฬ[๐ฃ๐ ])(ห
+ ๐๐ (๐ฟ๐1/2 ๐โ1/2 + ๐ฟ๐ ๐โ1/4 โฃห
๐ผ โ ๐ผ0 โฃ + โฃห
๐ผ โ ๐ผ0 โฃ2 ),
where ๐ผ
ห is any consistent estimator of ๐ผ0 . Hence provided that โฃห
๐ผ โ ๐ผ0 โฃ = ๐๐ (๐โ1/4 ), the remainder
term in the above expansion is ๐๐ (๐โ1/2 ), and the 1-step estimator ๐ผ
ห de๏ฌned by
๐ผ
ห=๐ผ
ห + (๐ผ๐ [๐๐ ๐ฃห๐2 ])โ1 ๐ผ๐ [๐๐ผห,๐ฝ,
ห ๐ห(๐ฆ๐ , ๐๐ , ๐ฅ๐ )]
has the following stochastic expansion:
๐ผ
ห=๐ผ
ห + {๐๐ Eฬ[๐ฃ๐2 ] + ๐๐ (๐โ1/4 )}โ1 {โ(๐๐Eฬ[๐ฃ๐2 ])(ห
๐ผ โ ๐ผ0 ) + ๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )] + ๐๐ (๐โ1/2 )}
= ๐ผ0 + (๐๐ Eฬ[๐ฃ๐2 ])โ1 ๐ผ๐ [๐๐ผ0 ,๐ฝ0 ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ )] + ๐๐ (๐โ1/2 ),
โ
so that ๐๐โ1 ๐(ห
๐ผ โ ๐ผ0 ) โ ๐ (0, 1).
Appendix B. Proof of Theorem 1
The proof of Theorem 1 uses the properties of Post-โ1-LAD and Post-Lasso. We will collect these
properties together with required regularity conditions in Appendix D.
Proof of Theorem 1. We will verify Condition ILAD and the desired result then follows from Lemma 1.
The assumptions on the error density ๐๐ (โ
) in Condition ILAD(i) are assumed in Condition I(iv). The
moment conditions on ๐๐ and ๐ฃ๐ in Condition ILAD(i) are assumed in Condition I(ii).
Condition SE implies that ๐
c is bounded away from zero with probability 1โฮ๐ for ๐ su๏ฌciently large,
ห 0 โฉฝ ๐ถ๐ .
see [9]. Step 1 relies on Post-โ1 -LAD. By assumption with probability 1 โ ฮ๐ we have ๐ ห = โฅ๐ฝโฅ
Thus, by Condition SE ๐min (ห
๐ + ๐ ) is bounded away from zero since ๐ ห + ๐ โฉฝ โ๐ ๐ for large enough
๐ with probability 1 โ ฮ๐ . Moreover, Condition PLAD in Appendix D is implied by Condition I.
The required side condition of Lemma 4 is satis๏ฌed by relations (F.41) and (F.42). By Lemma 4 we
โ
have โฃห
๐ผ โ ๐ผ0 โฃ โฒ๐
๐ log(๐ โจ ๐)/๐ โฉฝ ๐(1) logโ1 ๐ under ๐ 3 log3 (๐ โจ ๐) โฉฝ ๐ฟ๐ ๐. Note that this implies
UNIFORM POST SELECTION INFERENCE FOR LAD
17
{๐ผ : โฃ๐ผ โ ๐ผ0 โฃ โฉฝ ๐โ1/2 log ๐} โ ๐ (with probability 1 โ ๐(1)) which is required in ILAD(ii) and the
(shrinking) de๏ฌnition of ๐ establishes the initial rate of ILAD(iv). By Lemma 5 in Appendix D we have
โ
โฅ๐ฅโฒ๐ (๐ฝห โ ๐ฝ0 )โฅ2,๐ โฒ๐ ๐ log(๐ โจ ๐)/๐ since the required side condition holds. Indeed, for ๐ฅห๐ = (๐๐ , ๐ฅโฒ๐ )โฒ and
๐ฟ = (๐ฟ๐ , ๐ฟ๐ฅโฒ )โฒ , because of Condition SE and the fact that ๐ผ๐ [โฃ๐๐ โฃ3 ] โฒ๐ Eฬ[โฃ๐๐ โฃ3 ] = ๐(1),
โฅห
๐ฅโฒ๐ ๐ฟโฅ32,๐
๐ผ
๐ฅโฒ๐ ๐ฟโฃ3 ]
๐ [โฃห
โฅ๐ฟโฅ0 โฉฝ๐ +๐ถ๐
inf
โฉพ
โฉพ
{๐min (๐ +๐ถ๐ )}3/2 โฅ๐ฟโฅ3
โฒ
3
3
3
4๐ผ
๐ [โฃ๐ฅ๐ ๐ฟ๐ฅ โฃ ]+4โฃ๐ฟ๐ โฃ ๐ผ๐ [โฃ๐๐ โฃ ]
โฅ๐ฟโฅ0 โฉฝ๐ +๐ถ๐
3/2
{๐min (๐ +๐ถ๐ )}
โฅ๐ฟโฅ3
inf
2 +4โฅ๐ฟโฅ3 ๐ผ [โฃ๐ โฃ3 ]
4๐พ
โฅ๐ฟ
โฅ
๐
(๐ +๐ถ๐ )โฅ๐ฟ
โฅ
๐ฅ
๐ฅ
1
max
๐ฅ
๐
๐
โฅ๐ฟโฅ โฉฝ๐ +๐ถ๐
inf
0
โฉพ
4๐พ๐ฅ
{๐min (๐ +๐ถ๐ )}3/2
โ
๐ +๐ถ๐ ๐max (๐ +๐ถ๐ )+4๐ผ๐ [โฃ๐๐ โฃ3 ]
โณ๐
1โ
.
๐พ๐ฅ ๐
โ
Therefore, since ๐พ๐ฅ2 ๐ 2 log2 (๐ โจ ๐) โฉฝ ๐ฟ๐ ๐ and ๐ โฒ ๐ log(๐ โจ ๐) we have
โ
โ
โฅห
๐ฅโฒ๐ ๐ฟโฅ32,๐
๐ ๐min (๐ +๐ถ๐ )
๐
inf
โณ
โ โ
โฒ
3
๐
๐ผ๐ [โฃห
๐ฅ ๐ฟโฃ ]
๐พ๐ฅ ๐ log(๐โจ๐) โ โ.
๐ ๐ +
๐ ๐ log(๐โจ๐) โฅ๐ฟโฅ0 โฉฝ๐ +๐ถ๐
๐
Step 2 relies on Post-Lasso. Condition HL in Appendix D is implied by Condition I and Lemma 2
applied twice with ๐๐ = ๐ฃ๐ and ๐๐ = ๐๐ under the condition that ๐พ๐ฅ4 log ๐ โฉฝ ๐ฟ๐ ๐. By Lemma 7 in Appendix
โ
ห 0 โฒ ๐ with probability 1 โ ๐(1).
D we have โฅ๐ฅโฒ๐ (๐ห โ ๐0 )โฅ2,๐ โฒ๐ ๐ log(๐ โจ ๐)/๐ and โฅ๐โฅ
The rates established above for ๐ห and ๐ฝห imply (A.25) in ILAD(iii) since by Condition I(ii) E[โฃ๐ฃ๐ โฃ] โฉฝ
โ
(E[๐ฃ๐2 ])1/2 = ๐(1) and max1โฉฝ๐โฉฝ๐ โฃ๐ฅโฒ๐ (๐ห โ ๐0 )โฃ โฒ๐ ๐พ๐ฅ ๐ 2 log(๐ โจ ๐)/๐ = ๐(1).
We now verify the last requirement in Condition ILAD(iii). Consider the following class of functions
โฑ๐ = {(๐ฆ, ๐, ๐ฅ) 7โ 1{๐ฆ โฉฝ ๐ฅโฒ ๐ฝ + ๐๐ผ} : ๐ผ โ โ, โฅ๐ฝโฅ0 โฉฝ ๐ถ๐ },
(๐)
which is the union of ๐ถ๐
VC-subgraph classes of functions with VC indices bounded by ๐ถ โฒ ๐ . Hence
log ๐ (๐, โฑ๐ , โฅ โ
โฅโ๐ ,2 ) โฒ ๐ log ๐ + ๐ log(1/๐).
Likewise, consider the following class of functions ๐ข๐ ,๐ = {(๐ฆ, ๐, ๐ฅ) 7โ ๐ฅโฒ ๐ : โฅ๐โฅ0 โฉฝ ๐ถ๐ , โฅ๐ฅโฒ๐ ๐โฅ2,๐ โฉฝ ๐}.
Then
log ๐ (๐โฅ๐บ๐ ,๐ โฅโ๐,2 , ๐ข๐ , โฅ โ
โฅโ๐,2 ) โฒ ๐ log ๐ + ๐ log(1/๐),
where ๐บ๐ ,๐ (๐ฆ, ๐, ๐ฅ) = maxโฅ๐โฅ0 โฉฝ๐ถ๐ ,โฅ๐ฅโฒ๐ ๐โฅ2,๐ โฉฝ๐ โฃ๐ฅโฒ ๐โฃ.
Note that
sup โฃ๐พ๐ (๐๐ผ,๐ฝ,
ห ๐ห โ ๐๐ผ,๐ฝ0 ,๐0 )โฃ โฉฝ sup โฃ๐พ๐ (๐๐ผ,๐ฝ,
ห ๐ห โ ๐๐ผ,๐ฝ,๐
ห 0 )โฃ
๐ผโ๐
(B.28)
๐ผโ๐
+ sup โฃ๐พ๐ (๐๐ผ,๐ฝ,๐
ห 0 โ ๐๐ผ,๐ฝ0 ,๐0 )โฃ.
(B.29)
๐ผโ๐
Consider to bound (B.28). Observe that
๐๐ผ,๐ฝ,๐ (๐ฆ๐ , ๐๐ , ๐ฅ๐ ) โ ๐๐ผ,๐ฝ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ ) = โ(1/2 โ 1{๐ฆ๐ โฉฝ ๐ฅโฒ๐ ๐ฝ + ๐๐ ๐ผ})๐ฅโฒ๐ (๐ โ ๐0 ),
1
and consider the class of functions โ๐ ,๐
= {(๐ฆ, ๐, ๐ฅ) 7โ (1/2 โ 1{๐ฆ โฉฝ ๐ฅโฒ ๐ฝ + ๐๐ผ})๐ฅโฒ (๐ โ ๐0 ) : ๐ผ โ โ, โฅ๐ฝโฅ0 โฉฝ
โ
๐ถ๐ , โฅ๐โฅ0 โฉฝ ๐ถ๐ , โฅ๐ฅโฒ๐ (๐ โ ๐0 )โฅ2,๐ โฉฝ ๐} with ๐ โฒ ๐ log(๐ โจ ๐)/๐. Then by Lemma 9 together with the
above entropy calculations (and some straightforward algebras), we have
โ
โ
sup โฃ๐พ๐ (๐)โฃ โฒ๐ ๐ log(๐ โจ ๐) ๐ log(๐ โจ ๐)/๐ = ๐๐ (1),
๐โโ1๐ ,๐
18
UNIFORM POST SELECTION INFERENCE FOR LAD
where ๐ 2 log2 (๐ โจ ๐) โฉฝ ๐ฟ๐ ๐ is used. Since โฅ๐ฅโฒ๐ (๐ห โ ๐0 )โฅ2,๐ โฒ๐
probability 1 โ ๐(1), we conclude that (B.28) = ๐๐ (1).
โ
ห 0 โจ โฅ๐โฅ
ห 0 โฒ ๐ with
๐ log(๐ โจ ๐)/๐ and โฅ๐ฝโฅ
Lastly consider to bound (B.29). Observe that
๐๐ผ,๐ฝ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ ) โ ๐๐ผ,๐ฝ,๐0 (๐ฆ๐ , ๐๐ , ๐ฅ๐ ) = โ(1{๐ฆ๐ โฉฝ ๐ฅโฒ๐ ๐ฝ + ๐๐ ๐ผ} โ 1{๐ฆ๐ โฉฝ ๐ฅโฒ๐ ๐ฝ0 + ๐๐ ๐ผ})๐ฃ๐ ,
2
where ๐ฃ๐ = ๐๐ โ ๐ฅโฒ๐ ๐0 , and consider the class of functions โ๐ ,๐
= {(๐ฆ, ๐, ๐ฅ) โ
7 (1{๐ฆ โฉฝ ๐ฅโฒ ๐ฝ + ๐๐ผ} โ 1{๐ฆ โฉฝ
โ
โฒ
โฒ
โฒ
๐ฅ ๐ฝ0 + ๐๐ผ})(๐ โ ๐ฅ ๐0 ) : ๐ผ โ โ, โฅ๐ฝโฅ0 โฉฝ ๐ถ๐ , โฅ๐ฅ๐ (๐ฝ โ ๐ฝ0 )โฅ2,๐ โฉฝ ๐} with ๐ โฒ ๐ log(๐ โจ ๐)/๐. Then by
Lemma 9 together with the above entropy calculations (and some straightforward algebras), we have
โ
โ
sup โฃ๐พ๐ (๐)โฃ โฒ๐ ๐ log(๐ โจ ๐) sup
๐ผ๐ [๐(๐ฆ๐ , ๐๐ , ๐ฅ๐ )2 ] โจ Eฬ[๐(๐ฆ๐ , ๐๐ , ๐ฅ๐ )2 ].
๐โโ2๐ ,๐
๐โโ2๐ ,๐
Here we have
Eฬ[๐(๐ฆ๐ , ๐๐ , ๐ฅ๐ )2 ] โฉฝ ๐ถโฅ๐ฅโฒ๐ (๐ฝ โ ๐ฝ0 )โฅ2,๐ (Eฬ[๐ฃ๐4 ])1/2 โฒ
On the other hand,
โ
๐ log(๐ โจ ๐)/๐.
sup ๐ผ๐ [๐(๐ฆ๐ , ๐๐ , ๐ฅ๐ )2 ] โฉฝ ๐โ1/2 sup ๐พ๐ (๐ 2 ) + sup Eฬ[๐(๐ฆ๐ , ๐๐ , ๐ฅ๐ )2 ],
๐โโ2๐ ,๐
๐โโ2๐ ,๐
(B.30)
๐โโ2๐ ,๐
and apply Lemma 9 to the ๏ฌrst term on the right side of (B.30). Then we have
โ
โ
sup ๐พ๐ (๐ 2 ) โฒ๐ ๐ log(๐ โจ ๐) sup
๐ผ๐ [๐(๐ฆ๐ , ๐๐ , ๐ฅ๐ )4 ] โจ Eฬ[๐(๐ฆ๐ , ๐๐ , ๐ฅ๐ )4 ]
๐โโ2๐ ,๐
๐โโ2๐ ,๐
Since โฅ๐ฅโฒ๐ (๐ฝห โ ๐ฝ0 )โฅ2,๐ โฒ๐
โ
โ
โ
โ
โฒ ๐ log(๐ โจ ๐) ๐ผ๐ [๐ฃ๐4 ] โจ Eฬ[๐ฃ๐4 ] โฒ๐ ๐ log(๐ โจ ๐) Eฬ[๐ฃ๐4 ].
โ
ห 0 โฉฝ ๐ถ๐ with probability 1 โ ฮ๐ , we conclude that
๐ log(๐ โจ ๐)/๐ and โฅ๐ฝโฅ
โ
(B.29) โฒ๐ ๐ log(๐ โจ ๐)(๐ log(๐ โจ ๐)/๐)1/4 = ๐(1),
where ๐ 3 log3 (๐ โจ ๐) โฉฝ ๐ฟ๐ ๐ is used.
โก
Appendix C. Auxiliary Technical Results
In this section we collect two auxiliary technical results. Their proofs are given in the supplementary
appendix.
Lemma 2. Let ๐ฅ1 , . . . , ๐ฅ๐ be non-stochastic vectors in โ๐ with max1โฉฝ๐โฉฝ๐ โฅ๐ฅ๐ โฅโ โฉฝ ๐พ๐ฅ . Let ๐1 , . . . , ๐๐
be independent random variables such that Eฬ[โฃ๐๐ โฃ๐ ] < โ for some ๐ โฉพ 4. Then with probability at least
1 โ 8๐ ,
max โฃ(๐ผ๐ โ
1โฉฝ๐โฉฝ๐
Eฬ)[๐ฅ2๐๐ ๐๐2 ]โฃ
โฉฝ4
โ
log(2๐/๐ ) 2
๐พ๐ฅ (Eฬ[โฃ๐๐ โฃ๐ ]/๐ )4/๐ .
๐
Lemma 3. Let ๐ = support(๐ฝ0 ), โฃ๐ โฃ = โฅ๐ฝ0 โฅ0 โฉฝ ๐ and โฅ๐ฝห๐ ๐ โฅ1 โฉฝ cโฅ๐ฝห๐ โ ๐ฝ0 โฅ1 . Moreover, let ๐ฝห(2๐)
denote the vector formed by the largest 2๐ components of ๐ฝห in absolute value and zero in the remaining
components. Then for ๐ โฉพ ๐ we have that ๐ฝห(2๐) satis๏ฌes
โ
โฅ๐ฅโฒ๐ (๐ฝห(2๐) โ ๐ฝ0 )โฅ2,๐ โฉฝ โฅ๐ฅโฒ๐ (๐ฝห โ ๐ฝ0 )โฅ2,๐ + ๐max (๐)/๐ cโฅ๐ฝห๐ โ ๐ฝ0 โฅ1 ,
โ
where ๐max (๐)/๐ โฉฝ 2๐max (๐ )/๐ and โฅ๐ฝห๐ โ ๐ฝ0 โฅ1 โฉฝ ๐ โฅ๐ฅโฒ๐ (๐ฝห โ ๐ฝ0 )โฅ2,๐ /๐
c .
UNIFORM POST SELECTION INFERENCE FOR LAD
19
References
[1] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random
matrices. Constructive Approximation, 28:253โ263, 2008.
[2] A. Belloni, D. Chen, V. Chernozhukov, and C. Hansen. Sparse models and methods for optimal instruments with an
application to eminent domain. Econometrica, 80(6):2369โ2430, November 2012.
[3] A. Belloni and V. Chernozhukov. โ1 -penalized quantile regression for high dimensional sparse models. Annals of Statistics, 39(1):82โ130, 2011.
[4] A. Belloni and V. Chernozhukov. Least squares after model selection in high-dimensional sparse models. Bernoulli,
19(2):521โ547, 2013.
[5] A. Belloni, V. Chernozhukov, and I. Fernandez-Val. Conditional Quantile Processes based on Series or Many Regressors.
ArXiv e-prints, May 2011.
[6] A. Belloni, V. Chernozhukov, and C. Hansen. Inference on treatment e๏ฌects after selection amongst high-dimensional
controls. ArXiv, 2011.
[7] A. Belloni, V. Chernozhukov, and C. Hansen. Inference for high-dimensional sparse econometric models. Advances in
Economics and Econometrics. 10th World Congress of Econometric Society, held in August 2010, III:245โ295, 2013.
[8] A. Belloni, V. Chernozhukov, and K. Kato. Robust inference in high-dimensional sparse quantile regression models.
Working Paper, 2013.
[9] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics,
37(4):1705โ1732, 2009.
[10] Victor Chernozhukov and Christian Hansen. Instrumental variable quantile regression: A robust inference approach.
Journal of Econometrics, 142:379โ398, 2008.
[11] Xuming He and Qi-Man Shao. On parameters of increasing dimensions. J. Multivariate Anal., 73(1):120โ135, 2000.
[12] Bing-Yi Jing, Qi-Man Shao, and Qiying Wang. Self-normalized Cramer-type large deviations for independent random
variables. Ann. Probab., 31(4):2167โ2215, 2003.
[13] K. Kato. Group lasso for high dimensional sparse quantile regression models. Preprint, ArXiv, 2011.
[14] Roger Koenker. Quantile regression, volume 38 of Econometric Society Monographs. Cambridge University Press,
Cambridge, 2005.
[15] Michael R. Kosorok. Introduction to Empirical Processes and Semiparametric Inference. Series in Statistics. Springer,
Berlin, 2008.
[16] Sokbae Lee. E๏ฌcient semiparametric estimation of a partially linear quantile regression model. Econometric theory,
19:1โ31, 2003.
[17] Hannes Leeb and Benedikt M. Poฬtscher. Model selection and inference: facts and ๏ฌction. Economic Theory, 21:21โ59,
2005.
[18] Hannes Leeb and Benedikt M. Poฬtscher. Sparse estimators and the oracle property, or the return of Hodgesโ estimator.
J. Econometrics, 142(1):201โ211, 2008.
[19] J. Neyman. Optimal asymptotic tests of composite statistical hypotheses. In U. Grenander, editor, Probability and
Statistics, the Harold Cramer Volume. New York: John Wiley and Sons, Inc., 1959.
[20] J. Neyman. ๐(๐ผ) tests and their use. Sankhya, 41:1โ21, 1979.
[21] J. L. Powell. Censored regression quantiles. Journal of Econometrics, 32:143โ155, 1986.
[22] M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. ArXiv:1106.1151, 2011.
[23] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58:267โ288, 1996.
[24] S. A. van de Geer. High-dimensional generalized linear models and the lasso. Annals of Statistics, 36(2):614โ645, 2008.
[25] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics, 1996.
[26] Lie Wang. ๐ฟ1 penalized lad estimator for high dimensional ilnear regression. ArXiv, 2012.
[27] Cun-Hui Zhang and Stephanie S. Zhang. Con๏ฌdence intervals for low-dimensional parameters with high-dimensional
data. ArXiv.org, (arXiv:1110.2563v1), 2011.
[28] S. Zhou. Restricted eigenvalue conditions on subgaussian matrices. ArXiv:0904.4723v2, 2009.
20
UNIFORM POST SELECTION INFERENCE FOR LAD
Standard approach: rp(0.05)
Proposed Method 1: rp(0.05)
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.5
0.5
0.5
2
R on d
0 0
R2 on y
2
R on d
Ideal: rp(0.05)
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
2
Figure 1.
0.5
0.5
R on d
0 0
0 0
R2 on y
Proposed Method 2: rp(0.05)
0.4
0.5
0.5
2
R on y
2
R on d
0.5
0 0
2
R on y
The ๏ฌgure displays the empirical rejection probabilities of the nominal 5% level tests of a
true hypothesis based on di๏ฌerent testing procedures: the top left plot is based on the standard postmodel selection procedure based on ๐ผ
ห, the top right plot is based on the proposed post-model selection
procedure based on ๐ผ,
ห and the bottom left plot is based on another proposed procedure based on the
statistic ๐ฟ๐ . The results are based on 500 replications for each of the 100 combinations of ๐
2 โs in the
primary and auxiliary equations in (3.14). Ideally we should observe the 5% rejection rate (of a true
null) uniformly across the parameter space (as in bottom right plot).
UNIFORM POST SELECTION INFERENCE FOR LAD
Standard approach: Bias
Proposed Method: Bias
0.5
0.5
0
0
โ0.5
โ0.5
0.5
2
R on d
0
0
0.6
0.2 0.4
R2 on y
0.8
0.5
R2 on d
Standard approach: Standard Deviation
0.2
0
0
0.5
R on d
0
0
0.6 0.8
0.2 0.4
2
R on d
R on y
0.5
0
0
R on d
Figure 2.
0
0
0.6 0.8
0.2 0.4
2
R on y
R on y
0
0
0.6 0.8
0.2 0.4
2
R on y
Proposed Method: RMSE
0.5
0.5
0
0.5
2
Standard approach: RMSE
2
0
0.6 0.8
0.2 0.4
2
Proposed Method: Standard Deviation
0.2
2
21
0.5
2
R on d
0
0
0.6 0.8
0.2 0.4
2
R on y
The ๏ฌgure displays mean bias (top row), standard deviation (middle row), and root mean
square error (bottom row) for the the proposed post-model selection estimator ๐ผ
ห (right column) and the
standard post-model selection estimator ๐ผ
ห (left column). The results are based on 500 replications for
each of the 100 combinations of ๐
2 โs in the primary and auxiliary equations in (3.14).
UNIFORM POST SELECTION INFERENCE FOR LAD
1
Supplementary Appendix for โUniform Post Selection Inference
for LAD Regression Modelsโ
Appendix D. Auxiliary Results for โ1 -LAD and Heteroscedastic Lasso
In this section we state relevant theoretical results on the performance of the estimators โ1 -LAD, Postโ1 -LAD, heteroscedastic Lasso, and heteroscedastic Post-Lasso. There results were developed in [3] and
[2]. The main design condition relies on the restricted eigenvalue proposed in [9], namely for ๐ฅ
ห๐ = (๐๐ , ๐ฅโฒ๐ )โฒ
๐
c =
inf
โฅ๐ฟ๐ ๐ โฅ1 โฉฝcโฅ๐ฟ๐ โฅ1
โฅห
๐ฅโฒ๐ ๐ฟโฅ2,๐ /โฅ๐ฟ๐ โฅ,
(D.31)
where c = (๐ + 1)/(๐ โ 1) for the slack constant ๐ > 1, see [9]. It is well known that Condition SE implies
that ๐
c is bounded away from zero if c is bounded for any subset ๐ โ {1, . . . , ๐} with โฃ๐ โฃ โฉฝ ๐ .
D.1. โ1 -Penalized LAD. For a data generating process such that P(๐ฆ๐ โฉฝ ๐ฅหโฒ๐ ๐0 โฃ ๐ฅ
ห๐ ) = 1/2, independent
across ๐ (๐ = 1, . . . , ๐) we consider the estimation of ๐0 via the โ1 -penalized LAD regression estimate
๐ห โ arg min ๐ผ๐ [โฃ๐ฆ๐ โ ๐ฅหโฒ๐ ๐โฃ] +
๐
๐
โฅ๐โฅ1 .
๐
As established in [3] and [26], under the event that
๐
โฉพ 2๐โฅ๐ผ๐ [(1/2 โ 1{๐ฆ๐ โฉฝ ๐ฅหโฒ๐ ๐0 })ห
๐ฅ๐ ]โฅโ ,
๐
(D.32)
the estimator above achieves good theoretical guarantees under mild design conditions. Although ๐0 is
unknown, we can set ๐ so that the event in (D.32) holds with high probability. In particular, the pivotal
rule discussed in [3] proposes to set ๐ = ๐โฒ ๐ฮ(1 โ ๐พ โฃ ๐ฅ
ห) for ๐โฒ > ๐ and ๐พ โ 0 where
ฮ(1 โ ๐พ โฃ ๐ฅ
ห) := (1 โ ๐พ)-quantile of 2โฅ๐ผ๐ [(1/2 โ 1{๐๐ โฉฝ 1/2})ห
๐ฅ๐ ]โฅโ ,
(D.33)
and where ๐๐ are independent uniform random variables on (0, 1), independent of ๐ฅ
ห1 , . . . , ๐ฅห๐ . We suggest
๐พ = 0.1/ log ๐ and ๐โฒ = 1.1๐. This quantity can be easily approximated via simulations. Below we
summarize required regularity conditions.
Condition PLAD. Assume that โฅ๐0 โฅ0 = ๐ โฉพ 1, ๐ผ๐ [ห
๐ฅ2๐๐ ] = 1 for all 1 โฉฝ ๐ โฉฝ ๐, the conditional
density of ๐ฆ๐ given ๐๐ , denoted by ๐๐ (โ
), and its derivative are bounded by ๐¯ and ๐¯โฒ , respectively, and
๐๐ (ห
๐ฅโฒ๐ ๐0 ) โฉพ ๐ > 0 is bounded away from zero uniformly in ๐.
Condition PLAD is implied by Condition I. The assumption on the conditional density is standard in
the quantile regression literature even with ๏ฌxed ๐ or ๐ increasing slower than ๐ (see respectively [14]
and [5]). Next we present bounds on the prediction norm of the โ1 -LAD estimator.
Lemma 4 (Estimation Error of โ1 -LAD). Under Condition PLAD, and using ๐ = ๐โฒ ๐ฮ(1 โ ๐พ โฃ ๐ฅ
ห), we
have with probability 1 โ 2๐พ โ ๐(1) for ๐ large enough
โ
โ
๐ ๐
1
๐ log(๐/๐พ)
โฒ
โฅห
๐ฅ๐ (ห
๐ โ ๐0 )โฅ2,๐ โฒ
+
,
๐๐
c
๐
c
๐
2
UNIFORM POST SELECTION INFERENCE FOR LAD
provided
{
๐๐
โc
๐ ๐
๐๐
c
๐ ๐ log([๐โจ๐]/๐พ)
+โ
}
๐¯๐¯โฒ
inf
๐ ๐ฟโฮ
c
โฅห
๐ฅโฒ๐ ๐ฟโฅ32,๐
โ โ.
๐ผ๐ [โฃห
๐ฅโฒ๐ ๐ฟโฃ3 ]
Lemma 4 establishes the rate of convergence in the prediction norm for the โ1 -LAD estimator in
a parametric setting. The extra growth condition required for identi๏ฌcation is mild. For instance we
โ
typically have ๐ โฒ log(๐ โจ ๐)/๐ and for many designs of interest we have inf ๐ฟโฮc โฅห
๐ฅโฒ๐ ๐ฟโฅ32,๐ /๐ผ๐ [โฃห
๐ฅโฒ๐ ๐ฟโฃ3 ]
bounded away from zero (see [3]). For more general designs we have
inf
๐ฟโฮc
โฅห
๐ฅโฒ๐ ๐ฟโฅ32,๐
โฅห
๐ฅโฒ๐ ๐ฟโฅ2,๐
๐
c
โฉพ inf
โฉพ โ
โฒ
3
๐ผ๐ [โฃห
๐ฅ๐ ๐ฟโฃ ] ๐ฟโฮc โฅ๐ฟโฅ1 max๐โฉฝ๐ โฅห
๐ฅ๐ โฅโ
๐ (1 + c) max๐โฉฝ๐ โฅห
๐ฅ๐ โฅโ
which implies the extra growth condition under ๐พ๐ฅ2 ๐ 2 log(๐ โจ ๐) โฉฝ ๐ฟ๐ ๐
2c ๐.
In order to alleviate the bias introduced by the โ1 -penalty, we can consider the associated post-model
selection estimate associated with a selected support ๐ห
}
{
๐ห โ arg min ๐ผ๐ [โฃ๐ฆ๐ โ ๐ฅหโฒ๐ ๐โฃ] : ๐๐ = 0 if ๐ โโ ๐ห .
(D.34)
๐
The following result characterizes the performance of the estimator in (D.34), see [3] for the proof.
Lemma 5 (Estimation Error of Post-โ1-LAD). Assume the conditions of Lemma 4 hold, support(ห
๐ ) โ ๐ห,
and let ๐ ห = โฃ๐หโฃ. Then we have for ๐ large enough
โ
โ
โ
1 ๐ log(๐/๐พ)
(ห
๐ + ๐ ) log(๐ โจ ๐) ๐ ๐
โฒ
๐ โ ๐0 )โฅ2,๐ โฒ๐
โฅห
๐ฅ๐ (ห
+
+
,
๐๐min (ห
๐ + ๐ )
๐๐
c
๐
c
๐
{ โ
}
โ
โฅห
๐ฅโฒ๐ ๐ฟโฅ32,๐
๐ ๐min (ห
๐ +๐ )
๐ ๐min (ห
๐ +๐ )
๐¯๐¯โฒ
โ
โ
provided
+
inf
๐
๐ผ๐ [โฃห
๐ฅโฒ ๐ฟโฃ3 ] โ๐ โ.
๐ ๐
๐ ๐ log([๐โจ๐]/๐พ)
โฅ๐ฟโฅ0 โฉฝห
๐ +๐
๐
Lemma 5 provides the rate of convergence in the prediction norm for the post model selection estimator
despite of possible imperfect model selection. The rates rely on the overall quality of the selected model
(which is at least as good as the model selected by โ1 -LAD) and the overall number of components ๐ ห.
Once again the extra growth condition required for identi๏ฌcation is mild. For more general designs we
have
โ
โฅห
๐ฅโฒ๐ ๐ฟโฅ32,๐
๐min (ห
๐ + ๐ )
โฅห
๐ฅโฒ๐ ๐ฟโฅ2,๐
โ
inf
โฉพ
inf
โฉพ
.
๐ฅโฒ๐ ๐ฟโฃ3 ] โฅ๐ฟโฅ0 โฉฝห๐ +๐ โฅ๐ฟโฅ1 max๐โฉฝ๐ โฅห
๐ฅ๐ โฅโ
โฅ๐ฟโฅ0 โฉฝห
๐ +๐ ๐ผ๐ [โฃห
๐ ห + ๐ max๐โฉฝ๐ โฅห
๐ฅ๐ โฅโ
Comment D.1. In Step 1 of Algorithm 2 we use โ1 -LAD with ๐ฅห๐ = (๐๐ , ๐ฅโฒ๐ )โฒ , ๐ฟห := ๐หโ๐0 = (ห
๐ผโ๐ผ0 , ๐ฝหโฒ โ๐ฝ0โฒ )โฒ ,
ห 2,๐ . However, it follows that
and we are interested on rates for โฅ๐ฅโฒ (๐ฝห โ ๐ฝ0 )โฅ2,๐ instead of โฅห
๐ฅโฒ ๐ฟโฅ
๐
๐
ห 2,๐ + โฃห
โฅ๐ฅโฒ๐ (๐ฝห โ ๐ฝ0 )โฅ2,๐ โฉฝ โฅห
๐ฅโฒ๐ ๐ฟโฅ
๐ผ โ ๐ผ0 โฃ โ
โฅ๐๐ โฅ2,๐ .
Since ๐ โฉพ 1, without loss of generality we can assume the component associated with the treatment
๐๐ belongs to ๐ (at the cost of increasing the cardinality of ๐ by one which will not a๏ฌect the rate of
convergence). Therefore we have that
ห 2,๐ /๐
c .
โฃห
๐ผ โ ๐ผ0 โฃ โฉฝ โฅ๐ฟห๐ โฅ โฉฝ โฅห
๐ฅโฒ๐ ๐ฟโฅ
In most applications of interest โฅ๐๐ โฅ2,๐ and 1/๐
c are bounded from above with high probability. Similarly,
in Step 1 of Algorithm 1 we have that the Post-โ1-LAD estimator satis๏ฌes
(
)
โ
ห 2,๐ 1 + โฅ๐๐ โฅ2,๐ / ๐min (ห
โฅ๐ฅโฒ๐ (๐ฝห โ ๐ฝ0 )โฅ2,๐ โฉฝ โฅห
๐ฅโฒ๐ ๐ฟโฅ
๐ + ๐ ) .
UNIFORM POST SELECTION INFERENCE FOR LAD
3
D.2. Heteroscedastic Lasso. In this section we consider the equation (1.4) in the form
๐๐ = ๐ฅโฒ๐ ๐0 + ๐ฃ๐ , E[๐ฃ๐ ] = 0,
(D.35)
where we observe {(๐๐ , ๐ฅโฒ๐ )โฒ }๐๐=1 , (๐ฅ๐ )๐๐=1 are non-stochastic and normalized in such a way that ๐ผ๐ [๐ฅ2๐๐ ] = 1,
for all 1 โฉฝ ๐ โฉฝ ๐, and (๐ฃ๐ )๐๐=1 are independent across ๐ but not necessary identically distributed. The
unknown support of ๐0 is denoted by ๐๐ and it satis๏ฌes โฃ๐๐ โฃ โฉฝ ๐ . To estimate ๐0 and consequently ๐ฃ๐ , we
compute
๐
ห ๐ = 1, . . . , ๐,
๐ห โ arg min ๐ผ๐ [(๐๐ โ ๐ฅโฒ๐ ๐)2 ] + โฅฮฬ๐โฅ1 and set ๐ฃห๐ = ๐๐ โ ๐ฅโฒ๐ ๐,
(D.36)
๐
๐
where ๐ and ฮฬ are the associated penalty level and loadings which are potentially data-driven. In this
case the following regularization event plays an important role
๐
โฉพ 2๐โฅฮฬโ1 ๐ผ๐ [๐ฅ๐ (๐๐ โ ๐ฅโฒ๐ ๐0 )]โฅโ .
๐
(D.37)
As discussed in [9], [4] and [2], the event above implies that the estimator ๐ห satis๏ฌes โฅ๐ห๐๐๐ โฅ1 โฉฝ cโฅ๐ห๐๐ โ๐0 โฅ1
where c = (๐ + 1)/(๐ โ 1). Thus rates of convergence for ๐ห and ๐ฃห๐ de๏ฌned on (D.36) can be established
based on the restricted eigenvalue ๐
c de๏ฌned in (D.31) with ๐ฅ
ห๐ = ๐ฅ๐ and ๐ = ๐๐ .
The following are su๏ฌcient high-level conditions where again the sequences ฮ๐ and ๐ฟ๐ go to zero and
๐ถ is a positive constant independent of ๐.
Condition HL. For the model (D.35), suppose that for ๐ = ๐ ๐ โฉพ 1 we have โฅ๐0 โฅ0 โฉฝ ๐ and
(i) max (Eฬ[โฃ๐ฅ๐๐ ๐ฃ๐ โฃ3 ])1/3 /(Eฬ[โฃ๐ฅ๐๐ ๐ฃ๐ โฃ2 ])1/2 โฉฝ ๐ถ and ฮฆโ1 (1 โ ๐พ/2๐) โฉฝ ๐ฟ๐ ๐1/3 ,
1โฉฝ๐โฉฝ๐
(ii) max โฃ(๐ผ๐ โ Eฬ)[๐ฅ2๐๐ ๐ฃ๐2 ]โฃ + max โฃ(๐ผ๐ โ Eฬ)[๐ฅ2๐๐ ๐2๐ ]โฃ โฉฝ ๐ฟ๐ , with probability 1 โ ฮ๐ .
1โฉฝ๐โฉฝ๐
๐โฉฝ๐
Condition HL is implied by Conditions I and growth conditions (see Lemma 2). Several primitive
moment conditions imply the various cross moments bounds. These conditions also allow us to invoke
moderate deviation theorems for self-normalized sums from [12] to bound some important error components. Despite heteroscedastic non-Gaussian noise, Those results allows a sharp choice of penalty level
and loadings was analyzed in [2] which is summarized by the following lemma.
Valid options for setting the penalty level and the loadings for ๐ = 1, . . . , ๐, are
โ
¯ 2 ], ๐ = 2๐โ๐ฮฆโ1 (1 โ ๐พ/(2๐)),
initial
๐พ
ห๐ = ๐ผ๐ [๐ฅ2๐๐ (๐๐ โ ๐)
โ
โ
re๏ฌned ๐พ
ห๐ = ๐ผ๐ [๐ฅ2๐๐ ๐ฃห๐2 ],
๐ = 2๐ ๐ฮฆโ1 (1 โ ๐พ/(2๐)),
(D.38)
where ๐ > 1 is a constant, ๐พ โ (0, 1), ๐¯ := ๐ผ๐ [๐๐ ] and ห
๐ฃ๐ is an estimate of ๐ฃ๐ based on Lasso with the
initial option (or iterations). [2] established that using either of the choices in (D.38) implies that the
regularization event (D.37) holds with high probability. Next we present results on the performance of
the estimators generated by Lasso.
โ
Lemma 6. Under Condition HL and setting ๐ = 2๐โฒ ๐ฮฆโ1 (1 โ ๐พ/2๐) for ๐โฒ > ๐ > 1, and using penalty
loadings as in (D.38), there is an uniformly bounded c such that we have
โ
๐ ๐
โฒ ห
โฅห
๐ฃ๐ โ ๐ฃ๐ โฅ2,๐ = โฅ๐ฅ๐ (๐ โ ๐0 )โฅ2,๐ โฒ๐
and
โฅห
๐ฃ๐ โ ๐ฃ๐ โฅโ โฉฝ โฅ๐ห โ ๐0 โฅ1 max โฅ๐ฅ๐ โฅโ .
๐โฉฝ๐
๐๐
c
4
UNIFORM POST SELECTION INFERENCE FOR LAD
Associated with Lasso we can de๏ฌne the Post-Lasso estimator as
{
}
ห
๐ห โ arg min ๐ผ๐ [(๐๐ โ ๐ฅโฒ๐ ๐)2 ] : ๐๐ = 0 if ๐ห๐ = 0 and set ๐ฃห๐ = ๐๐ โ ๐ฅโฒ๐ ๐.
๐
(D.39)
That is, the Post-Lasso estimator is simply the least squares estimator applied to the covariates selected
by Lasso in (D.36). Sparsity properties of the Lasso estimator ๐ห under estimated weights follows similarly
to the standard Lasso analysis derived in [2]. By combining such sparsity properties and the rates in the
prediction norm we can establish rates for the post-model selection estimator under estimated weights.
The following result summarizes the properties of the Post-Lasso estimator.
Lemma 7 (Model Selection Properties of Lasso and Properties of Post-Lasso). Suppose that Conditions
HL and SE hold. Consider the Lasso estimator with penalty level and loadings speci๏ฌed as in Lemma 6.
Then the data-dependent model ๐ห๐ selected by the Lasso estimator ๐ห satis๏ฌes with probability 1 โ ฮ๐ :
Moreover, the Post-Lasso estimator obeys
โฅห
๐ฃ๐ โ ๐ฃ๐ โฅ2,๐ =
ห 0 = โฃ๐ห๐ โฃ โฒ ๐ .
โฅ๐โฅ
โฅ๐ฅโฒ๐ (๐ห โ ๐0 )โฅ2,๐
โฒ๐
(D.40)
โ
๐ log(๐ โจ ๐)
.
๐
Appendix E. Alternative Implementation via Double Selection
An alternative proposal for the method is reminiscent of the double selection method proposed in [6]
for partial linear models. This version replaces Step 3 with a LAD regression of ๐ฆ on ๐ and all covariates
selected in Steps 1 and 2 (i.e. the union of the selected sets). The method is described as follows:
Algorithm 3. (A Double Selection Method)
Step 1 Run Post-โ1-LAD of ๐ฆ๐ on ๐๐ and ๐ฅ๐ :
ห โ arg min ๐ผ๐ [โฃ๐ฆ๐ โ ๐๐ ๐ผ โ ๐ฅโฒ ๐ฝโฃ] +
(ห
๐ผ, ๐ฝ)
๐
๐ผ,๐ฝ
๐1
โฅ(๐ผ, ๐ฝ)โฅ1 .
๐
Step 2 Run Heteroscedastic Lasso of ๐๐ on ๐ฅ๐ :
๐2
๐ห โ arg min ๐ผ๐ [(๐๐ โ ๐ฅโฒ๐ ๐)2 ] + โฅฮฬ๐โฅ1 .
๐
๐
Step 3 Run LAD regression of ๐ฆ๐ on ๐๐ and the covariates selected in Step 1 and 2:
ห โ arg min {๐ผ๐ [โฃ๐ฆ๐ โ ๐๐ ๐ผ โ ๐ฅโฒ ๐ฝโฃ] : support(๐ฝ) โ support(๐ฝ)
ห โช support(๐)}.
ห
(ห
๐ผ, ๐ฝ)
๐
๐ผ,๐ฝ
The double selection algorithm has three steps: (1) select covariates based on the standard โ1 -LAD
regression, (2) select covariates based on heteroscedastic Lasso of the treatment equation, and (3) run a
LAD regression with the treatment and all selected covariates.
This approach can also be analyzed through Lemma 1 since it creates instruments implicitly. To see
ห โช support(๐).
ห By the ๏ฌrst
that let ๐หโ denote the variables selected in Step 1 and 2: ๐หโ = support(๐ฝ)
ห we have
order conditions for (ห
๐ผ, ๐ฝ)
ห ๐ , ๐ฅโฒ โ )โฒ ]โฅ = ๐{( max โฃ๐๐ โฃ + ๐พ๐ฅ โฃ๐หโ โฃ1/2 )(1 + โฃ๐หโ โฃ)/๐},
โฅ๐ผ๐ [๐(๐ฆ๐ โ ๐๐ ๐ผ
ห โ ๐ฅโฒ๐ ๐ฝ)(๐
๐๐ห
1โฉฝ๐โฉฝ๐
UNIFORM POST SELECTION INFERENCE FOR LAD
5
which creates an orthogonal relation to any linear combination of (๐๐ , ๐ฅโฒ๐๐หโ )โฒ . In particular, by taking the
linear combination (๐๐ , ๐ฅโฒ )(1, โ๐หโฒ )โฒ = ๐๐ โ ๐ฅโฒ ๐หหโ = ๐๐ โ ๐ฅโฒ ๐ห = ๐งห๐ , which is the instrument in Step 2
๐๐หโ
๐หโ
๐
๐๐หโ ๐
of Algorithm 1, we have
ห ๐ง๐ ] = ๐{โฅ(1, โ๐หโฒ )โฒ โฅ( max โฃ๐๐ โฃ + ๐พ๐ฅ โฃ๐หโ โฃ1/2 )(1 + โฃ๐หโ โฃ)/๐}.
๐ผ๐ [๐(๐ฆ๐ โ ๐๐ ๐ผ
ห โ ๐ฅโฒ๐ ๐ฝ)ห
1โฉฝ๐โฉฝ๐
As soon as the right side is ๐๐ (๐โ1/2 ), the double selection estimator ๐ผ
ห approximately minimizes
โฒห
๐ง๐ ]โฃ2
ห ๐ (๐ผ) = โฃ๐ผ๐ [๐(๐ฆ๐ โ ๐๐ ๐ผ โ ๐ฅ๐ ๐ฝ)ห
๐ฟ
,
ห 2 ๐งห2 ]
๐ผ๐ [{๐(๐ฆ๐ โ ๐๐ ๐ผ
ห โ ๐ฅโฒ๐ ๐ฝ)}
๐
where ๐งห๐ is the instrument created by Step 2 of Algorithm 1. Thus the double selection estimator can be
seen as an iterated version of the method based on instruments where the Step 1 estimate ๐ฝห is updated
ห
with ๐ฝ.
Appendix F. Proof of Theorem 2
Proof of Theorem 2. We will verify Condition ILAD and the desired then follows from Lemma 1. The
assumptions on the error density ๐๐ (โ
) in Condition ILAD(i) are assumed in Condition I(iv). The moment
conditions on ๐๐ and ๐ฃ๐ in Condition ILAD(i) are assumed in Condition I(ii).
Condition SE implies that ๐
c is bounded away from zero with probability 1 โ ฮ๐ for ๐ su๏ฌciently
large, see [9]. Step 1 relies on โ1 -LAD. Condition PLAD is implied by Condition I. By Lemma 4 and
Comment D.1 we have
โฅ๐ฅโฒ๐ (๐ฝห โ ๐ฝ0 )โฅ2,๐ โฒ๐
โ
โ
๐ log(๐ โจ ๐)/๐ and โฃห
๐ผ โ ๐ผ0 โฃ โฒ๐ ๐ log(๐ โจ ๐)/๐ โฒ ๐(1) logโ1 ๐
because ๐ 3 log3 (๐ โจ ๐) โฉฝ ๐ฟ๐ ๐ and the required side condition holds. Indeed, without loss of generality
assume that ๐ contains the treatment so that for ๐ฅ
ห๐ = (๐๐ , ๐ฅโฒ๐ )โฒ , ๐ฟ = (๐ฟ๐ , ๐ฟ๐ฅโฒ )โฒ , because of Condition SE
and the fact that ๐ผ๐ [โฃ๐๐ โฃ3 ] โฒ๐ Eฬ[โฃ๐๐ โฃ3 ] = ๐(1), we have
inf ๐ฟโฮc
โฅห
๐ฅโฒ๐ ๐ฟโฅ32,๐
๐ผ๐ [โฃห
๐ฅโฒ๐ ๐ฟโฃ3 ]
โฅห
๐ฅโฒ๐ ๐ฟโฅ22,๐ โฅ๐ฟ๐ โฅ๐
c
โฅห
๐ฅโฒ๐ ๐ฟโฅ22,๐ โฅ๐ฟ๐ โฅ๐
c
4๐ผ๐ [โฃ๐ฅโฒ๐ ๐ฟ๐ฅ โฃ3 ]+4๐ผ๐ [โฃ๐๐ ๐ฟ๐ โฃ3 ] โฉพ inf ๐ฟโฮc 4๐พ๐ฅ โฅ๐ฟ๐ฅ โฅ1 ๐ผ๐ [โฃ๐ฅโฒ๐ ๐ฟ๐ฅ โฃ2 ]+4โฃ๐ฟ๐ โฃ3 ๐ผ๐ [โฃ๐๐ โฃ3 ]
โฅห
๐ฅโฒ๐ ๐ฟโฅ22,๐ โฅ๐ฟ๐ โฅ๐
c
inf ๐ฟโฮc 4๐พ๐ฅ โฅ๐ฟ๐ฅ โฅ1 {โฅห๐ฅโฒ ๐ฟโฅ2,๐ +โฅ๐ฟ
2
2
3
๐ ๐๐ โฅ2,๐ } +4โฃ๐ฟ๐ โฃ ๐ผ
๐
โ๐ [โฃ๐๐ โฃ ]โฅ๐ฟ๐ โฅ1
โฅห
๐ฅโฒ๐ ๐ฟโฅ22,๐ โฅ๐ฟ๐ โฅ1 ๐
c / ๐
inf ๐ฟโฮc 8๐พ๐ฅ (1+c)โฅ๐ฟ๐ โฅ1 โฅห๐ฅโฒ ๐ฟโฅ2 +8๐พ๐ฅ (1+c)โฅ๐ฟ๐ โฅ1 โฃ๐ฟ๐ โฃ2 {โฅ๐๐ โฅ2 +๐ผ๐ [โฃ๐๐ โฃ3 ]}
2,๐
2,๐
๐
โ
๐
c / ๐
โ1 .
โณ
2
๐
2
3
2
8๐พ๐ฅ (1+c){1+โฅ๐๐ โฅ2,๐ /๐
c +๐ผ๐ [โฃ๐๐ โฃ ]/๐
c }
๐ ๐พ๐ฅ
โฉพ inf ๐ฟโฮc
โฉพ
โฉพ
โฉพ
โ
Therefore, since ๐ โฒ ๐ log(๐ โจ ๐) we have
โ
โฅห
๐ฅโฒ๐ ๐ฟโฅ32,๐
๐๐
๐
โ c
inf
โณ
โ๐ โ
โ
๐
โฒ
3
๐ฅ๐ ๐ฟโฃ ]
๐พ๐ฅ ๐ log(๐ โจ ๐)
๐ ๐ + ๐ ๐ log(๐ โจ ๐) ๐ฟโฮc ๐ผ๐ [โฃห
(F.41)
(F.42)
under ๐พ๐ฅ2 ๐ 2 log2 (๐ โจ ๐) โฉฝ ๐ฟ๐ ๐. Note that the rate for ๐ผ
ห and the de๏ฌnition of ๐ implies {๐ผ : โฃ๐ผ โ ๐ผ0 โฃ โฉฝ
โ1/2
๐
log ๐} โ ๐ (with probability 1 โ ๐(1)) which is required in ILAD(ii). Moreover, by the (shrinking)
de๏ฌnition of ๐ we have the initial rate of ILAD(iv). Step 2 relies on Lasso. Condition HL is implied by
Condition I and Lemma 2 applied twice with ๐๐ = ๐ฃ๐ and ๐๐ = ๐๐ under the condition that ๐พ๐ฅ4 log ๐ โฉฝ ๐ฟ๐ ๐.
โ
ห 0 โฒ ๐ with
By Lemma 6 we have โฅ๐ฅโฒ (๐ห โ ๐0 )โฅ2,๐ โฒ๐ ๐ log(๐ โจ ๐)/๐. Moreover, by Lemma 7 we have โฅ๐โฅ
๐
probability 1 โ ๐(1).
6
UNIFORM POST SELECTION INFERENCE FOR LAD
The rates established above for ๐ห and ๐ฝห imply (A.25) in ILAD(iii) since by Condition I(ii) E[โฃ๐ฃ๐ โฃ] โฉฝ
โ
(E[๐ฃ 2 ])1/2 = ๐(1) and max1โฉฝ๐โฉฝ๐ โฃ๐ฅโฒ (๐ห โ ๐0 )โฃ โฒ๐ ๐พ๐ฅ ๐ 2 log(๐ โจ ๐)/๐ = ๐(1).
๐
๐
To verify Condition ILAD(iii) (A.26), arguing as in the proof of Theorem 1, we can deduce that
sup โฃ๐พ๐ (๐๐ผ,๐ฝ,
ห ๐ห โ ๐๐ผ,๐ฝ0 ,๐0 )โฃ = ๐๐ (1).
๐ผโ๐
This completes the proof.
โก
Appendix G. Proof of Auxiliary Technical Results
Proof of Lemma 2. We shall use Lemma 8 ahead. Let ๐๐ = (๐ฅ๐ , ๐๐ ) and de๏ฌne โฑ = {๐๐ (๐ฅ๐ , ๐๐ ) = ๐ฅ2๐๐ ๐๐2 :
โ
๐ = 1, . . . , ๐}. Since P(โฃ๐โฃ > ๐ก) โฉฝ E[โฃ๐โฃ๐ ]/๐ก๐ , for ๐ = 2 we have that median(โฃ๐โฃ) โฉฝ 2E[โฃ๐โฃ2 ] and for
๐ = ๐/4 we have (1 โ ๐ )-quantile of โฃ๐โฃ is bounded by (E[โฃ๐โฃ๐/4 ]/๐ )4/๐ . Then we have
โ
โ
max median (โฃ๐พ๐ (๐ (๐ฅ๐ , ๐๐ ))โฃ) โฉฝ 2Eฬ[๐ฅ4๐๐ ๐๐4 ] โฉฝ ๐พ๐ฅ2 2Eฬ[๐๐4 ]
๐ โโฑ
and
(1 โ ๐ )-quantile of max
๐โฉฝ๐
โ
โ
๐ผ๐ [๐ฅ4๐๐ ๐๐4 ] โฉฝ (1 โ ๐ )-quantile of ๐พ๐ฅ2 ๐ผ๐ [๐๐4 ] โฉฝ ๐พ๐ฅ2 (Eฬ[โฃ๐๐ โฃ๐ ]/๐ )4/๐ .
The conclusion follows from Lemma 8.
โก
Proof of Lemma 3. By the triangle inequality we have
ห 2,๐ .
โฅ๐ฅโฒ๐ (๐ฝห(2๐) โ ๐ฝ0 )โฅ2,๐ โฉฝ โฅ๐ฅโฒ๐ (๐ฝห โ ๐ฝ0 )โฅ2,๐ + โฅ๐ฅโฒ๐ (๐ฝห(2๐) โ ๐ฝ)โฅ
Now let ๐ 1 denote the ๐ largest components of ๐ฝห and ๐ ๐ corresponds to the ๐ largest components of ๐ฝห
outside โช๐โ1 ๐ ๐. It follows that ๐ฝห(2๐) = ๐ฝห๐ 1 โช๐ 2 .
๐=1
โ
Next note that for ๐ โฉพ 3 we have โฅ๐ฝห๐ ๐+1 โฅ โฉฝ โฅ๐ฝห๐ ๐ โฅ1 / ๐. Indeed, consider the problem max{โฅ๐ฃโฅ/โฅ๐ขโฅ1 :
๐ฃ, ๐ข โ โ๐ , max๐ โฃ๐ฃ๐ โฃ โฉฝ min๐ โฃ๐ข๐ โฃ}. Given a ๐ฃ and ๐ข we can always increase the objective function by
using ๐ฃห = max๐ โฃ๐ฃ๐ โฃ(1, . . . , 1)โฒ and ๐ข
หโฒ = min๐ โฃ๐ข๐ โฃ(1, . . . , 1)โฒ instead. Thus, the maximum is achieved at
โ
โ
โ
โฒ
๐ฃ = ๐ข = (1, . . . , 1) , yielding 1/ ๐.
Thus, by โฅ๐ฝห๐ ๐ โฅ1 โฉฝ cโฅ๐ฟ๐ โฅ1 and โฃ๐ โฃ = ๐ we have
ห 2,๐
โฅ๐ฅโฒ๐ (๐ฝห(2๐) โ ๐ฝ)โฅ
โ๐พ
= โฅ๐ฅโฒ๐ ๐=3 ๐ฝห๐ ๐ โฅ2,๐
โ
โ๐พ
โ๐พ
โฉฝ ๐=3 โฅ๐ฅโฒ๐ ๐ฝห๐ ๐ โฅ โฉฝ ๐max (๐) ๐=3 โฅ๐ฝห๐ ๐ โฅ
โ
โ
ห 1 ๐ โฅ1
โ
ห ๐ โฅ1
โฅ๐ฝ
โฅ๐ฝ
๐
โ
โฉฝ ๐max (๐) ๐พโ1
โฉฝ ๐max (๐) (๐โ๐)
๐=2
๐
โ
โ
ห ๐
โฉฝ ๐max (๐) โฅ๐ฝโ๐๐โฅ1 โฉฝ ๐max (๐)c โฅ๐ฟโ๐๐โฅ1 .
โก
UNIFORM POST SELECTION INFERENCE FOR LAD
7
Appendix H. Auxiliary Probabilistic Inequalities
Let ๐1 , . . . , ๐๐ be independent random variables taking values in a measurable space (๐, ๐ฎ), and
โ๐
consider an empirical process ๐พ๐ (๐ ) = ๐โ1/2 ๐=1 {๐ (๐๐ ) โ E[๐ (๐๐ )]} indexed by a pointwise measurable
class of functions โฑ on ๐ (see [25], Chapter 2.3). Denote by โ๐ the (random) empirical probability
measure that assigns probability ๐โ1 to each ๐๐ . Let ๐ (๐, โฑ , โฅ โ
โฅโ๐ ,2 ) denote the ๐-covering number of
โฑ with respect to the ๐ฟ2 (โ๐ ) seminorm โฅ โ
โฅโ๐ ,2 .
The following maximal inequality is derived in [6].
Lemma 8 (Maximal inequality for ๏ฌnite classes). Suppose that the class โฑ is ๏ฌnite. Then for every
๐ โ (0, 1/2) and ๐ฟ โ (0, 1), with probability at least 1 โ 4๐ โ 4๐ฟ,
{ โ
}
max โฃ๐พ๐ (๐ )โฃ โฉฝ 4 2 log(2โฃโฑ โฃ/๐ฟ) ๐(1 โ ๐ ) โจ 2 max median (โฃ๐พ๐ (๐ )โฃ) ,
๐ โโฑ
๐ โโฑ
โ
where ๐(๐ข) := ๐ข-quantile of max๐ โโฑ ๐ผ๐ [๐ (๐๐ )2 ].
The following maximal inequality is derived in [3].
Lemma 9 (Maximal inequality for in๏ฌnite classes). Let ๐น = sup๐ โโฑ โฃ๐ โฃ, and suppose that there exist
some constants ๐๐ > 1, ๐ > 1, ๐ > 0, and โ๐ โฉพ โ0 such that
๐ (๐โฅ๐น โฅโ๐,2 , โฑ , โฅ โ
โฅโ๐,2 ) โฉฝ (๐ โจ โ๐ )๐ (๐๐ /๐)๐๐ , 0 < ๐ < 1.
โ
โ
Set ๐ถ := (1 + 2๐)/4. Then for every ๐ฟ โ (0, 1/6) and every constant ๐พ โฉพ 2/๐ฟ, we have
{
}
โ
โ
โ
โ
2
2
sup โฃ๐พ๐ (๐ )โฃ โฉฝ 4 2๐๐พ๐ถ ๐ log(๐ โจ โ๐ โจ ๐๐ ) max sup Eฬ[๐ (๐๐ ) ], sup ๐ผ๐ [๐ (๐๐ ) ] ,
๐ โโฑ
๐ โโฑ
๐ โโฑ
with probability at least 1 โ ๐ฟ, provided that ๐ โจ โ0 โฉพ 3; the constant ๐ < 30 is universal.
References
[1] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random
matrices. Constructive Approximation, 28:253โ263, 2008.
[2] A. Belloni, D. Chen, V. Chernozhukov, and C. Hansen. Sparse models and methods for optimal instruments with an
application to eminent domain. Econometrica, 80(6):2369โ2430, November 2012.
[3] A. Belloni and V. Chernozhukov. โ1 -penalized quantile regression for high dimensional sparse models. Annals of Statistics, 39(1):82โ130, 2011.
[4] A. Belloni and V. Chernozhukov. Least squares after model selection in high-dimensional sparse models. Bernoulli,
19(2):521โ547, 2013.
[5] A. Belloni, V. Chernozhukov, and I. Fernandez-Val. Conditional Quantile Processes based on Series or Many Regressors.
ArXiv e-prints, May 2011.
[6] A. Belloni, V. Chernozhukov, and C. Hansen. Inference on treatment e๏ฌects after selection amongst high-dimensional
controls. ArXiv, 2011.
[7] A. Belloni, V. Chernozhukov, and C. Hansen. Inference for high-dimensional sparse econometric models. Advances in
Economics and Econometrics. 10th World Congress of Econometric Society, held in August 2010, III:245โ295, 2013.
[8] A. Belloni, V. Chernozhukov, and K. Kato. Robust inference in high-dimensional sparse quantile regression models.
Working Paper, 2013.
8
UNIFORM POST SELECTION INFERENCE FOR LAD
[9] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics,
37(4):1705โ1732, 2009.
[10] Victor Chernozhukov and Christian Hansen. Instrumental variable quantile regression: A robust inference approach.
Journal of Econometrics, 142:379โ398, 2008.
[11] Xuming He and Qi-Man Shao. On parameters of increasing dimensions. J. Multivariate Anal., 73(1):120โ135, 2000.
[12] Bing-Yi Jing, Qi-Man Shao, and Qiying Wang. Self-normalized Cramer-type large deviations for independent random
variables. Ann. Probab., 31(4):2167โ2215, 2003.
[13] K. Kato. Group lasso for high dimensional sparse quantile regression models. Preprint, ArXiv, 2011.
[14] Roger Koenker. Quantile regression, volume 38 of Econometric Society Monographs. Cambridge University Press,
Cambridge, 2005.
[15] Michael R. Kosorok. Introduction to Empirical Processes and Semiparametric Inference. Series in Statistics. Springer,
Berlin, 2008.
[16] Sokbae Lee. E๏ฌcient semiparametric estimation of a partially linear quantile regression model. Econometric theory,
19:1โ31, 2003.
[17] Hannes Leeb and Benedikt M. Poฬtscher. Model selection and inference: facts and ๏ฌction. Economic Theory, 21:21โ59,
2005.
[18] Hannes Leeb and Benedikt M. Poฬtscher. Sparse estimators and the oracle property, or the return of Hodgesโ estimator.
J. Econometrics, 142(1):201โ211, 2008.
[19] J. Neyman. Optimal asymptotic tests of composite statistical hypotheses. In U. Grenander, editor, Probability and
Statistics, the Harold Cramer Volume. New York: John Wiley and Sons, Inc., 1959.
[20] J. Neyman. ๐(๐ผ) tests and their use. Sankhya, 41:1โ21, 1979.
[21] J. L. Powell. Censored regression quantiles. Journal of Econometrics, 32:143โ155, 1986.
[22] M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. ArXiv:1106.1151, 2011.
[23] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58:267โ288, 1996.
[24] S. A. van de Geer. High-dimensional generalized linear models and the lasso. Annals of Statistics, 36(2):614โ645, 2008.
[25] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics, 1996.
[26] Lie Wang. ๐ฟ1 penalized lad estimator for high dimensional ilnear regression. ArXiv, 2012.
[27] Cun-Hui Zhang and Stephanie S. Zhang. Con๏ฌdence intervals for low-dimensional parameters with high-dimensional
data. ArXiv.org, (arXiv:1110.2563v1), 2011.
[28] S. Zhou. Restricted eigenvalue conditions on subgaussian matrices. ArXiv:0904.4723v2, 2009.
© Copyright 2026 Paperzz