Summary - Andrew Tulloch

Theorem (Hoeffding’s Inequality). Suppose EX = 0, and a ≤ X ≤ b.
Then
t2 (b−a)2
8
E etX ≤ e
.
(1.11)
STATISTICAL THEORY SUMMARY
ANDREW TULLOCH
2. Uniform Laws of Large Numbers
Theorem. Let H be a class of functions from a measurable space T
to R. Assume that for every > 0 there exists a finite set of brackets [lj , uj ], j = 1, . . . , N (), such that E(|lj (X)|) < ∞, E(|uj (X)|) < ∞,
and E(|uj (X) − lj (X)|) < for every j. Suppose moreover that for every
h ∈ H there exists j with h ∈ [lj , uj ] ⇐⇒ h ∈ {f : T → R|l(x) ≤ f (x) ≤
u(x), ∀x ∈ T }. Then we have a uniform law of large numbers,
n
1 X
as
sup (h(Xi ) − E(h(X))) → 0
(2.1)
n
h∈H
1. Basic Concepts
Definition (Convergence almost surely). A sequence Xn , n ∈ N of random variables converges almost surely to a random variable X if
P(Xn → X) = µ(ω ∈ Ω|Xn (ω) → X(ω)) = 1
(1.1)
as
We say that Xn → X.
i=1
p
Definition (Convergence in probability). Xn → X (in probability) if for
all > 0,
P(|Xn − X| > ) → 0
(1.2)
Proof. Let > 0, and choose brackets [lj , uj ] such that E(|uj − lj ) (X) < 2
by hypothesis. Then for every ω ∈ T ∞ outside a null set A, there exists
n0 (ω, ) such that
as n → ∞. For random vectors, we define analogously with taking the
norm in Rn .
max
)
j=1,...,N ( 2
d
Definition (Convergence in distribution). Xn → X or Xn converges to
X in distribution if
P(Xn ≤ t) → P(X ≤ t)
(1.3)
(i)
as
d
=
(1.4)
which is bounded above by
(ii) If Xn → X in any mode, and if g : X → Rd is continuous, then
g(Xn ) → g(X)in the same mode.
d
d
d
(iii) Xn Yn → cX where Yn ∈ R.
2
(2.3)
n
1X
uj − E(uj ) + E(uj ) − E(h)
n i=1
+
2
p
d
−1 X, where A = (A ).
A−1
ij
n Xn → A
Proof. The key is to consider the set S = {θ ∈ Θ : kθ − θ0 k ≥ }. This is
compact, with Q continuous on this set, and so an infimum Q(θ ) > Q(θ0 )
is attained on this set. Choose δ > 0 so Q(θ ) − δ > Q(θ0 ) + δ. Then
consider the set An () = {supθ∈Θ |Qn (θ) − Q(θ)| < δ}. On this set, we
have
Theorem (Law of Large Numbers). let X1 , . . . , Xn be IID copies of X P
such that E(|Xi |) < ∞, then
(1.5)
inf Qn (θ) ≥ inf Q(θ) − δ = Q(θ ) − δ > Q(θ0 ) + δ ≥ Qn (θ0 ).
S
Theorem (Central limit theorem). Let X1 , . . . , Xn be IID copies of X ∼ P
on R with V(X) = σ 2 < ∞. Then
!
n
√
1X
d
n
Xi − E(X) → N (0, σ 2 )
(1.6)
n i=1
but as P(An ()) → 1, we have P kθ̂n − θ0 k < → 1.
E sup |q(X, θ)|
<∞
(3.3)
θ∈Θ
Then
2
1
as
sup | q(Xi , θ) − E(q(X, θ)) | → 0
n
(3.4)
θ∈Θ
(1.8)
as n → ∞
Theorem (Chebyshev’s Inequality). Let µ = E(X), σ 2 = V(X). Then
Proof. We seek to find suitable bracketing functions and proceed via the
uniform law of large numbers. First, define the brackets u(x, θ, η) =
supθ0 ∈B(0,η) , so E(|u(X, θ, η)|) < ∞ by assumption. By continuity of
q(·, x), the supremum is achieved at points kθu (θ) − θk < η, and so
limη→0 |u(x, θ, η) − q(θ, x) → 0 at every x and every θ, and using the dominated convergence theorem gives limη→0 E(|u(X, θ, η) − q(θ, X)) → 0.
Then for > 0 and θ ∈ Θ we can choose η(, θ) small so that
(1.9)
Theorem (Markov’s Inequality). Let X be a non-negative random variable and suppose EX exists. Then for any t > 0,
E(X)
t
(3.2)
Theorem. Let Θ be compact in Rp , and let X ⊆ Rd and consider observing X1 , . . . , Xn iid from X ∼ P on X. Let q : X × Θ → R that is
continuous in θ for all x and measurable in x for all θ ⊆ Θ.
Assume
Theorem (Gaussian Tail Inequality). If X ∼ N (0, 1), then
t2
S
So if θ̂n lay in S , then Qn (δ0 ) would be strictly smaller than Qn (θ̂n ),
contradicting that θ̂n is a minimizer.
Conclude that
An () ⇒ kθ̂n −θ0 k < ,
In the multivariate case, where X ∼ P on Rd with the covariance of X
as Σ, then
!
n
√
1X
d
n
Xi − E(X) → N (0, Σ)
(1.7)
n i=1
P(|X − µ| ≥ t) ≤
(3.1)
as n → ∞.
p
If θn is any sequence of minimizers of Qn , then θ̂n → θ0 as n → ∞.
d
σ2
θ∈Θ
i, j and Xn → X, then An Xn → AX, and if A is invertible,
2e−
P(|X| > ) ≤
.
(2.4)
= , as required.
sup |Qn (θ) − Q(θ)| → 0
d
(iv) Xn Yn−1 → c−1 X where Yn ∈ R, c 6= 0.
p
(iv) If (An , n ∈ N) are random matrices with (An )ij → Aij for all
n
1X
as
Xi → E(X)
n i=1
(2.2)
Theorem. Let Θ ⊆ Rp be compact. Let Q : Θ → R be a continuous,
non-random function that has a unique minimizer θ0 ∈ Θ.
Let Qn : Θ → R be any sequence of random functions such that
(ii) Xn + Yn → X + c
P(X > t) ≤
i=1
2
3. Consistency of M -estimators
d
(iii) Slutsky’s lemma If Xn → X and Yn → c (a constant). Then
p
(i) Yn → c
d
uj − E(uj ) | <
n
n
1X
1X
h(Xi ) − E(h(X))
uj − E(h)
n i=1
n i=1
Proposition. Let (Xn , n ∈ N), X taking values in X ⊆ Rd .
Xn → X ⇒ Xn → X ⇒ Xn → X
n
X
by the strong law of large numbers and taking a union over all j.
Then this gives for h ∈ H, outside a set of zero measure,
whenever t 7→ P(X ≤ t) is continuous.
p
|
(1.10)
E(u(X, θ, η) − l(X, θ, η)) < .
1
(3.5)
STATISTICAL THEORY SUMMARY
The open balls {B(θ, η(, θ))} are an open cover of the compact set Θ,
so there exists a finite subcover by Heine-Borel. This finite subcover
uj (·, θj , η(, θj )) constitutes a bracketing set of the q, and so we apply
the uniform law of large numbers.
Theorem (Consistency of the Maximum Likelihood Estimator). Consider
the model f (θ, y), θ ∈ Θ ⊆ Rp ,R y ∈ Y ⊂ Rd . Assume f (θ, y) > 0 for all
y ∈ Y and all θ ∈ Θ, and that Y f (θ, y)dy = 1 for every θ ∈ Θ. Assume
further that Θ is compact and that the map θ 7→ f (θ, y) is continuous
on Θ for every y ∈ Y. Let Y1 , . . . , Yn be iid with common density f (θ0 ),
where θ0 ∈ Θ. Suppose finally that the identification condition 13 and the
domination condition
Z
sup | log f (θ0 , y)|f (θ0 , y)dy < ∞
(3.6)
Y θ 0 ∈Θ
R
Proof. First, note that as f (θ, y)dy = 1 for all θ ∈ J, then
R ∂ log f (θ,y)
f (θ, y)dy = 0 for every θ ∈ int K, so
∂θ
Eθ0
∂
∂θ
R
f (θ, y)dy =
∂ log f (θ0 , Y )
= 0.
∂θ
p
∂Q (θ̂ )
n n
bility approaching one, so
= 0. Applying the mean value theorem,
∂θ
we have
√ ∂Qn (θ0 )
√
0= n
+ An n(θ̂n − θ0 )
(4.8)
∂θ
where Ân is the matrix of second derivatives of Qn evaluated at a mean
value θnj on the line segment between θ0 and θ̂n .
For the first component, we have
n
√ ∂Qn (θ0 )
1 X ∂ log f (θ0 , Yi ) d
n
= −√
→ N (0, i(θ0 ))
∂θ
n i=1
∂θ
(3.7)
Q(θ) = Eθ0 q(θ, Y ),
(3.8)
this follows from the previous results.
(3.9)
Definition (Uniform Consistency). An estimator Tn is uniformly consistent in θ ∈ Θ, if for every δ > 0,
sup Pθ0 (kTn − θ0 k > δ) → 0
(3.10)
θ0 ∈Θ
2
p
∂ log f (θ0 ,Y )
For the second component, we show An → −E
,
∂θ∂θ T
1 Pn
,
Y
),
which we do component-wise. We have (An )kj = n
h
(θ
nj
i
kj
i=1
where hjk is the second mixed partial derivative of − log f (θ, Yi ), and we
p
seek to show each hjk → E hjk (θ0 , Y ) . This follows by
n
1 X
hjk (θnj , Yi ) − E hjk (θ0 , Y ) (4.10)
n
i=1
n
1 X
≤ sup hjk (θ, Yi ) − E hjk (θ, Y ) + E hjk (θnj , Y ) − E hjk (θ0 , Y ) θ∈K n
i=1
(4.11)
as n → ∞.
then by the uniform law of large numbers, the first term converges to
zero, and the fact that θnj → θ0 in probability implies the second term
Theorem. An estimator is uniformly consistent if, for every > 0,
inf
(Q(θ) − Q(θ0 )) > 0
inf
θ0 ∈Θ θ∈Θ:kθ−θ0 k≥
(3.11)
and that
sup Pθ0 (sup |Qn (θ; Y1 , . . . , Yn ) − Q(θ)| > δ) → 0.
(3.12)
θ∈Θ
∂ 2 log f (θ ,Y )
0
converges to zero. Hence, −An →pθ0 E(θ0 )
≡ Σ(θ0 ).
∂θ∂θ T
As the limit is invertible we have that An is invertible on sets with
measure approaching one, so we can rewrite the previous result as
√
−1 √
n(θ̂n − θ0 ) = −An
n
4. Asymptotic Distribution Theory
d
Theorem. Consider the model f (θ, y), θ ∈ Θ ⊂ Rp , y
R ∈ Y ⊂ R . Assume
f (θ, y) > 0for all y ∈ Y and all θ ∈ Θ, and that Y f (θ, y)dy = 1 for
every θ ∈ Θ. Let Y1 , . . . , Yn be iid from density f (θ0 , y) for some θ0 ∈ Θ.
Assume moreover
(i) θ0 is an interior point on Θ.
(ii) There exists an open set U satisfying θ0 ∈ U ⊂ Θ such that f (θ, y)
is, for every y ∈ Y, twice continuously differentiable with respect
to θ on U ,
(iii) Eθ0
∂ 2 log f (θ0 ,Y )
∂θ∂θ T
is nonsingular, and
∂ log f (θ0 , Y ) 2
k <∞
(4.1)
∂θ
(iv) There exists a compact ball K ⊂ U with nonempty interior centered at θ0 such that
Eθ0 k
Eθ0 sup k
θ∈K
from Slutsky’s lemma.
Finally, we show Σ(θ0 ) = i(θ0 ). This follows from interchanging integration and differentiation to show
Z
Z 2
∂f (θ, y)
∂ f (θ, y)
∂
dy
=
dy = 0
(4.13)
∂θT
∂θ
∂θ∂θT
for all θ ∈ int K. Then, use the chain rule to show
∂ 2 log f (θ, y)
1
∂ log f (θ, y) ∂ log f (θ, y) T
∂ 2 f (θ, y)
=
−
T
T
∂θ∂θ
f (θ, y) ∂θ∂θ
∂θ
∂θ
and using this identity at θ0 .
Theorem. In the framework of the previous theorem with p = 1 and for
n ∈ N fixed, let θ̃ = θ̃(Y1 , . . . , Yn ), be any unbiased estimator of θ — that
is, it satisfies Eθ θ̃ = θ for all θ ∈ Θ. Then
∂f (θ, y)
kdy < ∞
∂θ
(4.3)
for all θ ∈ int(Θ).
(4.4)
Proof. Cauchy-Swartz and Eθ0
Pn
d
i=1 dθ log f (θ, Yi ),
sup k
Z
sup k
Y θ∈K
∂ 2 f (θ, y)
∂θ∂θT
kdy < ∞
Vθ (θ̃n ) ≥
Let θ̂n be the MLE in the model {f (θ, ·); θ ∈ Θ} based on the sample Y1 , . . . , Yn , and assume θ̂n →Pθ0 θ0 as n → ∞. Define the Fisher
information
Then i(θ0 ) = −Eθ0
∂ log f (θ0 , Y ) ∂ log f (θ0 , Y )T
∂θ
∂θ
√
Vθ (θ̃) ≥
Covθ (θ̃, l0 (θ, Y )) =
Z
θ̃(y)l0 (θ, y)
Specifically, letting l(θ, Y ) =
(4.16)
n
Y
f (θ, yi )dy
(4.17)
i=1
Z
n(θ̂n − θ0 ) → N (0, i
∂ log f (θ,Y )
.
θ
(4.15)
Cov2θ (θ̃, l0 (θ, Y ))
1
=
Vθ (l0 (θ, Y ))
ni(θ)
and
d
1
ni(θ)
since
(4.5)
2
∂ log f (θ0 ,Y )
,
∂θ∂θ T
(4.14)
(4.2)
Y θ∈K
i(θ0 ) = Eθ0
∂Qn (θ0 ) d
→ N (0, Σ−1 (θ0 i(θ0 )Σ−1 (θ0 ))).
∂θ
(4.12)
∂ 2 log f (θ, Y )
k<∞
∂θ∂θT
Z
as n → ∞.
(4.9)
by the central limit theorem.
q(θ, y) = − log f (θ, y),
n
1X
Qn (θ) =
q(θ, Yi ),
n i=1
θ0 ∈Θ
(4.7)
Since θ̂n → θ0 , we have θ̂n is an interior point of Θ on events of proba-
hold. If θ̂n is the MLE in the model {f (θ, ·)|θ ∈ Θ} based on the sample
Y1 , . . . , Yn , then θ̂n is consistent, in that θ̂n →pθ0 θ0 as n → ∞.
Proof. Setting
2
−i
(θ0 ))
(4.6)
=
θ̃(y)
d
d
d
f (θ, y)dy =
Eθ θ̃ =
θ = 1.
dθ
dθ
dθ
(4.18)
STATISTICAL THEORY SUMMARY
Theorem (Delta Method). Let Θ be an open subset of Rp and let Φ : Θ →
Rm be differentiable at θ ∈ Θ, with derivative DΦθ . Let rn be a divergent
sequence of positive real numbers and let Xn be random variables taking
for some vector θn on the line segment between θ0 and θ0 +
uniform law of large numbers, we have
n
1 T X ∂ 2 q(θn , Yi )
1
h
h − hT i(θ0 h) →pθ0 0
T
2n
∂θ∂θ
2
i=1
d
values in Θ such that rn (Xn − θ) → X as n → ∞. Then
d
rn (Φ(Xn ) − Φ(θ)) → DΦθ (X)
N (0, i−1 (θ)),
as n → ∞. If X ∼
(4.19)
3
√h .
n
By the
(4.29)
as n → ∞.
then
DΦθ (X) ∼ N (0, Σ(Φ, Θ)).
(4.20)
Definition (Contiguity). Let Pn , Qn be two sequences of probability measures. We say that Qn is contiguous with respect to P − n if for every
sequence of measurable sets An , the hypothesis Pn (An → 0) as n → ∞
implies Qn (An ) → 0 as n → ∞, and write Qn / P − n. The sequences are
mutually contiguous if both Qn / Pn and Pn / Qn , and write Pn / .Qn .
Definition (Likelihood Ratio test statistic). Suppose we observe Y1 , . . . , Yn
from f (θ, ·), and consider the testing problem H0 : θ ∈ Θ0 against H1 :
θ ∈ Θ, where Θ0 ⊂ Θ ⊂ Rp . The Neyman-Pearson theory suggests to test
these hypothesis by the likelihood ratio test statistic
Theorem (LeCam’s First Lemma). Let P −n, Qn be probability measures
Q
supθ∈Θ n
i=1 f (θ, Yi )
Q
Λn (Θ, Θ0 ) = 2 log
(4.21) on measurable spaces (Ωn , An ). Then the following are equivalent:
supθ∈Θ0 n
i=1 f (θ, Yi )
(i) Qn / Pn .
dPn
(ii) If dQ
→dQn U along a subsequence of n, then P (U > 0) = 1.
which in terms of the maximum likelihood estimators θ̂n , θ̂n,0 of the models
n
Θ, Θ0 is
Λn (Θ, Θ0 ) = −2
n
X
log f (θ̂n,0 , Yi ) − log f (θ̂n , Yi ).
(4.22)
n
→dPn V along a subsequence of n, then E(V ) = 1.
(iii) If dQ
dPn
(iv) For any sequence of statistics (measurable functions Tn : Ωn →
Rk ), we have Tn →Pn 0 as n → ∞ implies Tn →Qn 0 as n → ∞.
i=1
Theorem. Consider a parametric model f (θ, y), θ ∈ Θ ⊂ Rp that satisfies
the assumptions of the theorem on asymptotic normality of the MLE.
Consider the simple null hypothesis Θ0 = {θ0 }, θ0 ∈ Θ. Then under H0 ,
the likelihood ratio test statistic is asymptotically chi-squared distributed,
so
d
(4.23)
Λn (Θ, Θ0 ) → χ2p
as n → ∞ under Pθ0 .
T
∂ 2 Q(θ
∂Qn (θ̂n )
n)
(θ0 − θ̂n )
(θ0 − θ̂n ) + n(θ0 − θ̂n )T
∂θ
∂θ∂θT
(4.24)
for some vector θn on the line segment between θ̂n and θ0 .
As in the proof of the previous theorem, we show that An converges to
√
i(θ0 ) in probability. Thus, by Slutsky’s lemma and consistency, n(θ̂n −
T
θ0 ) (An − i(θ0 )) converges to zero in distribution and probability (as the
limit is constant), so we can repeat the argument and obtain
√
√
n(θ̂n − θ0 )T (An − i(θ0 )) n(θ̂n − θ0 ) →pθ0 0
(4.25)
and so Λn (Θ, Θ0 ) has the same limit distribution as the random variable
√
√
n(θ̂n − θ0 )T i(θ0 ) n(θ̂n − θ0 ).
(4.26)
By continuity of x 7→ xT i(θ0 x), we obtain the limiting distribution is
X T i(θ0 X), with X ∼ N (0, i−1 )(θ0 ), which is the squared Euclidean norm
of the MVN N (0, Ip ), which has a χ2p distribution.
4.1. Local Asymptotic Normality and Contiguity.
Definition (Local Asymptotic Normality). Consider a parametric model
f (θ) ≡ f (θ, ·), θ ∈ Θ ⊂ Rp and let q(θ, y) = − log f (θ, y). Suppose
∂
q(θ0 , y) and the Fisher information i(θ0 ) exist at the interior point
∂θ
θ0 ∈ Θ. We say that the model {f (θ) : θ ∈ Θ} is locally asymptotically
normal at θ0 if for every convergent sequence hn → h and for Y1 , . . . , Yn
iid ∼ f (θ0 ), we have, as n → ∞,
log
n f (θ +
Y
0
i=1
hn
√
)
n
f (θ0 )
n
1 X T ∂q(θ0 , Yi )
1
(Yi ) = − √
h
− hT i(θ0 )h + oPθ (1).
0
n i=1
∂θ
2
(4.27)
We say that the model {f (θ) : θ ∈ Θ} is locally asymptotically normal
if it is locally asymptotically normal for every θ ∈ int Θ.
Rp ,
Theorem. Consider a parametric model {f (θ), θ ∈ Θ}, Θ ⊂
that
satisfies the assumptions of the theorem on the asymptotic normality of
the MLE. Then {f (θ) : θ ∈ Θ0 } is locally asymptotically normal for every
open subset Θ0 of Θ.
Proof. We prove for nn = h fixed. As before, we can expand
about log f (θ0 ) up to second order, and obtain
1
−√
n
n
X
i=1
Theorem. Let Pn , Qn be sequences of probability measures on measurdPn
→dQn eX where X ∼ N (− 21 σ 2 , σ 2 ),
able spaces (Ωn , An ) such that dQ
n
2
for some σ > 0 as n → ∞. Then Pn / .Qn .
2
Proof. Since P (eX > 0) = 1, so Qn /Pn from (ii) and since E eN (µ,σ ) =
2
Proof. Since Λn (Θ, Θ0 ) = 2nQn (θ0 ) − 2nQn (θ̂n ), we can expand this in
a Taylor series around θ̂n , obtaining
2n
Proof. (i) ⇐⇒ (iv): follows by taking An = {kTn k > }, so Qn (An ) → 0
implies Tn →Qn 0. Conversely, take Tn = 1An . (i) ⇒ (ii)
hT
∂q(θ0 , Yi )
1 T
−
h
∂θ
2n
n
X
i=1
log f (θ0 + √hn )
∂ 2 q(θn , Yi )
h
∂θ∂θT
1 ⇐⇒ µ = − σ2 , this follows from (iii).
Theorem. If {f (θ) : θ ∈ Θ} is locally asymptotically normal and if hn →
h ∈ Rp , then the product measures P n hn and Pθn corresponding to
θ+ √
n
hn
samples X1 , . . . , Xn from densities f (θ + √
) and f (θ), respectively, are
n
mutually contiguous. In particular, if a statistic T (Y1 , . . . , Yn ) converges
to zero in probability under Pθn then it also converges to zero in P n hn θ+ √
n
probability.
Proof. This follows from the fact that the asymptotic expansion converges
to N (−
hT i(θ)h
, hT i(θ)h)
2
under Pθ . Then we can apply (iv).
4.2. Bayesian Inference.
Definition. kP −QkT V = supB∈B(Rp ) |P (B)−Q(B)| is the total variation
distance on the set of probability measures on the Borel σ-algebra B(Rp )
of Rp .
Theorem (Bernstein-von Mises Theorem). Consider a parametric model
{f (θ), θ ∈ Θ}, Θ ⊂ Rp , that satisfies the assumptions of the theorem on the
asymptotic normality of the MLE. Suppose the model admits a uniformly
consistent estimator Tn . Let X1 , . . . , Xn be iid from density f (θ0 ), let θ̂n
be the MLE based on that sample, assume the prior measure Π is defined
on the Borel sets of Rp and that Π possesses a Lebesgue-density π that is
continuous and positive in a neighborhood of θ0 . Then, if Π(·|X1 , . . . , Xn )
is the posterior distribution given the sample, we have
kΠ(·|X1 , . . . , Xn ) − N (θ̂n ,
1 −1
i (θ0 ))kT V →Pθ0 0
n
(4.30)
as n → ∞.
5. High Dimensional Linear Models
Here, we consider the model Y = Xθ = , ∼ N (0, σ 2 In ), θ ∈ Θ =
Rp , σ 2 > 0, where X is an n×p design matrix, and is a standard Gaussian
noise vector in Rn . Throughout, we denote the resulting p × p Gram
1
X T X which is symmetric and positive semidefinite.
matrix as Σ̂ = n
Write a < b for a ≤ Cb for some fixed (ideally harmless) constant
∼
C > 0.
Theorem. In the case p ≤ n, the classical least squares estimator introduced by Gauss solves the problem
(4.28)
min
θ∈Rp
kY − Xθk2
n
(5.1)
Com
STATISTICAL THEORY SUMMARY
with the solution θ̂ = (X T X)−1 X T Y ∼ N (0, σ 2 (X T X)−1 ) where we rely
on X having full column rank so
we have
XT X
XT X
n
is invertible. Assuming
σ2
σ2 p
1
Eθ kX(θ̂ − θ)k2e2 = Eθ kθ̂ − θk22
tr(Ip ) =
.
n
n
n
= Ip ,
(5.2)
Definition. θ0 ∈ B0 (k) ≡ {θ ∈ Rp , at most k nonzero entries}.
For θ0 ∈ B0 (k), call S0 = {j : θj0 6= 0} the active set of θ0 .
4
T
X
Proof. Note that √n
are N (0, Σ̂) distributed. We then have the probability in questions exceeds one minus
p
1
P
max √ |(T X)| > σ t2 + 2 log p
(5.17)
j=1,...,p
n
p
X p
≤
P |Z| > t2 + 2 log p
(5.18)
j=1
≤ pe−
t2
2
exp− log p
= e−
t2
2
.
(5.19)
5.1. The LASSO.
Definition (The LASSO). The θ̃ = θ̃LASSO = arg minθ∈Rp
λkθk1 .
kY −Xθk2
2
n
+
Theorem (The LASSO performs almost as well as the LS estimator).
Let θ0 ∈ B0 (k) be a k-sparse vector in Rp with active set S0 . Suppose
Y = Xθ0 + where ∼ N (0, In ), and let θ̃ be the LASSO estimator with
penalization parameter
s
t2 + 2 log p 2
λ = 4σ
, σ = max Σ̂jj ,
(5.3)
j=1,...,p
n
t2
2
(5.4)
(5.5)
Proof. Note that by definition we have
1
˜2 2 + λkθk
˜1 ≤ 1 kY − Xθ0 k2 + λkθ0 k1
kY − X θk
2
n
n
1
2
kX(θ0 − θ̃)k22 + λkθ̃k1 ≤ T X(θ̃ − θ0 ) + λkθ0 k1 .
n
n
(5.6)
(5.7)
by inserting the model equation.
Using the tail bound on the next theorem, we have on an event A,
λ
2T X(θ̃ − θ0 )
| ≤ kθ̃ − θ0 k1 .
|
n
2
(5.8)
and thus combining with the above result obtain
2
kX(θ0 − θ̃)k22 + 2λkθ̃k1 ≤ \kθ̃ − θ0 k1 + 2λkθ0 k1 .
n
(5.9)
0 k − kθ̃
0
c
Using kθ̃k1 = kθ̃S0 k1 + kθ̃S0c k1 ≥ kθS
1
S0 − θS0 k1 + kθ̃S0 k1 we
0
0 = 0 by definition of S
obtain on this event, noting θS
c
0
0
2
0
k + λkθ̃S0c k1
kX(θ0 − θ̃)k22 + 2λkθ̃S0c k1 ≤ 3λkθ̃S0 − θS
0 1
n
(5.10)
and so
2
0
kX(θ0 − θ̃)k22 + λkθ̃S0c k1 ≤ 3λkθ̃S0 − θS
k
0 1
n
holds on the event.
Then we have
0
{θ ∈ Rp : kθS0c k1 ≤ 3kθS0 − θS
k }
0 1
(5.21)
Theorem. Let the n × p matrix X have entries (Xij )
(5.22)
∼iid
N (0, 1) and
T
we have
k
1
kX(θ̃ − θ0 )k22 + λkθ̃ − θ0 k1 ≤ 4λ2 kr0 < × log p.
∼ n
n
Theorem. The theorem on LASSO holds true with the crucial condition
(5.20) replaced with the following condition: For S0 , the active set of
θ0 ∈ B0 (k), k ≤ p, assume the n × p matrix X satisfies, for all θ in
kθS0 − θ0 k21 ≤ kr0 (θ − θ0 )T Σ̂(θ − θ0 ).
on an event of probability at least 1 − β. Then with probability at least
1 − β − exp−
holding true with high probability.
and some universal constant r0 ,
and assume the n × p matrix X is such that, for some r0 > 0,
1
kθ̃S0 − θ0 k21 ≤ kr0 (θ̃ − θ0 )T Σ̂(θ̃ − θ0 )
n
5.2. Coherence Conditions for Design Matrices. The critical condition is
kθ̃S0 − θ0 k21 ≤ kr0 (θ̃ − θ0 )T Σ̂(θ̃ − θ0 )
(5.20)
(5.11)
n
let Σ̂ = X n X . Suppose log
→ ∞ as min(p, n) → ∞. Then for every
p
k ∈ N fixed and every 0 < C< ∞, there exists n large enough such that
P θT Σ̂θ ≥ 21 kθk22 ∀θ ∈ B0 (k) ≥ 1 − 2 exp(−Ck log p).
Proof. For θ = 0, the result is trivial. Thus, it suffices to bound
1
P θT Σ̂θ ≥ kθk22 ∀θ ∈ B0 (k)\{0}
2
!
θT Σ̂θ
1
=P
−
1
≥
−
∀θ
∈
B
(k)\{0}
0
kθk22
2


θT Σ̂θ
1

≥P
sup
T − 1 ≤
2
θ∈B0 (k),kθk2 6=0 θ θ
(5.23)
(5.24)
(5.25)
2
from below by 1 − 2 exp(−Ck log p). We can then do this over each kdimensional subspace RpS for each S ⊂ {1, . . . , p} with |S| = k, then use


θT Σ̂θ
1
P
sup
(5.26)
T − 1 ≤ 
θ θ
2
θ∈B0 (k),kθk2
2 6=0


θT Σ̂θ
X
1
sup
P
(5.27)
≤
T − 1 ≥  .
p
2
θ∈R ,kθk2 6=0 θ θ
S⊆{1,...,p}
S
2
2e−(C+1)k log p)
Then we just
= 2e−Ck log p p−k and sum
need a bound of
over the kp ≤ pk subsets.
Using the below result and taking t = (C + 1)k log p is then sufficient.
Theorem. Under the conditions of the previous theorem, we have for
some universal constant c0 > 0, every S ⊂ {1, . . . , p} such that |S| = k
and every t > 0,


r
θT Σ̂θ
t + c0 k
t + c0 k 

P
sup
| T − 1| ≥ 18(
+
) ≤ 2e−t . (5.28)
p
n
n
θ∈R ,kθk2 6=0 θ θ
S
2
2
2
0
kX(θ̃ − θ0 )k22 + λkθ̃ − θ0 k1 = kX(θ̃ − θ0 )k22 + λ |θ̃S0 − θS
k + λkθ̃S0c k1 Proof (Nontrivial!). Note
0 1
n
n
(5.12)
θT Σ̂θ
0
sup
−
1
sup
|θT (Σ̂ − 1)θ|
(5.29)
=
(5.13)
≤ 4λkθ̃S0 − θS0 k1
T
p
θ∈Rp ,kθk2 ≤1
2 6=0 θ θ
θ∈R
,kθk
r
2
2
S
S
kr0
≤ 4λ
kX(θ̃ − θ0 )k2
(5.14)
By compactness, we can cover the unit ball B(S) = {θ ∈ RpS : kθk2 ≤ 1}
n
by a net of points θl such that for every θ ∈ B(S) there exists l with
1
≤ kX(θ̃ − θ0 )k22 + 4λ2 kr0
(5.15) kθ − θl k2 ≤ δ. Then with Φ = Σ̂ − I, we have
n
θT Φθ = (θ − θl )Φ(θ − θl ) + (λl )T Φθl + 2(θ − θl )T Φθl .
(5.30)
using the previous inequalities and 4ab ≤ a2 + 4b2 .
2
T
The second term is bounded by δ supv∈B(S) |v Φv|. The third term is
Theorem. Let λ0 = λ
. The for all t > 0,
2
bounded by 2δ supv∈B(S) |v T Φv|. Thus,
2 T
t2
sup |θT Φθ| ≤
max |θl Φθl | + (δ 2 + 2δ) sup |v T Φv|
(5.31)
P
max
|( X)j | ≤ λ0 ≥ 1 − exp(− ).
(5.16)
l∈1,...,n(θ)
θ∈B(S)
v∈B(S)
j=1,...,p n
2
STATISTICAL THEORY SUMMARY
which gives the bound
sup |θT Φθ| ≤
θ∈B(S)
9
2
max
|θl Φθl |
(5.32)
l=1,...,N (δ)
at δ = 13 .
At θl ∈ B(S) fixed, we have
(θl )T Φθl =
n
1X
((Xθl )2i − E (Xθl )2i )
n i=1
(5.33)
and the random variables (Xθl )i are IID N (0, kθl k22 ) distributed with variance kθl k22 ≤ 1. Thus for gi iidN (0, 1), we have
!!
r
9
t + c0 k
t + c0 k
l T
l
P
max
|(θ ) Φθ | > 18
+
(5.34)
2 l=1,...,N ( 13 )
n
n
1)
N( 3
≤
X
r
l T
P |(θ ) Φθ | >
l=1
1)
N( 3
=
X
l=1
l
4kθl k22
t + c0 k
t + c0 k
+
n
n
!!
!
n
X
p
P (gi2 − 1) > 4( n(t + c0 k) + t + c0 k)
(5.35)
(5.36)
i=1
1
≤ 2N ( )e−t e−c0 k
3
≤ 2e−t
(5.37)
(5.38)
where we apply the next inequality with z = t + c0 k, and rely on the covering numbers of the unit ball in k-dimensional Euclidean space satisfying
k
for some universal A > 0.
N (δ) ≤ A
δ
Pn
2
Theorem. Let gi ,i = 1, . . . , n be iid N (0, 1), and set X = i=1 (gi − 1).
Then for all t ≥ 0 and n ∈ N,
t2
).
4(n + t)
2
Proof. For |λ| < 12 , we can compute the MGF of E eλ(g −1) =
P(|X| ≥ t) ≤ 2 exp(−
(5.39)
−λ
√e
1−2λ
=
exp( 21 − log(1 − 2λ) − 2λ).
Then taking Taylor expansions, we have
1
2
2
λ2
[− log(1 − 2λ) − 2λ] = λ2 (1 + 2λ + · · · +
(2λ)k + . . . ) ≤
2
3
k+2
1 − 2λ
(5.40)
nλ2
λX
and by IID, log E e
≤ 1−2λ .
nλ2
Then by Markov’s inequality, P(X > t) ≤ E eλX−λt ≤ exp( 1−2λ
−
2
t
t
), when taking λ = 2n+2t
.
λt) = exp(− 4(n+t)
√
Then taking t = 4( nz + z), we obtain the required result.
References
5