Part 1: Prediction with the Lasso

Theory for the Lasso
Recall the linear model
Yi =
p
X
(j)
βj Xi
+ i , i = 1, . . . , n,
j=1
or, in matrix notation,
Y = Xβ + ,
To simplify, we assume that the design X is fixed, and that is
N (0, σ 2 I)-distributed.
We moreover assume that the linear model holds exactly, with some
“true parameter value” β 0 .
(Part 1)
High-dimensional statistics
May 2012
1 / 41
What is an oracle inequality?
Suppose for the moment that p ≤ n and that X has full rank p.
Consider the least squares estimator in the linear model
β̂LM := (XT X)−1 XT Y.
Then the prediction error
kX(β̂LM − β 0 )k22 /σ 2
is χ2p -distributed. In particular, this means that
EkX(β̂LM − β 0 )k22
σ2
=
p.
n
n
In words: each parameter βj0 is estimated with squared accuracy σ 2 /n,
j = 1, . . . , p. The overall squared accuracy is then (σ 2 /n) × p.
(Part 1)
High-dimensional statistics
May 2012
2 / 41
Sparsity
We now turn to the situation where possibly p > n. The philosophy that will generally rescue us, is to “believe” that in fact
only a few, say s0 , of the βj0 are non-zero.
We use the notation
S0 :
S0 := {j : βj0 6= 0},
so that s0 = |S0 |. We call S0 the active set,
and s0 the sparsity index of β 0 .
(Part 1)
High-dimensional statistics
May 2012
3 / 41
Notation
βj,S0 := βj l{j ∈ S0 }, βj,S0c := βj l{j ∈
/ S0 }.
Clearly,
β = βS0 + βS0c ,
and
βS0 c = 0.
0
(Part 1)
High-dimensional statistics
May 2012
4 / 41
If we would know S0 , we could simply neglect all variables X(j) with
j∈
/ S0 . Then, by the above argument, the overall squared accuracy
would be (σ 2 /n) × s0 .
With S0 is unknown, we apply the `1 -penalty, i.e., the Lasso
n
o
β̂ := arg min kY − Xβk22 /n + λkβk1 .
β
Definition: Sparsity oracle inequality. The sparsity constant φ0 is the
largest value φ0 > 0 such that Lasso β̂ satisfies the φ0 -sparsity oracle
inequality
λ 2 s0
kX(β̂ − β 0 )k22 /n + λkβ̂S0c k1 ≤ 2 .
φ0
(Part 1)
High-dimensional statistics
May 2012
5 / 41
A digression: the noiseless case
Let X be some measurable space, Q be a
probability measure on X , and k · k be the
L2 (Q) norm. Consider a fixed dictionary of
functions {ψj }pj=1 ⊂ L2 (Q):
Consider linear functions
fβ (·) =
p
X
βj ψj (·) : β ∈ Rp .
j=1
Consider moreover a fixed target
f 0 :=
p
X
βj0 ψj .
j=1
We let S0 := {j : βj0 6= 0} be its active set, and s0 := |S0 | be the
sparsity index of f 0 .
(Part 1)
High-dimensional statistics
May 2012
6 / 41
For some fixed λ > 0, the Lasso for the noiseless problem is
∗
0 2
β := arg min kfβ − f k + λkβk1 ,
β
where k · k1 is the `1 -norm. We write f ∗ := fβ ∗ and let S∗ be the active
set of the Lasso.
The Gram matrix is
Z
Σ := ψ T ψdQ.
(Part 1)
High-dimensional statistics
May 2012
7 / 41
We will need certain conditions on the Gram matrix to make the theory
work. We require a certain compatibility of `1 -norms with `2 -norms.
Compatibility Let L > 0 be some constant. The compatibility constant
is
φ2Σ (L, S0 ) := φ2 (L, S0 ) := min{s0 β T Σβ : kβS0 k1 = 1, kβSc 0 k1 ≤ L}.
We say that the (L, S0 )-compatibility condition is met if φ(L, S0 ) > 0.
(Part 1)
High-dimensional statistics
May 2012
8 / 41
Back to the noisy case
Lemma
(Basic Inequality) We have
kX(β̂ − β 0 )k22 /n + 2λkβ̂k1 ≤ 2T X(β̂ − β 0 )/n + 2λkβ 0 k1 .
(Part 1)
High-dimensional statistics
May 2012
9 / 41
We introduce the set
T :=
T
(j)
max | X |/n ≤ λ0 .
1≤j≤p
We assume that
λ > λ0 ,
to make sure that on T we can get rid of the random part of the
problem.
(Part 1)
High-dimensional statistics
May 2012
10 / 41
Let us denote the diagonal elements of the Gram matrix Σ̂ := XT X/n,
by
σ̂j2 := Σ̂j,j , j = 1, . . . , p.
Lemma
Suppose that σ 2 = σ̂j2 = 1 for all j. Then we have for all t > 0, and for
r
λ0 :=
2t + 2 log p
,
n
P(T ) ≥ 1 − 2 exp[−t].
(Part 1)
High-dimensional statistics
May 2012
11 / 41
Compatibility condition (Noisy case) Let L > 0 be some constant.
The compatibility constant is
φ2Σ̂ (L, S0 ) := φ2 (L, S0 ) := min{s0 β T Σ̂β : kβS0 k1 = 1, kβSc 0 k1 ≤ L}.
We say that the (L, S0 )-compatibility condition is met if φ(L, S0 ) > 0.
(Part 1)
High-dimensional statistics
May 2012
12 / 41
Theorem
Suppose λ ≥ λ0 and that the compatibility condition holds for S0 , with
L=
Then on
T :=
λ + λ0
.
λ − λ0
T
(j)
max | X |/n ≤ λ0 ,
1≤j≤p
we have
kX (β̂ − β 0 )k2n ≤ 4(λ + λ0 )2 s0 /φ2 (L, S0 ),
kβ̂S0 − β 0 k1 ≤ 2(λ + λ0 )s0 /φ2 (L, S0 ),
and
kβ̂Sc 0 k1 ≤ 2L(λ + λ0 )s0 /φ2 (L, S0 ).
(Part 1)
High-dimensional statistics
May 2012
13 / 41
When does the compatibility condition hold?
oracle inequalities for prediction and estimation
RIP
weak (S,2s)- RIP
|S*\S| ≤ s
coherence
weak (S, 2s)irrepresentable
(Part 1)
adaptive (S, 2s)restricted regression
(S,2s)-restricted
eigenvalue
adaptive (S, s)restricted regression
(S,s)-restricted
eigenvalue
(S,2s)-irrepresentable
(S,s)-uniform
irrepresentable
High-dimensional statistics
S-compatibility
|S*\S| =0
May 2012
14 / 41
If Σ is non-singular, the compatibility condition holds, with
φ2 (S0 ) ≥ Λ2min , the latter being the smallest eigenvalue of Σ.
Example
Consider the matrix


1 ρ ··· ρ
ρ 1
ρ


Σ := (1 − ρ)I + ριιT =  .
,
. . .. 
 ..
. .
ρ ρ ··· 1
with 0 < ρ < 1, and ι := (1, . . . , 1)T a vector of 1’s. Then the smallest
eigenvalue of Σ is Λ2min = 1 − ρ, so the compatibility condition holds
with φ(S0 ) ≥ 1 − ρ.
(The uniform S0 -irrepresentable condition is met as well.)
(Part 1)
High-dimensional statistics
May 2012
15 / 41
If Σ is non-singular, the compatibility condition holds, with
φ2 (S0 ) ≥ Λ2min , the latter being the smallest eigenvalue of Σ.
Example
Consider the matrix


1 ρ ··· ρ
ρ 1
ρ


Σ := (1 − ρ)I + ριιT =  .
,
. . .. 
 ..
. .
ρ ρ ··· 1
with 0 < ρ < 1, and ι := (1, . . . , 1)T a vector of 1’s. Then the smallest
eigenvalue of Σ is Λ2min = 1 − ρ, so the compatibility condition holds
with φ(S0 ) ≥ 1 − ρ.
(The uniform S0 -irrepresentable condition is met as well.)
(Part 1)
High-dimensional statistics
May 2012
15 / 41
Geometric interpretation
Let Xj ∈ Rn denote the j-th column
of X (j = 1, . . . , p).
The set A := {X βS : kβS k1 = 1}
is the convex hull of the vectors
{±Xj }j∈S in Rn . Likewise, the set
B := {X βS c : kβS c k1 ≤ L} is the
convex hull including interior of the
vectors {±LXj }j∈S c .
The `1 -eigenvalue δ(L, S) is the
distance between these two sets.
(Part 1)
B
A
δ(L,S)
High-dimensional statistics
May 2012
16 / 41
We note that:
if L is large the `1 -eigenvalue will be small,
it will also be small if the vectors in S exhibit strong correlation
with those in S c ,
when the vectors in {Xj }j∈S are linearly dependent, it holds that
{X βS : kβS k1 = 1} = {X βS : kβS k1 ≤ 1},
and hence then δ(L, S) = 0.
(Part 1)
High-dimensional statistics
May 2012
17 / 41
The difference between the compatibility constant and the squared
`1 -eigenvalue lies only in the normalization by the size |S| of the set S.
This normalization is inspired by the orthogonal case, which we detail
in the following example.
Example
Suppose that the columns of X are all orthogonal: XjT Xk = 0 for all
p
j 6= k . Then δ(L, S) = 1/ |S| and φ(L, S) = 1.
(Part 1)
High-dimensional statistics
May 2012
18 / 41
Let Sβ := {j : βj 6= 0}. We call |Sβ | the sparsity-index of β. More
generally, we call |S| the sparsity index of the set S.
Definition
For a set S and constant L > 0, the effective sparsity Γ2 (L, S) is the
inverse of the squared `1 -eigenvalue, that is
Γ2 (L, S) =
(Part 1)
1
δ 2 (L, S)
High-dimensional statistics
.
May 2012
19 / 41
Example
As a simple numerical example, let us suppose n = 2, p = 3, S = {3},
and
√
5/13 0 1
.
X= n
12/13 1 0
The `1 -eigenvalue δ(L, S) is equal to the distance of X3 to line that
connects LX1 and −LX2 , that is
√
δ(L, S) = max{(5 − L)/ 26, 0}.
Hence, for example for L = 3 the effective sparsity is Γ2 (3, S) = 13/2.
Alternatively, when
√ 12/13 0 1
X= n
.
5/13 1 0
then for example δ(3, S) = 0 and hence Γ2 (3, S) = ∞. This is due to
the sharper angle between X1 and X3 .
(Part 1)
High-dimensional statistics
May 2012
20 / 41
The compatibility condition is slightly weaker than the restricted
eigenvalue condition of Bickel et al. [2009].
The restricted isometry property of Candes [2005] implies the
restricted eigenvalue condition.
(Part 1)
High-dimensional statistics
May 2012
21 / 41
The compatibility condition is slightly weaker than the restricted
eigenvalue condition of Bickel et al. [2009].
The restricted isometry property of Candes [2005] implies the
restricted eigenvalue condition.
(Part 1)
High-dimensional statistics
May 2012
21 / 41
Approximating the Gram matrix
For two (positive semi-definite) matrices Σ0 and Σ1 , we define the
supremum distance
kΣ1 − Σ0 k∞ := max |(Σ1 )j,k − (Σ0 )j,k |.
j,k
Lemma
Assume
kΣ1 − Σ0 k∞ ≤ λ̃.
Then ∀ kβS0c k1 ≤ 3kβS0 k1 ,
kfβ k2
16λ̃s
Σ1
−
1
.
≤ 2
2
kfβ kΣ
φcompatible (Σ0 , S0 )
0
(Part 1)
High-dimensional statistics
May 2012
22 / 41
Corollary
We have
p
φΣ1 (3, S0 ) ≥ φΣ0 (3, S0 ) − 4 kΣ0 − Σ1 k∞ s0 .
(Part 1)
High-dimensional statistics
May 2012
23 / 41
Example
Suppose we have a Gaussian random matrix
Σ̂ := XT X/n = (σ̂j,k ),
where X = (Xi,j ) is a n × p-matrix with i.i.d. N (0, 1)-distributed entries
in each column. For all t > 0, and for
r
4t + 8 log p 4t + 8 log p
+
,
λ̃(t) :=
n
n
one has the inequality
P kΣ̂ − Σk∞ ≥ λ̃(t) ≤ 2 exp[−t].
(Part 1)
High-dimensional statistics
May 2012
24 / 41
Example
(continued)
Hence, we know for example that with probability at least 1 − 2 exp[−t],
q
φcompatible (Σ̂, S0 ) ≥ Λmin (Σ) − 4 λ̃(t)s0 .
This leads to a bound on the sparsity of the form
s0 = o(1/λ̃(t)),
which roughly says that s0 should be of small order
(Part 1)
High-dimensional statistics
p
n/ log p.
May 2012
25 / 41
Definition We call a random variable X sub-Gaussian if for some
constant K and σ02 ,
E exp[X 2 /K 2 ] ≤ σ02 .
Theorem
Suppose X1 , . . . , Xn are uniformly sub-Gaussian with constants K and
σ02 . Then for constants η = η(K , σ02 ), it holds that
β T Σ̂β ≥
1 T
t + log p
β Σβ −
kβk21 /η 2 ,
3
n
with probability at least 1 − 2 exp[−t].
See Raskutti, Wainwright and Yu [2010].
(Part 1)
High-dimensional statistics
May 2012
26 / 41
General convex loss
Consider data {Zi }ni=1 , with Zi in someP
space Z.
Consider a linear space F := {fβ (·) = pj=1 βj ψj (·) : β ∈ Rp }.
For each f ∈ F, ρf : Z → R be a loss function. We assume that the
map f 7→ ρf (z) is convex for all z ∈ Z.
For example, Zi = (Xi , Yi ), and ρ is quadratic loss
ρf (·, y) = (y − f (·))2 ,
or logistic loss
ρf (·, y) = −yf (·) + log(1 + exp[f (·)]),
or minus log-likelihood loss
Z
ρf = f − log
exp[f ],
etc.
(Part 1)
High-dimensional statistics
May 2012
27 / 41
We denote, for a function ρ : Z → R, the empirical average by
Pn ρ :=
n
X
ρ(Zi )/n,
i=1
and the theoretical mean by
Pρ :=
n
X
Eρ(Zi )/n.
i=1
The Lasso is
n
o
β̂ = arg min Pn ρfβ + λkβk1 .
(1)
β
We write f̂ = fβ̂ .
(Part 1)
High-dimensional statistics
May 2012
28 / 41
We furthermore define the target as the minimizer of the theoretical
risk
f 0 := arg min Pρf .
f ∈F
The excess risk is
E(f ) := P(ρf − ρf 0 ).
Note that by definition, E(f ) ≥ 0 for all f ∈ F.
We will mainly examine the excess risk E(f̂ ) of the Lasso.
(Part 1)
High-dimensional statistics
May 2012
29 / 41
2.0
2.5
Definition We say that the
margin condition holds with
strictly convex function G, if
uv
1.5
0
uv-G(u)
H(v)
0.0
-0.5
-G(u)
-1.0
In typical cases, the margin condition holds with
quadratic function G, that is,
G(u) = cu 2 , u ≥ 0, where c
is a positive constant.
0.5
G
1.0
E(f ) ≥ G(kf − f k).
0.0
0.5
1.0
1.5
2.0
u
Definition Let G be a strictly convex function on [0, ∞). Its convex
conjugate H is defined as
H(v ) = sup {uv − G(u)} , v ≥ 0.
u
(Part 1)
High-dimensional statistics
May 2012
30 / 41
Set
ZM :=
sup
kβ−β 0 k1 ≤M
and
|(Pn − P)(ρfβ − ρfβ )|,
M0 := H
(2)
0
√ 4λ s0
/λ0 ,
φ(S0 )
(3)
where φ(S0 ) = φcompatible (S0 ).
Set
T := {ZM0 ≤ λ0 M0 }.
(Part 1)
High-dimensional statistics
(4)
May 2012
31 / 41
Theorem
(Oracle inequality for the Lasso)
Assume the compatibility condition and the margin condition with
strictly convex function G. Take λ ≥ 8λ0 . Then on the set T given in
(4), we have
√ 4λ s0
0
.
E(f̂ ) + λkβ̂ − β k1 ≤ 4H
φ(S0 )
(Part 1)
High-dimensional statistics
May 2012
32 / 41
Corollary
Assume quadratic margin behavior, i.e., G(u) = u 2 . Then
H(v ) = v 2 /4, and we obtain on T ,
E(f̂ ) + λkβ̂ − β 0 k1 ≤
(Part 1)
4λ2 s0
.
φ2 (S0 )
High-dimensional statistics
May 2012
33 / 41
`2 -rates
To derive rates for kβ̂ − β 0 k2 , we need a stronger compatibility
condition.
Definition
We say that the (S0 , 2s0 )-restricted eigenvalue condition is satisfied, with
constant φ = φ(S0 , 2s0 ) > 0, if for all N ⊃ S0 , |N | = 2s0 , and all
β ∈ Rp , that satisfy kβS0c k1 ≤ 3kβS0 k1 , and |βj | ≤ kβN \S0 k∞ , ∀ j ∈
/ N , it
holds that
kβN k2 ≤ kfβ k/φ.
(Part 1)
High-dimensional statistics
May 2012
34 / 41
Lemma
Suppose the conditions of the previous theorem are met, but now with
the stronger (S0 , 2s0 )-restricted eigenvalue condition. On T ,
√ 2
λ2 s0
4λ s0
/(λ2 s0 ) +
kβ̂ − β 0 k22 ≤ 16 H
.
φ
4φ4
In the case of quadratic margin behavior, with G(u) = u 2 , we then get
on T ,
16λ2 s0
kβ̂ − β 0 k22 ≤
.
φ4
(Part 1)
High-dimensional statistics
May 2012
35 / 41
Theory for `1 /`2 -penalties
Group Lasso
Yi =
p X
T
X
j=1
(j) 0
Xi,t βj,t
+ i , i = 1, . . . , n,
t=1
0 , . . . , β 0 )T have sparsity property β 0 ≡ 0 for
where the βj0 := (βj,1
j,T
j
“most” j.
`1 /`2 -penalty:
p
X
kβk2,1 :=
kβj k2 .
j=1
(Part 1)
High-dimensional statistics
May 2012
36 / 41
Multivariate linear model
Yi,t =
p
X
(j)
0
Xi,t βj,t
+ i,t , , i = 1, . . . , n, t = 1, . . . , T ,
j=1
0 , . . . , β 0 )T , the sparsity property β 0 ≡ 0 for “most” j.
with for βj0 := (βj,1
j
j,T
Linear model with time-varying coefficients
Yi (t) =
p
X
(j)
Xi (t)βj0 (t) + i (t), i = 1, . . . , n, t = 1, . . . , T ,
j=1
where the coefficients βj0 (·) are “smooth” functions, with the sparsity
property that “most” of the βj0 ≡ 0.
(Part 1)
High-dimensional statistics
May 2012
37 / 41
High-dimensional additive model
Yi =
p
X
(j)
fj0 (Xi ) + i , i = 1, . . . , n,
j=1
(j)
where the fj0 (Xi ) are (non-parametric) “smooth” functions, with
sparsity property fj0 ≡ 0 for “most” j.
(Part 1)
High-dimensional statistics
May 2012
38 / 41
Theorem
Consider the group Lasso
o
n
√
β̂ = arg min kY − Xβk22 /n + λ T kβk2,1 ,
β
where λ ≥ 4λ0 , with
s
2
λ0 = √
n
r
1+
4x + 4 log p 4x + 4 log p
+
.
T
T
Then with probability at least 1 − exp[−x], we have
√
24λ2 TS0 s0
kXβ̂ − f0 k22 /n + λ T kβ̂ − β 0 k2,1 ≤
.
φ20
(Part 1)
High-dimensional statistics
May 2012
39 / 41
Theorem
Consider the smoothed group Lasso
2
2
β̂ := arg min kY − Xβk2 /n + λkβk2,1 + λ kBβk2,1 ,
β
where λ ≥ 4λ0 . Then on
T := {2|T Xβ|/n ≤ λ0 kβk2,1 + λ20 kBβk2,1 },
we have
kf̂ −
f 0 k2n
(Part 1)
2
2
2
0
+ λpen(β̂ − β )/2 ≤ 3 16λ s0 /φ0 + 2λ kBβ k2,1 .
0
High-dimensional statistics
May 2012
40 / 41
etc. . . .
(Part 1)
High-dimensional statistics
May 2012
41 / 41