CHAPTER 13
Model Selection, Penalization and Oracle Inequalities
Our ultimate goal is to obtain sharper bounds on rates of convergence - in fact
exactly optimal rates, rather than with spurious log terms.
However, this is a situation where the tools introduced are perhaps of independent interest. These include model selection via penalized least squares, where the
penalty function is not ℓ2 or even ℓ1 but instead a function of the number of terms
in the model. We will call such things complexity penalties.
Many of the arguments work for general (i.e. non-orthogonal) linear models.
While we will not ultimately use this extra generality in this book, there are important applications (E.A.D.) and the model is of such importance that it seems
reasonable to present part of the theory in this setting.
While it is natural to start with penalties proportional to the number of terms
in the model, it will turn out that for our later results on exact rates, it will be
necessary to consider a larger class of “2k log(p/k)” penalties, in which, roughly
speaking, the penalty to enter the kth variable is a function that decreases with k
approximately like 2 log(p/k).
We will be looking essentially at “all subsets” versions of the model selection
problem. If there are p variables, then there are kp distinct submodels with k variables, and this grows very quickly with k. In order to control the resulting model
explosion, good exponential probability inequalities for the tails of chi-square distributions are needed. We will derive these as a consequence of a powerful concentration inequality for Gaussian measures in Rn . We give a separate exposition of
this result, as it is finding increasing application in statistics.
13.1. A Gaussian concentration inequality
The Lipschitz norm of a function f : Rn → R is
kf kLip = sup{ |f (x) − f (y)|/kx − yk }
where kxk is the usual Euclidean norm on Rn .
Proposition 13.1. If Z ∼ Nn (0, I), and f : Rn → R is Lipschitz, then
(13.1)
2 /(2kf k2 )
Lip
P {f (Z) ≥ Ef (Z) + t} ≤ e−t
.
Thus the tails of the distribution of a Lipschitz function of a Gaussian vector are
sub Gaussian. Some statistically relevant examples of Lipschitz functions include
(i) order statistics. If z(1) ≥ z(2) ≥ · · · ≥ z(n) denote the order statistics of a
data vector z, then f (z) = z(k) has Lipschitz constant kf kLip = 1.
(ii) ordered eigenvalues of symmetric matrices. Let A be an n × n symmetric
matrix with eigenvalues λ1 (A) ≥ λ2 (A) ≥ · · · ≥ λn (A). If E is also symmetric, then
167
(e.g. (Golub & Van Loan 1996, p. 56 and 396))
kEk2F
P
2
i,j ei,j
|λk (A + E) − λk (A)| ≤ kEkF ,
where
=
denotes the square of the Frobenius norm, which is the
Euclidean norm on n × n matrices,
(iii) orthogonal projections. If S is a linear subspace of Rn , then f (z) = kPS zk
has Lipschitz constant 1. This is the example we use below. If dim S = k, then
D
kPS zk2 = χ2(k) and so
√
EkPS zk ≤ {EkPS zk2 }1/2 = k
and so the inequality implies
P { kPS zk ≥
√
2 /2
k + t } ≤ e−t
.
Note that the dimension n plays a very weak role in the inequality, which is
sometimes said to be “infinite-dimensional”. The phrase “concentration of measure”
refers at least in part to the fact that the distribution of a Lipschitz(1) function of
n variables is concentrated about its mean, in the sense that the tails are no heavier
than those of a univariate standard Gaussian, regardless of the value of n! 1
13.2. All subsets regression and complexity penalized least squares
We begin with the usual form of the general linear model with Gaussian errors:
(13.2)
z ∼ Nn (0, I).
y = Xβ + ǫz = µ + ǫz,
There are n observations y and p unknown parameters β, connected by an n × p
design matrix X with columns
X = [x1 , · · · , xp ].
There is no restriction on p: indeed, we particularly wish to allow for situations in
which p ≫ n. We will assume that the noise level ǫ is known.
Example: Overcomplete dictionaries. Here is a brief indication of why one might wish
to take p ≫ n. Consider estimation of f in the continuous Gaussian white noise model
(REF) dY (t) = f (t)dt + ǫdW (t), and suppose that the observed data are inner products of
Y with n orthonormal functions ψ1 , . . . , ψn . Thus
yi = hf, ψi i + ǫzi ,
i = 1, . . . , n.
Now consider the possibility of approximating f by elements from a dictionary D =
{φ1 , φ2 , . . . , φp }. The hope is that by making D sufficiently rich, one might be able to
represent f well by a linear combination of a very few elements of D. This idea has been
advanced by a number of authors ADD REFERENCES. As a simple illustration, the ψi might
be sinusoids at the first n frequencies, while the dictionary elements might allow a much
finer sampling of frequencies
k = 1, . . . , p = nβ ≫ n.
φk (t) = sin(2πkt/p),
with p = nβ for some β > 1. If there is a single dominant frequency in the data, it is
possible that it will be essentially captured by an element of the dictionary even if it does
not complete an integer number of cycles in the sampling interval.
1In fact, sharper bounds for the tail of χ2 random variables are available (Laurent & Massart
(1998), Johnstone (2001), Birgé & Massart (2001), [CHECK!]), but this bound will suffice for our
purposes and serves as an illustration of the power of the general inequality (13.1).
168
Pp
If we suppose that f has the form f = j=1 βj φj , then these observation equation
become an instance of the general linear model (13.2) with
Xij = hψi , φj i.
Again, the hope is that one can find an estimate β̂ for which only a small number of
components β̂j 6= 0.
All subsets regression. To each subset J ⊂ {1, . . . , p} of cardinality nJ = |J|
corresponds a regression model which fits only the variables xj for j ∈ J. The
possible fitted vectors µ that could arise from these variables lie in the model space
SJ = span{xj : j ∈ J}.
The dimension of SJ is at most nJ , and could be less in the case of collinearity.
Let PJ denote orthogonal projection onto SJ : the least squares estimator µ̂J
of µ is given by µ̂J = PJ y. The issue in all subsets regression consists in deciding
how to select a subset Jˆ on the basis of data y: the resulting estimate of µ is then
µ̂ = PJˆy.
Mean squared error properties can be used to motivate all subsets regression.
We will use a predictive risk2 criterion to judge an estimator β̂ through the fit
µ̂ = X β̂ that it generates:
EkX β̂ − Xβk2 = Ekµ̂ − µk2 .
The mean of a projection estimator µ̂J is just the projection of µ, namely E µ̂J =
PJ µ, while its variance is ǫ2 trPJ = ǫ2 dim SJ . From the variance-bias decomposition
of MSE,
Ekµ̂J − µk2 = kPJ µ − µk2 + ǫ2 dim SJ .
A saturated model arises from any subset with dim SJ = n, so that µ̂J = y “interpolates the data”. In this case the MSE is just the unrestricted minimax risk for
Rn :
Ekµ̂ − µk2 = nǫ2 .
Comparing
the last two displays, we see that if µ lies close to a low rank subspace
. P
β
— µ =
j∈J j xj for |J| small—then µ̂J offers substantial risk savings over a
saturated model. Thus, it seems that one would wish to expand the dictionary D as
much as possible to increase the possibilities for sparse representation. Against this
must be set the dangers inherent in fitting over-parametrized models – principally
overfitting of the data. Penalized least squares estimators are designed specifically
to address this tradeoff.
This discussion leads to a natural generalization of the notion of ideal risk introduced in Chapter ???. For each mean vector µ, there will be an optimal model
subset J = J(µ) which attains the ideal risk
R(µ, ǫ) = min kµ − PJ µk2 + ǫ2 dim SJ .
J
2Why the name “predictive risk”? Imagine that new data will be taken from the same design as
used to generate the original observations y and estimator β̂: y ∗ = Xβ + ǫz ∗ . A natural prediction
of y ∗ is X β̂, and its mean squared error, averaging over the distributions of both z and z ∗ , is
Eky ∗ − X β̂k2 = EkXβ − X β̂k2 + nǫ2 ,
so that the mean squared error of prediction equals Ekµ̂−µk2 , up to an additive factor that doesn’t
depend on the model chosen.]
169
Of course, this choice J(µ) is not available to the statistician, since µ is unknown.
The challenge, taken up below, is to see to what extent penalized least squares
estimators can “mimick” ideal risk, in a fashion analagous to the mimicking achieved
by threshold estimators in the orthogonal setting.
Complexity penalized least squares The residual sum of squares (RSS) after
fitting model J is
ky − µ̂J k2 = ky − PJ yk2 ,
and clearly decreases as the model J increases. To discourage simply using a saturated model, we introduce a penalty on the size of the model, pen(nJ ), and define
a complexity criterion
C(J, y) = ky − µ̂J k2 + ǫ2 pen(nJ ).
(13.3)
The complexity penalized RSS estimate µ̂pen is then given by orthogonal projection
onto the subset that minimizes the penalized criterion:
Jˆpen = argminJ C(J, y)
µ̂pen = PJˆpen y.
(13.4)
The simplest penalty function is simply proportional to the number of variables
in the model:
pen0 (k) = λ2p k,
(13.5)
where we will take λ2p to be roughly of order 2 log p. [The well known AIC criterion
would set λ2p = 2: this is effective for selection among a nested sequence of models,
but is known to overfit in all-subsets settings. [mention BIC λ2p = log p?]
For this particular case, we describe the kind of oracle inequality to be proved
in this chapter. First, note that for pen0 (k), minimal complexity and ideal risk are
related:
min C(J, µ) = min [ kµ − PJ µk2 + ǫ2 pen0 (nJ )]
J
≤
J
2
λp min
[ kµ − PJ µk2 + ǫ2 nJ ] = λ2p R(µ, ǫ).
√
Let λp = ζ(1 + 2 log p) for ζ > 1 and A(ζ) = (1 − ζ −1 )−1 . Then for penalty
function (13.5) and arbitary µ,
Ekµ̂pen − µk2 ≤ A(ζ)λ2p [Cǫ2 + R(µ, ǫ)].
Thus, the complexity penalized RSS estimator, for non-orthogonal and possibly
over-complete dictionaries, comes within a factor of order 2 log p of the ideal risk.
Remark. Another possibility is to use penalty functions monotone in the rank
of the model, pen(dim J). However, when k → pen(k) is strictly monotone, this
will yield the same models as minimizing (13.3), since a collinear model will always
be rejected in favor of a sub-model with the same span.
170
13.3. Oracle inequalities for 2k log(p/k) penalties
Class P. Assume that k → pen(k) is strictly increasing, and has the specific
form
p
pen(k) = ζ 2 k(1 + 2Lp,k )2
(ζ > 1)
with k → Lp,k non-increasing and
Lp,k ≥ log(p/k) + γk
for some sequence γk not depending on p and satisfying γk ≥ γ > 1 for all large k.
The class P is broad enough to include penalties proportional to model size, c.f.
(13.5), with Lp,k ≡ λp ≥ log p. However, the possibility that k → Lp,k be strictly
decreasing is critical for the later application of the oracle inequalities to derive
exact rates of convergence. Since the dominant terms in the two preceding displays
are 2k log(p/k), we will loosely refer to (P) as a class of “2k log(p/k) penalties”.
Theorem 13.2. Let µ̂ be a penalized least squares estimate of the form (13.3)(13.4) for a penalty of class P defined above. Then, for all µ,
Ekµ̂ − µk2 ≤ (1 − ζ −1 )−1 ζCγ Lp,1 ǫ2 + min C(J, µ) .
J
We will see that minJ C(J, µ) is an extension of the notion of ideal risk, and
describes the “intrinsic accuracy” of approximation of µ by models {SJ : J ⊂
{1, . . . , p}}.
Our key examples will have the form Lp,k ≈ log(p/k) + γ, for γ > 1. For
example, in the orthogonal case, the False Discovery Rate thresholds satisfy
p
tn,k ∼ 2 log(2n/kq)
corresponding roughly to Lp,k = log(n/k) + γ with γ = log(2/q) > 1 for q < 2/e.
“Threshold” interpretation. In the orthogonal setting (n = p), suppose that
pen(k) =
k
X
t2n,j ,
j=1
so that µ̂pen becomes thresholding:
µ̂pen,j
(
yj
=
0
|yj | ≥ ǫtn,k̂ ,
otherwise
where
k̂ = argmin
Pk
n
X
0≤k≤n k+1
2
y(j)
+ǫ
2
k
X
t2p,j .
1
Typically 1 t2n,j ∼ kt2n,k , at least for k = o(n), and so the thresholds covered by
the Theorem have the form
p
p
tn,k ≈ ζ(1 + 2Lp,k ) ≥ ζ(1 + 2 log(p/k) + 2γn ).
The fact that ζ > 1 (rather than = 1) and the extra “1” are needed for the
proof, but do make the thresholds somewhat larger than is desirable in practice.
171
The idea to use penalties of the general form 2ǫ2 k log(n/k) arose among several
authors more or less simultaneously:
P
• Foster & Stine (1997) pen(k) = ǫ2 k1 2 log(n/j) via information theory.
i.i.d.
• George & Foster (2000) Empirical Bayes approach. [µi ∼ (1 − w)δ0 +
wN (0, C) followed by estimation of (w, C)]. They argue that this approach
penalizes the kth variable by about 2ǫ2 log(((n + 1)/k) − 1).
• The covariance inflation criterion of Tibshirani & Knight (1999) in the
P
orthogonal case leads to pen(k) = 2ǫ2 k1 2 log(n/j).
• FDR - discussed above (?).
• Birgé & Massart (2001) contains a systematic study of complexity penalized model selection from the specific viewpoint of obtaining non-asymptotic
bounds, using a penalty class similar to, but more general than that used
here.
13.4. Proof of Oracle inequality
Penalized complexity and complexity functionals The definition of the
complexity penalized RSS estimator uses the minimum over submodels J of
C(J, y) = ky − PJ yk2 + pen(nJ ).
It will be helpful to introduce a related complexity functional K(µ, y), defined for all
µ ∈ Rn . First, a definition. Given µ, the minimal dimension of a model containing
µ is
N (µ) = inf{nJ : µ ∈ SJ }.
Now define
K(µ, y) = ky − µk2 + pen(N (µ)).
The criteria C and K yield the same estimators:
Lemma 13.3. Suppose that pen(k) is strictly increasing in k. Given data y, let
Jˆ be a minimizer of C(J, y) and set µ̂ = PJˆy. Then µ̂ minimizes K(µ, y) and
inf K(µ, y) = min C(J, y).
µ
J
Proof. We chase definitions resolutely. First, fix µ and let J(µ) = argmin{nJ : µ ∈
SJ }. Consequently µ ∈ SJ(µ) and hence both
ky − µk2 ≥ ky − PJ(µ) yk2
and
N (µ) = nJ(µ) ,
so that for every µ,
K(µ, y) ≥ C(J(µ), y) ≥ min C(J, y).
J
Now turn to Jˆ = argmin C(J, y) and µ̂ = PJˆy. Let us show that N (µ̂) = nJˆ. Indeed,
suppose to the contrary that there were a subset J with nJ < nJˆ and µ̂ ∈ SJ . Then we
would have both pen(nJ ) < pen(nJˆ) and
ky − PJ yk2 ≤ ky − µ̂k2 = ky − PJˆyk2 .
ˆ Hence
But this means that C(J, y) < C(Jˆ, y), in contradiction to the definition of J.
pen(N (µ̂)) = pen(nJˆ) and so
K(µ̂, y) = ky − PJˆyk2 + pen(nJˆ) = C(Jˆ, y) = min C(J, y).
J
Combining the second and fourth displays, we obtain the result.
172
It is now clear that Theorem 13.2 is a consequence of
Theorem 13.4. Let K(µ, y) = ky − µk2 + ǫ2 pen(N (µ)) and assume that pen(k)
satsifies assumptions P. If µ̂P = argminµ̃ K(µ̃, y) and K0 (µ) = inf µ̃ K(µ̃, µ), then
EK(µ̂P , µ) ≤ (1 − ζ −1 )−1 ζCγ Lp,1 ǫ2 + K0 (µ) .
Proof. 1◦ . Reduction to ǫ = 1. Showing the ǫ−dependence explicitly for now
in Kǫ (µ, y), we see immediately that
Kǫ (µ, y) = ǫ2 K1 (µ/ǫ, y/ǫ),
and so it suffices to establish the inequality when ǫ = 1.
2◦ . A basic inequality One reason to introduce the functional K(µ, y) is a
useful inequality which we now derive. Indeed, use of y = µ + ǫz and the expansion
ky − µ̃k2 = kǫzk2 + 2hµ − µ̃, ǫzi + kµ − µ̃k2 lead to the identity
K(µ̃, y) = kǫzk2 + 2hµ − µ̃, ǫzi + K(µ̃, µ).
Use this identity for both K(µ̂, y) and K(µ̃, y). Since K(µ̂, y) ≤ K(µ̃, y) for all µ̃
by definition, we obtain by subtraction the basic inequality
(13.6)
K(µ̂, µ) ≤ K(µ̃, µ) + 2hµ̂ − µ̃, ǫzi.
The left side exceeds the quadratic loss kµ̂ − µk2 , while on the right side, µ̃ can be
chosen to minimize the theoretical complexity K0 (µ) = inf µ̃ K(µ̃, µ). If a minimizing
value is µ0 , the chief task in obtaining an upper bound for Ekµ̂ − µk2 in terms of
K0 (µ) becomes that of bounding the error term hµ̂ − µ0 , ǫzi.
We return to the basic inequality (13.6). Inserting µ0 for µ̃ and noting that
Ehµ − µ0 , zi = 0, we have
(13.7)
EK(µ̂, µ) ≤ K(µ0 , µ) + 2Ehµ̂ − µ, zi.
Our goal will be to derive an “a priori” bound for Ehµ̂−µ, zi in terms of ζ −1 EK(µ̂, µ)
and an error term. This will require control on deviations of kPS zk2 as S ranges
over the possible subset models.
3◦ . Rejection of individual models. Given a model S, we will say that it is
rejected if
(13.8)
kPS zk2 ≥ ζ −2 pen(dim S).
In a ‘null hypothesis’ situation, this will constitute a type I error, and so the penalty
should be large enough so as to make this type I error probability small, simultaneously over all sufficiently large modelsp
S.
With the choice pen(k) = ζ 2 k(1 + 2Lp,k )2 , we find that when dim S = k,
(13.8) is equivalent to
√
p
kPS zk ≥ k + 2kLp,k .
√
Now EkPS zk2 = Eχ2(k) = k, and so EkPS zk ≤ k. Consequently, from the concentration inequality (13.1),
(13.9)
P {reject model S} ≤ e−kLp,k .
4◦ . A Priori Bound. Fix µ, and for each model J, let SJ,µ = span{µ, SJ }.
Denote the full augmented collection of models by
Mµ = {SJ , SJ,µ : J ⊂ 1 : p}.
173
We need to track the size the of the largest rejected model. So, let
(
min{N : all models S ∈ Mµ with dim S ≥ n are “accepted”}
N̂ =
∞
if kzk2 ≥ ζ −2 pen(n),
since in the latter case, even the saturated model is not accepted. Correspondingly,
set
(
pen(N̂ ) if N̂ ≤ n
P̂ (z, µ) =
kζzk2
if N̂ = ∞.
We will establish that
2ζhz, µ′ − µi ≤ P̂ (z, µ) + K(µ′ , µ).
(13.10)
Assuming for now, both (13.10) and that E P̂ < ∞, we have, on substitution
into (13.7),
EK(µ̂, µ) ≤ K0 (µ) + ζ −1 Eµ P̂ + ζ −1 EK(µ̂, µ).
Moving the unknown to the left side and then ignoring the penalty term in K(µ̂, µ) =
kµ̂ − µk2 + pen(µ̂), we get
Ekµ̂ − µk2 ≤ (1 − ζ −1 )−1 ζ −1 E P̂ + K0 (µ) .
5◦ . Proof of (13.10). As before, let J(µ′ ) = argmin{nJ : µ′ ∈ SJ } be the minimal
model containing µ′ , and set
S ′ = SJ(µ′ ),µ = span{µ, SJ(µ′ ) }.
Clearly µ′ − µ ∈ S ′ and so
2ζhz, µ′ − µi ≤ 2ζkPS ′ zk kµ′ − µk
≤ kζPS ′ zk2 + kµ′ − µk2 .
If N̂ = ∞, the right side is bounded by kζzk2 + K(µ′ , µ), so the Claim is
straightforward. So we now suppose that N̂ ≤ n, and propose to show that
kζPS ′ zk2 ≤ pen(N̂ ) + pen(N (µ′ )),
which suffices for (13.10).
Two cases arise. If dim S ′ ≤ N̂ , choose S ′′ ⊃ S ′ with dim S ′′ = N̂. By definition
of N̂
kζPS ′ zk2 ≤ kζPS ′′ zk2 ≤ pen(N̂ ).
In the second case, dim S ′ > N̂ . Since N (µ′ ) ≤ dim S ′ ≤ N (µ′ ) + 1, we
know also that N (µ′ ) ≥ N̂ . We remark in addition that for any r, pen(r + 1) ≤
pen(r) + pen(r)/r since pen(r)/r is decreasing. Since model S ′ is accepted, we
therefore find
kζPS ′ zk2 ≤ pen(dim S ′ )
≤ pen(N (µ′ )) + pen(N (µ′ ))/N (µ′ )
≤ pen(N (µ′ )) + pen(N̂ )/N̂
≤ pen(N (µ′ )) + pen(N̂ ).
174
6◦ . Bounding E P̂ . By definition, we have
E P̂ (z, µ) = E{pen(N̂ ), N̂ ≤ n} + E{kζzk2 , kζzk2 ≥ pen(n)},
p
and, using monotonicity of k → Lp,k and setting λp,1 = ζ(1 + 2Lp,1 ),
(13.11)
E{pen(N̂ ), N̂ ≤ n} ≤ λ2p,1 {P (N̂ = 1) +
n−1
X
(k + 1)P (N̂ = k + 1)}.
k=1
If N̂ = k + 1, there is some k− dimensional model in Mµ that is rejected. For
each individual model, the rejection probability is bounded by (13.9). We turn to
bounding the number of such models.
Let Sk be the set of models SJ with dimension k—clearly |(Sk )| ≤ kp . A model
in Mµ of dimension k is either from Sk or is of the form SJ ⊕ µ, for some SJ ∈ Sk−1
which does not already contain µ. Hence
|{S ∈ Mµ : dim S = k}| ≤ |Sk | + |Sk−1 |
p
p
pk
≤
+
≤2
k!
k
k−1
since k < n ≤ p. Putting this together with (13.9) gives
pk
exp{−kLp,k }
k!
2
≤√
exp{−k[Lp,k − log(p/k) − 1]},
2πk
√
where we have used Stirling’s inequality: k! ≥ 2πe−k kk+1/2 .
Now use the assumption Lp,k ≥ log(p/k) + γ for some γ > 1. This yields
P {N̂ = k + 1} ≤ 2
E[N̂ , N̂ ≤ n] ≤ 1 +
n−1
X
2(k + 1) −k(γ−1)
√
e
≤ c(γ).
2πk
k=1
We must finally deal with the second term in (13.11). The Cauchy-Schwarz
inequality shows it to be bounded by
p
√
{Ekζzk4 }1/2 P {kzk > n + 2nLp,n }1/2 .
Since kzk2 ∼ χ2n , we have Ekzk4 = 2n + n2 ≤ (n + 1)2 . Applying (13.1) to the
second term, we obtain as upper bound
1
ζ 2 (n + 1)e− 2 nLp,n ≤ ζ 2 (n + 1)e−n/2 ,
using again the assumption on Lp,k .
Finally assembling the bounds, we have, so long as Lp,1 ≥ 1,
p
EP (z, µ) ≤ ζ 2 (1 + 2Lp,1 )2 c(γ) + ζ 2 (n + 1)e−n/2
p
≤ ζ 2 (1 + 2Lp,1 )2 c′ (γ) ≤ ζ 2 c′′ (γ)Lp,1 .
175
13.5. Aside: Stepwise methods vs. complexity penalization.
Stepwise model selection methods have long been used as heuristic tools for
model selection. In this aside, we explain a connection between such methods and
a class of penalties for penalized least squares.
The basic idea with stepwise methods is to use a test statistic—in application,
often an F -test—and a threshold to decide whether to add or delete a variable from
the current fitted model. Let Jˆk denote the best submodel of size k:
Jˆk = argmaxk {kPJ yk2 : nJ = k},
and denote the resulting best k-variable estimator by Qk y = PJˆk y. The mapping
y → Qk (y) is non-linear since the optimal set Jˆk (y) will in general vary with y.
In the forward stepwise approach, the model size is progressively increased until
a threshold criterion suggests that no further benefit will accrue by continuing.
Thus, define
(13.12)
k̂G = first k s.t. kQk+1 yk2 − kQk yk2 ≤ ǫ2 t2p,k+1 .
Note that we allow the threshold to depend on k: in practice it is often constant,
but we wish to allow k → t2p,k to be decreasing.
In contrast, the backward stepwise approach starts with a saturated model and
gradually decreases model size until there appears to be no further advantage in
going on. So, define
(13.13)
k̂F = last k s.t. kQk yk2 − kQk−1 yk2 ≥ ǫ2 t2p,k .
Remarks. 1. In the orthogonal case, yi = µi + ǫzi , i = 1, . . . , n with order
statistics |y|(1) ≥ |y|(2) ≥ . . . ≥ |y|(n) , we find that
kQk yk2 =
k
X
j=1
|y|2(j) ,
so that
(13.14)
k̂F = max{k : |y|(k) ≥ ǫtp,k },
and that k̂F agrees with the FDR definition with tp,k = z(qk/2n). [Fuller reference??]
In this case, it is critical to the method that the thresholds k → tp,k be (slowly)
decreasing.
2. In practice, for reasons of computational simplicity, the forward and backward
stepwise algorithms are often “greedy”, i.e., they look for the best variable to add
(or delete) without optimizing over all sets of size k.
The stepwise schemes are related to a penalized least squares estimator. Let
(13.15)
S(k) = ky − Qk yk2 + ǫ2
k
X
t2p,j ,
j=1
k̂2 = argmin0≤k≤n S(k).
Pk 2
Thus the associated penalty function is pen(k) =
1 tp,j and the corresponding estimator is given by (13.3) and (13.4). [Remark that pen(k) satisfies
176
assumptions (P) if j → tp,j is decreasing with ζ −1 tp,k ≥ log(p/k)+γk etc.
??]
The optimal model size for pen(k) is bracketed between the stepwise quantities.
Proposition 13.5. Let k̂G , k̂F be the forward and backward stepwise variable
numbers defined at (13.12) and (13.13) respectively, and let k̂2 be the global optimum
model size for pen(k) defined at (13.15). Then
k̂G ≤ k̂2 ≤ k̂F .
Proof. Since ky − Qk yk2 = kyk2 − kQk yk2 ,
S(k + 1) − S(k) = kQk yk2 − kQk+1 yk2 + ǫ2 t2p,k+1 .
Thus
( )
<
S(k + 1) = S(k) according as
>
( )
>
= ǫ2 t2p,k+1.
kQk+1 yk − kQk yk
<
2
2
Thus, if it were the case that k̂2 > k̂F , then necessarily S(k̂2 ) > S(k̂2 − 1), which
would contradict the definition of k̂2 as a global minimum of S(k). Likewise, k̂2 < k̂G
is not possible, since it would imply that S(k̂2 + 1) < S(k̂2 ).
[Include diagram, notes 11/26/02 p.8 showing local and global minima.]
[Remark?: Experience with FDR in orthogonal case shows that often
k̂G = k̂2 = k̂F , and that in sparse cases can bound k̂F − k̂G .]
13.6. A version for orthogonal regression
We now specialize to the n− dimensional white Gaussian sequence model:
(13.16)
yi = µi + ǫzi ,
i = 1, . . . , n,
3
i.i.d.
zi ∼ N (0, 1).
The columns of the design matrix implicit in (13.16) are the unit co-ordinate
vectors ei , consisting of zeros except for a 1 in the ith position. The least squares
estimator corresponding to a subset J ⊂ {1, . . . , n} is simply given by co-ordinate
projection:
(
yk k ∈ J
µ̂J,k (y) =
0 k∈
/ J.
In the orthogonal setting
N (µ) = #{i : µi 6= 0}
simply counts the number of non-zero co-ordinates. To minimize the empirical
complexity
K(µ, y) = ky − µk2 + ǫ2 pen(N (µ)),
3Of course, this is the canonical form of the more general orthogonal regression setting
Y = Xβ + ǫZ,
with N dimensional response and n dimensional parameter vector β linked by an orthogonal design
matrix X satisfying X T X = In , and with the noise Z ∼ NN (0, I). This reduces to (13.16) after
premultiplying by X T and setting y = X T Y , µ = β and z = X T Z.
177
we observe that the best fitting model of dimension k just picks off the largest
2
2
2
k observations. Thus, if y(1)
≥ y(2)
≥ . . . ≥ y(n)
denote the ordered squared
observations,
inf K(µ, y) = min inf
ky − µk2 + ǫ2 pen(k)
µ
k N (µ)=k
X
2
= min
y(j)
+ ǫ2 pen(k).
k
j>k
If k̂ denotes the (data dependent!) minimizing value of k, then the complexity
penalized least squares estimate µ̂P is given by hard thresholding at threshold t̂ =
|y|(k) :
(
yj
if |yj | ≥ t̂
µ̂P,j (y) =
0
otherwise.
To state the oracle inequality established in the last section (?), recall the definition of theoretical complexity:
Kǫ (µ) = inf kµ − µ̃k + ǫ2 pen(N (µ̃)),
µ̃
and the assumptions (P) on the penalty function, namely that
p
ζ > 1,
pen(k) = ζ 2 k(1 + 2Ln,k )2 ,
and (i) k → pen(k) is strictly increasing, (ii) k → Ln,k is nonincreasing, and (iii)
Ln,k ≥ log(n/k) + γk , with γk ≥ γ > 1 for k ≥ k0 .
Then the oracle inequality becomes
(13.17)
Ekµ̂P − µk2 ≤ c1 Ln,1 ǫ2 + c2 Kǫ (µ),
with c1 = c1 (ζ, γ) and c2 = c2 (ζ).
A simple bound for theoretical complexity
Lemma 13.6. If pen(k) = kλ2k with k → λk non-increasing, then
(13.18)
Kǫ (µ) ≤
n
X
k=1
µ2(k) ∧ λ2k ǫ2 .
This and the following lemma will be proved below. For now, look at the
quantity
n
X
Rλ (µ, ǫ) =
µ2(k) ∧ λ2k ǫ2 .
k=1
If λk ≡ λ, this reduces to the ideal risk R(µ, λǫ) studied at length earlier – and in
fact, equality holds in (13.18) in this case. However, we are now more
p interested in
cases in which λk is strictly decreasing, for example like k → ζ(1 + 2 log(nβ/k)).
Although the rate of decrease is slow, we will see that it suffices to remove spurious
logarithmic terms from rates of convergence.
We remark also that in typical cases, the inequality in (13.18) is sharp at the
level of rates:
178
Lemma 13.7. Suppose that λk = ℓ(k/n) for a function ℓ(x), positive and decreasing in x ∈ [0, 1], that satisfies
sup xℓ′ (x) ≤ c1 .
lim xℓ(x) = 0,
x→0
Then
n
X
k=1
0≤x≤1
µ2(k) ∧ λ2k ǫ2 ≤ c2 Kǫ (µ),
c2 ≤ 1 + c1 /ℓ(1).
This leads immediately to an important bound.
Corollary 13.8. If µ̂P = argminµ ky − µk2 + ǫ2 pen(N (µ)), for pen(k) = kλ2k
p
and λk = ζ(1 + 2Ln,k ) satisfying assumptions (P), then
X
µ2(k) ∧ λ2k ǫ2 .
Ekµ̂P − µk2 ≤ c1 Ln,1 ǫ2 + c2
k
[Include proofs, pp 5 & 6.]
Lemma 13.9. (a) Suppose that {sk }n1 and {γk }n1 are positive, non-decreasing
sequences. Then
n
n
X
X
sk ∧ γk .
γn ] ≤
min [ksk +
0≤k≤n
k=1
k+1
(b) Conversely, suppose that sk = σ(k/n) with σ(u) a positive, decreasing function
on [0, 1] that satisfies
sup |uσ ′ (u)| ≤ c1 .
lim uσ(u) = 0,
u→0
0≤u≤1
Then, with c2 ≤ 1 + (c1 /σ(1)),
n
X
k=1
sk ∧ γk ≤ c2 min [ksk +
0≤k≤n
n
X
γn ].
k+1
P
Proof. Let Γk = ni=k+1 γi , and let κ = max{k ≥ 1 : sk ≤ γk } if such an
index exists, otherwise set κ = 0. Using the monotonicity of both sequences and
the definition of κ, we have
κ
n
X
X
(13.19)
si ∧ γi + Γκ ≥ κ(sκ ∧ γκ ) + Γκ
si ∧ γi =
i=1
i=1
= κsκ + Γκ ≥ inf ksk + Γk .
k
Pn
Pk
i=1 si ∧ γi ≤
i=1 si +
Pn (b) For each k, including k = 0 and n, we have
γ
,
so
that
i
i=k+1
k
n
X
X
s i + Γk .
si ∧ γi ≤ min
0≤k≤n
i=1
Pk
i=1
≤ c2 ksk for k = 0, . . . , n. But
Z
k
k
k/n
X
X
σ(u)du.
σ(i/n) ≤
si =
The result will follow if we show that
i=1
i=1 si
0
i=1
179
Hence (ksk )−1 ≤ sup0≤u≤ [xσ(x)]−1
Rx
0
σ(u)du. The result now follows from
Remark. If σ(u) is a positive, decreasing function on [0, 1] that satisfies
sup |uσ ′ (u)| ≤ c1 .
lim uσ(u) = 0,
u→0
0≤u≤1
Then, with c2 ≤ 1 + (c1 /σ(1)) and v ∈ [0, 1],
Z v
σ(u)du ≤ c2 vσ(v).
0
Indeed, by partial integration,
Z
Z v
σ(u)du = vσ(v) +
0
v
0
u|σ ′ (u)|du ≤ v[σ(v) + c1 ]
≤ vσ(v)[1 + c1 /σ(1)].
13.7. False discovery rate estimation
The notion of False Discovery Rate (FDR) originated in the area of multiple
inference, which is concerned with the simultaneous testing of a possibly large number of null hypotheses. When cast as a prescription for estimation, the FDR point
of view leads to an estimator closely connected with the 2k log(n/k) penalty class.
We describe this connection, first by reviewing the simultaneous testing setting for
FDR.
Consider the orthogonal regression model y = µ + ǫz in Rn , and consider the n
separate null hypotheses Hi : µi = 0. Suppose that n independent test statistics are
given, which we here take to be simply the components yi of the data. Traditional
familywise error rate (FWER) control methods seek to bound the chance of even
one type I error among the n tests. The goal then is to guarantee that
P {reject ≥ 1 Hi | all Hi true} ≤ q,
for a specified value q. The standard way to achieve this is to use the Bonferroni
approach: assign error probability q/n to each test and set the rejection regions as
{|yi | > t} where t is chosen so that under the null distribution, P {|zi | > t|} = q/n.
As usual, let z(η) denote the upper 100(1 − η)% quantile of the standard Gaussian
distribution. Thus t = tB , where
p
p
tB = z(q/2n) ∼ 2 log(2n/q) ∼ 2 log n
for n large. This shows the fundamental limitation of the FWER control approach—
since the thresholds are chosen quite high, the overall procedure has insufficient
power to detect false Hi .
Here is an alternate sequential procedure, originally due to Seeger (1968) and
Simes (1986), to decide which hypotheses to reject. Form the 2−sided P −values
Pi = 2Φ̃(|yi |/ǫ), order them: P(1) ≤ P(2) ≤ . . . ≤ P(n) , and look for the last crossing
time of a linear boundary with slope q/n:
(13.20)
k̂F = max{i : P(i) ≤ iq/n}.
INCLUDE DIAGRAM! Now reject all null hypotheses H(i) corresponding to i = 1, . . . , k̂F .
By contrast, the Bonferroni method rejects some hypotheses if and only if P(1) ≤
180
q/n. If k̂B = #{i : P(i) ≤ q/n} is the number of Bonferroni-rejected hypotheses,
then necessarily k̂F ≥ k̂B , so that FDR always generates at least as many “discoveries” as Bonferroni, and typically many more.
A major contribution of Benjamini & Hochberg (1995) was to provide a definition of false discovery rate that the k̂F parameter satisfies. Let N = {k : µk = 0}
be the set of “true” null hypotheses, and D̂ = {k : Hk is rejected}, the set of
“discoveries” made from the data. Then set
F DR = |D̂ ∩ N |/|D̂|,
the fraction of discoveries that are false, i.e., come from true null hypotheses.
The key result of Benjamini & Hochberg (1995) is that the sequential procedure
producing k̂F guarantees that
for all µ ∈ Rn .
Eµ { F DR } ≤ q,
Thus, the expected fraction of spurious results is controlled, regardless of the configuration of the true means. [In fact, Benjamini & Hochberg (1995) prove a somewhat
sharper result, with the upper bound being n0 q/n, where n0 = #{i : µi = 0}].
FDR Estimation. As proposed by Abramovich & Benjamini (1995), the definition (13.20) can be converted into a prescription for estimation in the sequence
model via the switching relation
(13.21)
P(k) = 2Φ(|y|(k) /ǫ) ≤ kq/n
⇔ |y|(k) /ǫ > z(kq/2n).
(Here, |y|(1) ≥ . . . ≥ |y|(n) are the order statistics of the data y). This suggests that
we define a boundary sequence k → tn,k via the expression t2n,k = z(kq/2n). This
sequence is decreasing and satisfies
(13.22)
t2n,k ∼ 2 log(2n/kq)
as k/n → 0. From (13.20) and (13.21), we see that the FDR index k̂F is the last
crossing time
(13.23)
k̂F = max{k : |y|(k) ≥ ǫtn,k }.
The corresponding estimator is just hard thresholding with a data determined
threshold t̂F = tk̂F :
(
yk |yk | ≥ t̂F
µ̂F,k = ηH (yk ; t̂F ) =
0 otherwise.
Now the connection with penalized estimation is apparent: (13.23) is just the
backward stepwise model selection method discussed in Section 13.5 for the threshold sequence t2n,k = z(kq/2n). The associated penalized least squares criterion is
given by (13.15), and the relation (13.22) indicates that the penalty may belong to
the 2k log(n/k) class. FIX UP!
181
© Copyright 2026 Paperzz