Optimal Rates of Aggregation

Optimal Rates of Aggregation
Alexandre B. Tsybakov
Laboratoire de Probabilités et Modèles Aléatoires, Université Paris 6, 4 pl. Jussieu,
75252 Paris Cedex 05, France,
[email protected]
Abstract. We study the problem of aggregation of M arbitrary estimators of a regression function with respect to the mean squared risk.
Three main types of aggregation are considered: model selection, convex
and linear aggregation. We define the notion of optimal rate of aggregation in an abstract context and prove lower bounds valid for any method
of aggregation. We then construct procedures that attain these bounds,
thus establishing optimal rates of linear, convex and model selection type
aggregation.
1
Introduction
Consider the regression model
Yi = f (Xi ) + ξi ,
i = 1, . . . , n,
(1)
where X1 , . . . , Xn are i.i.d. random vectors with values in a Borel subset X of Rd ,
ξi are i.i.d. zero-mean random variables in R such that (ξ1 , . . . , ξn ) is independent
of (X1 , . . . , Xn ) and f : X → R is an unknown regression function. The problem
is to estimate the function f from the data Dn = ((X1 , Y1 ), . . . , (Xn , Yn )).
Denote Pf and P X the probability distributions of Dn and of X1 respectively.
For an estimator fˆn of f based on the sample Dn , define the L2 -risk
R(fˆn , f ) = Ef kfˆn − f k2
where Ef denotes the expectation w.r.t. the measure Pf and, for a Borel function
g : X → R,
µZ
¶1/2
kgk =
g 2 (x)P X (dx)
.
X
Suppose that we have M ≥ 2 arbitrary estimators fn,1 , . . . fn,M of the function f
based on the sample Dn . The aim of aggregation is to construct a new estimate of
f (called aggregate) that mimics in a certain sense the behavior of the best among
the estimators fn,j . We will consider the following three well-known aggregation
problems (cf. Nemirovski (2000)).
Problem (L). (Linear aggregation.) Find an aggregate estimator f˜n which
is at least as good as the best linear combination of fn,1 , . . . , fn,M , up to a small
remainder term, i.e.
R(f˜n , f ) ≤ inf R(fλ∗ , f ) + ∆L
n,M
λ∈RM
for every f belonging to a large class of functions F, where
fλ∗ =
M
X
λj fn,j ,
λ = (λ1 , . . . , λM ),
j=1
and ∆L
n,M is a remainder term that does not depend on f .
Problem (C). (Convex aggregation.) Find an aggregate estimator f˜n which
is at least as good as the best convex combination of fn,1 , . . . , fn,M , up to a
small remainder term, i.e.
R(f˜n , f ) ≤ inf R(fλ∗ , f ) + ∆C
n,M
λ∈ΛM
for every f belonging to a large class of functions F, where


M


X
ΛM = λ ∈ RM : λj ≥ 0,
λj ≤ 1


j=1
and ∆C
n,M is a remainder term that does not depend on f .
Problem (MS). (Model selection aggregation.) Find an aggregate estimator
f˜n which is at least as good as the best among fn,1 , . . . , fn,M , up to a small
remainder term, i.e.
R(f˜n , f ) ≤ min R(fn,j , f ) + ∆MS
n,M
1≤j≤M
for every f belonging to a large class of functions F, where ∆MS
n,M is a remainder
term that does not depend on f .
Clearly,
min R(fn,j , f ) ≥ inf R(fλ∗ , f ) ≥ inf R(fλ∗ , f ).
1≤j≤M
λ∈ΛM
λ∈RM
(2)
C
MS
The smallest possible remainder terms ∆L
n,M , ∆n,M and ∆n,M characterize the
price to pay for aggregation. We will see that they satisfy a relation that is in a
sense inverse to (2): the largest price ∆n,M is to be paid for linear aggregation
and the smallest one for model selection aggregation. Convex aggregation has
an intermediate price.
Aggregation of arbitrary estimators for regression with random design (1)
under the L2 -risk has been studied by several authors, mostly in the case of
model selection (Yang (2000), Catoni (2001), Wegkamp (2000), Györfi, Kohler,
Krzyżak and Walk (2002), Birgé (2002)) and convex aggregation (Nemirovski
(2000), Juditsky and Nemirovski (2000), Yang (2001)). Linear aggregation for
the Gaussian white noise model is discussed by Nemirovski (2000). Aggregation
procedures are typically based on sample splitting. The initial sample Dn is di1
vided into two independent subsamples Dm
and Dl2 of sizes m and l respectively
1
where m À l and m + l = n. The first subsample Dm
is used to construct esti2
mators fn,1 , . . . , fn,M and the second subsample Dl is used to aggregate them,
i.e. to construct f˜n (thus, f˜n is measurable w.r.t. the whole sample Dn ). In
this paper we will not consider sample splitting schemes but rather deal with
an idealized framework (following Nemirovski (2000), Juditsky and Nemirovski
(2000)) where the first subsample is fixed and thus instead of the estimators
fn,1 , . . . , fn,M we have fixed functions f1 , . . . , fM . The problem is to find linear, convex and model selection aggregates of f1 , . . . , fM based on the sample
Dn that would converge with the fastest possible rate (i.e. with the smallest
possible remainder terms ∆n,M ) in a minimax sense. A partial solution of this
problem for the case of convex aggregation has been given by Juditsky and Nemirovski (2000) and Yang (2001). Here we solve the problem for all the three
types of aggregation, in particular, improving these results concerning convex
aggregation. The main goal of this paper is to find optimal rates of aggregation
in the sense of a general definition given below.
2
Main definition and lower bounds
We start with the following definition that covers more general framework than
the one considered in the paper.
Definition 1. Let H be a given abstract index set and let F, F 0 be a given
classes of Borel functions on X .
A sequence of positive numbers ψn is called optimal rate of aggregation
for (H, F, F 0 ) if
– for any family of Borel functions {fλ , λ ∈ H} indexed by H and contained
in F 0 there exists an estimator f˜n of f (aggregate) such that
·
¸
2
˜
sup R(fn , f ) − inf kfλ − f k ≤ Cψn ,
(3)
λ∈H
f ∈F
for some constant C < ∞ and any integer n,
and
– there exists a family of Borel functions {fλ , λ ∈ H} indexed by H and contained in F 0 such that for all estimators Tn of f we have
·
¸
2
sup R(Tn , f ) − inf kfλ − f k ≥ cψn ,
(4)
λ∈H
f ∈F
for some constant c > 0 and any integer n.
In this paper we are interested in the following index sets:

for Problem (MS),
 {1, . . . , M }
for Problem (C),
H = ΛM
 M
R
for Problem (L),
and we consider F = F0 defined by
F0 = {f : kf k∞ ≤ L},
(5)
where k · k∞ denotes the L∞ norm associated with the measure P X and L < ∞
is an unknown constant. We take also F 0 = F0 for Problems (MS) and (C), and
F 0 = L2 (X , P X ) for Problem (L).
Optimal rates of aggregation for ({1, . . . , M }, F0 , F0 ), for (ΛM , F0 , F0 ) and
for (RM , F0 , L2 (X , P X )) will be called for brevity optimal rates of model selection, convex and linear aggregation respectively.
In the rest of the paper the notation fλ for a vector λ = (λ1 , . . . , λM ) ∈ RM
is understood in the following sense:
fλ =
M
X
λj fj .
j=1
In this section we prove lower bounds of the type (4) for model selection,
convex and linear aggregation. The proofs will be based on the following lemma
on minimax lower bounds which can be obtained, for example, by combining
Theorems 2.2 and 2.5 in Tsybakov (2003).
Lemma 1. Let C be a finite set of functions on X such that N = card(C) ≥ 2,
kf − gk2 ≥ 4ψn > 0,
∀ f, g ∈ C, f 6= g,
and the Kullback divergences K(Pf , Pg ) between the measures Pf and Pg satisfy
K(Pf , Pg ) ≤ (1/16) log N,
∀ f, g ∈ C.
Then
inf sup R(Tn , f ) ≥ c1 ψn ,
Tn f ∈C
where inf Tn denotes the infimum over all estimators and c1 > 0 is a constant.
Throughout the paper we denote by ci finite positive constants. Introduce the
following assumptions.
(A1) The errors ξi are i.i.d. Gaussian N (0, σ 2 ) random variables, 0 < σ < ∞.
(A2) There exists a cube S ⊂ X such that P X admits a bounded density
µ(·) on S w.r.t. the Lebesgue measure and µ(x) ≥ µ0 > 0 for all x ∈ S.
(A3) There exists a constant c0 such that log M ≤ c0 n.
(A4) There exists a constant c0 such that M ≤ c0 n.
Theorem 1. Under assumptions (A1)–(A3) we have
i
h
sup
inf sup R(Tn , f ) − min kfj − f k2 ≥ cψnMS (M )
f1 ,...,fM ∈F0 Tn f ∈F0
1≤j≤M
for some constant c > 0 and any integer n, where inf Tn denotes the infimum
over all estimators and
log M
.
ψnMS (M ) =
n
Proof. Let {ϕj }M
j=1 be an orthogonal system of functions in L2 (S, dx) for the
cube S given in assumption (A2) and satisfying kϕj k∞ ≤ A < ∞ for j =
1, . . . , M . Such functions can be constructed, for example, by taking ϕj (x) =
A1 cos(ajx1 + b) for x ∈ S and for suitably chosen constants A1 , a and b, where
x1 is the first coordinate of x. Define the functions
r
log M
ϕj (x)I(x ∈ S), j = 1, . . . , M,
fj (x) = γ
n
where I(·) denotes the indicator function and γ is a positive constant to be
chosen. In view of assumption (A3), {f1 , . . . , fM } ⊂ F0 if γ is small enough.
Thus, it suffices to prove the lower bound of the theorem for f ∈ {f1 , . . . , fM }.
But for such f we have min1≤j≤M kfj − f k2 = 0, and to finish the proof of
the theorem it is sufficient to bound from below by cψnMS (M ) the quantity
supf ∈{f1 ,...,fM } R(Tn , f ) uniformly over all estimators Tn . This is done by applying Lemma 1. Using assumption (A2) and orthogonality of the system {fj }M
j=1
on S we get, for j 6= k,
Z
Z
Z
γ 2 log M
kfj − fk k2 ³ (fj (x) − fk (x))2 dx =
fj2 (x)dx + fk2 (x)dx ³
. (6)
n
S
S
S
Since ξj ’s are N (0, σ 2 ) random variables, the Kullback divergence K(Pfj , Pfk )
between Pfj and Pfk satisfies
K(Pfj , Pfk ) =
n
kfj − fk k2 ,
2σ 2
j = 1, . . . , M.
(7)
In view of (6) and (7), one can choose γ small enough to have K(Pfj , Pfk ) ≤
(1/16) log M, j, k = 1, . . . , M . To finish the proof it remains to use this inequality, (6) and Lemma 1.
Theorem 2. Under assumptions (A1)–(A3) we have
h
i
sup
inf sup R(Tn , f ) − min kfλ − f k2 ≥ cψnC (M )
f1 ,...,fM ∈F0 Tn f ∈F0
λ∈ΛM
for some constant c > 0 and any integer n, where inf Tn denotes the infimum
over all estimators and

√
M/n
if M ≤ n,



ψnC (M ) = r
´
³
√

M

 n1 log √
+1
if M > n.
n
Proof. Consider first the case where M >
in the proof of Theorem 1. Set
√
fj (x) = γϕj (x)I(x ∈ S),
n. Let the functions {ϕj }M
j=1 be as
j = 1, . . . , M,
(8)
for some constant γ to be chosen later. Define an integer
» h
³M
´i1/2 ¼
m = c2 n/ log √ + 1
n
(9)
for a constant c2 > 0 chosen in such a way that M ≥ 6m. Denote by C the finite
set of such convex combinations of f1 , . . . , fM that m of the coefficients λj are
equal to 1/m and the remaining M − m coefficients are zero. For every pair of
functions g1 , g2 ∈ C we have
kg1 − g2 k2 ≤ c3 γ 2 /m.
(10)
2
Clearly, C ⊂ F0 for γ small enough and minλ∈Λ
√ M kfλ − f k = 0 for any f ∈ C.
Therefore, r
to prove the theorem for M > n it is sufficient to bound from
³
´
below by c n1 log √Mn + 1 the supremum supf ∈C R(Tn , f ) uniformly over all
estimators Tn . In fact, we will show that the required lower bound holds already
for the quantity supf ∈N R(Tn , f ) where N is a subset of C of cardinality card(N )
satisfying
µ
¶
M
log(card(N )) ≥ c4 m log
+1
(11)
m
and such that for every two functions g1 , g2 ∈ N we have
kg1 − g2 k2 ≥ c5 γ 2 /m.
The existence of such a subset N of C follows, for example, from Lemma 4 of
Birgé and Massart (2001). Now, using (7) – (11) and the definition of m we get
that, for any g1 , g2 ∈ N ,
K(Pg1 , Pg2 ) ≤ c6 γ 2 n/m ≤ c7 γ 2 log(card(N )).
Finally, we choose γ small enough to have c7 γ 2 < 1/16 and we apply Lemma 1
to get the result.
√
Consider now the case M ≤ n. Define the functions fj by (8) and introduce
a finite set of functions


M


1 X
C1 = f = √
ωj fj : ω ∈ Ω
(12)


n
j=1
where Ω is the set √
of all vectors ω of length M with binary coordinates ωj ∈
{0, 1}. Since M ≤ n we have C1 ⊂ F0 for γ small enough and C1 ⊂ {fλ :
λ ∈ ΛM }. Therefore, similarly to the previous proofs, it is sufficient to bound
from below inf Tn supf ∈C1 R(Tn , f ). Using assumption (A2) we get that, for any
g1 , g2 ∈ C1 ,
kg1 − g2 k2 ≤ c8 γ 2 M/n.
(13)
If M < 8 we have ψnC (M ) ³ 1/n, and the lower bound of the theorem can
be easily deduced from testing between two hypotheses: f1 ≡ 0 and f2 (x) =
n−1/2 I(x ∈ S). For M ≥ 8 it follows from the Varshamov-Gilbert bound (see e.g.
Tsybakov (2003), Ch.2) that there exists a subset N1 of C1 such that card(N1 ) ≥
2M/8 and
kg1 − g2 k2 ≥ c9 γ 2 M/n.
(14)
for any g1 , g2 ∈ N1 . Using (7) and (13) we get, for any g1 , g2 ∈ N1 ,
K(Pg1 , Pg2 ) ≤ c10 γ 2 M ≤ c11 γ 2 log(card(N1 )),
and by choosing
√ γ small enough, we can finish the proof in the same way as in
the case M > n.
Note that Theorem 2 generalizes the lower bounds for convex aggregation
given by Juditsky and Nemirovski (2000) and Yang (2001). Juditsky and Nemirovski (2000) considered the case of very large p
M (satisfying M ≥ n/ log n)
and they proved the lower bound with the rate n−1 log M which coincides
in order with ψnC (M ) in this zone. Yang (2001) obtained the lower bounds for
convex aggregation with polynomial M , i.e. M ³ nτ for 0 < τ < ∞. His bounds
also follow as a special case from Theorem 2.
Theorem 3. Under assumptions (A1), (A2), (A4) we have
h
i
sup
inf sup R(Tn , f ) − min kfλ − f k2 ≥ cψnL (M )
f1 ,...,fM ∈F0 Tn f ∈F0
λ∈RM
for some constant c > 0 and any integer n, where inf Tn denotes the infimum
over all estimators and
M
ψnL (M ) =
.
n
Proof. Assume w.l.o.g. that there exist disjoint subsets S1 , . . . , SM of S such
that the Lebesgue measure of Sj is 1/M . Define the functions fj (x) = γI(x ∈
Sj ), j = 1, . . . , M , for a constant γ > 0, and the set


r
M


X
M
C2 = f =
ωj fj : ω ∈ Ω


n j=1
with Ω as in (12). Assumption (A4) guarantees that RC2 ⊂ F0 for γ small enough.
Since the functions fj are mutually orthogonal and fj2 (x)dx = γ 2 /M , the rest
of the proof is identical to the part of the proof of Theorem 2 after (13) (with
C1 replaced by C2 ), and it is therefore omitted.
3
Attainability of the lower bounds
In this section we show that the lower bounds of Theorems 1–3 give optimal
rates of aggregation. We start with the problem of linear aggregation (Problem
(L)) in which case we construct an aggregate attaining in order the lower bound
of Theorem 3.
Denote by L the linear span of f1 , . . . , fM . Let ϕ1 , . . . , ϕM 0 with M 0 ≤ M be
an orthonormal basis of L in L2 (X , P X ). Consider a linear aggregate
0
f˜nL (x) =
M
X
λ̂j ϕj (x),
x ∈ X,
(15)
j=1
where
n
λ̂j =
1X
Yi ϕj (Xi ).
n i=1
Theorem 4. Let E(ξi ) = 0, E(ξi2 ) ≤ σ 2 < ∞. Then
R(f˜nL , f ) − min kfλ − f k2 ≤
λ∈RM
(σ 2 + L2 )M
n
for any integers M ≥ 2, n ≥ 1 and any f, f1 , . . . , fM ∈ F0 , where L is the
constant in (5).
PM 0
Proof. We have minλ∈RM kfλ − f k2 = kfλ∗ − f k2 where fλ∗ = j=1 λ∗j ϕj , λ∗j =
(f, ϕj ), and (·, ·) is the scalar product in L2 (X , P X ). Now,
0
kf˜nL − f k2 =
M
X
(λ̂j − λ∗j )2 + kfλ∗ − f k2 ,
j=1
i
h
and to finish the proof it suffices to note that E(λ̂j ) = λ∗j , E (λ̂j − λ∗j )2 =
Var(λ̂j ) ≤ (σ 2 + L2 )/n.
Theorems 3 and 4 imply the following result.
Corollary 1. Under assumptions (A1), (A2), (A4) the sequence ψnL (M ) is optimal rate of linear aggregation.
√
Consider now the problem of convex aggregation (Problem (C)). If M ≤ n
the lower bound of Theorem 2 is identical to the linear aggregation case, so we
can use the linear aggregate
f˜nL defined in (15) that attains this bound in view of
√
Theorem 4. For M > n we use a different procedure. To define this procedure,
consider first the Kullback divergence based model selection aggregate (Catoni
(2001), Yang (2000)). This aggregate, for the problem of model selection with
N Borel functions g1 , . . . , gN on X , is defined by
n
g̃n,N (x) =
1 X
pk,N (x)
n+1
k=0
where
pk,0 (x) =
N
1 X
gj (x)
N j=1
and, for k = 1, . . . , N ,
PN
pk,N (x) =
Qk
2
2
j=1 gj (x)
i=1 exp(−(Yi − gj (Xi )) /2σ )
PN Qk
2
2
j=1
i=1 exp(−(Yi − gj (Xi )) /2σ )
.
As shown by Catoni (2001), for any integers M ≥ 2, n ≥ 1 and any functions
f, g1 , . . . , gM ∈ F0 ,
R(g̃n,N , f ) ≤ min kgj − f k2 + C0
1≤j≤N
log N
n
(16)
where C0 < ∞ is a constant that depends only on L and σ 2 .
Now, define m by (9) and denote by C 0 the set of all sub-convex combinations
of f1 , . . . , fM with weights equal to integer multiples of 1/m. We have
card(C 0 ) =
¶ µ
¶m
m µ
X
e(M + m)
M +j−1
≤
j
m
(17)
j=1
(cf., e.g., Devroye, Györfi and Lugosi (1996, p.218)). Let N = card(C 0 ), let
MS
g1 , . . . , gN be the elements of C 0 , and denote by f˜n,m
the corresponding model
selection aggregate g̃n,N . Then (16) takes the form
MS
R(f˜n,m
, f ) ≤ min0 kg − f k2 + C0
g∈C
log(card(C 0 ))
,
n
(18)
which holds for every f ∈ F0 .
Finally, define a compound method of convex aggregation by
½
f˜nC =
f˜nL
f˜MS
n,m
√
if M ≤ n,
√
if M > n.
This definition emphasizes an intermediate character of convex√aggregation: it
switches from linear to model selection aggregates. If M > n we are in a
“sparse case”: convex aggregation oracle concentrates only on a relatively small
number of functions fj , and optimal rate
√ is attained on a model selection type
procedure. On the contrary, for M ≤ n convex aggregation oracle does not
concentrate on the boundary of the set ΛM , and therefore it essentially behaves
as a linear oracle giving solution to unrestricted minimization problem.
Theorem 5. Under assumption (A1) we have
R(f˜nC , f ) − min kfλ − f k2 ≤ CψnC (M )
λ∈ΛM
for some constant C < ∞, any integers M ≥ 2, n ≥ 1 and any f, f1 , . . . , fM ∈
F0 .
√
Proof. In view of Theorem 4, it suffices to consider the case M > n. Let
λ0 = (λ01 , . . . , λ0M ) be the weights of a convex aggregation oracle, i.e. a vector λ0
satisfying kfλ0 − f k = minλ∈ΛM kfλ − f k. Applying the argument of Nemirovski
(2000, p.192–193) to the vector λ0 instead of λ(z) and putting there K = m we
find
J
X
L2
pi khi − f k2 ≤ min kfλ − f k2 +
m
λ∈ΛM
i=1
where J = card(C 0 ), (p1 , . . . , pJ ) is a probability vector (i.e. p1 + · · · + pJ = 1,
pj ≥ 0) that depends on λ0 and the functions hi are the elements of the set C 0 .
This immediately implies that
min0 kg − f k2 ≤ min kfλ − f k2 +
g∈C
λ∈ΛM
L2
.
m
(19)
Combining (17)–(19) we obtain
· µ ¶
¸
m
M
L2
+ c12
log
+1
m
n
m
λ∈ΛM
s
µ
¶
1
M
2
√
≤ min kfλ − f k + C
log
+1
n
n
λ∈ΛM
MS
R(f˜n,m
, f ) ≤ min kfλ − f k2 +
for a constant C < ∞.
Theorems 2 and 5 imply the following result.
Corollary 2. Under assumptions (A1)–(A3) the sequence ψnC (M ) is optimal
rate of convex aggregation.
Finally, for the Problem (MS), the attainability of the lower bound of Theorem 1 follows immediately from (16) with N = M and gj = fj . Thus, we have
the following corollary.
Corollary 3. Under assumptions (A1)–(A3) the sequence ψnMS (M ) is optimal
rate of model selection aggregation.
References
1. Birgé, L.: Model selection for Gaussian regression with random design.
Prépublication n. 783, Laboratoire de Probabilités et Modèles Aléatoires, Universités Paris 6 - Paris 7 (2002).
2. Birgé, L., Massart, P.: Gaussian model selection. J. Eur. Math. Soc. 3 (2001) 203–
268.
3. Catoni, O.: Statistical Learning Theory and Stochastic Optimization. Ecole d’Eté
de Probabilités de Saint-Flour 2001, Lecture Notes in Mathematics, Springer, N.Y.
(to appear).
4. Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition.
Springer, N.Y. (1996).
5. Györfi, L., Kohler, M., Krzyżak, A., Walk, H.: A Distribution-Free Theory of Nonparametric Regression. Springer, N.Y.(2002).
6. Juditsky, A., Nemirovski, A.: Functional aggregation for nonparametric estimation.
Annals of Statistics 28 (2000) 681–712.
7. Nemirovski, A.: Topics in Non-parametric Statistics. Ecole d’Eté de Probabilités
de Saint-Flour XXVIII - 1998, Lecture Notes in Mathematics, v. 1738, Springer,
N.Y. (2000).
8. Tsybakov, A.: Introduction à l’estimation non-paramétrique. (2003) Springer (to
appear).
9. Wegkamp, M.: Model selection in nonparametric regression. Annals of Statistics
31 (2003) (to appear).
10. Yang, Y.: Combining different procedures for adaptive regression. J.of Multivariate
Analysis 74 (2000) 135–161.
11. Yang, Y.: Aggregating regression procedures for a better performance (2001).
Manuscript.