Optimal Rates of Aggregation Alexandre B. Tsybakov Laboratoire de Probabilités et Modèles Aléatoires, Université Paris 6, 4 pl. Jussieu, 75252 Paris Cedex 05, France, [email protected] Abstract. We study the problem of aggregation of M arbitrary estimators of a regression function with respect to the mean squared risk. Three main types of aggregation are considered: model selection, convex and linear aggregation. We define the notion of optimal rate of aggregation in an abstract context and prove lower bounds valid for any method of aggregation. We then construct procedures that attain these bounds, thus establishing optimal rates of linear, convex and model selection type aggregation. 1 Introduction Consider the regression model Yi = f (Xi ) + ξi , i = 1, . . . , n, (1) where X1 , . . . , Xn are i.i.d. random vectors with values in a Borel subset X of Rd , ξi are i.i.d. zero-mean random variables in R such that (ξ1 , . . . , ξn ) is independent of (X1 , . . . , Xn ) and f : X → R is an unknown regression function. The problem is to estimate the function f from the data Dn = ((X1 , Y1 ), . . . , (Xn , Yn )). Denote Pf and P X the probability distributions of Dn and of X1 respectively. For an estimator fˆn of f based on the sample Dn , define the L2 -risk R(fˆn , f ) = Ef kfˆn − f k2 where Ef denotes the expectation w.r.t. the measure Pf and, for a Borel function g : X → R, µZ ¶1/2 kgk = g 2 (x)P X (dx) . X Suppose that we have M ≥ 2 arbitrary estimators fn,1 , . . . fn,M of the function f based on the sample Dn . The aim of aggregation is to construct a new estimate of f (called aggregate) that mimics in a certain sense the behavior of the best among the estimators fn,j . We will consider the following three well-known aggregation problems (cf. Nemirovski (2000)). Problem (L). (Linear aggregation.) Find an aggregate estimator f˜n which is at least as good as the best linear combination of fn,1 , . . . , fn,M , up to a small remainder term, i.e. R(f˜n , f ) ≤ inf R(fλ∗ , f ) + ∆L n,M λ∈RM for every f belonging to a large class of functions F, where fλ∗ = M X λj fn,j , λ = (λ1 , . . . , λM ), j=1 and ∆L n,M is a remainder term that does not depend on f . Problem (C). (Convex aggregation.) Find an aggregate estimator f˜n which is at least as good as the best convex combination of fn,1 , . . . , fn,M , up to a small remainder term, i.e. R(f˜n , f ) ≤ inf R(fλ∗ , f ) + ∆C n,M λ∈ΛM for every f belonging to a large class of functions F, where M X ΛM = λ ∈ RM : λj ≥ 0, λj ≤ 1 j=1 and ∆C n,M is a remainder term that does not depend on f . Problem (MS). (Model selection aggregation.) Find an aggregate estimator f˜n which is at least as good as the best among fn,1 , . . . , fn,M , up to a small remainder term, i.e. R(f˜n , f ) ≤ min R(fn,j , f ) + ∆MS n,M 1≤j≤M for every f belonging to a large class of functions F, where ∆MS n,M is a remainder term that does not depend on f . Clearly, min R(fn,j , f ) ≥ inf R(fλ∗ , f ) ≥ inf R(fλ∗ , f ). 1≤j≤M λ∈ΛM λ∈RM (2) C MS The smallest possible remainder terms ∆L n,M , ∆n,M and ∆n,M characterize the price to pay for aggregation. We will see that they satisfy a relation that is in a sense inverse to (2): the largest price ∆n,M is to be paid for linear aggregation and the smallest one for model selection aggregation. Convex aggregation has an intermediate price. Aggregation of arbitrary estimators for regression with random design (1) under the L2 -risk has been studied by several authors, mostly in the case of model selection (Yang (2000), Catoni (2001), Wegkamp (2000), Györfi, Kohler, Krzyżak and Walk (2002), Birgé (2002)) and convex aggregation (Nemirovski (2000), Juditsky and Nemirovski (2000), Yang (2001)). Linear aggregation for the Gaussian white noise model is discussed by Nemirovski (2000). Aggregation procedures are typically based on sample splitting. The initial sample Dn is di1 vided into two independent subsamples Dm and Dl2 of sizes m and l respectively 1 where m À l and m + l = n. The first subsample Dm is used to construct esti2 mators fn,1 , . . . , fn,M and the second subsample Dl is used to aggregate them, i.e. to construct f˜n (thus, f˜n is measurable w.r.t. the whole sample Dn ). In this paper we will not consider sample splitting schemes but rather deal with an idealized framework (following Nemirovski (2000), Juditsky and Nemirovski (2000)) where the first subsample is fixed and thus instead of the estimators fn,1 , . . . , fn,M we have fixed functions f1 , . . . , fM . The problem is to find linear, convex and model selection aggregates of f1 , . . . , fM based on the sample Dn that would converge with the fastest possible rate (i.e. with the smallest possible remainder terms ∆n,M ) in a minimax sense. A partial solution of this problem for the case of convex aggregation has been given by Juditsky and Nemirovski (2000) and Yang (2001). Here we solve the problem for all the three types of aggregation, in particular, improving these results concerning convex aggregation. The main goal of this paper is to find optimal rates of aggregation in the sense of a general definition given below. 2 Main definition and lower bounds We start with the following definition that covers more general framework than the one considered in the paper. Definition 1. Let H be a given abstract index set and let F, F 0 be a given classes of Borel functions on X . A sequence of positive numbers ψn is called optimal rate of aggregation for (H, F, F 0 ) if – for any family of Borel functions {fλ , λ ∈ H} indexed by H and contained in F 0 there exists an estimator f˜n of f (aggregate) such that · ¸ 2 ˜ sup R(fn , f ) − inf kfλ − f k ≤ Cψn , (3) λ∈H f ∈F for some constant C < ∞ and any integer n, and – there exists a family of Borel functions {fλ , λ ∈ H} indexed by H and contained in F 0 such that for all estimators Tn of f we have · ¸ 2 sup R(Tn , f ) − inf kfλ − f k ≥ cψn , (4) λ∈H f ∈F for some constant c > 0 and any integer n. In this paper we are interested in the following index sets: for Problem (MS), {1, . . . , M } for Problem (C), H = ΛM M R for Problem (L), and we consider F = F0 defined by F0 = {f : kf k∞ ≤ L}, (5) where k · k∞ denotes the L∞ norm associated with the measure P X and L < ∞ is an unknown constant. We take also F 0 = F0 for Problems (MS) and (C), and F 0 = L2 (X , P X ) for Problem (L). Optimal rates of aggregation for ({1, . . . , M }, F0 , F0 ), for (ΛM , F0 , F0 ) and for (RM , F0 , L2 (X , P X )) will be called for brevity optimal rates of model selection, convex and linear aggregation respectively. In the rest of the paper the notation fλ for a vector λ = (λ1 , . . . , λM ) ∈ RM is understood in the following sense: fλ = M X λj fj . j=1 In this section we prove lower bounds of the type (4) for model selection, convex and linear aggregation. The proofs will be based on the following lemma on minimax lower bounds which can be obtained, for example, by combining Theorems 2.2 and 2.5 in Tsybakov (2003). Lemma 1. Let C be a finite set of functions on X such that N = card(C) ≥ 2, kf − gk2 ≥ 4ψn > 0, ∀ f, g ∈ C, f 6= g, and the Kullback divergences K(Pf , Pg ) between the measures Pf and Pg satisfy K(Pf , Pg ) ≤ (1/16) log N, ∀ f, g ∈ C. Then inf sup R(Tn , f ) ≥ c1 ψn , Tn f ∈C where inf Tn denotes the infimum over all estimators and c1 > 0 is a constant. Throughout the paper we denote by ci finite positive constants. Introduce the following assumptions. (A1) The errors ξi are i.i.d. Gaussian N (0, σ 2 ) random variables, 0 < σ < ∞. (A2) There exists a cube S ⊂ X such that P X admits a bounded density µ(·) on S w.r.t. the Lebesgue measure and µ(x) ≥ µ0 > 0 for all x ∈ S. (A3) There exists a constant c0 such that log M ≤ c0 n. (A4) There exists a constant c0 such that M ≤ c0 n. Theorem 1. Under assumptions (A1)–(A3) we have i h sup inf sup R(Tn , f ) − min kfj − f k2 ≥ cψnMS (M ) f1 ,...,fM ∈F0 Tn f ∈F0 1≤j≤M for some constant c > 0 and any integer n, where inf Tn denotes the infimum over all estimators and log M . ψnMS (M ) = n Proof. Let {ϕj }M j=1 be an orthogonal system of functions in L2 (S, dx) for the cube S given in assumption (A2) and satisfying kϕj k∞ ≤ A < ∞ for j = 1, . . . , M . Such functions can be constructed, for example, by taking ϕj (x) = A1 cos(ajx1 + b) for x ∈ S and for suitably chosen constants A1 , a and b, where x1 is the first coordinate of x. Define the functions r log M ϕj (x)I(x ∈ S), j = 1, . . . , M, fj (x) = γ n where I(·) denotes the indicator function and γ is a positive constant to be chosen. In view of assumption (A3), {f1 , . . . , fM } ⊂ F0 if γ is small enough. Thus, it suffices to prove the lower bound of the theorem for f ∈ {f1 , . . . , fM }. But for such f we have min1≤j≤M kfj − f k2 = 0, and to finish the proof of the theorem it is sufficient to bound from below by cψnMS (M ) the quantity supf ∈{f1 ,...,fM } R(Tn , f ) uniformly over all estimators Tn . This is done by applying Lemma 1. Using assumption (A2) and orthogonality of the system {fj }M j=1 on S we get, for j 6= k, Z Z Z γ 2 log M kfj − fk k2 ³ (fj (x) − fk (x))2 dx = fj2 (x)dx + fk2 (x)dx ³ . (6) n S S S Since ξj ’s are N (0, σ 2 ) random variables, the Kullback divergence K(Pfj , Pfk ) between Pfj and Pfk satisfies K(Pfj , Pfk ) = n kfj − fk k2 , 2σ 2 j = 1, . . . , M. (7) In view of (6) and (7), one can choose γ small enough to have K(Pfj , Pfk ) ≤ (1/16) log M, j, k = 1, . . . , M . To finish the proof it remains to use this inequality, (6) and Lemma 1. Theorem 2. Under assumptions (A1)–(A3) we have h i sup inf sup R(Tn , f ) − min kfλ − f k2 ≥ cψnC (M ) f1 ,...,fM ∈F0 Tn f ∈F0 λ∈ΛM for some constant c > 0 and any integer n, where inf Tn denotes the infimum over all estimators and √ M/n if M ≤ n, ψnC (M ) = r ´ ³ √ M n1 log √ +1 if M > n. n Proof. Consider first the case where M > in the proof of Theorem 1. Set √ fj (x) = γϕj (x)I(x ∈ S), n. Let the functions {ϕj }M j=1 be as j = 1, . . . , M, (8) for some constant γ to be chosen later. Define an integer » h ³M ´i1/2 ¼ m = c2 n/ log √ + 1 n (9) for a constant c2 > 0 chosen in such a way that M ≥ 6m. Denote by C the finite set of such convex combinations of f1 , . . . , fM that m of the coefficients λj are equal to 1/m and the remaining M − m coefficients are zero. For every pair of functions g1 , g2 ∈ C we have kg1 − g2 k2 ≤ c3 γ 2 /m. (10) 2 Clearly, C ⊂ F0 for γ small enough and minλ∈Λ √ M kfλ − f k = 0 for any f ∈ C. Therefore, r to prove the theorem for M > n it is sufficient to bound from ³ ´ below by c n1 log √Mn + 1 the supremum supf ∈C R(Tn , f ) uniformly over all estimators Tn . In fact, we will show that the required lower bound holds already for the quantity supf ∈N R(Tn , f ) where N is a subset of C of cardinality card(N ) satisfying µ ¶ M log(card(N )) ≥ c4 m log +1 (11) m and such that for every two functions g1 , g2 ∈ N we have kg1 − g2 k2 ≥ c5 γ 2 /m. The existence of such a subset N of C follows, for example, from Lemma 4 of Birgé and Massart (2001). Now, using (7) – (11) and the definition of m we get that, for any g1 , g2 ∈ N , K(Pg1 , Pg2 ) ≤ c6 γ 2 n/m ≤ c7 γ 2 log(card(N )). Finally, we choose γ small enough to have c7 γ 2 < 1/16 and we apply Lemma 1 to get the result. √ Consider now the case M ≤ n. Define the functions fj by (8) and introduce a finite set of functions M 1 X C1 = f = √ ωj fj : ω ∈ Ω (12) n j=1 where Ω is the set √ of all vectors ω of length M with binary coordinates ωj ∈ {0, 1}. Since M ≤ n we have C1 ⊂ F0 for γ small enough and C1 ⊂ {fλ : λ ∈ ΛM }. Therefore, similarly to the previous proofs, it is sufficient to bound from below inf Tn supf ∈C1 R(Tn , f ). Using assumption (A2) we get that, for any g1 , g2 ∈ C1 , kg1 − g2 k2 ≤ c8 γ 2 M/n. (13) If M < 8 we have ψnC (M ) ³ 1/n, and the lower bound of the theorem can be easily deduced from testing between two hypotheses: f1 ≡ 0 and f2 (x) = n−1/2 I(x ∈ S). For M ≥ 8 it follows from the Varshamov-Gilbert bound (see e.g. Tsybakov (2003), Ch.2) that there exists a subset N1 of C1 such that card(N1 ) ≥ 2M/8 and kg1 − g2 k2 ≥ c9 γ 2 M/n. (14) for any g1 , g2 ∈ N1 . Using (7) and (13) we get, for any g1 , g2 ∈ N1 , K(Pg1 , Pg2 ) ≤ c10 γ 2 M ≤ c11 γ 2 log(card(N1 )), and by choosing √ γ small enough, we can finish the proof in the same way as in the case M > n. Note that Theorem 2 generalizes the lower bounds for convex aggregation given by Juditsky and Nemirovski (2000) and Yang (2001). Juditsky and Nemirovski (2000) considered the case of very large p M (satisfying M ≥ n/ log n) and they proved the lower bound with the rate n−1 log M which coincides in order with ψnC (M ) in this zone. Yang (2001) obtained the lower bounds for convex aggregation with polynomial M , i.e. M ³ nτ for 0 < τ < ∞. His bounds also follow as a special case from Theorem 2. Theorem 3. Under assumptions (A1), (A2), (A4) we have h i sup inf sup R(Tn , f ) − min kfλ − f k2 ≥ cψnL (M ) f1 ,...,fM ∈F0 Tn f ∈F0 λ∈RM for some constant c > 0 and any integer n, where inf Tn denotes the infimum over all estimators and M ψnL (M ) = . n Proof. Assume w.l.o.g. that there exist disjoint subsets S1 , . . . , SM of S such that the Lebesgue measure of Sj is 1/M . Define the functions fj (x) = γI(x ∈ Sj ), j = 1, . . . , M , for a constant γ > 0, and the set r M X M C2 = f = ωj fj : ω ∈ Ω n j=1 with Ω as in (12). Assumption (A4) guarantees that RC2 ⊂ F0 for γ small enough. Since the functions fj are mutually orthogonal and fj2 (x)dx = γ 2 /M , the rest of the proof is identical to the part of the proof of Theorem 2 after (13) (with C1 replaced by C2 ), and it is therefore omitted. 3 Attainability of the lower bounds In this section we show that the lower bounds of Theorems 1–3 give optimal rates of aggregation. We start with the problem of linear aggregation (Problem (L)) in which case we construct an aggregate attaining in order the lower bound of Theorem 3. Denote by L the linear span of f1 , . . . , fM . Let ϕ1 , . . . , ϕM 0 with M 0 ≤ M be an orthonormal basis of L in L2 (X , P X ). Consider a linear aggregate 0 f˜nL (x) = M X λ̂j ϕj (x), x ∈ X, (15) j=1 where n λ̂j = 1X Yi ϕj (Xi ). n i=1 Theorem 4. Let E(ξi ) = 0, E(ξi2 ) ≤ σ 2 < ∞. Then R(f˜nL , f ) − min kfλ − f k2 ≤ λ∈RM (σ 2 + L2 )M n for any integers M ≥ 2, n ≥ 1 and any f, f1 , . . . , fM ∈ F0 , where L is the constant in (5). PM 0 Proof. We have minλ∈RM kfλ − f k2 = kfλ∗ − f k2 where fλ∗ = j=1 λ∗j ϕj , λ∗j = (f, ϕj ), and (·, ·) is the scalar product in L2 (X , P X ). Now, 0 kf˜nL − f k2 = M X (λ̂j − λ∗j )2 + kfλ∗ − f k2 , j=1 i h and to finish the proof it suffices to note that E(λ̂j ) = λ∗j , E (λ̂j − λ∗j )2 = Var(λ̂j ) ≤ (σ 2 + L2 )/n. Theorems 3 and 4 imply the following result. Corollary 1. Under assumptions (A1), (A2), (A4) the sequence ψnL (M ) is optimal rate of linear aggregation. √ Consider now the problem of convex aggregation (Problem (C)). If M ≤ n the lower bound of Theorem 2 is identical to the linear aggregation case, so we can use the linear aggregate f˜nL defined in (15) that attains this bound in view of √ Theorem 4. For M > n we use a different procedure. To define this procedure, consider first the Kullback divergence based model selection aggregate (Catoni (2001), Yang (2000)). This aggregate, for the problem of model selection with N Borel functions g1 , . . . , gN on X , is defined by n g̃n,N (x) = 1 X pk,N (x) n+1 k=0 where pk,0 (x) = N 1 X gj (x) N j=1 and, for k = 1, . . . , N , PN pk,N (x) = Qk 2 2 j=1 gj (x) i=1 exp(−(Yi − gj (Xi )) /2σ ) PN Qk 2 2 j=1 i=1 exp(−(Yi − gj (Xi )) /2σ ) . As shown by Catoni (2001), for any integers M ≥ 2, n ≥ 1 and any functions f, g1 , . . . , gM ∈ F0 , R(g̃n,N , f ) ≤ min kgj − f k2 + C0 1≤j≤N log N n (16) where C0 < ∞ is a constant that depends only on L and σ 2 . Now, define m by (9) and denote by C 0 the set of all sub-convex combinations of f1 , . . . , fM with weights equal to integer multiples of 1/m. We have card(C 0 ) = ¶ µ ¶m m µ X e(M + m) M +j−1 ≤ j m (17) j=1 (cf., e.g., Devroye, Györfi and Lugosi (1996, p.218)). Let N = card(C 0 ), let MS g1 , . . . , gN be the elements of C 0 , and denote by f˜n,m the corresponding model selection aggregate g̃n,N . Then (16) takes the form MS R(f˜n,m , f ) ≤ min0 kg − f k2 + C0 g∈C log(card(C 0 )) , n (18) which holds for every f ∈ F0 . Finally, define a compound method of convex aggregation by ½ f˜nC = f˜nL f˜MS n,m √ if M ≤ n, √ if M > n. This definition emphasizes an intermediate character of convex√aggregation: it switches from linear to model selection aggregates. If M > n we are in a “sparse case”: convex aggregation oracle concentrates only on a relatively small number of functions fj , and optimal rate √ is attained on a model selection type procedure. On the contrary, for M ≤ n convex aggregation oracle does not concentrate on the boundary of the set ΛM , and therefore it essentially behaves as a linear oracle giving solution to unrestricted minimization problem. Theorem 5. Under assumption (A1) we have R(f˜nC , f ) − min kfλ − f k2 ≤ CψnC (M ) λ∈ΛM for some constant C < ∞, any integers M ≥ 2, n ≥ 1 and any f, f1 , . . . , fM ∈ F0 . √ Proof. In view of Theorem 4, it suffices to consider the case M > n. Let λ0 = (λ01 , . . . , λ0M ) be the weights of a convex aggregation oracle, i.e. a vector λ0 satisfying kfλ0 − f k = minλ∈ΛM kfλ − f k. Applying the argument of Nemirovski (2000, p.192–193) to the vector λ0 instead of λ(z) and putting there K = m we find J X L2 pi khi − f k2 ≤ min kfλ − f k2 + m λ∈ΛM i=1 where J = card(C 0 ), (p1 , . . . , pJ ) is a probability vector (i.e. p1 + · · · + pJ = 1, pj ≥ 0) that depends on λ0 and the functions hi are the elements of the set C 0 . This immediately implies that min0 kg − f k2 ≤ min kfλ − f k2 + g∈C λ∈ΛM L2 . m (19) Combining (17)–(19) we obtain · µ ¶ ¸ m M L2 + c12 log +1 m n m λ∈ΛM s µ ¶ 1 M 2 √ ≤ min kfλ − f k + C log +1 n n λ∈ΛM MS R(f˜n,m , f ) ≤ min kfλ − f k2 + for a constant C < ∞. Theorems 2 and 5 imply the following result. Corollary 2. Under assumptions (A1)–(A3) the sequence ψnC (M ) is optimal rate of convex aggregation. Finally, for the Problem (MS), the attainability of the lower bound of Theorem 1 follows immediately from (16) with N = M and gj = fj . Thus, we have the following corollary. Corollary 3. Under assumptions (A1)–(A3) the sequence ψnMS (M ) is optimal rate of model selection aggregation. References 1. Birgé, L.: Model selection for Gaussian regression with random design. Prépublication n. 783, Laboratoire de Probabilités et Modèles Aléatoires, Universités Paris 6 - Paris 7 (2002). 2. Birgé, L., Massart, P.: Gaussian model selection. J. Eur. Math. Soc. 3 (2001) 203– 268. 3. Catoni, O.: Statistical Learning Theory and Stochastic Optimization. Ecole d’Eté de Probabilités de Saint-Flour 2001, Lecture Notes in Mathematics, Springer, N.Y. (to appear). 4. Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer, N.Y. (1996). 5. Györfi, L., Kohler, M., Krzyżak, A., Walk, H.: A Distribution-Free Theory of Nonparametric Regression. Springer, N.Y.(2002). 6. Juditsky, A., Nemirovski, A.: Functional aggregation for nonparametric estimation. Annals of Statistics 28 (2000) 681–712. 7. Nemirovski, A.: Topics in Non-parametric Statistics. Ecole d’Eté de Probabilités de Saint-Flour XXVIII - 1998, Lecture Notes in Mathematics, v. 1738, Springer, N.Y. (2000). 8. Tsybakov, A.: Introduction à l’estimation non-paramétrique. (2003) Springer (to appear). 9. Wegkamp, M.: Model selection in nonparametric regression. Annals of Statistics 31 (2003) (to appear). 10. Yang, Y.: Combining different procedures for adaptive regression. J.of Multivariate Analysis 74 (2000) 135–161. 11. Yang, Y.: Aggregating regression procedures for a better performance (2001). Manuscript.
© Copyright 2026 Paperzz