Entropy Numbers of Linear Function Classes Robert C. Williamson Department of Engineering Australian National University Canberra, ACT 0200, Australia Alex J. Smola Department of Engineering Australian National University Canberra, ACT 0200, Australia Abstract This paper collects together a miscellany of results originally motivated by the analysis of the generalization performance of the “maximum-margin” algorithm due to Vapnik and others. The key feature of the paper is its operator-theoretic viewpoint. New bounds on covering numbers for classes related to Maximum Margin classes are derived directly without making use of a combinatorial dimension such as the VC-dimension. Specific contents of the paper include: a new and self-contained proof of Maurey’s theorem and some generalizations with small explicit values of constants; bounds on the covering numbers of maximum margin classes suitable for the analysis of their generalization performance; the extension of such classes to those induced by balls in quasi-Banach spaces (such as norms with ). extension of results on the covering numbers of convex hulls of basis functions to -convex hulls ( ); an appendix containing the tightest known bounds on the entropy numbers of the identity operator between and ( ). 1 Introduction Linear classifiers have had a resurgence of interest in recent years because of the development of Support Vector machines [22, 24] which are based on Maximum Margin hyperplanes [25]. The generalization performance of support vector machines is becoming increasingly understood with an analysis of the covering numbers of the classes of functions they induce. Some of this analysis has made use of entropy number techniques. In this paper we focus on the simple maximum margin case more closely and do not consider kernel mappings at all. The effect of the kernel used in support vector machines has been analysed using similar techniques in [32, 11] The classical maximum margin algorithm effectively works with Bernhard Schölkopf Microsoft Research Limited St. George House 1 Guildhall Street Cambridge CB2 3NH, UK the class of functions (The standard notation used here is defined precisely below in Definition 3 .) The focus of the present paper is to consider what happens when different norms are used in the definition of . Apart from the purely mathematical interest in developing the connection between problems of determining covering numbers of function classes to those of determining entropy numbers of operators, the results in the paper indicate the effect to be expected by using different norms to define linear function classes in practical learning algorithms. There is a considerable body of work in the mistake-bounded framework for analysing learning algorithms for linear function classes exploring the effect of different norms. The present paper can be considered as a similar exercise in the statistical learning theory framework. The following section collects all the definitions we need. All proofs in the paper are relegated to the appendix. 2 Definitions We will make use of several notions from the theory of Banach spaces and a generalization of these called a quasiBanach spaces. A nice general reference for Banach spaces is [33]; for quasi-Banach spaces see [8]. Definition 1 (Banach space) A Banach space is a complete normed linear space with a norm on , i.e. a map that satisfies 1. if and only if ; 2. for scalars Ê and all ; 3. for all . Definition 2 (Quasi-norm) A quasi-norm is a map like a norm which instead of satisfying the triangle inequality (3 above) satisfies 3 . There exists a constant such that for all , All of the spaces considered in this paper are real. A norm (quasi-norm) induces a metric (quasi-metric) via . We will use to denote both the norm and the induced metric, and use to denote the induced metric space. A quasi-Banach space is a complete quasi-normed linear space. Definition 3 ( Norms) Suppose Ê with Æ or Ê Æ (for infinite dimensional spaces). Then for , (provided the sum converges). for , ½ . If , we often write to explicitly indicate the dimension. The space is defined as . Given , write . Suppose is a class of functions defined on Ê . The norm with respect to of is defined as (1) For , is a norm, and for it is a quasi-norm. Note that a different definition of the norm is used in some papers in learning theory, e.g. [28, 34]. A useful inequality in this context is the following (cf. e.g. [16, p.21]): Theorem 4 (Hölder’s inequality) Suppose satisfy and that and . Then Suppose is a metric space and let . We say is an -cover for if for all , there is an such that . Definition 5 (Covering and Entropy number) The -covering number of , denoted by , is the size of the smallest -cover of . The th entropy number of a set is defined by (2) Given a class of functions Ê , the uniform covering number or -growth function is (3) Covering numbers are of considerable interest to learning theory because generalization bounds can be stated in terms of them [1, 30]. Definition 6 (Operator Norm) Denote by a (quasi)-normed space, is the (closed) unit ball: . Suppose and are (quasi)Banach spaces and is a linear operator mapping from to . Then the operator norm of is defined by (4) and is called bounded if . We denote by the set of all bounded linear operators from to . Definition 7 (Entropy numbers of operators) Suppose . The entropy numbers of the operator are defined by (5) The dyadic entropy numbers are defined by for Æ (6) The main reference for entropy numbers of operators is [7]. Many of the properties shown there for Banach spaces actually carry over to quasi-Banach spaces — see e.g. [8]. The factorization theorem for entropy numbers is of considerable use: Lemma 8 (Edmunds and Triebel [8, p.7]) Let be quasi-Banach spaces and let and . Then 1. . 2. Æ , . Finite rank operators have exponentially decaying entropy numbers: Lemma 9 (Carl and Stephani, [7, p.14,21]) Denote by Banach spaces, let and suppose that . Then for all Æ , (7) In fact this bound is tight to within a constant factor of . One operator that we will make repeated use of is the identity operator If and are (quasi)-Banach spaces, defined by (8) This seemingly trivial operator is of interest when because of the definition of the operator norm. As we shall see below the entropy numbers of play a central role in many of the results we develop. In the following we will use to denote a positive constant and is logarithm base 2. 3 Entropy Numbers of Linear Function Classes Since we are mainly concerned about the capacity of linear functions of the form with and (9) we will consider bounds on entropy numbers of linear operators. (Here and are norms or quasi-norms.) This will allow us to deal with function classes derived from (9). In particular we will analyze the class of functions (10) where Æ , and . More specifically we will look at the evaluation of on an –sample and the entropy numbers of the evalumetric. Formally, we will study ation map in terms of the the entropy numbers of the operator defined as The connection between and is given in Lemma 11 below. Since one cannot expect that all problems can be cast into (10) without proper rescaling we will be interested in constraints on and such that (and rescale later). It is of interest to determine an explicit value for the constant . Carl [3] proved that for $ , $ (15) This leads to the following straight-forward corollary: Lemma 10 (Product bounds from Hölder’s Inequality) Suppose with . Furthermore suppose Æ with and . Then (11) In order to avoid tedious notation we will assume that . This is no major restriction since the general results follow simply by rescaling by . Our interest in the operator and its entropy numbers is explained by the following lemma which connects to the uniform covering numbers of . Lemma 11 (Entropy and Covering Numbers) Let Æ . If for all , , then (12) 3.1 The Maurey-Carl Theorem In this section we present a special case of the famous MaureyCarl theorem. The proof (in the appendix) presented provides a (small) explicit constant. The result is not only of fundamental importance in statistical learning theory — it is of central importance in pure mathematics. Carl and Pajor [6] prove the Maurey theorem via the “Little Grothendieck theorem” which is related to Grothendieck’s “fundamental theorem of the metric theory of tensor products”. Furthermore the Little Grothendieck theorem can be proved in terms of Maurey’s theorem (and thus they are formally equivalent). See [7, pages 254–267] for details. The version proved by Carl [3] (following Maurey’s proof) uses a characterization of Banach spaces in terms of their Rademacher type. The latter is defined as follows. Definition 12 (Rademacher type of Banach spaces) A Banach space is of Rademacher type , if there is constant such that for every finite sequence we have ! " " (13) Here ! " #" is the th Rademacher function on . The Rademacher type constant is the smallest constant satisfying (13). Theorem 13 (Maurey-Carl) Let be a Banach space of Rademacher type , . Let Æ and let . Then there exists a constant such that for all Æ, (14) Corollary 14 (Small Constants for Maurey’s theorem) If , $ , and is a Hilbert space, (14) holds with . The dual version of Theorem 13, i.e. bounds on has identical formal structure as Theorem 13. We only state the Hilbert space case here. Theorem 15 (Dual Version of the Maurey-Carl Theorem) Suppose % is a Hilbert space, Æ and a linear opera. Then tor % % (16) where . As explained in Appendix B, we suspect that a smaller value of is possible (we conjecture 1.86). Theorem 15 can be improved by taking advantage of operators with low rank via Lemma 9. Lemma 16 (Improved Dual Maurey-Carl Theorem) Let % be a Hilbert space, and suppose Æ with and is a linear operator % . Then % (17) if if if where . 4 Dimensionality and Sample Size Lemma 16 already indicated that the size of the sample generating the evaluation operator plays a crucial role in the scaling behaviour of entropy numbers. The dimensionality of (i.e. the dimension of the samples ) also comes into play. This guides our analysis in the present section. In section 4.1 we will deal with the case where , section 4.2 deals with the situation where is polynomial in . Depending on the setting of the learning problem we need bounds for the entropy numbers of the identity map between and . We will use such bounds repeatedly below. We have collected together a number of bounds on this in appendix C. 4.1 Dimensionality of and Sample Size are Equal We begin with the simplest case — is a finite dimensional Hilbert space of dimensionality , hence we will be dealing with . Lemma 16 applies. It is instructive to restate this result in terms of covering numbers. Theorem 17 (Covering Numbers for ) There exists constants such that for all Æ , and all , ¼ if ¼ if (18) It is interesting to note the analogy with the Sauer-VapnikChervonenkis lemma [31, 20, 1] which shows that the growth function has two regimes. We will now develop generalizations of the above result for with . An existing result in this direction is Lemma 18 (Carl [3, p. 94]) Let , Æ and let . Then there exists a constant such that for all Æ , (19) (For one can get a better bound along the lines of Lemma 16.) This leads to the following theorem: Theorem 19 (Slack in ) Let , , and . Then there exist constants such that with & we have and 4.2 Dimensionality of is Polynomial in the Sample Size Now we will consider when . We will derive results that are useful when is polynomial in . defined as before, we proceed to With bound . Lemma 20 (Slack in ) Let , , , and Æ . Then there exists a constant such that with (20) (21) Consider now the situation that and . From Lemma 20, Thus regression with has a sample complexity of ignoring factors. Interestingly Zhang [34] has developed a result going in the other direction: he makes use of mistake bounds to determine similar covering numbers, whereas our covering number bounds (computed directly) recover the general form of the mistake bounds when turned into batch learning results. (Note, too, that Zhang uses a normalised definition of and so care needs to be taken in comparing his results to ours.) 5 Covering Numbers of Adaboost [10] is an algorithm related to the variants on the maximum margin algorithm considered in this paper. It outputs an hypothesis which is the convex hull of the set of weak learners and its generalization performance can be expressed in terms of the covering number of the convex hull of weak learners at a scale related to the margin achieved [21]. In this section we consider what effect there is on the covering numbers of the class used (and hence the generalization bounds) when the -convex hull is used (with ). Variants of Adaboost can be developed which use the -convex hull and experimental results indicate that affects the generalization in a manner consistent with what is suggested by the theory below [2]. The argument below is inspired by the results in [4, 5]. We are interested in ' when ' is a subset of a Hilbert space % and ' denotes the -convex hull of ' (see Definition 21). Recalling the definition of , let . For all , we have and thus ' ' . Since induces a Hilbert space we will now bound ' in terms of covering numbers with respect to the Hilbert space norm . Since the results will hold regardless of in fact we will bound the uniform covering number ' . Definition 21 (-Convex Hull) Suppose , and ' is a set. Then the -convex hull of ' (strictly speaking the absolutely convex hull) is defined by [13, chapter 6] ' Æ ( ' ( Ê ( As an example, consider the -convex hull of the set of Heaviside functions on . It is well known that the convex hull ( ) is the set of functions of bounded variation. In Appendix D we explore the analogous situation for . Lemma 22 Let , let be a set and let be an -cover of . Then is an -cover of . Lemma 23 Let , Æ , and Then Æ . Æ. Lemma 24 Let % where % is a Hilbert space. Define the linear operator % such that , , where is the canonical basis of (e.g. , , etc.). Then . Theorem 25 (Covering Numbers of -Convex Hull) Let be a Hilbert space, let ' % be compact, and let ). Then Æ ' % Æ Æ Suppose ' for some Æ . We can determine the rate of growth of Æ ' as follows. Neglecting the inside the log in (45) we can explicitly solve the equation. Numerical evidence suggests that the dependence of the value of in (45) is not very strong, so we choose Æ*. Then a simple approximate calculation yields: [4] B. Carl. Metric entropy of convex hulls in hilbert spaces. Bulletin of the London Mathematical Society, 29:452–458, 1997. Corollary 26 Suppose ' % is such that for some Æ , ' . Then for , [5] B. Carl, I. Kyrezi, and A. Pajor. Metric entropy of convex hulls in Banach spaces. Proceedings of the London Mathematical Society, 1999. to appear. Æ ' Æ *Æ For this is + *Æ *Æ whereas we know from [4] that the rate should be + *Æ . Of course for large , the difference is negligible (asymptotically in *Æ ). 6 Conclusions We have computed covering numbers for a range of variants on the maximum margin algorithm. In doing so we made explicit use of an operator theoretic viewpoint already used fruitfully in analysing the effect of the kernel in SV machines. We also analysed the covering numbers of -convex hulls of simple classes of functions. We have seen how the now classical results for maximum margin hyperplanes can be generalized to function classes induced by different norms. The scaling behaviour of the resulting covering number bounds gives some insight into how related algorithms will perform in terms of their generalization performance. In other work [32, 11, 26] we have explored the effect of the kernel used in support vector machines for instance. In that case the eigenvalues of the kernel play a key role. In all this work the viewpoint that he function class is the image under the multiple evaluation map (considered as a linear operator) of a ball induced by a norm has been used. Gurvits [12] asked (effectively) what can learning theory do for the geometric theory of Banach spaces. It seems the assistance flows more readily in the other direction. Perhaps the one contribution learning theory has made is pointing out an interesting research direction [27] by giving an answer to Pietsch’s implicit question where he said of entropy numbers [19, p.311] that “at present we do not know any application in the real world.” Now at least there is one! [6] B. Carl and A. Pajor. Gelfand numbers of operators with values in a Hilbert space. Inventiones Mathematicae, 94:479–504, 1988. [7] B. Carl and I. Stephani. Entropy, compactness, and the approximation of operators. Cambridge University Press, Cambridge, UK, 1990. [8] D. E. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Differential Operators. Cambridge University Press, Cambridge, 1996. [9] D.E. Edmunds and H. Triebel. Entropy numbers and approximation numbers in function spaces. Proceedings of the London Mathematical Society, 58:137–152, 1989. [10] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th International Conference on Machine Learning, pages 148–146. Morgan Kaufmann, 1996. [11] Y. G. Guo, P. L. Bartlett, J. Shawe-Taylor, and R. C. Williamson. Covering numbers for support vector machines. In Proceedings of COLT99, 1999. [12] L. Gurvits. A note on a scale-sensitive dimension of linear bounded functionals in Banach spaces. In M. Li and A. Maruoka, editors, Algorithmic Learning Theory ALT-97, LNAI-1316, pages 352–363, Berlin, 1997. Springer. [13] H. Jarchow. Locally Convex Spaces. B.G. Teubner, 1981. [14] H. König. Eigenvalue Distribution of Compact Operators. Birkhäuser, Basel, 1986. Acknowledgements We would like to thank detailed comments and much help from Bernd Carl, Anja Westerhoff and Ingo Steinwart. This work was supported by the Australian Research Council. AS was supported by the DFG, grant SM 62/1-1. Parts of this work were done while BS was visiting the Australian National University. [15] M. Laczkovich and D. Preiss. (-Variation and transformation into functions. Indiana University Mathematics Journal, 34(2):405–424, 1985. References [17] A.M. Olevskiĭ. Homeomorphisms of the circle, modifications of functions, and Fourier series. American Mathematical Society Translations (2), 147:51–64, 1990. [1] M. Anthony and P. Bartlett. A Theory of Learning in Artificial Neural Networks. Cambridge University Press, 1999. [2] J. Barnes. Capacity control in boosting using a convex hull. Master’s thesis, Department of Engineering, Australian National University, 1999. [3] B. Carl. Inequalities of Bernstein-Jackson-type and the degree of compactness of operators in Banach spaces. Ann. de l’Institut Fourier, 35(3):79–118, 1985. [16] I.J. Maddox. Elements of Functional Analysis. Cambridge University Press, Cambridge, 1970. [18] A. Pietsch. Operator ideals. North-Holland, Amsterdam, 1980. [19] A. Pietsch. Eigenvalues and s-Numbers. Cambridge University Press, Cambridge, 1987. [20] N. Sauer. On the density of families of sets. Journal of Combinatorial Theory, 13:145–147, 1972. [21] R. Schapire, Y. Freund, P. L. Bartlett, and W. Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 1998. [22] B. Schölkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods — Support Vector Learning. MIT Press, Cambridge, MA, 1999. [23] C. Schütt. Entropy numbers of diagonal operators between symmatric Banach spaces. Journal of Approximation Theory, 40:121–128, 1984. [24] J. Shawe-Taylor and N. Cristianini. Margin distribution and soft margin. In A.J. Smola, P.L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 349 – 358, Cambridge, MA, 2000. MIT Press. [25] A.J. Smola, P.L. Bartlett, B. Schölkopf, and D. Schuurmans. Advances in Large Margin Classifiers. MIT Press, Cambridge, MA, 2000. [26] A.J. Smola, A. Elisseeff, B. Schölkopf, and R.C. Williamson. Entropy numbers for convex combinations and MLPs. In A.J. Smola, P.L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 369 – 387, Cambridge, MA, 2000. MIT Press. [27] I. Steinwart. Some estimates for the entropy numbers of convex hulls with finitely many extreme points. Technical report, University of Jena, 1999. [28] M. Talagrand. The Glivenko–Cantelli problem, ten years later. Journal of Theoretical Probability, 9(2):371–384, 1996. [29] H. Triebel. Interpolation Theory, Function Spaces, Differential Operators. North-Holland, Amsterdam, 1978. [30] V. Vapnik. Statistical Learning Theory. Wiley, N.Y., 1998. [31] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probab. and its Applications, 16(2):264–280, 1971. [32] R. C. Williamson, A. J. Smola, and B. Schölkopf. Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. Technical Report 19, NeuroCOLT, http://www.neurocolt.com, 1998. Accepted for publication in IEEE Transactions on Information Theory. [33] P. Wojtaszczyk. Banach Spaces for Analysts. Cambridge University Press, 1991. [34] T. Zhang. Analysis of regularised linear functions for classification problems. IBM Research Report RC21572, 1999. A Proofs Proof (Lemma 10) If choose and , otherwise simply set and . Since we can always find two positive numbers such that and ¼ ¼ . Note that by convexity , , ¼ for ! ! and therefore ¼ . Thus we may apply Hölder’s inequality with and to obtain ¼ ¼ (22) which proves (11). Proof (Lemma 11) Fix . We have (23) where the last equality follows from the definition of . Since , we have . Thus and thus . Since was arbitrary, we conclude that . Proof (Corollary 14) The fact that and $ follows from the construction of Hilbert spaces [33]. What remains to be shown is the bound on . Specializing (15) for (and Hilbert spaces, rewriting conversely ) we obtain - (24) The next step is to find a (simple) function such that or equivalently * for all . We choose * . Next we have to bound . Set . (25) One can readily check that for any Æ , . attains its maximum value at . Furthermore . is monotonically Thus taking the limit (note non-decreasing. that and that ) we take . ! (26) Resubstitution yields ! "" which completes the proof.1 Proof (Lemma 16) The first bound (for ) follows immediately from the definition of entropy numbers by . The second line of (17) is a direct consequence of Theorem 15. All that remains is the third line: for we factorise as Æ and subsequently (27) 1 A (marginally) tighter version of the theorem could be stated by using directly to bound . By Lemma 9, Hence using Lemmas 11, 31 and 18 we obtain Theorem 15 tells us are done. . Substituting, we Proof (Theorem 17) We distinguish between the two last bounds of Lemma 16. For we can bound Rewriting the conditions on in terms of and collecting all remaining terms into the constants and completes the proof. @@ @@ @@ @@ / ? ~ ~~ ~ ~~ ~~ (29) 2 (30) (31) (33) (34) (35) (36) / ? ~ ~ ~ ~ ~~ ~~ (37) Exploiting the factorization we can bound which proves (20). Next we have to bound the individual terms separately. For the first factor Lemma 31 can be applied. The dual version of the Maurey-Carl theorem applies to the second term. What remains to be shown is that . This, however, follows immediately from Lemma 10 in analogy to the previous section. (Lemma 22) Let be arbitrary, ( , ( , where / may be infinite. For / , let # be such that # . # exist by definition of .) Let # # . (Such ( Proof # ( ( # A slightly more involved and longer argument gives a better bound: there are constants such that for . ½ @@ @@ @ @@ Æ (32) Proof (Lemma 20) Similar to Theorem 17 consider the factorization We have ½ By Lemma 8 Here we used the product inequality for entropy numbers in (33), (35) holds since ! , and more such that over there exists a constant for . Solving for then immediately gives the bound on the covering numbers. The idea is that the identity operator uses up the “slack” between and ! implicit in the constraint . As will be seen, a smaller value of & is achieved for larger values of ! . But ! can be no larger than in order for us to be able to use Lemma 10. If equality is achieved in the constraint, then maps to and of course nothing (additional) is gained. We will show that and then use Lemma 8 to bound . The operator is identical to except its domain is . Thus . By Lemma 10, since , and , we have Proof (Theorem 19) As before we observe that and we will factorise the operator as in the diagram (with ): (28) and hence . Application of Lemma 11 proves the first inequality of (18)2 . For the second inequality simply note that the third line of Lemma 16 states (for ) ( # ( (38) B Maurey’s Theorem Proof (Lemma 23) Suppose 0 is an -cover of . From Lemma 22, for any , there exists a # such that # . But for any # # # , there exists # 0 such that # # . # # By the triangle inequality # # # # Æ . Thus 0 is an Æ -cover of . In this appendix we provide a proof of Maurey’s theorem for operators % which gives an explicit value for the constant. This is considerably more work, and we get a correspondingly poorer estimate, than in the dual case of theorem 13 for operators % . The argument of this section is due to Professor Bernd Carl. Theorem 27 Proof (Lemma 24) Let . Write . Then , . Thus . Since . Likewise, for any , , with . Thus . We compute the entropy of an operator by factorizing it as Proof (Theorem 25) Let ' and let ' . Thus we have ' and by Lemma 23, if Æ (39) then Æ ' . Thus Æ /% ? (41) (42) By 8, % (43) Choosing and using Corollary 14 and Lemma 31 we obtain "" " (44) Furthermore we have that . Thus we obtain / > }} } }} }} (47) Ê 1 (48) of over the -dimensional Gaussian measure 1 (49) # Proposition 29 (Pajor, Talagrand) (50) We exploit the fact that the entropy numbers of the identity operator from to are known. They take the form: Recalling the definition of , and observing that for , we obtain @@ @@ @ @@ (46) where is a free parameter that we will optimize over at the end. The two factors will be dealt with by using the following two propositions, respectively. (40) By Lemma 24, . In order to compute we factorise as follows @@ @@ @@ @@ Definition 28 denotes the average Æ ' " holds Combining (40) and (45) concludes the proof. (45) Proposition 30 (Schütt) For , there exists a constant such that for all Æ , * (51) Here and below is to base 2. Schütt [23] did not provide an explicit value for . We compute one below in Lemma 32. In order to use Proposition 29 to bound , we need to upper bound . Using , , to denote the coordinates of in the orthonormal basis e , we have Ê 1 Ê 1 e e Ê e e Ê 1 1 The Khinchin inequality [33] states that Ê ( 1 where Hence $ * $* ( e e to obtain * Now choose . Observe for . Thus where % % with * . Note . Let so . Then and % %" ! ! Thus , for , (53) Conjecture on Best Value of Maurey Constant A value of is not particularly satisfying. We believe in fact it is quite loose. Our reasoning is as follows. Consider all operators such that . By definition . Observe that the unit ball is the ellipsoid of . Since maximum volume contained inside ½ vol vol ½ where is the covering number of we have that vol vol ½ But vol is maximized over all such that by choosing . Thus we have for and Next, we combine the obtained bound with Schütt’s result, using Now for we have By Proposition 29, we get Noting that , we obtain a statement valid for even as well as odd numbers. To infer a bound on , Æ , we set , hence *. Then (52) and all that remains is to bound , which we now do. From (66) " . But our choice of and hence " ! . Substituting this value of into (52) along with the numerical value of we get e e e e One can check that % and % has a unique maximum for . Computing , setting it equal to zero and solving numerically, one finds the maximum occurs for " " at which point % "! . Thus We conjecture that for all Æ and all with If this were true, by (55) the Maurey constant would be 1.86. C Bounds on We now determine bounds on . These have been given in [29, 4.10.3], [18, page 172], [14, 3.c.8], [23], and [9, p.141], (see also [8, page 101]). All but the last two references only considered . The most recent contribution by Edmunds and Triebel [9, page 141], [8, page 101] subsumes all of the others and is summarized in the Lemma below. For by an argument of Schütt [23] it is asymptotically optimal in and . . Then Lemma 31 Let if if if . Now set for where We can find a nice (small) explicit value for for the case when which is of interest in its own right. We proceed with that case now. C.1 When For by an argument of Schütt [23] it is asymptotically optimal in and . Since we need an explicit value of the constant we provide the explicit proof below. " (55) Proof Let denote a -ball in dimensions. Let . For a given number of dimensions , we determine the smallest so that can be covered by 2e (56) where e is the th canonical basis vector. For we have . Setting (the radius of ) gives * . We will now set . Along each of the axes of we have used ** cubes. We have added and subtracted one here so we do not multiply count the box that will live at the very center of . Therefore along all axes, we have * * * - -balls cover , where (57) . In other words (60) for ) (61) * . Thus we need to show Choose ) . 3 ) (62) * . Numerical calculations indicate that ) ) . achieves a maximum value of " at . Similarly, we can numerically determine that . achieves a unique maximum value of !" at . Thus for , * ) ) , will do. We then have from (61) that for ) !" !"" The following interpolation lemma follows immediately from [18, p.173]. Lemma 33 Let . Then for all Æ (63) (64) Combining this Lemma with (54) gives for , Thus * . Observe that by construction of , any point in which is not covered by must lie within a -dimension -ball on one of the principal axes of not contained in . Thus by separately covering all possible choices of axes in the manner above, we cover . Since there are ways to choose the axes, we have that and so * for some constants depending on . We will then set (54) (59) Let 3 and so . Suppose ) is a function such ) 3 for . Then Lemma 32 For all Æ , and all , Thus for Æ where is a positive constant independent of and depends on and . 3 (58) " (65) (66) C.2 When Lemma 31 and the proof of Lemma 32 suggests a similar form of the result in Lemma 32 should be obtainable when . However it turns out that for the value of constant obtained is quite unsatisfactory. This is because its value is dominated by the behaviour of . for very small and ( ). For learning applications these are uninteresting values of and . Furthermore for some applications we are actually interested in how behaves as a function of for fixed . For example if we wanted to use as a “capacity control knob.” In that case it is necessary to determine the dependence of on explicitly. In doing so it turns out that in the case of one has to pay too high a price for the elegance of an expression of the form (54) and we end up being better served by an implicit formula, which nevertheless can be easily computed. This implicit formula does exhibit the expected behaviour in for fixed . Setting 2 we have from (60) that 2 (67) For a given we can easily determine numerically: Let 2 3 2 . Such a 2 is unique. Then *2 . Using this method one can plot as a function of for various . As one would expect, the log covering numbers decrease with decreasing . 2 D -Convex Hull of Heavisides We take the following definitions from [15]. Suppose 4 Ê and 4 Ê . For ( , 0! ' 4 5 ! where 5 is an arbitrary finite system of non-overlapping intervals with 5 4 for . Suppose is continuous on 5. Let 6 be the union of all open subintervals on which is either strictly monotonic or constant. Then 7 7" 5 6 is the set of points of varying monotonicity. If is not continuous everywhere, we let 8 denote the set of points of dis& 7 8. continuity and set 7 If ( , is said to be of bounded (-variation and we & . If 0! 7 & we write '()! if 0! 7 say '()! . Let % 9 9 5 where % is the Heaviside (step function). Lemma 34 For any / Æ , for '()! . ( , ! Proof For any ! , we can write as & % 9 ! & % 9 9 & % 9 9 ! & & & ! ! '()! As a way of illustrating the “size” of '()! , in a fashion that gives some additional intuition to our entropy number results determined directly, we will now compute the fat-shattering dimension of '()! . Proposition 35 Suppose ( and 1 . Then *+ 1 ! 1 Proof Suppose '()! 1 -shatter the points 5 with respect to ! ! . We will show that 1 ! . For , we have 7" . There must always be a sign assignment 5 which is realized w.r.t. ! ! by such that 1 for 2 . Thus &" 0! # 7 ! 1 ! & " and so 1 ! But by hypothesis 0! 7 . Thus ! 1 ! 1 ! for and 1 . The smallness of '()! is well illustrated by the following theorem from [15]. For - define , - 5 Ê 7 : 5 : 7 : $ For Æ and - , , - is -times differentiable and , - Theorem 36 (Laczkovich and Preiss) Let - ( , - *(. Then '()! if and only if there exists a homeomorphism ; of 5 into itself such that Æ ; , -. (For related results see [17] and references therein.) where we will assume that 9 9 9 . Observe that & " 9 / . Thus 0! 7 &" 7 Since ! '()! for all / Æ , we have for (
© Copyright 2026 Paperzz