Entropy Numbers of Linear Function Classes

Entropy Numbers of Linear Function Classes
Robert C. Williamson
Department of Engineering
Australian National University
Canberra, ACT 0200, Australia
Alex J. Smola
Department of Engineering
Australian National University
Canberra, ACT 0200, Australia
Abstract
This paper collects together a miscellany of results
originally motivated by the analysis of the generalization performance of the “maximum-margin” algorithm due to Vapnik and others. The key feature
of the paper is its operator-theoretic viewpoint. New
bounds on covering numbers for classes related
to Maximum Margin classes are derived directly
without making use of a combinatorial dimension
such as the VC-dimension. Specific contents of the
paper include:
a new and self-contained proof of Maurey’s
theorem and some generalizations with small
explicit values of constants;
bounds on the covering numbers of maximum
margin classes suitable for the analysis of their
generalization performance;
the extension of such classes to those induced
by balls in quasi-Banach spaces (such as norms with ).
extension of results on the covering numbers
of convex hulls of basis functions to -convex
hulls ( );
an appendix containing the tightest known bounds
on the entropy numbers of the identity operator between and ( ).
1 Introduction
Linear classifiers have had a resurgence of interest in recent
years because of the development of Support Vector machines [22, 24] which are based on Maximum Margin hyperplanes [25]. The generalization performance of support
vector machines is becoming increasingly understood with
an analysis of the covering numbers of the classes of functions they induce. Some of this analysis has made use of
entropy number techniques.
In this paper we focus on the simple maximum margin
case more closely and do not consider kernel mappings at
all. The effect of the kernel used in support vector machines
has been analysed using similar techniques in [32, 11] The
classical maximum margin algorithm effectively works with
Bernhard Schölkopf
Microsoft Research Limited
St. George House
1 Guildhall Street
Cambridge CB2 3NH, UK
the class of functions
(The standard notation used here is defined precisely below
in Definition 3 .) The focus of the present paper is to consider
what happens when different norms are used in the definition
of .
Apart from the purely mathematical interest in developing the connection between problems of determining covering numbers of function classes to those of determining
entropy numbers of operators, the results in the paper indicate the effect to be expected by using different norms to define linear function classes in practical learning algorithms.
There is a considerable body of work in the mistake-bounded
framework for analysing learning algorithms for linear function classes exploring the effect of different norms. The
present paper can be considered as a similar exercise in the
statistical learning theory framework.
The following section collects all the definitions we need.
All proofs in the paper are relegated to the appendix.
2 Definitions
We will make use of several notions from the theory of Banach spaces and a generalization of these called a quasiBanach spaces. A nice general reference for Banach spaces
is [33]; for quasi-Banach spaces see [8].
Definition 1 (Banach space) A Banach space is
a complete normed linear space with a norm on , i.e. a
map that satisfies
1. if and only if ;
2. for scalars Ê and all ;
3. for all .
Definition 2 (Quasi-norm) A quasi-norm is a map like a
norm which instead of satisfying the triangle inequality (3
above) satisfies
3 . There exists a constant such that for all ,
All of the spaces considered in this paper are real. A norm
(quasi-norm) induces a metric (quasi-metric) via . We will use to denote both the norm and the
induced metric, and use to denote the induced metric
space. A quasi-Banach space is a complete quasi-normed
linear space.
Definition 3 ( Norms) Suppose Ê with Æ or
Ê Æ (for infinite dimensional spaces). Then
for , (provided the sum converges).
for , ½ .
If , we often write to explicitly indicate the
dimension. The space is defined as . Given , write .
Suppose is a class of functions defined on Ê . The norm with respect to of is defined as
(1)
For , is a norm, and for it is a
quasi-norm. Note that a different definition of the norm
is used in some papers in learning theory, e.g. [28, 34]. A
useful inequality in this context is the following (cf. e.g. [16,
p.21]):
Theorem 4 (Hölder’s inequality) Suppose satisfy
and that and . Then
Suppose is a metric space and let . We say
is an -cover for if for all
, there is an
such that .
Definition 5 (Covering and Entropy number) The -covering number of , denoted by , is the size of the
smallest -cover of . The th entropy number of a set
is defined by
(2)
Given a class of functions Ê , the uniform covering
number or -growth function is
(3)
Covering numbers are of considerable interest to learning
theory because generalization bounds can be stated in terms
of them [1, 30].
Definition 6 (Operator Norm) Denote by a
(quasi)-normed space, is the (closed) unit ball: . Suppose and are (quasi)Banach spaces and is a linear operator mapping from to . Then the operator norm of is defined by
(4)
and is called bounded if .
We denote by the set of all bounded linear operators
from to .
Definition 7 (Entropy numbers of operators) Suppose . The entropy numbers of the operator are defined
by
(5)
The dyadic entropy numbers are defined by
for Æ (6)
The main reference for entropy numbers of operators is [7].
Many of the properties shown there for Banach spaces actually carry over to quasi-Banach spaces — see e.g. [8].
The factorization theorem for entropy numbers is of considerable use:
Lemma 8 (Edmunds and Triebel [8, p.7]) Let be
quasi-Banach spaces and let and . Then
1. .
2. Æ , .
Finite rank operators have exponentially decaying entropy
numbers:
Lemma 9 (Carl and Stephani, [7, p.14,21]) Denote
by
Banach spaces, let and suppose that
. Then for all Æ ,
(7)
In fact this bound is tight to within a constant factor of .
One operator that we will make repeated use of is the
identity operator If and are (quasi)-Banach spaces,
defined by (8)
This seemingly trivial operator is of interest when because of the definition of the operator norm. As we shall
see below the entropy numbers of play a central role in many of the results we develop. In the following
we will use to denote a positive constant and is logarithm base 2.
3 Entropy Numbers of Linear Function
Classes
Since we are mainly concerned about the capacity of linear
functions of the form
with and (9)
we will consider bounds on entropy numbers of linear operators. (Here and are norms or quasi-norms.) This
will allow us to deal with function classes derived from (9).
In particular we will analyze the class of functions
(10)
where Æ , and . More specifically we will look at the evaluation of on an –sample
and the entropy numbers of the evalumetric. Formally, we will study
ation map in terms of the the entropy numbers of the operator defined as
The connection between and is given in Lemma 11
below. Since one cannot expect that all problems can be cast
into (10) without proper rescaling we will be interested in
constraints on and such that (and rescale
later).
It is of interest to determine an explicit value for the constant
. Carl [3] proved that for $ ,
$ (15)
This leads to the following straight-forward corollary:
Lemma 10 (Product bounds from Hölder’s Inequality)
Suppose with . Furthermore suppose Æ with and .
Then
(11)
In order to avoid tedious notation we will assume that . This is no major restriction since the general results
follow simply by rescaling by .
Our interest in the operator and its entropy numbers is explained by the following lemma which connects
to the uniform covering numbers of .
Lemma 11 (Entropy and Covering Numbers) Let Æ .
If for all , , then
(12)
3.1 The Maurey-Carl Theorem
In this section we present a special case of the famous MaureyCarl theorem. The proof (in the appendix) presented provides a (small) explicit constant.
The result is not only of fundamental importance in statistical learning theory — it is of central importance in pure
mathematics. Carl and Pajor [6] prove the Maurey theorem
via the “Little Grothendieck theorem” which is related to
Grothendieck’s “fundamental theorem of the metric theory
of tensor products”. Furthermore the Little Grothendieck
theorem can be proved in terms of Maurey’s theorem (and
thus they are formally equivalent). See [7, pages 254–267]
for details. The version proved by Carl [3] (following Maurey’s proof) uses a characterization of Banach spaces in terms
of their Rademacher type. The latter is defined as follows.
Definition 12 (Rademacher type of Banach spaces)
A Banach space is of Rademacher type , if
there is constant such that for every finite sequence
we have
! "
" (13)
Here ! " #" is the th Rademacher function
on . The Rademacher type constant is the
smallest constant satisfying (13).
Theorem 13 (Maurey-Carl) Let be a Banach space of
Rademacher type , . Let Æ and let . Then there exists a constant such that for all
Æ, (14)
Corollary 14 (Small Constants for Maurey’s theorem) If
, $ , and
is a Hilbert space, (14) holds with .
The dual version of Theorem 13, i.e. bounds on has identical formal structure as Theorem 13. We only
state the Hilbert space case here.
Theorem 15 (Dual Version of the Maurey-Carl Theorem)
Suppose % is a Hilbert space, Æ and a linear opera. Then
tor % %
(16)
where .
As explained in Appendix B, we suspect that a smaller value
of is possible (we conjecture 1.86). Theorem 15 can be
improved by taking advantage of operators with low rank via
Lemma 9.
Lemma 16 (Improved Dual Maurey-Carl Theorem) Let
% be a Hilbert space, and suppose Æ with and is a linear operator % . Then
% (17)
if if if where .
4 Dimensionality and Sample Size
Lemma 16 already indicated that the size of the sample
generating the evaluation operator plays a crucial role
in the scaling behaviour of entropy numbers. The dimensionality of (i.e. the dimension of the samples ) also
comes into play. This guides our analysis in the present section. In section 4.1 we will deal with the case where ,
section 4.2 deals with the situation where is polynomial
in .
Depending on the setting of the learning problem we
need bounds for the entropy numbers of the identity map
between and . We will use such bounds repeatedly
below. We have collected together a number of bounds on
this in appendix C.
4.1 Dimensionality of and Sample Size are Equal
We begin with the simplest case — is a finite dimensional
Hilbert space of dimensionality , hence we will be
dealing with . Lemma 16 applies. It is instructive to
restate this result in terms of covering numbers.
Theorem 17 (Covering Numbers for ) There exists
constants such that for all Æ , and all ,
¼
if ¼
if (18)
It is interesting to note the analogy with the Sauer-VapnikChervonenkis lemma [31, 20, 1] which shows that the growth
function has two regimes. We will now develop generalizations of the above result for with . An
existing result in this direction is
Lemma 18 (Carl [3, p. 94]) Let , Æ
and let . Then there exists a constant such that for all Æ ,
(19)
(For one can get a better bound along the lines of
Lemma 16.) This leads to the following theorem:
Theorem 19 (Slack in ) Let , , and . Then there exist constants such that with & we have and
4.2 Dimensionality of is Polynomial in the Sample
Size Now we will consider when . We will
derive results that are useful when is polynomial in .
defined as before, we proceed to
With bound .
Lemma 20 (Slack in ) Let , , , and Æ . Then there exists a constant such
that with
(20)
(21)
Consider now the situation that and . From
Lemma 20,
Thus regression with has a sample complexity of ignoring factors. Interestingly Zhang
[34] has developed a result going in the other direction: he
makes use of mistake bounds to determine similar covering
numbers, whereas our covering number bounds (computed
directly) recover the general form of the mistake bounds when
turned into batch learning results. (Note, too, that Zhang
uses a normalised definition of and so care needs to
be taken in comparing his results to ours.)
5 Covering Numbers of Adaboost [10] is an algorithm related to the variants on the
maximum margin algorithm considered in this paper. It outputs an hypothesis which is the convex hull of the set of weak
learners and its generalization performance can be expressed
in terms of the covering number of the convex hull of weak
learners at a scale related to the margin achieved [21]. In this
section we consider what effect there is on the covering numbers of the class used (and hence the generalization bounds)
when the -convex hull is used (with ). Variants of
Adaboost can be developed which use the -convex hull and
experimental results indicate that affects the generalization
in a manner consistent with what is suggested by the theory
below [2]. The argument below is inspired by the results in
[4, 5].
We are interested in ' when ' is a subset
of a Hilbert space % and ' denotes the -convex hull
of ' (see Definition
21). Recalling the definition of ,
let . For all , we have
and thus ' ' .
Since induces a Hilbert space we will now bound
' in terms of covering numbers with respect
to the Hilbert space norm . Since the results will hold
regardless of in fact we will bound the uniform covering
number ' .
Definition 21 (-Convex Hull) Suppose , and ' is a
set. Then the -convex hull of ' (strictly speaking the absolutely convex hull) is defined by [13, chapter 6]
' Æ
( ' ( Ê ( As an example, consider the -convex hull of the set of Heaviside functions on . It is well known that the convex hull
( ) is the set of functions of bounded variation. In Appendix D we explore the analogous situation for .
Lemma 22 Let , let be a set and let be
an -cover of . Then is an -cover of .
Lemma 23 Let , Æ , and Then Æ .
Æ.
Lemma 24 Let % where % is a
Hilbert space. Define the linear operator % such
that , , where is the canonical basis
of (e.g. , , etc.). Then
.
Theorem 25 (Covering Numbers of -Convex Hull) Let
be a Hilbert space, let ' % be compact, and let ). Then Æ ' %
Æ
Æ Suppose ' for some Æ . We can determine
the rate of growth of Æ ' as follows. Neglecting the inside the log in (45) we can explicitly solve the
equation. Numerical evidence suggests that the dependence
of the value of in (45) is not very strong, so we choose
Æ*. Then a simple approximate calculation yields:
[4] B. Carl. Metric entropy of convex hulls in hilbert
spaces. Bulletin of the London Mathematical Society,
29:452–458, 1997.
Corollary 26 Suppose ' % is such that for some Æ ,
' . Then for ,
[5] B. Carl, I. Kyrezi, and A. Pajor. Metric entropy of convex hulls in Banach spaces. Proceedings of the London
Mathematical Society, 1999. to appear.
Æ ' Æ
*Æ
For this is + *Æ *Æ whereas we know
from [4] that the rate should be + *Æ . Of course for
large , the difference is negligible (asymptotically in *Æ ).
6 Conclusions
We have computed covering numbers for a range of variants on the maximum margin algorithm. In doing so we
made explicit use of an operator theoretic viewpoint already
used fruitfully in analysing the effect of the kernel in SV machines. We also analysed the covering numbers of -convex
hulls of simple classes of functions.
We have seen how the now classical results for maximum
margin hyperplanes can be generalized to function classes
induced by different norms. The scaling behaviour of the
resulting covering number bounds gives some insight into
how related algorithms will perform in terms of their generalization performance. In other work [32, 11, 26] we have
explored the effect of the kernel used in support vector machines for instance. In that case the eigenvalues of the kernel
play a key role. In all this work the viewpoint that he function class is the image under the multiple evaluation map
(considered as a linear operator) of a ball induced by a norm
has been used.
Gurvits [12] asked (effectively) what can learning theory
do for the geometric theory of Banach spaces. It seems the
assistance flows more readily in the other direction. Perhaps
the one contribution learning theory has made is pointing out
an interesting research direction [27] by giving an answer to
Pietsch’s implicit question where he said of entropy numbers
[19, p.311] that “at present we do not know any application
in the real world.” Now at least there is one!
[6] B. Carl and A. Pajor. Gelfand numbers of operators
with values in a Hilbert space. Inventiones Mathematicae, 94:479–504, 1988.
[7] B. Carl and I. Stephani. Entropy, compactness, and
the approximation of operators. Cambridge University
Press, Cambridge, UK, 1990.
[8] D. E. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Differential Operators. Cambridge
University Press, Cambridge, 1996.
[9] D.E. Edmunds and H. Triebel. Entropy numbers and
approximation numbers in function spaces. Proceedings of the London Mathematical Society, 58:137–152,
1989.
[10] Y. Freund and R. E. Schapire. Experiments with a new
boosting algorithm. In Proc. 13th International Conference on Machine Learning, pages 148–146. Morgan
Kaufmann, 1996.
[11] Y. G. Guo, P. L. Bartlett, J. Shawe-Taylor, and R. C.
Williamson. Covering numbers for support vector machines. In Proceedings of COLT99, 1999.
[12] L. Gurvits. A note on a scale-sensitive dimension of
linear bounded functionals in Banach spaces. In M. Li
and A. Maruoka, editors, Algorithmic Learning Theory ALT-97, LNAI-1316, pages 352–363, Berlin, 1997.
Springer.
[13] H. Jarchow. Locally Convex Spaces. B.G. Teubner,
1981.
[14] H. König. Eigenvalue Distribution of Compact Operators. Birkhäuser, Basel, 1986.
Acknowledgements We would like to thank detailed comments and much help from Bernd Carl, Anja Westerhoff and
Ingo Steinwart. This work was supported by the Australian
Research Council. AS was supported by the DFG, grant SM
62/1-1. Parts of this work were done while BS was visiting
the Australian National University.
[15] M. Laczkovich and D. Preiss. (-Variation and transformation into functions. Indiana University Mathematics Journal, 34(2):405–424, 1985.
References
[17] A.M. Olevskiĭ. Homeomorphisms of the circle, modifications of functions, and Fourier series. American Mathematical Society Translations (2), 147:51–64,
1990.
[1] M. Anthony and P. Bartlett. A Theory of Learning in Artificial Neural Networks. Cambridge University Press,
1999.
[2] J. Barnes. Capacity control in boosting using a convex hull. Master’s thesis, Department of Engineering, Australian National University, 1999.
[3] B. Carl. Inequalities of Bernstein-Jackson-type and the
degree of compactness of operators in Banach spaces.
Ann. de l’Institut Fourier, 35(3):79–118, 1985.
[16] I.J. Maddox. Elements of Functional Analysis. Cambridge University Press, Cambridge, 1970.
[18] A. Pietsch. Operator ideals. North-Holland, Amsterdam, 1980.
[19] A. Pietsch. Eigenvalues and s-Numbers. Cambridge
University Press, Cambridge, 1987.
[20] N. Sauer. On the density of families of sets. Journal of
Combinatorial Theory, 13:145–147, 1972.
[21] R. Schapire, Y. Freund, P. L. Bartlett, and W. Sun Lee.
Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 1998.
[22] B. Schölkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods — Support Vector Learning.
MIT Press, Cambridge, MA, 1999.
[23] C. Schütt. Entropy numbers of diagonal operators between symmatric Banach spaces. Journal of Approximation Theory, 40:121–128, 1984.
[24] J. Shawe-Taylor and N. Cristianini. Margin distribution and soft margin. In A.J. Smola, P.L. Bartlett,
B. Schölkopf, and D. Schuurmans, editors, Advances in
Large Margin Classifiers, pages 349 – 358, Cambridge,
MA, 2000. MIT Press.
[25] A.J. Smola, P.L. Bartlett, B. Schölkopf, and D. Schuurmans. Advances in Large Margin Classifiers. MIT
Press, Cambridge, MA, 2000.
[26] A.J. Smola, A. Elisseeff, B. Schölkopf, and R.C.
Williamson. Entropy numbers for convex combinations
and MLPs. In A.J. Smola, P.L. Bartlett, B. Schölkopf,
and D. Schuurmans, editors, Advances in Large Margin
Classifiers, pages 369 – 387, Cambridge, MA, 2000.
MIT Press.
[27] I. Steinwart. Some estimates for the entropy numbers of
convex hulls with finitely many extreme points. Technical report, University of Jena, 1999.
[28] M. Talagrand. The Glivenko–Cantelli problem, ten
years later.
Journal of Theoretical Probability,
9(2):371–384, 1996.
[29] H. Triebel. Interpolation Theory, Function Spaces, Differential Operators. North-Holland, Amsterdam, 1978.
[30] V. Vapnik. Statistical Learning Theory. Wiley, N.Y.,
1998.
[31] V. N. Vapnik and A. Y. Chervonenkis. On the uniform
convergence of relative frequencies of events to their
probabilities. Theory of Probab. and its Applications,
16(2):264–280, 1971.
[32] R. C. Williamson, A. J. Smola, and B. Schölkopf.
Generalization performance of regularization networks
and support vector machines via entropy numbers of
compact operators. Technical Report 19, NeuroCOLT,
http://www.neurocolt.com, 1998. Accepted for publication in IEEE Transactions on Information Theory.
[33] P. Wojtaszczyk. Banach Spaces for Analysts. Cambridge University Press, 1991.
[34] T. Zhang. Analysis of regularised linear functions for
classification problems. IBM Research Report RC21572, 1999.
A Proofs
Proof (Lemma 10) If choose and ,
otherwise simply set and . Since
we can always find two positive numbers such that and ¼ ¼ . Note that by convexity
, , ¼ for ! ! and therefore ¼ . Thus
we may apply Hölder’s inequality with and to obtain
¼ ¼ (22)
which proves (11).
Proof (Lemma 11) Fix . We have
(23)
where the last equality follows from the definition of .
Since , we have . Thus
and thus . Since was arbitrary, we conclude that
.
Proof (Corollary 14) The fact that and $ follows from the construction of Hilbert spaces [33]. What
remains to be shown is the bound on . Specializing
(15) for
(and
Hilbert spaces, rewriting
conversely ) we obtain
-
(24)
The next step is to find a (simple) function such that
or equivalently * for all . We choose * . Next we have to bound . Set
. (25)
One can readily check that for any Æ , . attains
its maximum value at . Furthermore . is monotonically
Thus taking the limit (note
non-decreasing.
that and that ) we take
. !
(26)
Resubstitution yields ! "" which completes the proof.1
Proof (Lemma 16) The first bound (for ) follows immediately from the definition of entropy numbers by
. The second line of (17) is a direct consequence of Theorem 15. All that remains is the third line:
for we factorise as Æ and
subsequently
(27)
1
A (marginally) tighter version of the theorem
could be stated by using
directly to bound .
By Lemma 9,
Hence using Lemmas 11, 31 and 18 we obtain
Theorem 15 tells us are done.
.
Substituting, we
Proof (Theorem 17) We distinguish between the two last
bounds of Lemma 16. For we can bound
Rewriting the conditions on in terms of and collecting
all remaining terms into the constants and completes the
proof.
@@
@@
@@
@@
/ ?
~
~~
~
~~ ~~ (29)
2
(30)
(31)
(33)
(34)
(35)
(36)
/ ? ~
~
~
~
~~
~~ (37)
Exploiting the factorization we can bound
which proves (20). Next we have to bound the individual
terms separately. For the first factor Lemma 31 can be applied. The dual version of the Maurey-Carl theorem applies to the second term. What remains to be shown is that
. This, however, follows immediately from Lemma
10 in analogy to the previous section.
(Lemma 22) Let be arbitrary, (
,
( , where / may be infinite. For
/ , let # be such that #
.
# exist by definition of .) Let # # .
(Such ( Proof
#
( ( # A slightly more involved and longer argument gives a better bound: there are constants such that
for .
½
@@
@@
@
@@
Æ (32)
Proof (Lemma 20) Similar to Theorem 17 consider the factorization
We have
½
By Lemma 8
Here we used the product inequality for entropy numbers in
(33), (35) holds since ! , and more
such that over
there
exists
a
constant
for . Solving for then immediately gives the bound on the covering numbers.
The idea is that the identity operator uses up the “slack” between and ! implicit in the constraint . As will
be seen, a smaller value of & is achieved for larger values
of ! . But ! can be no larger than in order for us
to be able to use Lemma 10. If equality is achieved in the
constraint, then maps to and of course nothing
(additional) is gained.
We will show that and then use Lemma 8
to bound . The operator is identical to except its domain is . Thus . By Lemma 10, since , and , we
have
Proof (Theorem 19) As before we observe that and we will factorise the operator as in the diagram (with ):
(28)
and hence . Application of Lemma 11
proves the first inequality of (18)2 . For the second inequality simply note that the third line of Lemma 16 states (for
)
( # ( (38)
B Maurey’s Theorem
Proof (Lemma 23) Suppose 0 is an -cover of .
From Lemma 22, for any , there exists a # such that # . But for any # #
#
, there exists # 0 such that # # .
#
#
By the triangle inequality # # # # Æ . Thus 0 is an Æ -cover of .
In this appendix we provide a proof of Maurey’s theorem
for operators % which gives an explicit value for
the constant. This is considerably more work, and we get
a correspondingly poorer estimate, than in the dual case of
theorem 13 for operators % . The argument of this
section is due to Professor Bernd Carl.
Theorem 27
Proof (Lemma 24) Let . Write . Then ,
.
Thus
. Since . Likewise, for any , ,
with . Thus .
We compute the entropy of an operator by
factorizing it as
Proof (Theorem 25) Let ' and let ' . Thus we have ' and by Lemma 23, if
Æ
(39)
then Æ ' . Thus
Æ
/%
?



  (41)
(42)
By 8,
% (43)
Choosing and using Corollary 14 and Lemma 31 we
obtain
"" " (44)
Furthermore we have that .
Thus we obtain
/ > }}
}
}}
}} (47)
Ê
1 (48)
of over the -dimensional Gaussian measure
1 (49)
# Proposition 29 (Pajor, Talagrand)
(50)
We exploit the fact that the entropy numbers of the identity
operator from to are known. They take the form:
Recalling the definition of , and observing that for , we obtain
@@
@@
@
@@
(46)
where is a free parameter that we will optimize
over at the end. The two factors will be dealt with by using
the following two propositions, respectively.
(40)
By Lemma 24, . In order to compute we factorise as follows
@@
@@
@@
@@
Definition 28 denotes the average
Æ ' " holds
Combining (40) and (45) concludes the proof.
(45)
Proposition 30 (Schütt) For , there exists a
constant such that for all Æ ,
* (51)
Here and below is to base 2. Schütt [23] did not provide an explicit value for . We compute one below in
Lemma 32.
In order to use Proposition 29 to bound , we need to upper bound . Using , ,
to denote the coordinates of in the orthonormal basis e , we have
Ê
1 Ê
1 e e Ê e e Ê
1 1 The Khinchin inequality [33] states that
Ê
( 1 where
Hence
$ *
$*
( e e to obtain * Now choose . Observe for .
Thus
where
% %
with * . Note . Let
so . Then and
% %" ! ! Thus
,
for ,
(53)
Conjecture on Best Value of Maurey Constant
A value of is not particularly satisfying. We believe
in fact it is quite loose. Our reasoning is as follows. Consider
all operators such that . By definition
. Observe that the unit ball is the ellipsoid of
.
Since
maximum volume contained inside ½
vol vol
½
where is the covering number of we have that
vol vol ½ But vol is maximized over all such that by choosing . Thus we have for and
Next, we combine the obtained bound with Schütt’s result,
using
Now for we have
By Proposition 29, we get
Noting that , we
obtain a statement valid for even as well as odd numbers. To
infer a bound on , Æ , we set , hence *.
Then
(52)
and all that remains is to bound , which we now do.
From (66) " . But our choice of and hence " ! . Substituting
this value of into (52) along with the numerical value of
we get
e e e e One can check that % and % has a
unique maximum for . Computing , setting
it equal to zero and solving numerically, one finds the maximum occurs for " " at which point
% "! . Thus
We conjecture that for all Æ and all with If this were true, by (55) the Maurey constant would be 1.86.
C Bounds on We now determine bounds on . These
have been given in [29, 4.10.3], [18, page 172], [14, 3.c.8],
[23], and [9, p.141], (see also [8, page 101]). All but the
last two references only considered . The most recent
contribution by Edmunds and Triebel [9, page 141], [8, page
101] subsumes all of the others and is summarized in the
Lemma below. For by an argument of Schütt [23] it
is asymptotically optimal in and .
. Then Lemma 31 Let
if if if . Now set
for where We can find a nice (small) explicit value for for the case
when which is of interest in its own right. We proceed
with that case now.
C.1 When For by an argument of Schütt [23] it is asymptotically
optimal in and . Since we need an explicit value of the
constant we provide the explicit proof below.
" (55)
Proof Let denote a -ball in dimensions. Let
. For a given number of dimensions , we determine
the smallest so that can be covered by
2e
(56)
where e is the th canonical basis vector. For we
have . Setting
(the radius of ) gives * .
We will now set . Along each of the axes of we
have used ** cubes. We have added
and subtracted one here so we do not multiply count the box
that will live at the very center of . Therefore along all axes, we have
* * * - -balls cover , where (57)
. In other words
(60)
for ) (61)
* . Thus we need to show
Choose )
. 3 )
(62)
* . Numerical calculations indicate that
) )
. achieves a maximum value of "
at . Similarly, we can numerically determine that . achieves a unique maximum value of
!" at . Thus for ,
*
) )
,
will do. We then have from (61) that for
)
!" !""
The following interpolation lemma follows immediately from
[18, p.173].
Lemma 33 Let . Then for all Æ
(63)
(64)
Combining this Lemma with (54) gives for ,
Thus * . Observe that by construction
of , any point in which is not covered by must lie within a -dimension -ball on one of the
principal axes of not contained in . Thus
by separately covering all possible choices of axes
in the
manner above, we cover . Since there are ways to
choose the axes, we have that
and so
* for some constants depending on . We will then set
(54)
(59)
Let 3 and so . Suppose ) is a function such ) 3 for . Then Lemma 32 For all Æ , and all ,
Thus
for Æ where is a positive constant independent of and depends on and .
3 (58)
" (65)
(66)
C.2 When Lemma 31 and the proof of Lemma 32 suggests a similar
form of the result in Lemma 32 should be obtainable when
. However it turns out that for the value of
constant obtained is quite unsatisfactory. This is because its
value is dominated by the behaviour of . for very
small and ( ). For learning applications
these are uninteresting values of and . Furthermore for
some applications we are actually interested in how behaves as a function of for fixed . For example if we
wanted to use as a “capacity control knob.” In that case it
is necessary to determine the dependence of on explicitly. In doing so it turns out that in the case of one
has to pay too high a price for the elegance of an expression
of the form (54) and we end up being better served by an implicit formula, which nevertheless can be easily computed.
This implicit formula does exhibit the expected behaviour in
for fixed .
Setting 2 we have from (60) that
2
(67)
For a given we can easily determine numerically: Let
2 3 2 . Such a 2 is unique. Then
*2 . Using this method one can plot as a function of for various . As
one would expect, the log covering numbers decrease with
decreasing .
2
D
-Convex Hull of Heavisides
We take the following definitions from [15]. Suppose 4 Ê
and 4 Ê . For ( ,
0! ' 4 5 !
where 5 is an arbitrary finite system of non-overlapping intervals with 5 4 for .
Suppose is continuous on 5. Let 6 be the union of
all open subintervals on which is either strictly monotonic
or constant. Then
7
7" 5 6
is the set of points of varying monotonicity. If is not continuous everywhere, we let 8 denote the set of points of dis& 7 8.
continuity and set 7
If ( , is said to be of bounded (-variation and we
& . If 0! 7
& we
write '()! if 0! 7
say '()! .
Let % 9 9 5 where % is the
Heaviside (step function).
Lemma 34 For any / Æ , for
'()! .
(
, ! Proof For any ! , we can write as
& % 9 !
& % 9 9 & % 9 9 !
& & & ! ! '()! As a way of illustrating the “size” of '()! , in a fashion that
gives some additional intuition to our entropy number results
determined directly, we will now compute the fat-shattering
dimension of '()! .
Proposition 35 Suppose
(
and
1
. Then
*+ 1 ! 1
Proof Suppose '()! 1 -shatter the
points 5 with respect to ! ! .
We will show that 1 ! . For , we have
7" . There must always be a sign assignment 5 which is realized w.r.t. ! ! by
such that
1
for 2 . Thus
&" 0! # 7
! 1 ! & " and so 1 ! But by hypothesis 0! 7
. Thus
! 1 ! 1 !
for and 1 .
The smallness of '()! is well illustrated by the following
theorem from [15]. For - define
,
- 5 Ê 7 : 5
: 7 : $ For Æ and - ,
,
- is -times differentiable and
,
- Theorem 36 (Laczkovich and Preiss) Let - ( , - *(. Then '()! if and only if there exists a homeomorphism ; of 5 into itself such that Æ ; ,
-.
(For related results see [17] and references therein.)
where we will assume that 9 9 9 . Observe that
& " 9 / . Thus 0! 7
&" 7
Since ! '()! for all / Æ , we have for
(