ON PROJECTIONS OF METRIC SPACES 1 Introduction

Journal of Computational Geometry
jocg.org
ON PROJECTIONS OF METRIC SPACES
Mark Kozdoba∗
Abstract. Let X be a metric space and let µ be a probability measure on it. Consider
a Lipschitz map T : X → Rn , with Lipschitz constant ≤ 1. Then one can ask whether the
image T X can have large projections on many directions. For a large class of spaces X, we
show that there are directions φ ∈ S n−1 on which the projection of the image T X is small
on the average (in L2 (µ)), with bounds depending on the dimension n and the eigenvalues
of the Laplacian on X. Our results can be viewed as a multidimensional extension of the
concentration of measure principle.
1
Introduction
Let (X, m) be a metric space and let µ be a probability measure on X. Consider a Lipschitz
map T : X → Rn , with kT kLip ≤ 1, where Rn is taken with the standard inner product
h·, ·i and the corresponding Euclidean norm | · |.
Let us define another inner product on Rn by setting
Z
[φ, θ] = hφ, T xihθ, T xidµ(x)
(1)
for θ, φ ∈ Rn and denote the associated norm by
Z
kθkL2 :=
2
hT x, θi dµ(x)
1
2
.
(2)
This norm is known as the covariance structure of the push-forward measure T µ, and we
shall regard kθkL2 as measuring the size of the projection of the image of X onto the
direction θ. Similar notion of size is used, for instance, in the archetypical dimensionality
reduction technique in Machine Learning, the Principal Component Analysis (PCA) (see
[3] for an exposition). In PCA, the data is projected onto a subspace V ⊂ Rn , of a fixed
dimension k, such that the variation of the data on V is maximal among all subspaces with
that dimension. One ideally expects then that the projection onto V is the real source of
the data, and the projection onto V ⊥ comes from noise, with θ ∈ S n−1 ∩ V ⊥ having small
kθkL2 norm.
In this article we show that when a metric space is mapped into Rn , the geometry of
X can provide bounds on the size of the projections of the image. To see why this might be
the case, it is instructive first to take a look at an example. Take X to be the unit interval
[0, 1], with Lebesgue measure, and let T : [0, 1] → Rn be a Lipschitz map, i.e. we consider
∗
[email protected]
JoCG 5(1), 275–283, 2014
275
Journal of Computational Geometry
jocg.org
curves of length at most one in Rn (here and in what follows, Lipschitz will mean “having
Lipschitz constant ≤ 1”). By considering a few pictures, it is clear that if n is large then
there should be a direction with a small projection and we are interested in quantifying
this phenomenon. It turns out that this particular situation was investigated already by
Gohberg and Krein, in a different formulation and context (related to spectral properties
of smooth kernels), see the book [7], Chapter III.10.3.
Theorem 1 ([7]). There is an absolute constant c > 0, such that for every n > 1 and every
3
T : [0, 1] → Rn with kT kLip ≤ 1, there is θ ∈ S n−1 such that kθkL2 ≤ c · n− 2 .
Our objective is to extend this result to a more general class of metric spaces with
measure on which an appropriate notion of a derivative can be defined. Namely, we shall
discuss finite spaces X which are graphs with the shortest path metric and with measure
which is the stationary measure of the random walk, as described below. While this setting
makes the presentation simpler, the statement and the proof of our result, Theorem 2, can
be repeated with straightforward modifications in the setting of Riemannian manifolds. In
addition, the approach we take does not explicitly use the stationarity of the measure. This
assumption is used mainly because in the stationary case the associated Laplacians are
well investigated operators, for which explicit bounds can be provided. Details are given in
Section 2.
Let X be a finite set and let E ⊂ X × X be a set of edges such that (X, E) is a
connected undirected graph without loops. For x, y ∈ X, write x ∼ y iff (x, y) ∈ E and
let d(x) denote the degree of x. We endow X with the shortest path metric and we set
µ to be the stationary distribution of a simple nearest neighbour random walk on (X, E),
µ(x) = d(x)
2|E| . Denote by L(X) the space of real-valued functions on X. The graph Laplacian
on L(X) is defined by
!
1 X
f (y) .
4(f )(x) = 2 f (x) −
deg(x) y∼x
(3)
For our purposes, it suffices to note that this is a self-adjoint non-negative operator, with
a one dimensional kernel consisting of constant functions. Some of these properties will be
derived in Section 2, while the full details on the graph Laplacian can be found, for instance,
|X|−1
in [4]. We denote by {λi }i=1 the sequence of non-zero eigenvalues of 4 in non-decreasing
order, including multiplicities.
Theorem 2. There is an absolute constant c > 0 such that for every graph X as above,
every n > 1 and every Lipschitz map T : X → Rn , there is a direction θ ∈ S n−1 such that
1
−1
2
kθkL2 ≤ c · n− 2 λbn/2c−1
.
Example 3 (Discrete Space). Let (X, m) be a metric space such that m(x, y) = 1 for all
x 6= y, and let µ be the uniform probability on X. Then for every Lipschitz T : X → Rn
there is θ ∈ S n−1 such that
c
kθkL2 ≤ √ .
(4)
n
JoCG 5(1), 275–283, 2014
276
Journal of Computational Geometry
jocg.org
Note that, interestingly enough, the size of the space, |X|, does not appear in this bound
(except that, of course, for n > |X| we always have θ ∈ S n−1 with kθkL2 = 0). That is, one
can not increase the minimal projection size of X in, say, R20 , by adding more points. To
prove (4), note that (X, m) corresponds to a clique graph, and the Laplacian has a single
non-zero eigenvalue, which is of the order of a constant and has multiplicity |X| − 1.
Example 4 (Combinatorial Cube). Here X is the set C d = {0, 1}d , with the Hamming metric
m(x, y) = |{i ∈ {1, ..., d} | xi 6= yi }|
where x = x1 ...xd , y = y1 ...yd ∈ C d and µ is again taken to be the uniform probability
d
4k
measure. The non-zero eigenvalues of the Laplacian on X are d with multiplicity
,
k
k = 1, ..., d (See [5]). The n-th smallest eigenvalue is therefore of the order
obtain that for every Lipschitz T : X → Rn , there is θ ∈ S n−1 such that
√
d
kθkL2 ≤ c · √ p
.
n · log(n)
log(n)
d
and we
We shall in fact prove a slightly stronger statement then Theorem 2. To state it, we
require an additional bit of notation. Define the covariance matrix of the measure T µ in
Rn to be the operator CovT µ : Rn → Rn ,
Z
CovT µ = T x ⊗ T x dµ(x),
(5)
where for y ∈ Rn , y ⊗ y denotes an operator on Rn which acts by (y ⊗ y)(x) = hx, yiy. Note
that for every θ, φ ∈ Rn ,
[θ, φ] =< CovT µ θ, φ > .
Hence CovT µ is the non-negative matrix which connects the bilinear form [·, ·] with the
natural inner product h·, ·i. Let {αi }ni=1 be the eigenvalues of CovT µ in non-increasing
order. Then, with the rest of the notation as in Theorem 2, we have:
Theorem 5. There is an absolute constant c > 0 such that for every graph X as above,
every n > 1 and every Lipschitz map T : X → Rn , which satisfies in addition that
Z
T x dµ(x) = 0,
(6)
for every i = 1, ..., n,
αi ≤ c · i−1 λ−1
bn/2c .
(7)
Note that for T satisfying the mean-zero assumption (6), Theorem 5 indeed implies
Theorem 2. This can be seen by choosing θ of Theorem 2 to be the eigenvector of CovT µ
corresponding to the eigenvalue αn . The removal of assumption (6) is simple and will be
discussed in Section 2.
Our approach to Theorem 2 extends the argument in [7] and, naturally, Theorem
1 can also be seen as a consequence of the principle of Theorem 2. Indeed, consider the
JoCG 5(1), 275–283, 2014
277
Journal of Computational Geometry
jocg.org
2
2
Laplacian on [0, 1], 4(f ) = − ∂∂xf2 . The minus sign in front of ∂∂xf2 is taken for consistency
with definition (3), making 4(f ) a non-negative operator. This Laplacian has eigenspaces
spanned by {sin2πkx, cos2πkx | k ∈ N}, with eigenvalues λn ≈ n2 , up to multiplicative
1
1
3
constants. Hence the decay rate of n− 2 · (n2 )− 2 = n− 2 in Theorem 1.
Finally, we note that Theorem 5 can be viewed as a multidimensional extension of the
(weak) concentration of measure phenomenon. The concentration of measure phenomenon
states, roughly, that the distribution of a Lipschitz function on a metric measure space
tends to be more concentrated around the mean then what would be suggested by space’s
diameter. Many results, utilising different geometric parameters of the space were developed
to quantify this phenomenon. We refer to [10] and [9] for further details on the concentration
of measure. For the purposes of the present article, let us state a classic result by Alon
and Milman, [1], which derives concentration of measure on a graph in terms of the first
eigenvalue of the Laplacian.
Theorem 6 (Alon and Milman, 1985). There exist constants c, C > 0, such that for every
graph X, every Lipschitz f : X → R, and every t > 0:
q
−1
µ |f − Ef | > λ1 · t ≤ Ce−c·t .
(8)
The sub-exponential inequality (8) implies in particular the following, weaker, L2
form: There is a constant C > 0 such that for every X and every Lipschitz f : X → R,
q
1
(V arµ (f )) 2 ≤ C · λ−1
(9)
1 .
In Riemannian manifolds, this inequality is a special case of the Poincaré inequality (see
[6]) and it has a short independent proof. We give a short argument for the graph case in
Section 2.
In M. Gromov’s terminology,[8], both inequalities (8) and
q (9) state that the observable diameter of X, viewed through Lipschitz functions, is λ−1
1 . In other words, a
Lipschitzq
image of a space X in R will have most of its points within an interval of diameter
of order λ−1
1 , which is typically much smaller then the diameter of X. In this light, one
can ask what can be said about observations of X with values in higher dimensional Euclidean spaces. Theorem 5 provides one answer to this question, in terms of the Euclidean
structure of the image. Thus Theorem 5 and can be viewed as an extension of inequality
(9) to higher dimensional valued observations.
Let
this on the example of the cube C d . The diameter of C d is d and
q us illustrate
√
we have λ−1
d. Let T : X → Rn be a Lipschitz mapping and assume for simplicity
1 =
R
that T x dµ(x) = 0. For every θ ∈ S n−1 , define a function fθ : X → R by
fθ (x) = hT x, θi.
Since for every θ ∈ S n−1 , fθ is a Lipschitz function, we can conclude from inequality (9)
that
1
√
kθkL2 = V arµ2 fθ ≤ C d
JoCG 5(1), 275–283, 2014
278
Journal of Computational Geometry
jocg.org
and this is the best possible bound which is uniform over all θ. However, from Theorem
5 we know, for instance, that this bound can not be sharp for all θ. In particular, from
√
1
.
Theorem 2, there exists θ ∈ S n−1 with V arµ2 fθ ≤ c · √ √ d
n·
log(n)
Our results in this article deal with the L2 theory. That is, we give bounds on
the second moments of the functions of the form fθ under the distribution induced by T ,
and these bounds are better then what can be obtained by application of the usual, one
dimensional concentration inequalities. Whether similar results can be obtained for higher
moments, or for full sub-exponential inequalities of the form (8) remains an interesting open
question.
2
Proof
Let us begin by introducing additional some notions and notation that is used in the proof
of Theorem 2. Fix a graph (X, E) with the stationary measure µ on it. The notion of a
gradient on a graph is folklore, although not particularly frequent in the literature and the
notation varies considerably. We thus recall the definitions and some important properties.
Define an inner product on L(X) by
Z
[f, g]X = f (x)g(x)dµ(x).
Denote by L(E) the set of real valued functions on the set of edges, E, and let ν be the
uniform probability measure on E. We equip L(E) with the inner product
Z
[f, g]E = f (e)g(e)dν.
Fix an arbitrary orientation on E, i.e. for each edge e = {x, y} choose in an arbitrary way
an enumeration ve1 , ve2 of the vertices of the edge. We can express this enumeration as a
{+1, −1}-valued function o,
o(x, y) =
+1 if x = ve1 and y = ve2
−1 if x = ve2 and y = ve1
defined for pairs of vertices x, y which are connected by an edge e. The gradient operator
on X is defined by 5 : L(X) → L(E),
(5f )(e) = f (ve1 ) − f (ve2 ).
Equivalently, for e = {x, y},
(5f )(e) = o(x, y) (f (x) − f (y)) .
The Laplacian on the graph (X, E) is defined as 4 = 5∗ ◦ 5, where 5∗ is the Hilbert space
adjoint of the 5. Let us first compute the explicit expression for the Laplacian for arbitrary
JoCG 5(1), 275–283, 2014
279
Journal of Computational Geometry
jocg.org
measures µ and ν on X and E. By the definition of adjoints, the Laplacian needs to satisfy
[5∗ ◦ 5f, g]X = [5f, 5g]E
(10)
for every f, g ∈ L(X). By choosing g to be the delta function at vertex x, we get
X
X
(f (x) − f (y))ν(e),
o(x, y) · (f (x) − f (y)) · o(x, y)ν(e) =
(4f )(x) · µ(x) =
e
e
where the sum is over all edges e = {x, y} connected to the vertex x. Note, in particular,
that the Laplacian does not depend on the orientation, but does depend on the choice
of the measures µ and ν. We shall be interested in the case where µ(x) = deg(x)
2|E| is the
1
stationary measure of the graph, and ν(e) = |E| is the uniform measure. These measures
produce the classical definition (3), which has the advantage that this particular Laplacian
is well-studied and its eigenvalues are known in many cases (see, for instance, [4] and [5]).
Our proof of Theorem 2, however, uses the Laplacian only through equation (10), and does
not explicitly assume stationarity.
Denote by L0 (X) the subspace of L(X) that is orthogonal to the constant functions.
Since Ker4 is the space of the constant functions, the Laplacian is invertible on L0 (X) and
hence the gradient is also an invertible operator from L0 (X) onto its image. The inverse of
the gradient will be of importance in what follows and we denote it by Γ : Im(5) → L0 (X).
With the preliminaries on the Laplacian in place, we give a short argument for the
proof of inequality (9). This inequality, however, will not be used in the rest of this section.
First note that for every f ∈ L(X),
V arµ f = [(f − Ef ), (f − Ef )]X ≤ λ−1
1 [4f, f ]X .
(11)
P
Indeed, write f as an orthogonal decomposition f =
i≥0 fi where fi is contained in
the eigenspace of the eigenvalue λi of 4 (where fP
=
Ef
is the mean, corresponding to
0
eigenvalue
0). Then the left hand-side of (11) is i>0 hfi , fi i and the right hand-side is
P
λ
hf
,
fi i. Since λ1 is the smallest non-zero eigenvalue, (11) follows. Now, since the
i
i
i>0
norm of the derivative satisfies by definition
[5f, 5f ]E = [4f, f ]X
(12)
and since [5f, 5f ]E ≤ 1 for Lipschitz functions, inequality (9) follows.
We now briefly recall the notion of singular values of an operator and some related
singular value inequalities. We refer to [2] or [7] for full details and proofs. If V and W are
(say, finite dimensional ) Hilbert spaces, and A : V → W is a linear operator, then A∗ A
is a non-negative self-adjoint operator. Let {λi (A∗ A)}dimV
be the eigenvalues of A∗ A, in
i=1
non-increasing order, with multiplicities.
Then singular values of the operator A are the
p
non-increasing sequence si (A) = λi (A∗ A).
The singular values of A have a simple geometric interpretation. If B2 (V ) denotes
the unit ball of V , then the image A(B2 (V )) is an ellipsoid in W and si (A) are precisely
the lengths of the principal axes of this ellipsoid.
JoCG 5(1), 275–283, 2014
280
Journal of Computational Geometry
jocg.org
The singular values satisfy si (A) = si (A∗ ) and singular values of a composition can
be bounded by the following inequality due to Ky Fan. Let A : V → W and B : W → U
be two operators on Hilbert spaces. Then, for every i, j ≥ 1,
si+j−1 (BA) ≤ si (A)sj (B).
(13)
Finally, the Hilbert-Schmidt norm of an operator is defined by
kAkHS =
√
!1
2
trA∗ A
=
X
s2i (A)
.
i
Note that since si is a non-increasing sequence,
si (A) ≤
kAkHS
√
i
(14)
for all i.
Before proceeding with the proof of Theorem 2, it will be convenient to introduce
an additional assumption on the Lipschitz map T : X → Rn , namely that
Z
T (x)dµ = 0.
(15)
With conditions of Theorem 2 and this assumption, we will show that there is θ ∈ S n−1
such that
1 −1
2
kθkL2 = c · n− 2 λbn/2c
.
(16)
This
implies the result for arbitrary Lipschitz T : X → Rn . Indeed, assume that v :=
R
T (x)dµ 6= 0. Denote by P the orthogonal projection onto the n − 1 dimensional space
Rorthogonal to v. Then the composition P ◦ T is a Lipschitz map into that space and
(P ◦ T )(x)dµ = 0, so (16) can be applied in dimension n − 1.
Proof of Theorem 2. As mentioned above, we assume that condition (15) holds. Let DT
denote the gradient of the map T , i.e.
DT (e) = T (ve1 ) − T (ve2 ).
Since T is Lipschitz, DT is bounded, i.e. |DT (e)| ≤ 1 for all e ∈ E.
For every θ ∈ Rn consider the function on X, fθ (x) = hθ, T xi. By (15), fθ ∈ L0 (X)
and clearly (5fθ )(e) = hθ, DT (e)i and
fθ = Γ(hθ, DT i).
Let
JoCG 5(1), 275–283, 2014
n
o
BL2 (X) = f : X → R [f, f ]X ≤ 1
281
Journal of Computational Geometry
jocg.org
denote the unit ball in L(X) and write kθkL2 in a dual form:
Z
kθkL2 = supg∈B2 (X) g(x)hθ, T xidµ.
Then
Z
g(x)hθ, T xidµ = [g, fθ ]X = [g, Γ ◦ 5fθ ]X = [Γ∗ g, 5fθ ]E .
Next, write
∗
[Γ g, 5fθ ]E =
Z
Z
∗
(Γ g)(e) · hθ, DT (e)idν(e) = hθ,
(Γ∗ g)(e) · DT (e)dν(e)i.
e : L(E) → Rn the operator that acts by
Denote by D
Z
e
Du = u(e) · DT (e)dν.
With this notation,
e ◦ Γ∗ gi = suphθ, φi
kθkL2 = supg∈B2 hθ, D
(17)
φ∈E
e ◦ Γ∗ B2 is the dual (in Rn ) ellipsoid of the k · kL norm.
where E = D
2
Our aim is to bound the quantity
inf
θ∈S n−1
kθkL2 .
By (17), this quantity equals the length of the smallest principal axis of the ellipsoid E.
e ◦ Γ∗ , we bound the
Since the lengths of the principal axes are the singular values of D
singular values of this operator.
−1
The singular values of Γ∗ are given, si (Γ∗ ) = λi 2 , where λi are the non-zero eigenvalues of the Laplacian (in non-decreasing order, so that si (Γ∗ ) do not increase). To bound
e we show that
the singular values of D,
e HS ≤ 1.
kDk
(18)
Indeed, one readily verifies that
e ◦D
e ∗ (θ) =
D
Z
hθ, DT (e)iDT (e)dν.
For fixed e ∈ E, the trace of an operator θ 7→ hθ, DT (e)iDT (e) is |DT (e)|2 , which satisfies
|DT (e)| ≤ 1
(19)
e ≤
by the Lipschitz assumption. Therefore (18) follows by averaging. Now, by (14), si (D)
and an application of (13) (with i = j = n/2) completes the proof.
JoCG 5(1), 275–283, 2014
282
1
√
i
Journal of Computational Geometry
jocg.org
References
[1] N. Alon, V.D Milman, λ1 , Isoperimetric inequalities for graphs, and superconcentrators,
Journal of Combinatorial Theory, Series B, Vol. 38, issue 1, p. 73 - 88, 1985
[2] R. Bhatia, Matrix analysis. Graduate Texts in Mathematics, 169. Springer-Verlag, New
York, 1997.
[3] C.M. Bishop, Pattern Recognition and Machine Learning , Springer-Verlag New York,
Inc., 2006
[4] F.R. Chung, Spectral graph theory. Vol. 92. American Mathematical Soc., 1997.
[5] P. Diaconis, Group representations in probability and statistics, Institute of Mathematical Statistics Lecture Notes—Monograph Series, 11, Institute of Mathematical Statistics,
Hayward, CA, 1988,
[6] L.C. Evans, Partial Differential Equations, Graduate studies in mathematics, American
Mathematical Society, 1998
[7] I. C. Gohberg, M. G. Kreı̆n , Introduction to the theory of linear nonselfadjoint operators, Translations of Mathematical Monographs, Vol. 18, American Mathematical
Society,1969
[8] M. Gromov, Metric structures for Riemannian and non-Riemannian spaces. Based on
the 1981 French original. With appendices by M. Katz, P. Pansu and S. Semmes. Translated from the French by Sean Michael Bates. Progress in Mathematics, 152. Birkhuser
Boston, Inc., Boston, MA, 1999
[9] M. Ledoux, The Concentration of Measure Phenomenon, Mathematical surveys and
monographs, Vol. 89, American Mathematical Society, 2005
[10] V. Milman, G. Schechtman, Asymptotic theory of finite-dimensional normed spaces.
With an appendix by M. Gromov. Lecture Notes in Mathematics, 1200. Springer-Verlag,
Berlin, 1986.
JoCG 5(1), 275–283, 2014
283