the johnson lindenstrauss lemma

T HE J OHNSON L INDENSTRAUSS L EMMA
Suppose I have a subset X of Rn given by { x1 , . . . , xk }. Suppose that the thing I care most about
this subset is the distance between the points, eg, I don’t really care where the points lie in Rn
but I do care about d( xi , x j ), where we consider Euclidean distance.
The Johnson Lindenstrauss Lemma says that I can take a linear map T : Rn → Rl such that
(1 − e)d( xi , x j ) ≤ d( T ( xi ), T ( x j )) ≤ (1 + e)d( xi , x j ), where l ≥ K log k, where K = K (e).1 We
note that the dependence upon e is going to be such that this is going to only be useful for a
large number of points.
This is roughly (very roughly) saying that to represent the euclidean distances between n
points you only need O(log n) points of data, and thus can ignore the rest of it. The question is, which O(log n) points of data do we take? It is pretty obvious we can’t just take the
first O(log n) co-ordinates, this is certainly not going to work (we could hide all the interesting distance-y stuff in the higher end co-ordinates.) However, this is bascially what we do,
we choose a random l dimensional subspace of Rn , then project onto it. We will then argue
that this is sufficient. The main idea is that the length of a unit vector is sharply concentrated
around its mean when projected onto a random k dimensional subspace.
Some Notes on the Proof
The proof we present here is due to Sanjoy Dasgupta and Anupam Gupta. The proof was an iterative refinement upon previous proofs, so even though it is somewhat unexpected/unusual,
the main features of it are logical in why someone ever thought of them.
The main proof idea has always been to pick a random linear map and apply it. Some proofs
have taken the linear map to be random ±1’s, that are then scaled. It is intuitively clear that if
we take a two fixed length sequences of ±1 at random the resulting vectors are expected to be
orthogonal (or almost orthogonal), hence the map is ‘almost’ a projection.2
Other proofs have taken k random vectors Ui on the unit sphere and then taken f ( x ) =
√1 (hUi , x i)i . Again, when we choose k random vectors on the unit sphere, with k < d the
d
resulting vectors are ‘almost orthogonal,’ thus we are doing something similar: choosing a
random k dimensional space and projecting onto it.
The first proof, due to Johnson and Lindenstrauss (for whom the lemma is named) actually
followed very similar lines to the one we present: they chose a random k dimensional subspace
and projected onto it. However, their proof was somewhat more complicated, they defined a
random k dimensional projection as U AU ∗ where A was the projection onto the first k coordinate vectors, and U as some random orthogonal matrix. This then had to be coupled with
computing the measure of random orthogonal matrices for which the result is true.
The techniques used to prove the inequalities for random variables are standard: they are
similar to the techniques used in proving the Chernoff bound for binomial distributions.
1 We
take e ∈ (0, 1). For e close to 1 this is very not useful.
that this implementation is particularly amenable to computation.
2 Note
1
The proof
Theorem. For any e ∈ (0, 1) and any integer n let k be a positive integer such that k ≥ 4(e2 /2 −
e2 /3) log n. Then for any set V of n points in Rd there is a map f : Rd → Rk such that for all u and v
in V we have (1 − e)ku − vk2 ≤ k f (u) − f (v)k2 ≤ (1 + e)ku − vk2 .
Let X1 , . . . , Xd be d independent Gaussian N (0, 1) variables and let Y =
1
( X1 , . . . , X d ) .
kXk
This
Sd−1 .3
point Y is chosen uniformly at random from the surface of the d dimensional sphere
Let Z ∈ Rk be the projection of Y onto its first k co-ordinates and let L = k Z k2 . We note that
µ = E( L) = k/d. We argue that L is tightly concentrated around its mean.
Lemma. Let k < d. Then if β < 1, we have that
P( L ≤
βk
(1 − β)k (d−k)/2
) ≤ βk/2 (1 +
)
≤ exp(k/2(1 − β + log β))
d
d−k
and if β > 1 we have that
P( L ≥
βk
(1 − β)k (d−k)/2
) ≤ βk/2 (1 +
)
≤ exp(k/2(1 − β + log β))
d
d−k
Proof. We prove the first of these results, the second being computed identically.
We first note that if X ∼ N (0, 1) then
1
E(exp(sX 2 )) = √
2π
Z
2
esx e− x
2 /2
1
dx = √
2π
Z
e−(1−2s)x
2 /2
dx = √
1
1 − 2s
Now, we aim to estimate P(d( X12 + ... + Xk2 ) ≤ βk ( X12 + ... + Xd2 )), since this is an elementary
rewriting of the lemma presented. Now, P(d( X12 + ... + Xk2 ) ≤ βk ( X12 + ... + Xd2 )) = P(−d( X12 +
... + Xk2 ) + βk( X12 + ... + Xd2 ) ≥ 0). Applying the exponential to each side, we get that this
= P[exp(t(kβ( X12 + ... + Xd2 ) − d( X12 + ... + Xk2 )))]. Applying Markovs inequality we get that
this is ≤ E[exp{t(kβ( X12 + ... + Xd2 ) − d( X12 + ... + Xk2 )}].
Now we note that since all of the variables X1 , . . . , Xd are independent, E( f ( Xi ) g( X j )) =
E( f ( Xi ))E( g( X j )). Applying this to the deduced inequality we get that the last item is equal
to E[exp(tkβX 2 )]d−k E[exp(t(kβ − d) X 2 )]k . We must have that tkβ < 1/2 and t(kβ − d) < 1/2
1
.
for these integrals to be possible to evaluate, eg, 0 < t < 2kβ
However, we have computed what these are, and we get that this equals (1 − 2tkβ)−(d−k)/2 (1 −
2t(kβ − d))−k/2 . We then wish to minimize this over t, a parameter we have introduced. Instead of minimizing this over t we instead maximise f (t) = (1 − 2tkβ)d−k (1 − 2t(kβ − d))k
over t, and then use such a value of t. Differentiating f and setting equal to zero we find that
1− β
1
the maximum is at t = 2β(d−kβ) , which is certainly in 0 < t < 2kβ
. Setting t equal to this in the
derived expression, the first inequality follows.
We have that
βk
(1 − β)k (d−k)/2
) ≤ βk/2 (1 +
)
d
d−k
To show that the right hand side of this inequality is ≤ exp(k/2(1 − β + log β)) we note the
(1− β ) k
(1− β ) k
following: βk/2 (1 + d−k )(d−k)/2 = βk/2 (1 + 2 d−2 k )(d−k)/2 . We now use the fact that (1 +
x n
x
k/2 exp( k (1 − β )). But then the inequality follows.
n ) ≤ e to deduce that this is ≤ β
2
P( L ≤
3 Gaussians
are rotation invariant, thus when we rotate around the sphere we do not affect the probability.
2
Proof of JL Lemma. If d ≤ k the result is trivial. Else take a random k dimensional subspace S
(
and let vi0 be the projection of vi onto S. Then setting L = kvi0 − v0j k and µ = k /d)kvi − v j k2
and applying the lemma we get that P( L ≤ (1 − e)µ) ≤ exp( 2k (1 − (1 − e) + log(1 − e))).
2
Using the inequality log(1 − x ) ≤ − x − x2 /2 we get that this is ≤ exp( ke4 ). Now, since
k ≥ 4(e2 /3 − e3 /3)−1 log n we get that his is ≤ exp(−2 log n) = n12 .
Similarly applying the second part of the lemma and the fact that log(1 + x ) ≤ x − x2 /2 + x3 /3
we get that P( L ≥ (1 + e)µ) ≤ n12 .
√
We now set the map f (vi ) = d/kvi0 . By the above calculation, for some fixed pair i, j the
probability that the distortion k f (vi ) − f (v j )k/kvi − v j k is not in the range (1 − e, 1 + e) is at
most n22 . Using the trivial bound the probability that some pair of points suffers distortion out
of this range is at most (n2 ) n22 = 1 − 1/n. Hence f has the desired properties with probability
at least 1/n, and repeating this algorithm O(n) times gives us an algorithm to determine such
an f .
3

Download Report

the johnson lindenstrauss lemma

Paperzz.com

Your Paperzz