CSIS8601: Probabilistic Method & Randomized Algorithms
Lecture 6: Locality Sensitive Hashing: Nearest Neighbor Search
Lecturer: Hubert Chan
Date: 7 Oct 2009
These lecture notes are supplementary materials for the lectures. They are by no means substitutes
for attending lectures or replacement for your own notes!
1
Nearest Neighbor Search in Hamming Space
Deο¬nition 1.1 Let π» = {0, 1}. The π-dimensional Hamming Space π» π consists of bit strings of
length π. Each point π₯ β π» π is a string π₯ = (π₯0 , π₯1 , . . . , π₯πβ1 ) of zeroβs and oneβs. Given two points
π₯, π¦ β π» π , the Hamming distance ππ» (π₯, π¦) between them is the number of positions at which the
corresponding strings diο¬er, i.e.,
ππ» (π₯, π¦) := β£{π : π₯π β= π¦π }β£.
Example.
The DNA double helix consists of two strands. The four bases found in DNA are A, C, G and T.
Each type of base on one strand forms a bond with just one type of base on the other strand, with
A bonding only to T, and C bonding only to G. If we use 0 to represent an A-T bond and 1 to
represent a C-G bond, each DNA can be represented as a point in Hamming Space.
Remark 1.2 Given two points π₯, π¦ β π» π , their Hamming distance ππ» (π₯, π¦) can be computed in
π(π) time.
Deο¬nition 1.3 (Nearest Neighbor Search) Given a set π of π points in Hamming Space π» π ,
and a query point π β π» π , a nearest neighbor for π in π is a point π β π such that the Hamming
distance ππ» (π, π) is minimized.
Remark 1.4 (Naive Method) Given a query point π, the Hamming distance ππ» (π, π) for each
π β π can be computed in time π(π). Hence, it takes π(ππ) time to ο¬nd a nearest neighbor.
Observe that running time is linear in the number π of points in π .
Deο¬nition 1.5 (Approximate Nearest Neighbor Search) Given a set π of π points in Hamming Space π» π , approximation ratio π > 1 and a query point π β π» π , a π-nearest neighbor for π in
π is a point π β π such that the Hamming distance ππ» (π, π) β€ πππ» (πβ , π), where πβ is a nearest
neighbor of π.
Given a set π of π points in π» π , the goal is to design a data structure to store π such that given a
query point π, an approximate nearest neighbor can be returned in sublinear time π(π). We next
look at a sub-problem that can help us achieve this goal.
Deο¬nition 1.6 (Approximate Range Search) Suppose π is a set of π points in Hamming
Space π» π . A range parameter π > 0 and an approximation ratio π > 1 are given. Given a query
point π β π» π , the search with speciο¬ed range π and approximation ratio π does the following: if there
is some point πβ β π such that ππ» (πβ , π) β€ π, return a point π β π that satisο¬es ππ» (π, π) β€ ππ.
1
Remark 1.7 If there is no point πβ β π such that ππ» (πβ , π) β€ π, approximate range search could
still return a point π β π that satisο¬es π < ππ» (π, π) β€ ππ.
2
Locality Sensitive Hashing
The algorithm that we describe is due to Indyk and Motwani. Suppose β± is a family of hash
function of the form β : π» π β {0, 1}. A hash function β is picked uniformly at random from β±.
The idea is that if two points π₯ and π¦ in π» π are close with respect to their Hamming distance, then
the probability that β(π₯) = β(π¦) should be higher.
Deο¬nition 2.1 For π1 < π2 and π1 > π2 , a family β± of hash functions is (π1 , π2 , π1 , π2 )-sensitive
for π» π if for any π₯, π¦ β π» π ,
1. if ππ» (π₯, π¦) β€ π1 , then π πβ± [β(π₯) = β(π¦)] β₯ π1 ;
2. if ππ» (π₯, π¦) > π2 , then π πβ± [β(π₯) = β(π¦)] β€ π2 .
We next deο¬ne a family β±π of π hash functions. Each is of the form β(π) : π» π β {0, 1}, where
β(π) (π₯) := π₯π . In other words, β(π) (π₯) returns the πth position of a point π₯.
Lemma 2.2 For π₯, π¦ β π» π , π πβ±π [β(π₯) = β(π¦)] = 1 β
ππ» (π₯,π¦)
.
π
Proof: Let πΏ := ππ» (π₯, π¦). This means π₯ and π¦ diο¬er at exactly πΏ positions. Hence, when a
position π is chosen uniformly at random, the probability that π₯π = π¦π is 1 β ππΏ . It follows that
π πβ±π [β(π₯) = β(π¦)] = 1 β ππ» (π₯,π¦)
.
π
Corollary 2.3 For π > 0, π > 1 and ππ β€ π, the family β±π is (π, ππ, 1 β ππ , 1 β
Hamming Space π» π .
We take π1 := π, π2 := ππ, π1 := 1 β
2.1
π
π
and π2 := 1 β
ππ
π.
ππ
π )-sensitive
Since 1 > π1 > π2 , we have π :=
for the
log π1
log π2
< 1.
Hashing by Dimension Reduction
We use the family β±π of hash functions to build another family π’πΎ of hash functions, where πΎ is a
width parameter to be determined later. A hash function from π’πΎ is of the form π : π» π β {0, 1}πΎ .
A hash function π can be picked uniformly at random from π’πΎ in the following way.
1. For each π β [πΎ], pick βπ uniformly at random from β±π independently.
2. Let π := (βπ )πβ[πΎ] be the concatenation of the βπ βs.
For π₯ β π» π , π(π₯) := (β0 (π₯), β1 (π₯), . . . , βπΎβ1 (π₯)).
Hash Table π with Key π. For each point π₯ β π» π , a pointer to the point π₯ can be stored in
a hash table π with the key π(π₯) β {0, 1}πΎ . The space for the hash table is proportional to the
number of pointers stored. Hence, if there are π points, the space for the hash table π is π(π). Each
computation of π(π₯) takes π(πΎ) time. Hence, the time for constructing a hash table is π(ππΎ).
2
2.2
Data Structure for Approximate Range Search
Recall we are given a set π of π points in π» π , a range π > 0 and an approximate ratio π > 1. Let
log π1
π1 := 1 β ππ and π2 := 1 β ππ
π . Deο¬ne π := log π2 < 1.
Let πΎ := log
1
π2
π be the width of hash family π’πΎ and πΏ := ππ be the number of hash tables in our
data structure. For ease of presentation, we assume that πΎ and πΏ are integers and do not worry
about rounding oο¬ fractions.
Construction of Data Structure. For each π β [πΏ], a hash table ππ storing pointers to points in
π is constructed as follows.
1. Pick a function ππ from π’π¦ uniformly at random.
This requires π(πΎ log π) random bits.
2. For each point π₯ β π , store the pointer to the point π₯ in the hash table ππ using key ππ (π₯).
The space used for the πΏ hash tables is π(πΏπ) words and the space to store the original π points
is π(ππ). The construction time is π(πΏππΎ).
Λ 1+π ). The notation π
Λ hides logarithmic factors
The total space is π(π(ππ + π)) and the time is π(π
log π.
Query Algorithm. Given a query point π β π» π , a point π β π is returned by the following
algorithm.
1. For each π β [πΏ], compute the key ππ (π) and use the key to retrieve points stored in the hash
table ππ .
2. For each retrieved point π, if ππ» (π, π) β€ ππ, return the point π, and terminate.
3. At most 2πΏ retrieved points will be looked at. If after 2πΏ points, the algorithm still has not
returned a point, the algorithm terminates and does not return a point.
Running time for Query. Since each computation ππ (π) takes π(πΎ) time, and each computation
π ). Observe π < 1.
Λ
ππ» (π, π) takes π(π) time, the total time of the algorithm is π(πΏ(πΎ + π)) = π(ππ
2.3
Correctness
We show that for each query point π, the algorithm performs correctly with probability at least
1
1
2 β π . Using standard repetition technique, the failure probability can be reduced to πΏ, if we repeat
the whole data structure log 1πΏ times.
The algorithm behaves correctly means that if there is a point πβ β π such that ππ» (πβ , π) β€ π, a
point π is returned such that ππ» (π, π) β€ ππ.
Suppose πβ β π is a point such that ππ» (πβ , π) β€ π. The algorithm behaves correctly if both the
following events happen.
3
1. πΈ1 : there exists some π β [πΏ] such that ππ (πβ ) = ππ (π).
2. πΈ1 : there exist less than 2πΏ points π such that ππ» (π, π) > ππ and for some π β [πΏ], ππ (π) = ππ (π).
The event πΈ1 ensures that the point πβ would be a candidate point for consideration. The event
πΈ2 ensures that there would not be too many bad collision points so that the algorithm does not
terminate before the point πβ (or other legitimate points) can be found.
We show that π π[πΈ1 ] β€
1
π
and π π[πΈ2 ] β€ 12 . By union bound, we can conclude π π[πΈ1 β© πΈ2 ] β₯
1
2
log
1
π2
Since ππ» (π, π) β€ π, for some ο¬xed π, the probability that ππ (π) = ππ (π) is at least ππΎ
1 = π1
log π
π
β log π1
2
= πβπ =
β 1π .
π
=
1
πΏ.
Hence, π π[πΈ1 ] β€ (1 β πΏ1 )πΏ β€ 1π .
Fix π β [πΏ] and a point π such that ππ» (π, π) > ππ. Then, the probability that ππ (π) = ππ (π) is at
most ππ2 = π1 . Hence, it follows that the expected number of points π such that ππ» (π, π) > ππ and
ππ (π) = ππ (π) is at most 1.
Hence, the expected number of points π such that ππ» (π, π) > ππ and ππ (π) = ππ (π) for some π is at
most πΏ. Using Markovβs Inequality, π π[πΈ2 ] β€ 21 , as required.
3
From Approximate Range Query to Approximate Nearest Neighbor
The data
structure
described above can be repeated for diο¬erent values of π. In particular, deο¬ne
β
β
πΌ := log1+π Ξ , where Ξ := π = maxπ₯,π¦ ππ» (π₯, π¦). For each π β [πΌ], let ππ := (1 + π)π , and we build
a data structure π·π with range ππ and approximation ratio π = 1 + π.
For the query algorithm, we can use binary search to ο¬nd the smallest π for which the corresponding
data structure π·π returns a point.
4
© Copyright 2026 Paperzz