Optimal Data-Dependent Hashing for Nearest Neighbor Search

Optimal Data-Dependent Hashing
for
Nearest Neighbor Search
Alex Andoni
(Columbia University)
Joint work with: Ilya Razenshteyn
Nearest Neighbor Search (NNS)

Preprocess: a set 𝑃 of points

Query: given a query point π‘ž, report a
point π‘βˆ— ∈ 𝑃 with the smallest distance
to π‘ž
π‘βˆ—
π‘ž
2
Motivation
 Generic setup:
 Points model objects (e.g. images)
 Distance models (dis)similarity measure
 Application areas:
 machine learning: k-NN rule
 speech/image/video/music recognition, vector
quantization, bioinformatics, etc…
 Distances:

Hamming, Euclidean,
edit distance, earthmover distance, etc…
 Core primitive for: closest pair, clustering …
3
000000
011100
010100
000100
010100
011111
000000
001100
000100
000100
110100
111111
π‘βˆ—
π‘ž
Curse of Dimensionality
All exact algorithms degrade rapidly with the
dimension 𝑑

Algorithm
Full indexing
No indexing –
linear scan
4
Query time
𝑑 β‹… log 𝑛
𝑂(𝑛 β‹… 𝑑)
𝑂 1
Space
𝑛𝑂(𝑑) (Voronoi diagram size)
𝑂(𝑛 β‹… 𝑑)
Approximate NNS
π‘Ÿ-near neighbor: given a query point π‘ž,
report a point 𝑝′ ∈ 𝑃 s.t. 𝑝′ βˆ’ π‘ž ≀ π‘π‘Ÿ
π‘Ÿ


as long as there is some point within
distance π‘Ÿ
Practice: use for exact NNS


5
Filtering: gives a set of candidates (hopefully
small)
π‘Ÿ
π‘π‘Ÿ
π‘βˆ—
π‘ž
𝑝′
NNS algorithms
Exponential dependence on dimension
[Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-SilvermanWe’98], [Kleinberg’97], [Har-Peled’02],[Arya-Fonseca-Mount’11],…

Linear dependence on dimension
[Kushilevitz-Ostrovsky-Rabani’98], [Indyk-Motwani’98], [Indyk’98, β€˜01],
[Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-IndykMirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06],
[A.-Indyk’06], [A.-Indyk-Nguyen-Razenshteyn’14], [A.-Razenshteyn’15],
[Pagh’16],…

6
Locality-Sensitive Hashing
[Indyk-Motwani’98]
Random hash function β„Ž on 𝑅
satisfying:
π‘ž
𝑝′
𝑑
𝑝
π‘ž
for close pair: when π‘ž βˆ’ 𝑝 ≀ π‘Ÿ
β€œnot-so-small”
𝑃1 = Pr[β„Ž(π‘ž) = β„Ž(𝑝)] is β€œhigh”
 for far pair: when π‘ž βˆ’ 𝑝′ > π‘π‘Ÿ
𝑃2 = Pr[β„Ž(π‘ž) = β„Ž(𝑝′)] is β€œsmall”

Use several hash tables:
π‘›πœŒ,
log 1/𝑃1
where 𝜌 =
log 1/𝑃2
Pr[𝑔(π‘ž) = 𝑔(𝑝)]
1
𝑃1
𝑃2
7
π‘žβˆ’π‘
π‘Ÿ
π‘π‘Ÿ
LSH Algorithms
Space Time Exponent
Hamming 𝑛1+𝜌
space
π‘›πœŒ
Euclidean 𝑛1+𝜌
space
π‘›πœŒ
𝜌 = 1/𝑐
𝜌 β‰₯ 1/𝑐
Reference
𝜌 = 1/2 [IM’98]
[MNP’06, OWZ’11]
𝜌 = 1/𝑐
𝜌 = 1/2 [IM’98, DIIM’04]
𝜌 β‰ˆ 1/𝑐 2
𝜌 = 1/4 [AI’06]
𝜌 β‰₯ 1/𝑐 2
8
𝒄=𝟐
[MNP’06, OWZ’11]
LSH is tight.
Are we really done with basic NNS algorithms?
Detour: Closest Pair Problem
Closest Pair Problem (bichromatic) in 0,1 𝑑 :


given 𝐴, 𝐡, find π‘Ž ∈ 𝐴, 𝑏 ∈ 𝐡 at distance ≀ π‘π‘Ÿ
assuming there is a pair at distance ≀ π‘Ÿ
Via LSH:



𝑐 =1+πœ–
Preprocess 𝐴, and then query each 𝑏 ∈ 𝐡
Gives 𝑛2βˆ’Ξ˜(πœ–) time
[Alman-Williams’15]:


exact in 𝑂(𝑛
2βˆ’π›Ώ
) for 𝑑 <
𝑂(log2 1/𝛿)
𝛿
β‹… log 𝑛
[Valiant’12, Karppa-Kaski-Kohonen’16]:



9
Average case: 𝑛2πœ”/3+π‘œπœ–(1) for 𝑑 ≫ log 𝑛
Worst-case: 𝑛2βˆ’Ξ˜( πœ–)
Points: random ∈ 0,1 𝑑
at distance π‘π‘Ÿ = 𝑑/2
Close pair: random at
𝑑/2
distance π‘Ÿ =
𝑐
Much more efficient algorithms for 𝑑 = 𝑂(log 𝑛)
or for average-case
Beyond Locality Sensitive Hashing
Space Time Exponent
1+𝜌
𝑛
Hamming
π‘›πœŒ
𝜌 = 1/𝑐
𝜌 β‰₯ 1/𝑐
space
𝒄=𝟐
Reference
𝜌 = 1/2 [IM’98]
[MNP’06, OWZ’11]
𝑛1+𝜌
π‘›πœŒ
complicated 𝜌 = 1/2 βˆ’ πœ– [AINR’14]
1
𝜌 = 1/3 [AR’15]
πœŒβ‰ˆ
2𝑐 βˆ’ 1
Euclidean 𝑛1+𝜌
space
π‘›πœŒ
𝜌 β‰ˆ 1/𝑐 2
𝑛1+𝜌
π‘›πœŒ
10
𝜌 = 1/4 [AI’06]
[MNP’06, OWZ’11]
𝜌 β‰₯ 1/𝑐 2
complicated 𝜌 = 1/4 βˆ’ πœ– [AINR’14]
1
𝜌 = 1/7 [AR’15]
πœŒβ‰ˆ 2
2𝑐 βˆ’ 1
LSH
LSH
New approach?
Data-dependent hashing


A random hash function,
chosen after seeing the given
dataset
Efficiently computable
11
A look at LSH lower bounds

LSH lower bounds in Hamming space


Fourier analytic
[Motwani-Naor-Panigrahy’06]




𝐻 distribution over hash functions β„Ž: 0,1 𝑑 β†’ π‘ˆ
𝛿𝑑
Far pair: 𝑝, π‘ž random, distance = 𝑑/2
𝑑/2 𝛿𝑑
Close pair: 𝑝, π‘ž random at distance =
𝑐
𝑐
Get 𝜌 β‰₯ 0.5/𝑐
𝜌 β‰₯ 1/𝑐
12
[O’Donnell-Wu-Zhou’11]
Points random ∈ 0,1 𝑑
at distance π‘π‘Ÿ = 𝑑/2
Close pair: distance 𝑑/2𝑐
at distance π‘Ÿ = 𝑑/2𝑐
Why not NNS lower bound?

Suppose we try to apply [OWZ’11] to NNS



13
Pick random π‘ž
All the β€œfar point” are concentrated in a small ball of radius πœ–π‘‘
Easy to see at preprocessing: actual near neighbor close to the
center of the minimum enclosing ball
Construction of hash function

Two components:



Average-case: nicer structure
Reduction to such structure
has better LSH
data-dependent
Like a (very weak) β€œregularity lemma” for a set of points
14
Average-case: nicer structure
[A.-Indyk-Nguyen-Razenshteyn’14]

Think: random dataset on a sphere



vectors perpendicular to each other
s.t. two random points at distance β‰ˆ π‘π‘Ÿ
Lemma: 𝜌 =

1
2𝑐 2 βˆ’1
π‘π‘Ÿ
via Cap Carving
π‘π‘Ÿ/ 2
15
Reduction to nice structure

Idea:
iteratively decrease the radius of
minimum enclosing ball

Algorithm:

find dense clusters




recurse on dense clusters
apply cap carving on the rest


16
with smaller radius
large fraction of points
recurse on each β€œcap”
eg, dense clusters might reappear
Why ok?
β€’ no dense clusters
β€’ like β€œrandom dataset”
with radius=100π‘π‘Ÿ
β€’ even better!
radius = 100π‘π‘Ÿ
radius = 99π‘π‘Ÿ
*picture not to scale & dimension
Hash function

Described by a tree (like a hash table)
radius = 100π‘π‘Ÿ
17
*picture not to scale&dimension
2βˆ’πœ– 𝑅
Dense clusters


Current dataset: radius 𝑅
A dense cluster:



After we remove all clusters:



Contains 𝑛1βˆ’π›Ώ points
Smaller radius: 1 βˆ’ Ξ© πœ– 2 𝑅
For any point on the surface, there are at most 𝑛1βˆ’π›Ώ points
within distance 2 βˆ’ πœ– 𝑅
πœ– trade-off
𝛿 trade-off
The other points are essentially orthogonal !
When applying Cap Carving with parameters (𝑃1 , 𝑃2 ):


+ 𝑛1βˆ’π›Ώ
Empirical number of far pts colliding with query: 𝑛𝑃2
As long as 𝑛𝑃2 ≫ 𝑛1βˆ’π›Ώ , the β€œimpurity” doesn’t matter!
?
Tree recap

During query:




Recurse in all clusters
Just in one bucket in CapCarving
Will look in >1 leaf!
How much branching?


𝛿
Claim: at most 𝑛 + 1
Each time we branch




at most 𝑛𝛿 clusters (+1)
a cluster reduces radius by Ξ©(πœ– 2 )
cluster-depth at most 100/Ξ© πœ– 2
Progress in 2 ways:



𝑂(1/πœ–2 )
Clusters reduce radius
CapCarving nodes reduce the # of far points (empirical 𝑃2 )
A tree succeeds with probability β‰₯
19
𝛿 trade-off
1
βˆ’
βˆ’π‘œ(1)
𝑛 2𝑐2 βˆ’1
Fast preprocessing

How to find the dense clusters fast?

Step 1: reduce to 𝑂(𝑛2 ) time.


Enough to consider centers that
are data points
Step 2: reduce to near-linear time.



20
Down-sample!
Ok because we want clusters of size 𝑛1βˆ’π›Ώ
After downsampling by a factor of 𝑛, a cluster is still
somewhat heavy.
2βˆ’πœ– 𝑅
Even more details

Analysis:


Instead of working with β€œprobability of collision with far point”
𝑃2 , work with β€œempirical estimate” (the actual number)
A little delicate: interplay with β€œprobability of collision with
close point”, 𝑃1




Empirical 𝑃2 important only for the bucket where the query falls into
Need to condition: on collision with the close point
β‡’ 3-point collision analysis for LSH
In dense clusters, points may appear inside the balls


21
whereas CapCarving works for points on the sphere
need to partition balls into thin shells (introduces more
branching)
Data-dependent hashing: beyond LSH

A reduction from worst-case to average-case


1
β‡’ NNS with query π‘›πœŒ , where 𝜌 = 2𝑐 2βˆ’1
Better bounds?

For dimension 𝑑 = 𝑂(log 𝑛), can get better 𝜌! [Laarhoven’16]


For 𝑑 > log1+𝛿 𝑛: above 𝜌 is optimal even for data-dependent hashing! [ARazenshteyn’??]:




in the right formalization (to rule out Voronoi diagram):
description complexity of the hash function is 𝑛1βˆ’Ξ©(1)
Better reduction?
NNS for β„“βˆž

[Indyk’98] gets approximation 𝑂(log log 𝑑) (poly space, sublinear qt)



like it is the case for closest pair [Alman-Williams’15]
Matching lower bounds in the relevant model [ACP’08, KP’12]
Can be thought of as data-dependent hashing
NNS for any norm (eg, matrix norms, EMD) ?
22
Lower bounds: an opportunity to
discover new algorithmic techniques