Optimal Hashing for High

Beyond
Locality-Sensitive Hashing
Alex Andoni
(Columbia University)
Based on joint work with: Piotr Indyk, Thijs Laarhoven,
Huy Nguyen, Sasho Nikolov, Ilya Razenshteyn, Ludwig
Schmidt, Negev Shekel-Nosatzki, Erik Waingarten
Nearest Neighbor Search (NNS)

Preprocess: a set 𝑃 of points

Query: given a query point π‘ž, report a
point π‘βˆ— ∈ 𝑃 with the smallest distance
to π‘ž
π‘βˆ—
π‘ž
2
Motivation
 Generic setup:
000000
011100
010100
000100
010100
011111
 Points model objects (e.g. images)
 Distance models (dis)similarity measure
 Applications:
Goal: k-NN rule
 machine learning:
0.99
 in many areas…Sublinear time: 𝑛
𝑂
Polynomial
space:
𝑛
 Distances:

1
Hamming, Euclidean, β„“βˆž
edit distance, earthmover distance, etc…
 Core primitive: closest pair, clustering, etc…
3
000000
001100
000100
000100
110100
111111
π‘βˆ—
π‘ž
Near Neighbor Search
π‘Ÿ-near neighbor: given a query point π‘ž,
report a point 𝑝′ ∈ 𝑃 s.t. 𝑝′ βˆ’ π‘ž ≀ π‘π‘Ÿ
π‘Ÿ


as long as there is some point within
distance π‘Ÿ
π‘Ÿ
π‘π‘Ÿ
π‘βˆ—
π‘ž
𝑝′
4
It’s all about space decompositions
Approximate space decompositions
[Arya-Mount’93], [Clarkson’94], [AryaMount-Netanyahu-Silverman-We’98],
[Kleinberg’97], [Har-Peled’02],[AryaFonseca-Mount’11],…

~2𝑑 factors
Random decompositions == hashing
[Kushilevitz-Ostrovsky-Rabani’98], [IndykMotwani’98], [Indyk’01], [Gionis-IndykMotwani’99], [Charikar’02], [DatarImmorlica-Indyk-Mirrokni’04], [ChakrabartiRegev’04], [Panigrahy’06], [AilonChazelle’06], [A.-Indyk’06], [TerasawaTanaka’07], [Kapralov’15], [Pagh’16], [BeckerDucas-Gama-Laarhoven’16], [Christiani’16],
[Ahle-Aumuhler-Pagh’17], …

5
π‘ž
𝑝
𝑝′
Locality Sensitive Hashing
[IM’98]
Hash: project on
random bits!
LSH until ~’14
Space Time Exponent
Hamming 𝑛1+𝜌
space
π‘›πœŒ
Euclidean
space
π‘›πœŒ
𝑛1+𝜌
𝜌 = 1/𝑐
𝜌 β‰₯ 1/𝑐
𝒄=𝟐
Reference
𝜌 = 1/2 [IM’98]
[MNP’06, OWZ’11]
𝜌 = 1/𝑐
𝜌 = 1/2 [IM’98, DIIM’04]
𝜌 β‰ˆ 1/𝑐 2
𝜌 = 1/4 [AI’06]
𝜌 β‰₯ 1/𝑐 2
Lower bounds (cell probe)
[A.-Indyk-Patrascu’06,
Panigrahy-Talwar-Wieder’08,β€˜10,
Kapralov-Panigrahy’12]
Space-time trade-offs
[MNP’06, OWZ’11]
Datasets with additional structure
[Clarkson’99, Karger-Ruhl’02, Krauthgamer-Lee’04,
Beygelzimer-Kakade-Langford’06,
Indyk-Naor’07, Dasgupta-Sinha’13,
Abdullah-A.-Krauthgamer-Kannan’14,…]
[Panigrahy’06, A.-Indyk’06, Kapralov’15]
Post-LSH
Space Time Exponent
Hamming
space
𝑛1+𝜌
π‘›πœŒ
𝜌 = 1/𝑐
𝜌 β‰₯ 1/𝑐
𝑛1+𝜌
π‘›πœŒ
𝒄=𝟐
Reference
𝜌 = 1/2 [IM’98]
[MNP’06, OWZ’11]
LSH
complicated 𝜌 = 1/2 βˆ’ πœ– [AINR’14]
1
𝜌 = 1/3 [AR’15]
πœŒβ‰ˆ
2𝑐 βˆ’ 1
Random hashing is not optimal!
Euclidean 𝑛1+𝜌
space
π‘›πœŒ
𝑛1+𝜌
π‘›πœŒ
7
𝜌 β‰ˆ 1/𝑐 2
𝜌 = 1/4 [AI’06]
[MNP’06, OWZ’11]
𝜌 β‰₯ 1/𝑐 2
complicated 𝜌 = 1/4 βˆ’ πœ– [AINR’14]
1
𝜌 = 1/7 [AR’15]
πœŒβ‰ˆ 2
2𝑐 βˆ’ 1
LSH
Data-dependent hashing
A random space partition,
chosen after seeing the given
dataset
Efficient…


8
Optimal hashing* algorithm for time-space trade-off
[A-Laarhoven-Razenshteyn-Weingarten’17]
Algorithm: achieves 𝑐-approximate NNS with:


Space: 𝑛1+πœŒπ‘’+π‘œ 1
Query time: π‘›πœŒπ‘ž +π‘œ(1)

As long as 𝑐 2 πœŒπ‘ž + 𝑐 2 βˆ’ 1

Proof plan:




Basic algorithm (=LSH)
Data-dependent hashing
Trading-off space & time
πœŒπ‘’ = 2𝑐 2 βˆ’ 1
𝑛1.77… space
π‘›π‘œ(1) query time
Lower bounds



9
Optimal if hashing*
Unconditional for small q.t.
𝑛1.14…
𝑛0.14…
𝑐=2
best LSH
[A-Indyk’06]
𝑛1+π‘œ(1)
𝑛0.43…
Setup for algorithm

All points (dataset, query) are on the surface of a unit
sphere


Wlog: look at the dataset from far away
Dimension: 𝑑 = log1.1 𝑛

10
Wlog: [Johnson-Lindenstrauss’84]
Basic Algorithm (with oracle)

Preprocessing




How to pick 𝐴𝑖 ’s?



Pick 𝑇 random parts 𝐴𝑖
Form subsets 𝑃𝑖 = 𝑃 ∩ 𝐴𝑖
Store non-empty 𝑃𝑖 ’s
𝑇 and πœ‚:
parameters to be chosen later
Sample Gaussian vector 𝑧𝑖
𝐴𝑖 = 𝑝: 𝑝, 𝑧𝑖 β‰₯ πœ‚
Query


11
Retrieve all the sets 𝑃𝑖 s.t. π‘ž ∈ 𝐴𝑖 (i.e., π‘ž, 𝑧𝑖 β‰₯ πœ‚)
Inside the retrieved 𝑃𝑖 ’s, search a point within π‘π‘Ÿ from π‘ž
𝜼: s.t. Pr far 𝑝 ∈ 𝐴𝑖 π‘ž ∈ 𝐴𝑖 ] = 1/𝑛
1
Space:
𝑛⋅
.
Analysis
Key quantity:
Pr π‘ž ∈ 𝐴𝑖 𝑝 ∈ 𝐴𝑖 ]

Query time:
Matches best LSH [AI’06]


1
𝑐
1+ 2 +π‘œ(1)
space 𝑛
query time 𝑛
πœ‚
,
1
+π‘œ(1)
𝑐2
π‘Ÿ
Pr π‘ž ∈ 𝐴𝑖 𝑝 ∈ 𝐴𝑖 ]

Pr close π‘žβˆˆπ΄π‘– π‘βˆˆπ΄π‘– ]
1
.
Pr close π‘βˆˆπ΄π‘– π‘žβˆˆπ΄π‘– ]
1
π‘›πœŒ
π‘π‘Ÿ
1
where 𝜌 π‘Ÿ, 𝑐 ≀ 𝑐 2 + π‘œ 1
1
𝑛
πœ‚=0
Related partitions

Similar to



Locality Sensitive Filters [Becker, Ducas, Gama, Laarhoven 2016]
cell-probe algorithm of [Panigrahy, Talwar, Wieder 2010]
Related (or sometimes equivalent) partitioning procedures:





13
Approximation algorithms [Goemans, Williamson 1995], [Karger,
Motwani, Sudan 1995], [Charikar, Chekuri, Goel, Guha, Plotkin 1998],
[Chlamtac, Makarychev, Makarychev 2006], [Louis, Makarychev 2014]
Spectral graph partitioning [Lee, Oveis Gharan, Trevisan 2012], [Louis,
Raghavendra, Tetali, Vempala 2012]
Spherical cubes [Kindler, O’Donnell, Rao, Wigderson 2008]
Metric embeddings [Fakcharoenphol, Rao, Talwar 2003], [Mendel,
Naor 2005]
Communication complexity [Bogdanov, Mossel 2011], [Canonne,
Guruswami, Meka, Sudan 2015]
How to implement the oracle?
Retrieve all the caps such that
π‘ž, 𝑧𝑖 β‰₯ πœ‚

Idea: β€œgradual” partitioning, like in block coding


β€œcap 𝐴𝑖 ” = intersection of 𝐾 caps
𝑃
Algorithm:




14
Sample 𝑇 Gaussians 𝑧1 , 𝑧2 , … , 𝑧𝑇
Form 𝑃𝑖 = 𝑝 ∈ 𝑃 | 𝑝, 𝑧𝑖 β‰₯ πœ‚β€²
Recurse on non-empty 𝑃𝑖 ’s
For 𝐾 levels
𝑇 = 3, 𝐾 = 2
𝑃1
𝑃2
𝑃3
Data-dependent hashing
[A-Razenshteyn’15]

Two components:


15
Pseudo-random configuration
Reduction to such configuration
has better d.i. hashing
data-dependent
Pseudo-random configuration

Points on a unit sphere, where

π‘π‘Ÿ β‰ˆ 2, i.e., far pair is (near) orthogonal
this would be the distance if the dataset were random on the
sphere

Close pair: π‘Ÿ = 2/𝑐

π‘π‘Ÿ

Lemma (in the LSH regime):

Cap Hashing gives
𝜌=
16
1
2𝑐 2 βˆ’1
π‘π‘Ÿ/ 2
Reduction to nice configuration

Idea:
iteratively make the pointset more
isotopic by narrowing ball around it

Algorithm:

find dense clusters




recurse on dense clusters
apply Cap Hashing on the rest


17
with smaller radius
large fraction of points
recurse on each β€œcap”
eg, dense clusters might reappear
Why ok?
β€’ no dense clusters
β€’ like β€œrandom dataset”
with radius=100π‘π‘Ÿ
β€’ even better!
radius = 100π‘π‘Ÿ
radius = 99π‘π‘Ÿ
*picture not to scale & dimension
Practice?

Two components:


Nice geometric structure: spherical case
Reduction to such structure
practical variant
[A-Indyk-Laarhoven-Razenshteyn-Schmidt’15]
harder: regularity lemma flavor
18
Ludwig’s talk
Partial progress: LSH Forest++
[A.β€”Razenshteynβ€”Shekel-Nosatzki’17]

Hamming space



[Indyk-Motwani’98]
π‘›πœŒ
𝜌 = 1/𝑐
𝑛1+𝜌
π‘›πœŒ random trees, of height π‘˜
each node: coordinate split by random coordinate
π‘₯ = 00110110
LSH Forest++:



In a tree, split until 𝑂(1) bucket size
++: store 𝑑 points closest to the dataset mean


coor 3
LSH Forest [Bawa-Condie-Ganesan’05]
query compared against the 𝑑 points
Theorem: obtains 𝜌 β‰ˆ
coor 2
coor 5
0.72
𝑐
for 𝑑, 𝑐 large enough
=performance of IM98 on random data

Open: practical algorithm
with optimal bounds?
19
coor 7
Time-space trade-offs

So far, all algorithms:



𝑛1+𝜌 space
π‘›πœŒ query time
for some constant 𝜌
Given a space budget,
what’s the best query time?

Studied in [KOR’98,Ind’02, Pan’06, Kap’15]

20
No linear-space algorithm for 𝑐 = 1.5
Trade-off: eg, smaller query time?
𝜼: s.t. Pr far 𝑝 ∈ 𝐴𝑖 π‘ž ∈ 𝐴𝑖 ] = 1/𝑛
1
Space:
𝑛⋅
.
Query

increase probability?
Consider a point 𝑝 captured by 𝐴𝑖 iff 𝑝, 𝑧𝑖 β‰₯ πœ‚π‘’


Pr close π‘žβˆˆπ΄π‘– π‘βˆˆπ΄π‘– ]
1
time:
.
Pr close π‘βˆˆπ΄π‘– π‘žβˆˆπ΄π‘– ]
is affected
where πœ‚π‘’ < πœ‚
Need to match π‘ž to 𝐴𝑖 when

21
π‘ž, 𝑧𝑖 β‰₯ πœ‚π‘ž > πœ‚
πœ‚
πœ‚π‘’
𝐴𝑖
Trade-off algo. (with oracle)

Parameters: 𝑇, πœ‚π‘’ and πœ‚π‘ž

Preprocessing





Pick 𝑇 random parts 𝐴(𝐴
𝑖 𝑖 , 𝐡𝑖 )
Form subsets 𝑃𝑖 = 𝑃 ∩ 𝐴𝑖
Store non-empty 𝑃𝑖 ’s
Sample Gaussian vector 𝑧𝑖
𝑝, 𝑧𝑧𝑖𝑖 β‰₯
β‰₯ πœ‚πœ‚π‘’ and 𝐡𝑖 = 𝑝 | 𝑝, 𝑧𝑖 β‰₯ πœ‚π‘ž
𝐴𝐴𝑖 𝑖== π‘π‘βˆΆ | 𝑝,
Query


22
πœ‚π‘ž
Picking 𝐴𝑖 ’s?


πœ‚π‘’
Retrieve all the sets 𝑃𝑖 such that π‘žπ‘žβˆˆβˆˆπ΄π΅π‘–π‘–
Inside the retrieved 𝑃𝑖 ’s, search a point within π‘π‘Ÿ from π‘ž
𝐡𝑖
𝐴𝑖
Overall Algorithm [ALRW’17]

Can achieve 𝑐-approximate NNS with

for πœŒπ‘ž , πœŒπ‘’ s.t. 𝑐 2 πœŒπ‘ž + 𝑐 2 βˆ’ 1

Query time: π‘›πœŒπ‘ž +π‘œ(1)
Space: 𝑛1+πœŒπ‘’+π‘œ 1

πœŒπ‘’ = 2𝑐 2 βˆ’ 1
data-independent
LSH regime: πœŒπ‘’ = πœŒπ‘ž
data-dependent
23
Lower Bounds

Trade-off is tight! [ALRW’17, Christiani’17]


Since caps are near-optimal [O’Donnell’14]
Cell-probe lower bounds:

1 cell probe: tight space!


2 cell probes: also!



Using [Panigrahy-Talwar-Wieder’10]
Using Locally-Decodable Codes lower bounds [Kerenidis, de Wolf β€˜03]
first such lower bound for static data structures
Lower bounds for data-dependent hashing?


Need to restrict computational power of hashes?
[A-Razenshteyn’16]: restrict description of hash function


24
in LSH regime only (πœŒπ‘’ = πœŒπ‘ž )
(likely extends to entire trade-off)
Looking forward

Data-dependent hashing: reduction in other settings?


Go β€œbeyond-worst case”?


Eg, closest pair can be solved much faster for random-case
[Valiant’12, Karppa-Kaski-Kohonen’16]
Instance-optimal partitions?
Space partitions for harder metrics?



Classic hard distance: β„“βˆž
No non-trivial sketch
But: [Indyk’98] gets efficient NNS with approximation
𝑂(log log 𝑑)



25
Tight approximation in the relevant models [ACP’08, PTW’10, KP’12]
Can be thought of as data-dependent hashing!
Other distances: EMD, edit, matrix norms, etc ?
NNS for any symmetric norm
[A-Nguyen-Nikolov-Razenshteyn-Weingarten’17]

Thm: for any symmetric norm with 𝑑 < π‘›π‘œ(1) , can get




invariant if permute
coordinates
log log 𝑛 𝑂 1 approximation
𝑛1+π‘œ(1) space
π‘›π‘œ 1 query time
Ilya’s talk
Is there an efficient ANN algorithm for general highdimensional norms with approximation 𝑑 π‘œ(1) ?
General norm ?
Conjectured LB if: there is a family of spectral expanders that
embed with distortion 𝑂 1 into some log 𝑂(1) 𝑛-dimensional
norm, where 𝑛 is the number of nodes
[Naor β€˜17]
26