Optimal Hashing for High

Beyond
Locality-Sensitive Hashing
Alex Andoni
(Columbia University)
Based on joint work with: Piotr Indyk, Thijs Laarhoven,
Huy Nguyen, Sasho Nikolov, Ilya Razenshteyn, Ludwig
Schmidt, Negev Shekel-Nosatzki, Erik Waingarten
Nearest Neighbor Search (NNS)

Preprocess: a set 𝑃 of points

Query: given a query point 𝑞, report a
point 𝑝∗ ∈ 𝑃 with the smallest distance
to 𝑞
𝑝∗
𝑞
2
Motivation
 Generic setup:
000000
011100
010100
000100
010100
011111
 Points model objects (e.g. images)
 Distance models (dis)similarity measure
 Applications:
Goal: k-NN rule
 machine learning:
0.99
 in many areas…Sublinear time: 𝑛
𝑂
Polynomial
space:
𝑛
 Distances:

1
Hamming, Euclidean, ℓ∞
edit distance, earthmover distance, etc…
 Core primitive: closest pair, clustering, etc…
3
000000
001100
000100
000100
110100
111111
𝑝∗
𝑞
Near Neighbor Search
𝑟-near neighbor: given a query point 𝑞,
report a point 𝑝′ ∈ 𝑃 s.t. 𝑝′ − 𝑞 ≤ 𝑐𝑟
𝑟


as long as there is some point within
distance 𝑟
𝑟
𝑐𝑟
𝑝∗
𝑞
𝑝′
4
It’s all about space decompositions
Approximate space decompositions
[Arya-Mount’93], [Clarkson’94], [AryaMount-Netanyahu-Silverman-We’98],
[Kleinberg’97], [Har-Peled’02],[AryaFonseca-Mount’11],…

~2𝑑 factors
Random decompositions == hashing
[Kushilevitz-Ostrovsky-Rabani’98], [IndykMotwani’98], [Indyk’01], [Gionis-IndykMotwani’99], [Charikar’02], [DatarImmorlica-Indyk-Mirrokni’04], [ChakrabartiRegev’04], [Panigrahy’06], [AilonChazelle’06], [A.-Indyk’06], [TerasawaTanaka’07], [Kapralov’15], [Pagh’16], [BeckerDucas-Gama-Laarhoven’16], [Christiani’16],
[Ahle-Aumuhler-Pagh’17], …

5
𝑞
𝑝
𝑝′
Locality Sensitive Hashing
[IM’98]
Hash: project on
random bits!
LSH until ~’14
Space Time Exponent
Hamming 𝑛1+𝜌
space
𝑛𝜌
Euclidean
space
𝑛𝜌
𝑛1+𝜌
𝜌 = 1/𝑐
𝜌 ≥ 1/𝑐
𝒄=𝟐
Reference
𝜌 = 1/2 [IM’98]
[MNP’06, OWZ’11]
𝜌 = 1/𝑐
𝜌 = 1/2 [IM’98, DIIM’04]
𝜌 ≈ 1/𝑐 2
𝜌 = 1/4 [AI’06]
𝜌 ≥ 1/𝑐 2
Lower bounds (cell probe)
[A.-Indyk-Patrascu’06,
Panigrahy-Talwar-Wieder’08,‘10,
Kapralov-Panigrahy’12]
Space-time trade-offs
[MNP’06, OWZ’11]
Datasets with additional structure
[Clarkson’99, Karger-Ruhl’02, Krauthgamer-Lee’04,
Beygelzimer-Kakade-Langford’06,
Indyk-Naor’07, Dasgupta-Sinha’13,
Abdullah-A.-Krauthgamer-Kannan’14,…]
[Panigrahy’06, A.-Indyk’06, Kapralov’15]
Post-LSH
Space Time Exponent
Hamming
space
𝑛1+𝜌
𝑛𝜌
𝜌 = 1/𝑐
𝜌 ≥ 1/𝑐
𝑛1+𝜌
𝑛𝜌
𝒄=𝟐
Reference
𝜌 = 1/2 [IM’98]
[MNP’06, OWZ’11]
LSH
complicated 𝜌 = 1/2 − 𝜖 [AINR’14]
1
𝜌 = 1/3 [AR’15]
𝜌≈
2𝑐 − 1
Random hashing is not optimal!
Euclidean 𝑛1+𝜌
space
𝑛𝜌
𝑛1+𝜌
𝑛𝜌
7
𝜌 ≈ 1/𝑐 2
𝜌 = 1/4 [AI’06]
[MNP’06, OWZ’11]
𝜌 ≥ 1/𝑐 2
complicated 𝜌 = 1/4 − 𝜖 [AINR’14]
1
𝜌 = 1/7 [AR’15]
𝜌≈ 2
2𝑐 − 1
LSH
Data-dependent hashing
A random space partition,
chosen after seeing the given
dataset
Efficient…


8
Optimal hashing* algorithm for time-space trade-off
[A-Laarhoven-Razenshteyn-Weingarten’17]
Algorithm: achieves 𝑐-approximate NNS with:


Space: 𝑛1+𝜌𝑢+𝑜 1
Query time: 𝑛𝜌𝑞 +𝑜(1)

As long as 𝑐 2 𝜌𝑞 + 𝑐 2 − 1

Proof plan:




Basic algorithm (=LSH)
Data-dependent hashing
Trading-off space & time
𝜌𝑢 = 2𝑐 2 − 1
𝑛1.77… space
𝑛𝑜(1) query time
Lower bounds



9
Optimal if hashing*
Unconditional for small q.t.
𝑛1.14…
𝑛0.14…
𝑐=2
best LSH
[A-Indyk’06]
𝑛1+𝑜(1)
𝑛0.43…
Setup for algorithm

All points (dataset, query) are on the surface of a unit
sphere


Wlog: look at the dataset from far away
Dimension: 𝑑 = log1.1 𝑛

10
Wlog: [Johnson-Lindenstrauss’84]
Basic Algorithm (with oracle)

Preprocessing




How to pick 𝐴𝑖 ’s?



Pick 𝑇 random parts 𝐴𝑖
Form subsets 𝑃𝑖 = 𝑃 ∩ 𝐴𝑖
Store non-empty 𝑃𝑖 ’s
𝑇 and 𝜂:
parameters to be chosen later
Sample Gaussian vector 𝑧𝑖
𝐴𝑖 = 𝑝: 𝑝, 𝑧𝑖 ≥ 𝜂
Query


11
Retrieve all the sets 𝑃𝑖 s.t. 𝑞 ∈ 𝐴𝑖 (i.e., 𝑞, 𝑧𝑖 ≥ 𝜂)
Inside the retrieved 𝑃𝑖 ’s, search a point within 𝑐𝑟 from 𝑞
𝜼: s.t. Pr far 𝑝 ∈ 𝐴𝑖 𝑞 ∈ 𝐴𝑖 ] = 1/𝑛
1
Space:
𝑛⋅
.
Analysis
Key quantity:
Pr 𝑞 ∈ 𝐴𝑖 𝑝 ∈ 𝐴𝑖 ]

Query time:
Matches best LSH [AI’06]


1
𝑐
1+ 2 +𝑜(1)
space 𝑛
query time 𝑛
𝜂
,
1
+𝑜(1)
𝑐2
𝑟
Pr 𝑞 ∈ 𝐴𝑖 𝑝 ∈ 𝐴𝑖 ]

Pr close 𝑞∈𝐴𝑖 𝑝∈𝐴𝑖 ]
1
.
Pr close 𝑝∈𝐴𝑖 𝑞∈𝐴𝑖 ]
1
𝑛𝜌
𝑐𝑟
1
where 𝜌 𝑟, 𝑐 ≤ 𝑐 2 + 𝑜 1
1
𝑛
𝜂=0
Related partitions

Similar to



Locality Sensitive Filters [Becker, Ducas, Gama, Laarhoven 2016]
cell-probe algorithm of [Panigrahy, Talwar, Wieder 2010]
Related (or sometimes equivalent) partitioning procedures:





13
Approximation algorithms [Goemans, Williamson 1995], [Karger,
Motwani, Sudan 1995], [Charikar, Chekuri, Goel, Guha, Plotkin 1998],
[Chlamtac, Makarychev, Makarychev 2006], [Louis, Makarychev 2014]
Spectral graph partitioning [Lee, Oveis Gharan, Trevisan 2012], [Louis,
Raghavendra, Tetali, Vempala 2012]
Spherical cubes [Kindler, O’Donnell, Rao, Wigderson 2008]
Metric embeddings [Fakcharoenphol, Rao, Talwar 2003], [Mendel,
Naor 2005]
Communication complexity [Bogdanov, Mossel 2011], [Canonne,
Guruswami, Meka, Sudan 2015]
How to implement the oracle?
Retrieve all the caps such that
𝑞, 𝑧𝑖 ≥ 𝜂

Idea: “gradual” partitioning, like in block coding


“cap 𝐴𝑖 ” = intersection of 𝐾 caps
𝑃
Algorithm:




14
Sample 𝑇 Gaussians 𝑧1 , 𝑧2 , … , 𝑧𝑇
Form 𝑃𝑖 = 𝑝 ∈ 𝑃 | 𝑝, 𝑧𝑖 ≥ 𝜂′
Recurse on non-empty 𝑃𝑖 ’s
For 𝐾 levels
𝑇 = 3, 𝐾 = 2
𝑃1
𝑃2
𝑃3
Data-dependent hashing
[A-Razenshteyn’15]

Two components:


15
Pseudo-random configuration
Reduction to such configuration
has better d.i. hashing
data-dependent
Pseudo-random configuration

Points on a unit sphere, where

𝑐𝑟 ≈ 2, i.e., far pair is (near) orthogonal
this would be the distance if the dataset were random on the
sphere

Close pair: 𝑟 = 2/𝑐

𝑐𝑟

Lemma (in the LSH regime):

Cap Hashing gives
𝜌=
16
1
2𝑐 2 −1
𝑐𝑟/ 2
Reduction to nice configuration

Idea:
iteratively make the pointset more
isotopic by narrowing ball around it

Algorithm:

find dense clusters




recurse on dense clusters
apply Cap Hashing on the rest


17
with smaller radius
large fraction of points
recurse on each “cap”
eg, dense clusters might reappear
Why ok?
• no dense clusters
• like “random dataset”
with radius=100𝑐𝑟
• even better!
radius = 100𝑐𝑟
radius = 99𝑐𝑟
*picture not to scale & dimension
Practice?

Two components:


Nice geometric structure: spherical case
Reduction to such structure
practical variant
[A-Indyk-Laarhoven-Razenshteyn-Schmidt’15]
harder: regularity lemma flavor
18
Ludwig’s talk
Partial progress: LSH Forest++
[A.—Razenshteyn—Shekel-Nosatzki’17]

Hamming space



[Indyk-Motwani’98]
𝑛𝜌
𝜌 = 1/𝑐
𝑛1+𝜌
𝑛𝜌 random trees, of height 𝑘
each node: coordinate split by random coordinate
𝑥 = 00110110
LSH Forest++:



In a tree, split until 𝑂(1) bucket size
++: store 𝑡 points closest to the dataset mean


coor 3
LSH Forest [Bawa-Condie-Ganesan’05]
query compared against the 𝑡 points
Theorem: obtains 𝜌 ≈
coor 2
coor 5
0.72
𝑐
for 𝑡, 𝑐 large enough
=performance of IM98 on random data

Open: practical algorithm
with optimal bounds?
19
coor 7
Time-space trade-offs

So far, all algorithms:



𝑛1+𝜌 space
𝑛𝜌 query time
for some constant 𝜌
Given a space budget,
what’s the best query time?

Studied in [KOR’98,Ind’02, Pan’06, Kap’15]

20
No linear-space algorithm for 𝑐 = 1.5
Trade-off: eg, smaller query time?
𝜼: s.t. Pr far 𝑝 ∈ 𝐴𝑖 𝑞 ∈ 𝐴𝑖 ] = 1/𝑛
1
Space:
𝑛⋅
.
Query

increase probability?
Consider a point 𝑝 captured by 𝐴𝑖 iff 𝑝, 𝑧𝑖 ≥ 𝜂𝑢


Pr close 𝑞∈𝐴𝑖 𝑝∈𝐴𝑖 ]
1
time:
.
Pr close 𝑝∈𝐴𝑖 𝑞∈𝐴𝑖 ]
is affected
where 𝜂𝑢 < 𝜂
Need to match 𝑞 to 𝐴𝑖 when

21
𝑞, 𝑧𝑖 ≥ 𝜂𝑞 > 𝜂
𝜂
𝜂𝑢
𝐴𝑖
Trade-off algo. (with oracle)

Parameters: 𝑇, 𝜂𝑢 and 𝜂𝑞

Preprocessing





Pick 𝑇 random parts 𝐴(𝐴
𝑖 𝑖 , 𝐵𝑖 )
Form subsets 𝑃𝑖 = 𝑃 ∩ 𝐴𝑖
Store non-empty 𝑃𝑖 ’s
Sample Gaussian vector 𝑧𝑖
𝑝, 𝑧𝑧𝑖𝑖 ≥
≥ 𝜂𝜂𝑢 and 𝐵𝑖 = 𝑝 | 𝑝, 𝑧𝑖 ≥ 𝜂𝑞
𝐴𝐴𝑖 𝑖== 𝑝𝑝∶ | 𝑝,
Query


22
𝜂𝑞
Picking 𝐴𝑖 ’s?


𝜂𝑢
Retrieve all the sets 𝑃𝑖 such that 𝑞𝑞∈∈𝐴𝐵𝑖𝑖
Inside the retrieved 𝑃𝑖 ’s, search a point within 𝑐𝑟 from 𝑞
𝐵𝑖
𝐴𝑖
Overall Algorithm [ALRW’17]

Can achieve 𝑐-approximate NNS with

for 𝜌𝑞 , 𝜌𝑢 s.t. 𝑐 2 𝜌𝑞 + 𝑐 2 − 1

Query time: 𝑛𝜌𝑞 +𝑜(1)
Space: 𝑛1+𝜌𝑢+𝑜 1

𝜌𝑢 = 2𝑐 2 − 1
data-independent
LSH regime: 𝜌𝑢 = 𝜌𝑞
data-dependent
23
Lower Bounds

Trade-off is tight! [ALRW’17, Christiani’17]


Since caps are near-optimal [O’Donnell’14]
Cell-probe lower bounds:

1 cell probe: tight space!


2 cell probes: also!



Using [Panigrahy-Talwar-Wieder’10]
Using Locally-Decodable Codes lower bounds [Kerenidis, de Wolf ‘03]
first such lower bound for static data structures
Lower bounds for data-dependent hashing?


Need to restrict computational power of hashes?
[A-Razenshteyn’16]: restrict description of hash function


24
in LSH regime only (𝜌𝑢 = 𝜌𝑞 )
(likely extends to entire trade-off)
Looking forward

Data-dependent hashing: reduction in other settings?


Go “beyond-worst case”?


Eg, closest pair can be solved much faster for random-case
[Valiant’12, Karppa-Kaski-Kohonen’16]
Instance-optimal partitions?
Space partitions for harder metrics?



Classic hard distance: ℓ∞
No non-trivial sketch
But: [Indyk’98] gets efficient NNS with approximation
𝑂(log log 𝑑)



25
Tight approximation in the relevant models [ACP’08, PTW’10, KP’12]
Can be thought of as data-dependent hashing!
Other distances: EMD, edit, matrix norms, etc ?
NNS for any symmetric norm
[A-Nguyen-Nikolov-Razenshteyn-Weingarten’17]

Thm: for any symmetric norm with 𝑑 < 𝑛𝑜(1) , can get




invariant if permute
coordinates
log log 𝑛 𝑂 1 approximation
𝑛1+𝑜(1) space
𝑛𝑜 1 query time
Ilya’s talk
Is there an efficient ANN algorithm for general highdimensional norms with approximation 𝑑 𝑜(1) ?
General norm ?
Conjectured LB if: there is a family of spectral expanders that
embed with distortion 𝑂 1 into some log 𝑂(1) 𝑛-dimensional
norm, where 𝑛 is the number of nodes
[Naor ‘17]
26

Download Report

Optimal Hashing for High

Paperzz.com

Your Paperzz