ppt

Sketching in Adversarial Environments
Or
Sublinearity and Cryptography
Moni Naor
Joint work with: Ilya Mironov and Gil Segev
1
Comparing Streams

How to compare data streams without storing them?
SA



SB
Step 1: Compress data on-line into sketches
Step 2: Interact using only the sketches
Goal: Minimize sketches, update time, and communication
2
Comparing Streams

How to compare data streams that cannot to be stored?
$ Shared randomness $


Real-life applications: massive data sets, on-line data,...
Highly efficient solutions assuming shared randomness
3
Comparing Streams

How to compare data streams that cannot to be stored?
$ Shared randomness $

Is shared randomness a reasonable assumption?


Plagiarism
detection
No guarantees when set adversarially
Inputs may be adversarially chosen depending on the randomness
4
The Adversarial Sketch Model
“Adversarial” factors:

No secrets

Adversarially-chosen
inputs
Communication
complexity
Adversarial
sketch model
Massive data sets:

Sketching, streaming
5
The Adversarial Sketch Model


Goal: Compute f(A,B)
Sketch phase






small sketches,
fast updates
An adversary chooses the inputs of the parties
Provided as on-line sequences of insert and delete operations
No shared secrets
The parties are not allowed to communicate
Any public information is known to the adversary in advance
 Adversary is computationally all powerful
Interaction phase
low communication
& computation
6
Our Results

Equality testing



A, B µ [N] of size at most K
Error probability ²
If we had public randomness…
 Sketches of size O(log(1/²))
 Similar update time, communication and computation
Lower
Bound
Equality testing in the adversarial sketch model requires
sketches of size (K¢log(N/K))1/2
7
Our Results

Equality testing


A, B µ [N] of size at most K
Error probability ²
Lower
Bound
Equality testing in the adversarial sketch model requires
sketches of size (K¢log(N/K))1/2
Upper
Bound
Explicit and efficient protocol:
1/2
 Sketches of size (K¢polylog(N)¢log(1/²))
 Update time, communication and computation polylog(N)
8
Our Results

Symmetric difference approximation


A, B µ [N] of size at most K
Goal: approximate |A Δ B| with error probability ²
Upper
Bound




(1 + ½)-approximation for any constant ½
Sketches of size (K¢polylog(N)¢log(1/²))1/2
Update time, communication and computation polylog(N)
Explicit construction: polylog(N)-approximation
9
Outline


Lower bound
Equality testing




Main tool: Incremental encoding
Explicit construction using dispersers
Symmetric difference approximation
Summary & open problems
10
Simultaneous Messages Model
y
x
f(x,y)
11
Simultaneous Messages Model
y
x
adversarial sketch model
Lower
Bound
Equality testing in the private-coin SM model requires
communication (K¢log(N/K))1/2
[NS96, BK97]
sketches
12
Outline


Lower bound
Equality testing




Main tool: Incremental encoding
Explicit construction using dispersers
Symmetric difference approximation
Summary & open problems
13
Simultaneous Equality Testing
x
y
K
C(x)
C(y)
K1/2£K1/2
Communication K1/2
14
First Attempt
C(A)
row = 3
col = 2
Sketches of size K1/2
C(B)
C(B)3,2
Problem: update time K1/2
15
Incrementality vs. Distance

Incrementality:
Given C(S) and x 2 [N], the encodings of S [ {x} and S \ {x} are
obtained by modifying very few entries
logarithmic

High distance:
For every distinct A,B µ [N] of size at most K, d(C(A),C(B)) > 1 - ²
constant

Impossible to achieve both properties simultaneously with
Hamming distance
16
Incremental Encoding
S  C(S)1, ... , C(S)r
d(C(A),C(B)) = 1 


r
{1 – dH(C(A)i,C(B)i)}
i=1
Normalized
r=1: Hamming distance
Hamming distance
Hope: Larger r will enable fast updates
r corresponds to the communication complexity of our protocol
 Want to keep r as small as possible
Explicit construction with r = logK:
 Codeword size K¢polylog(N)
 Update time polylog(N)
17
Equality Protocol
rows (3,1,1)
C(A)3
C(A)2
Error
probability:
cols (2,3,1), values
C(A)1
C(B)1
C(B)2
C(B)3
r
{1 – dH(C(A)i,C(B)i)} < ²
i=1
1 – d(C(A), C(B))
18
The Encoding

Global encoding



Map each element to several entries of each codeword
Exploit “random-looking” graphs
Local encoding


Resolve collisions separately in each entry
A simple solution when |A Δ B| is guaranteed to be small
19
The Local Encoding

Suppose that |A Δ B| · ℓ
20
Missing Number Puzzle



Let S={1,...,N}\{i}
 – random permutation over S:
 (1),....,(N) as a one-way stream
 One number i is missing
Goal: Determine the missing number i using O(log N) bits
What if there are ℓ missing numbers?
•
Can it be done using O(ℓ¢logN) bits?
21
The Local Encoding

Suppose that |A Δ B| · ℓ
A simple & well-known solution:

Associate each x 2 [N] with v(x) such that for any distinct x1,...,xℓ
the vectors v(x1),...,v(xℓ) are linearly-independent
C(S) =  v(x)
x2S



If 1 · |A Δ B| · ℓ then C(A)  C(B)
For example v(x) = (1, x, ..., xℓ-1)
Size & update time O(ℓ¢logN)
Independent of the
size of the sets
22
The Global Encoding


Each element is mapped into several entries of each codeword
The content of each entry is locally encoded
C1
Universe
of size N
C2
C3
23
The Global Encoding



Each element is mapped into several entries of each codeword
The content of each entry is locally encoded
The local guarantee:
If 1 · |Ci[y] Å (A Δ B)| · ℓ then C(A) and C(B) differ on Ci[y]
A
Universe
of size N
B
Consider
ℓ=1
1
2
2
1
2
1
2
1
1
2
C1[2]
C(A) and C(B)
differ at least on
these entries
24
The Global Encoding


Identify each codeword with a bipartite graph G = ([N],R,E)
For S µ [N] define (S,ℓ) µ R as the set of all y 2 R for which
1 · |(y) Å S| · ℓ
(K, ², ℓ)-Bounded-Neighbor Disperser:
For any S ½ [N] such that K · |S| · 2K it holds that
|(S,ℓ)| > (1 - ²)|R|
S
Universe of
size N
2
1
2
1
2
25
The Global Encoding



Bounded-Neighbor
Disperser
r = logK codewords, each Ci is identified with a (2i, ², ℓ)-BND
For i = log2|A Δ B| we have dH(C(A)i,C(B)i) > 1 - ²
In particular
r
d(C(A),C(B)) = 1 - {1 – dH(C(A)i,C(B)i)} > 1 - ²
i=1
A
C1
Universe
of size N
B
C2
C3
26
Constructing BNDs
(K, ², ℓ)-Bounded-Neighbor Disperser:
For any S ½ [N] such that K · |S| · 2K it holds that
|(S,ℓ)| > (1 - ²)|R|

Given N and K, want to optimize M, ℓ, ² and the left-degree D
Optimal
Extractor
Disperser
ℓ
1
O(1)
polylog(N)
D
log(N/K)
2(loglogN)2
polylog(N)
M
K¢log(N/K)
K¢2(loglogN)2
K
Codeword
of length M
Universe
of size N
27
Outline


Lower bound
Equality testing




Main tool: Incremental encoding
Explicit construction using dispersers
Symmetric difference approximation
Summary & open problems
28
Symmetric Difference Approximation
1. Sketch input streams into codewords
2. Compare s entries from each pair of codewords
 di - # of differing entries sampled from the i-th pair
3. Output APX = (1 + ½)i for the maximal i s.t. di & (1 - ²)s
KD
|AΔB|· APX · (1+½)¢
¢|AΔB|
(1 - ²)M
d1
dk
A  C(A)1, ... , C(A)k
B  C(B)1, ... , C(B)k
non-explicit: » 1
explicit: polylog(N)
29
Outline


Lower bound
Equality testing




Main tool: Incremental encoding
Explicit construction using dispersers
Symmetric difference approximation
Summary & open problems
30
Summary

Formalized a realistic model for computation over massive data sets
“Adversarial” factors:

No secrets

Adversarially-chosen
inputs
Communication
complexity
Adversarial
sketch model
Massive data sets:

Sketching, streaming
31
Summary

Formalized a realistic model for computation over massive data sets

Incremental encoding



Main technical contribution
Additional applications?
S  C(S)1, ... , C(S)r
d(C(A),C(B)) = 1 -
r
{1 – dH(C(A)i,C(B)i)}
i=1
Determined the complexity of two fundamental tasks


Equality testing
Symmetric difference approximation
32
Open Problems


Better explicit approximation for symmetric difference
 Our (1 + ½)-approximation in non-explicit
 Explicit approximation: polylog(N)
Approximating various similarity measures
 Lp norms, resemblance,...
The Power of Adversarial Sketching

Characterizing the class of functions that can be “efficiently”
computed in the adversarial sketch model
sublinear sketches
polylog updates

Possible approach: public-coins to private-coins transformation
that “preserves” the update time
33
Computational Assumptions

Better schemes using computational assumptions?

Equality testing: Incremental collision-resistant hashing [BGG ’94]




Significantly smaller sketches
Existing constructions either have very long public descriptions, or rely on random
oracles
Practical constructions without random oracles?
Symmetric difference approximation: Not known

Even with random oracles!
Thank you!
34
Pan-Privacy Model
Data is stream of items, each item belongs to a user
Data of different users interleaved arbitrarily
Curator sees items, updates internal state, output at stream end
state
output
Can also consider multiple
intrusions
Pan-Privacy
For every possible behavior of user in stream, joint
distribution of the internal state at any single point in time
and the final output is differentially private
Adjacency: User Level
Universe U of users whose data in the stream; x 2 U
• Streams x-adjacent if same projections of users onto U\{x}
Example: axbxcxdxxxex and abcdxe are x-adjacent
• Both project to abcde
• Notion of “corresponding locations” in x-adjacent streams
• U -adjacent: 9 x 2 U for which they are x-adjacent
– Simply “adjacent,” if U is understood
Note: Streams of different lengths can be adjacent
Example: Stream Density or # Distinct Elements
Universe U of users, estimate how many distinct
users in U appear in data stream
Application: # distinct users who searched for “flu”
Ideas that don’t work:
• Naïve
Keep list of users that appeared (bad privacy and space)
• Streaming
– Track random sub-sample of users (bad privacy)
– Hash each user, track minimal hash (bad privacy)
Pan-Private Density Estimator
Inspired by randomized response.
Store for each user x 2 U a single bit bx
Initially all bx
0 w.p. ½
Distribution D0
1 w.p. ½
When encountering x redraw bx
0 w.p. ½-ε
1 w.p. ½+ε
Distribution D1
Final output: [(fraction of 1’s in table - ½)/ε] + noise
Pan-Privacy
If user never appeared: entry drawn from D0
If user appeared any # of times: entry drawn from D1
D0 and D1 are 4ε-differentially private