Sketching in Adversarial Environments
Or
Sublinearity and Cryptography
Moni Naor
Joint work with: Ilya Mironov and Gil Segev
1
Comparing Streams
How to compare data streams without storing them?
SA
SB
Step 1: Compress data on-line into sketches
Step 2: Interact using only the sketches
Goal: Minimize sketches, update time, and communication
2
Comparing Streams
How to compare data streams that cannot to be stored?
$ Shared randomness $
Real-life applications: massive data sets, on-line data,...
Highly efficient solutions assuming shared randomness
3
Comparing Streams
How to compare data streams that cannot to be stored?
$ Shared randomness $
Is shared randomness a reasonable assumption?
Plagiarism
detection
No guarantees when set adversarially
Inputs may be adversarially chosen depending on the randomness
4
The Adversarial Sketch Model
“Adversarial” factors:
No secrets
Adversarially-chosen
inputs
Communication
complexity
Adversarial
sketch model
Massive data sets:
Sketching, streaming
5
The Adversarial Sketch Model
Goal: Compute f(A,B)
Sketch phase
small sketches,
fast updates
An adversary chooses the inputs of the parties
Provided as on-line sequences of insert and delete operations
No shared secrets
The parties are not allowed to communicate
Any public information is known to the adversary in advance
Adversary is computationally all powerful
Interaction phase
low communication
& computation
6
Our Results
Equality testing
A, B µ [N] of size at most K
Error probability ²
If we had public randomness…
Sketches of size O(log(1/²))
Similar update time, communication and computation
Lower
Bound
Equality testing in the adversarial sketch model requires
sketches of size (K¢log(N/K))1/2
7
Our Results
Equality testing
A, B µ [N] of size at most K
Error probability ²
Lower
Bound
Equality testing in the adversarial sketch model requires
sketches of size (K¢log(N/K))1/2
Upper
Bound
Explicit and efficient protocol:
1/2
Sketches of size (K¢polylog(N)¢log(1/²))
Update time, communication and computation polylog(N)
8
Our Results
Symmetric difference approximation
A, B µ [N] of size at most K
Goal: approximate |A Δ B| with error probability ²
Upper
Bound
(1 + ½)-approximation for any constant ½
Sketches of size (K¢polylog(N)¢log(1/²))1/2
Update time, communication and computation polylog(N)
Explicit construction: polylog(N)-approximation
9
Outline
Lower bound
Equality testing
Main tool: Incremental encoding
Explicit construction using dispersers
Symmetric difference approximation
Summary & open problems
10
Simultaneous Messages Model
y
x
f(x,y)
11
Simultaneous Messages Model
y
x
adversarial sketch model
Lower
Bound
Equality testing in the private-coin SM model requires
communication (K¢log(N/K))1/2
[NS96, BK97]
sketches
12
Outline
Lower bound
Equality testing
Main tool: Incremental encoding
Explicit construction using dispersers
Symmetric difference approximation
Summary & open problems
13
Simultaneous Equality Testing
x
y
K
C(x)
C(y)
K1/2£K1/2
Communication K1/2
14
First Attempt
C(A)
row = 3
col = 2
Sketches of size K1/2
C(B)
C(B)3,2
Problem: update time K1/2
15
Incrementality vs. Distance
Incrementality:
Given C(S) and x 2 [N], the encodings of S [ {x} and S \ {x} are
obtained by modifying very few entries
logarithmic
High distance:
For every distinct A,B µ [N] of size at most K, d(C(A),C(B)) > 1 - ²
constant
Impossible to achieve both properties simultaneously with
Hamming distance
16
Incremental Encoding
S C(S)1, ... , C(S)r
d(C(A),C(B)) = 1
r
{1 – dH(C(A)i,C(B)i)}
i=1
Normalized
r=1: Hamming distance
Hamming distance
Hope: Larger r will enable fast updates
r corresponds to the communication complexity of our protocol
Want to keep r as small as possible
Explicit construction with r = logK:
Codeword size K¢polylog(N)
Update time polylog(N)
17
Equality Protocol
rows (3,1,1)
C(A)3
C(A)2
Error
probability:
cols (2,3,1), values
C(A)1
C(B)1
C(B)2
C(B)3
r
{1 – dH(C(A)i,C(B)i)} < ²
i=1
1 – d(C(A), C(B))
18
The Encoding
Global encoding
Map each element to several entries of each codeword
Exploit “random-looking” graphs
Local encoding
Resolve collisions separately in each entry
A simple solution when |A Δ B| is guaranteed to be small
19
The Local Encoding
Suppose that |A Δ B| · ℓ
20
Missing Number Puzzle
Let S={1,...,N}\{i}
– random permutation over S:
(1),....,(N) as a one-way stream
One number i is missing
Goal: Determine the missing number i using O(log N) bits
What if there are ℓ missing numbers?
•
Can it be done using O(ℓ¢logN) bits?
21
The Local Encoding
Suppose that |A Δ B| · ℓ
A simple & well-known solution:
Associate each x 2 [N] with v(x) such that for any distinct x1,...,xℓ
the vectors v(x1),...,v(xℓ) are linearly-independent
C(S) = v(x)
x2S
If 1 · |A Δ B| · ℓ then C(A) C(B)
For example v(x) = (1, x, ..., xℓ-1)
Size & update time O(ℓ¢logN)
Independent of the
size of the sets
22
The Global Encoding
Each element is mapped into several entries of each codeword
The content of each entry is locally encoded
C1
Universe
of size N
C2
C3
23
The Global Encoding
Each element is mapped into several entries of each codeword
The content of each entry is locally encoded
The local guarantee:
If 1 · |Ci[y] Å (A Δ B)| · ℓ then C(A) and C(B) differ on Ci[y]
A
Universe
of size N
B
Consider
ℓ=1
1
2
2
1
2
1
2
1
1
2
C1[2]
C(A) and C(B)
differ at least on
these entries
24
The Global Encoding
Identify each codeword with a bipartite graph G = ([N],R,E)
For S µ [N] define (S,ℓ) µ R as the set of all y 2 R for which
1 · |(y) Å S| · ℓ
(K, ², ℓ)-Bounded-Neighbor Disperser:
For any S ½ [N] such that K · |S| · 2K it holds that
|(S,ℓ)| > (1 - ²)|R|
S
Universe of
size N
2
1
2
1
2
25
The Global Encoding
Bounded-Neighbor
Disperser
r = logK codewords, each Ci is identified with a (2i, ², ℓ)-BND
For i = log2|A Δ B| we have dH(C(A)i,C(B)i) > 1 - ²
In particular
r
d(C(A),C(B)) = 1 - {1 – dH(C(A)i,C(B)i)} > 1 - ²
i=1
A
C1
Universe
of size N
B
C2
C3
26
Constructing BNDs
(K, ², ℓ)-Bounded-Neighbor Disperser:
For any S ½ [N] such that K · |S| · 2K it holds that
|(S,ℓ)| > (1 - ²)|R|
Given N and K, want to optimize M, ℓ, ² and the left-degree D
Optimal
Extractor
Disperser
ℓ
1
O(1)
polylog(N)
D
log(N/K)
2(loglogN)2
polylog(N)
M
K¢log(N/K)
K¢2(loglogN)2
K
Codeword
of length M
Universe
of size N
27
Outline
Lower bound
Equality testing
Main tool: Incremental encoding
Explicit construction using dispersers
Symmetric difference approximation
Summary & open problems
28
Symmetric Difference Approximation
1. Sketch input streams into codewords
2. Compare s entries from each pair of codewords
di - # of differing entries sampled from the i-th pair
3. Output APX = (1 + ½)i for the maximal i s.t. di & (1 - ²)s
KD
|AΔB|· APX · (1+½)¢
¢|AΔB|
(1 - ²)M
d1
dk
A C(A)1, ... , C(A)k
B C(B)1, ... , C(B)k
non-explicit: » 1
explicit: polylog(N)
29
Outline
Lower bound
Equality testing
Main tool: Incremental encoding
Explicit construction using dispersers
Symmetric difference approximation
Summary & open problems
30
Summary
Formalized a realistic model for computation over massive data sets
“Adversarial” factors:
No secrets
Adversarially-chosen
inputs
Communication
complexity
Adversarial
sketch model
Massive data sets:
Sketching, streaming
31
Summary
Formalized a realistic model for computation over massive data sets
Incremental encoding
Main technical contribution
Additional applications?
S C(S)1, ... , C(S)r
d(C(A),C(B)) = 1 -
r
{1 – dH(C(A)i,C(B)i)}
i=1
Determined the complexity of two fundamental tasks
Equality testing
Symmetric difference approximation
32
Open Problems
Better explicit approximation for symmetric difference
Our (1 + ½)-approximation in non-explicit
Explicit approximation: polylog(N)
Approximating various similarity measures
Lp norms, resemblance,...
The Power of Adversarial Sketching
Characterizing the class of functions that can be “efficiently”
computed in the adversarial sketch model
sublinear sketches
polylog updates
Possible approach: public-coins to private-coins transformation
that “preserves” the update time
33
Computational Assumptions
Better schemes using computational assumptions?
Equality testing: Incremental collision-resistant hashing [BGG ’94]
Significantly smaller sketches
Existing constructions either have very long public descriptions, or rely on random
oracles
Practical constructions without random oracles?
Symmetric difference approximation: Not known
Even with random oracles!
Thank you!
34
Pan-Privacy Model
Data is stream of items, each item belongs to a user
Data of different users interleaved arbitrarily
Curator sees items, updates internal state, output at stream end
state
output
Can also consider multiple
intrusions
Pan-Privacy
For every possible behavior of user in stream, joint
distribution of the internal state at any single point in time
and the final output is differentially private
Adjacency: User Level
Universe U of users whose data in the stream; x 2 U
• Streams x-adjacent if same projections of users onto U\{x}
Example: axbxcxdxxxex and abcdxe are x-adjacent
• Both project to abcde
• Notion of “corresponding locations” in x-adjacent streams
• U -adjacent: 9 x 2 U for which they are x-adjacent
– Simply “adjacent,” if U is understood
Note: Streams of different lengths can be adjacent
Example: Stream Density or # Distinct Elements
Universe U of users, estimate how many distinct
users in U appear in data stream
Application: # distinct users who searched for “flu”
Ideas that don’t work:
• Naïve
Keep list of users that appeared (bad privacy and space)
• Streaming
– Track random sub-sample of users (bad privacy)
– Hash each user, track minimal hash (bad privacy)
Pan-Private Density Estimator
Inspired by randomized response.
Store for each user x 2 U a single bit bx
Initially all bx
0 w.p. ½
Distribution D0
1 w.p. ½
When encountering x redraw bx
0 w.p. ½-ε
1 w.p. ½+ε
Distribution D1
Final output: [(fraction of 1’s in table - ½)/ε] + noise
Pan-Privacy
If user never appeared: entry drawn from D0
If user appeared any # of times: entry drawn from D1
D0 and D1 are 4ε-differentially private
© Copyright 2026 Paperzz