Efficient Distance Approximation

Efficient and Private
Distance Approximation
David Woodruff
MIT
Outline
1.
Two-Party Communication
2.
Two Problems
1.
Private Euclidean norm estimation
2.
Higher norm estimation
The Communication Model
Alice
Bob
x 2 n
y 2 n
If What
=
R, what
Lp distance
2 (0,
1) ?
For
example,
is theifisdistance
the
= {0,1},
D(x,y)
what between
isfor
thesome
Hamming
xp
and
y?Distance?
Lp distance is (i=1n |xi-yi|p)1/p
Application – Streaming Model
4





7
Want to mine a


3
3
1
massive
1
7
…
data stream
How many distinct elements?
What’s the most frequent item?
Is the data uniform or skewed?
Elements arranged in adversarial order
Algorithms only allowed one pass
Goal: low-space algorithms
In this talk, most
protocols yield
streaming
algorithms
Thus,
communication
equals space
Application – Streaming Model
Two-party Communication
Streaming model
CommunicationL
ower bounds
Always
Space lower
bounds
Protocols
Often
Algorithms
Distance approximation captures streaming primitives
Distinct elements (Hamming), frequent items (L2), skew (Lp)
Application – IP session data
AT & T collects 100+ GBs of NetFlow everyday
Source
Destination
Bytes
Duration
Protocol
18.6.7.1
10.6.2.3
11.1.0.6
12.3.1.5
…
19.7.3.2
12.3.4.8
11.6.8.2
14.7.0.1
…
40K
20K
58K
30K
…
28
18
22
32
…
http
ftp
http
http
…
Application – IP Session Data
 AT & T needs to process massive stream of network data
 Traffic estimation
What fraction of network IP addresses are active?
Distinct elements computation
 Traffic analysis
What are the 100 IP addresses with the most traffic?
Frequent items computation
 Security/Denial of Service
Are there any IP addresses witnessing a spike in traffic?
Skewness computation
Application – Secure Datamining
 For medical research, hospitals wish to mine their joint data
 Distance approximation is useful in many mining algorithms,
e.g., classification and clustering
 Patient confidentiality imposes strict laws on what information
can be shared. Mining cannot leak anything sensitive
Issues

Exact vs. Approximate Solution

Efficiency



Communication Complexity
Round Complexity
Security

Neither party learns more than what the solution and
his/her input implies about the other party’s input
Initial Observations
Exact
Deterministic (n)
(folklore)
Randomized (n)
[KS, R]
Approximate
(n)
(folklore)
?
To cope with the (n) communication bound, we look for
randomized approximation algorithms
Previous Results
Output D’ such that for all x,y:
Pr[D(x,y) · D’(x,y) · (1+)D(x,y)] ¸ 2/3
Communication Complexity
Upper Bounds
Lower Bounds
Hamming 1/2
1/
Distance [FM79, BJKST02…] folklore
L2
1/2
[AMS96]
Lp, p > 2
n1-1/(p-1)
n1-2/p
[AMS96, CK04, [AMS96,
G05]
BJKS02, CKS03]
1/
folklore
Private Communication Complexity
Upper Bounds
Lower Bounds
n1/2
[FIMNSW01]
1/
n
SFE
1/
n
SFE
n1-2/p
Our Results [IW03, W04, IW05, IW06]
Communication Complexity
Private Communication Complexity
Upper Bounds
Lower Bounds
Upper Bounds
Lower Bounds
Hamming
Distance
1/2
1/
(1/2), 1-round
n1/2
1/
O(1/2), O(1)-rounds
L2
1/2
1/
(1/2), 1-round
n
Lp, p > 2
n1-1/(p-1)
n1-2/p
O(n1-2/p), 1-round
1/
O(1/2), O(1)-rounds
n
n1-2/p
Still open
Outline
1.
The Two-Party Communication Model
2.
Two Problems
1.
Private Euclidean norm estimation
2.
Higher norm estimation
Private L2 Estimation
 We improve the n1/2 upper bound to 1/2 for private L2,
and our protocol uses O(1) rounds
 Optimal up to suppressed logarithmic factors
 Holds for Hamming distance
 Speculation that private is much harder than non-private
 We refute this speculation
Security Definition
What does privacy mean for distance computation?
Minimal
Requirement
Alice does not learn anything
about y other than what
follows from her input x and D(x,y)
What does privacy mean for distance approximation?
Does this
work?
Alice does not learn anything about
y other than what follows from x and
the approximation D’(x,y)
Not Sufficient!!
Security Definition
Alice
Bob
x 2 n
y 2 n
Suppose  = {0,1}
Set the LSB of D’(x,y) to be yn, and the remaining bits of
D’(x,y) to agree with those of D(x,y)
D’(x,y) is a +/- 1 approximation, but Alice learns yn , which
doesn’t follow from x, D(x,y)
Security Definition
What does privacy mean for distance approximation?
New
Requirement
Alice and Bob don’t learn
anything about each other’s input
other than what follows their own
input and D(x,y)
Implications
D’(x,y) is determined by D(x,y)
and the randomness
How do we model the power of the cheating parties?
Security Models
Alice
Difficult to achieve
security
in
x 2 n
malicious model…
Bob
y 2 n
Semi-honest: parties follow their instructions but try to
learn more than what is prescribed
Malicious: parties deviate from the protocol arbitrarily
- Use a different input
- Force other party to output wrong answer
- Abort before other party learns answer
Reductions – Yao, GMW, NN
Efficiency of the new protocol
=
Efficiency of the old protocol
Protocol
Protocol
It suffices to design protocols
the semi-honest model
secure
secure in
ininthe
the
semi-honest
malicious
The parties follow themodel
instructions of the protocol.
model
Don’t need to worry about “weird” behavior.
Just ensure neither party learns anything about the other’s
input except what follows from the exact distance
Our Protocol
2 =a||x-y||
2.
n, j, an
A
first
try:
randomly
||Mx
–
sample
My||
few
coordinates
2
2 efficiency
Solution:
A second
Thus,
Parties
the
We
try:
need
correctness
build
randomly
toaagree
private
rotate
and
onsub-protocol
desired
the
vectors
rotation
over
to
M.R
output
Problem:
neither
party
can
learn
the
samples,
since
with
2
2
Now
compute
mass
is
“spread
ysamples,
)of,try
and
out”,
scale
sotosampling
estimate
is||x-y||
j –then
jlow
2
estimate
Can
beknowledge
done
from
of (x
the
with
the
protocol
communication
the
are
without
sampling
easy
to
revealing
verify.
approach
using
aeffective.
the
PRG
samples
the
M,
this
reveals
extra
information
Alice: x =Me
e11
Bob: y =Me
e22
Problem:
With high probability, all samples return 0, so estimate is 0.
Private Sub-protocol
Problem: Alice learns Myj for some j (Bob is similar)
Solution:
1. Use an oblivious masking sampling protocol
[FIMNSW]
Alice learns Myj © b for random mask b, Bob has b
Alice does not learn j
Private Sub-protocol
Alice
Bob
M
x 2 n
Compute Mx
y 2 n
Compute My
Create mask a
Create mask b
Run oblivious, masking sampling protocol
Gets b ©(My)j for unknown j
Gets a ©(Mx)j for unknown j
Private Sub-protocol
Alice
Bob
M
Has mask a
Gets b ©(My)j for unknown j
Has mask b
Gets a ©(Mx)j for unknown j
Low communication private protocol.
Computes (M(x-y)j)2, and since j is random,
Ej [M(x-y)j]2 = ||Mx-My||22/n = ||x-y||22/n
Private Sub-protocol
Low communication private protocol.
Computes (M(x-y)j)2, and since j is random,
Ej [M(x-y)j]2 = ||Mx-My||22/n = ||x-y||22/n
If mosta repetitions
Repeat
few
return
c =tight
0,
adjust
times
todepends
get
Thus, the expectation
only
on the length!
T, and repeat
concentration
2
1. Let T be an upper bound on ||x-y||2
2. The protocol outputs a bit c.
3. Since c is a bit, it is determined from its expectation.
Pr[c = 1] = n(M(x-y)j)2 / T ¼ ||x-y||22/T · 1
Wrapup



We give an O(1)-round 1/2 private protocol for the
L2 distance
Optimal up to suppressed logarithmic factors
Details
 Randomness is not true – it’s from a pseudorandom generator against non-uniform machines

Parties have bounded precision
Outline
1.
The Two-Party Communication Model
2.
Two Problems
1.
Private Euclidean norm estimation
2.
Higher norm estimation
Lp Estimation for p > 2

We improve the n1-1/(p-1) communication
upper bound to n1-2/p, and our protocol is
1-round

Achieving this privately is still an open
problem
Lp Estimation for p > 2
Problem: Rotation doesn’t work for p > 2
L22
L44
(1, 0)
1
1
(1/21/2, 1/21/2)
1
1/2
rotation
Not clear how to “re-randomize” Lp for p > 2
We need a new approach…
Lp Estimation for p > 2
Bob
Alice
One source of error:
si are approximate
x 2 {1, …, m}n
We will approximate
p
||x-y||p to
Another source:
values are
y approximate
2 {1, …, m}n
within a constant factor
Strategy
1. Classify coordinates |xj – yj| into buckets
0, [1, 2), [2, 4), …, [2i, 2i+1), … Overall, still within
a constant factor
2. Estimate size si of each bucket
3. Output
Estimating Bucket Sizes


Remaining Problem: Estimate
si = # of coordinates |xj – yj| in the range [2i, 2i+1)
Is this easy?
Our Approach: Whenever si is hard to estimate
Sometimes!
canset
help to 0.
we can
this, Iand
No!detect
Can show
we need
you estimate si when i is
(n) communication
if
largewe
[CCF-C]
Otherwise,
estimate it.
Problem: Aren’t we undercounting?
Answer: No! Hard si don’t matter!
The CountSketch Protocol
I have a 1-round B-communication
protocol which computes
all j for which (xj – yj)2 ¸ ||x-y||22/B
i largepromising!
- sLooks
- If si = O(1), we can
Intuition:
we O(n
can1-detect
Lp ! very
L2 large coordinates,
compute
si with
where large is with respect to the L2 norm
2/p) communication
Random Restriction
We would like to estimate si given that
and that we can efficiently output all coordinates j for which
Randomly
restrict
to ¼ 1/s
Ideas? Not
so obvious
if si ifraction
is large.of coordinates j!
Random Restriction
Contributes
Contributes
Contributes
(n)
1/2
1/4
1/3
3
3
n 1to(n
(n||x-y||
)) == 3nn5/4
to
to ||x-y||
||x-y||33333
(n)
Number of
1/2
coordinates n
1
1
n1/4
n1/3
Value
The middle group dominates, but the CountSketch
protocol cannot detect it.
The reason is that each value in the middle group is small,
but the group itself is large.
Random Restriction
(n)
Number of n1/2
coordinates
1
1 n1/4
n1/3
Value
We randomly restrict to n1/2 coordinates
Recap
Algorithm
1. Classify coordinates |xj – yj| into buckets
0, [1, 2), [2, 4), …, [2i, 2i+1), …
2. Estimate size si of each bucket
3. Output
Subroutine
1. Randomly restrict to n/2, n/4, n/8, …, coordinates
2. For each restriction, use CountSketch to retrieve the
largest elements. Classify them into groups.
3. Scale back to estimate si
Guarantee either you estimate si well, or si is tiny.
Wrapup



We give a 1-round n1-2/p-communication protocol
 Optimal due to lower bounds [AMS, BJKS, CKS]
Yields optimal n1-2/p-space streaming algorithm (resolves [AMS])
Lots of details
 Naive use of [CCF-C] requires >1 round, but we get 1 round

The randomness needed for restrictions cannot be pure for
the streaming algorithm. We use a PRG
My Other Work

Algorithms



Complexity theory


Longest common/increasing subsequence
Computational biology, clustering
Graph spanners, locally decodable codes
Cryptography

Broadcast encyption, torus-based crypto,
PIR, inference control, practical secure
function evaluation
Thank you!
The [CCF-C] protocol
Alice
Random linear map h:[n] -> {-1,1}
Compute j xj h(j)
Bob
Compute: R = j (x-y)j h(j)
= j xjh(j) - jyjh(j)
Then E[h(i)R] = j (x-y)j E[h(i)h(j)] = xj – yj
Repeat many times to reduce
the variance of the estimator

Download Report

Efficient Distance Approximation

Paperzz.com

Your Paperzz