Lecture 20 - Uncertainty and Privacy

COM S
COM S 453X – Spring 2017
Privacy Preserving Algorithms
and Data Security
Lecture 20: Uncertainty and Data Privacy
Prof. EWD Rozier
Shannon Information Theory and
Limits
•
Claude Elwood Shannon (1916-2001)
•
If I have M messages, how many bits do I need
to encode those messages?
2
COM S
Representing Information
•
Deck of cards
• How do I represent a card using bits?
3
COM S
Representing Information
•
•
Can we do better?
Under what assumptions?
4
COM S
Shannon Entropy
•
•
The amount of “disorder” in a system.
Probability of a given character showing up in a
stream of characters.
Relates to the information content of a stream.
5
COM S
Bloom Filters
•
•
•
Approximate set membership problem .
Trade-off between the space and the false
positive probability .
Generalize the hashing ideas.
COM S
Approximate set membership problem
•
•
•
•
Suppose we have a set
S = {s1,s2,...,sm}  universe U
Represent S in such a way we can quickly
answer “Is x an element of S ?”
To take as little space as possible ,we allow
false positive (i.e. xS , but we answer yes )
If xS , we must answer yes .
COM S
Bloom filters
Consist of an arrays A[n] of n bits (space) , and k
independent random hash functions
h1,…,hk : U --> {0,1,..,n-1}
1. Initially set the array to 0
2.  sS, A[hi(s)] = 1 for 1 i  k
(an entry can be set to 1 multiple times, only the first
times has an effect )
3. To check if xS , we check whether all location
A[hi(x)] for 1 i  k are set to 1
If not, clearly xS.
If all A[hi(x)] are set to 1 ,we assume xS
•
COM S
x1
0
0
10
0
y
10
x2
10
0
0
10
0
10
0
If only
Each
To
check
element
1s
ifappear,
y isofwith
inSconclude
S,
is all
hashed
check
the
k times
kyhash
is in S
Initial
0 that
Each
This
location.
hash
mayIfyield
location
a 0 false
appears
setpositive
to, 1y is not in S
COM S
The probability of a false positive
•
•
We assume the hash function are random.
After all the elements of S are hashed into the
bloom filters ,the probability that a specific bit is
still 0 is
1 km
 km / n
p  (1  )  e
n
COM S
To simplify the analysis ,we can assume a
fraction p of the entries are still 0 after all the
elements of S are hashed into bloom filters.
• In fact,let X be the random variable of
number of those 0 positions. By Chernoff
bound
•
Pr( X  np   n)  2e ne
 n 2 / 3 p
It implies X/n will be very close to p with a very
high probability
COM S
The probability of a false positive f is
f  (1  p)k  (1  e km / n )k
• To find the optimal k to minimize f .
Minimize f iff minimize g=ln(f)
 km / n
dg
km
e
 ln(1  e km / n ) 
dk
n 1  e km / n
 k=ln(2)*(n/m)
 f = (1/2)k = (0.6185..)n/m
The false positive probability falls exponentially
in n/m ,the number bits used per item !!
•
COM S
High-Level View
•
•
•
•
A Bloom filters is like a hash table ,and simply
uses one bit to keep track whether an item
hashed to the location.
If k=1 , it’s equivalent to a hashing based
fingerprint system.
If n=cm for small constant c,such as c=8 ,then
k=5 or 6 ,the false positive probability is just over
2% .
It’s interesting that when k is optimal
k=ln(2)*(n/m) , then p= 1/2.
An optimized Bloom filters looks like a random
bit-string
COM S
The main point
•
Whenever you have a set or list, and
space is an issue, a Bloom filter may be a
useful alternative.
COM S
Start with an m bit array, filled with 0s.
B
0 0
0
0
0 0
0
0
0
0
0
0
0
0
0
0
Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.
B
0 1
0
0
1 0
1
0
0
1
1
1
0
1
1
0
To check if y is in S, check B at Hi(y). All k values must be 1.
B
0 1
0
0
1 0
1
0
0
1
1
1
0
1
1
0
Possible to have a false positive; all k values are 1, but y is not in S.
B
0 1
n items
0
0
1 0
1
0
0
m = cn bits
1
1
1
0
1
1
0
k hash functions
COM S
False positive rate
0.1
0.09
0.08
m/n = 8
0.07
0.06
0.05
0.04
0.03
Opt k = 8 ln 2 = 5.45...
0.02
0.01
0
0
1
2
3
4
5
6
7
8
9
10
Hash functions
n items
m = cn bits
k hash functions
COM S
False Positives -- Theory
•
For large enough universes, a data structure to
represent n keys using kn bits has false positive
probability at least
1
Perr  k  
2
•
  0
Can be matched by perfect hashing.
COM S
The main point (revised)
•
Whenever you have a set or list, and
space is an issue, a Bloom filter may be a
useful alternative.
•
Just be sure to consider the effects of the
false positives!
COM S
Bloom Filters and Deletions
•
Cache contents change
•
•
•
•
Items both inserted and deleted.
Insertions are easy – add bits to BF
Can Bloom filters handle deletions?
Use Counting Bloom Filters to track
insertions/deletions at hosts; send Bloom filters.
COM S
Handling Deletions
•
Bloom filters can handle insertions, but not
deletions.
xi xj
B
•
0
1
0
0
1 0
1
0
0
1
1
1
0
1
1
0
If deleting xi means resetting 1s to 0s, then
deleting xi will “delete” xj.
COM S
How can we fix this?
21
COM S
Counting Bloom Filters
Start with an m bit array, filled with 0s.
B
0 0
0
0
0 0
0
0
0
0
0
0
0
0
0
0
Hash each item xj in S k times. If Hi(xj) = a, add 1 to B[a].
B
0 3
0
0
1 0
2
0
0
3
2
1
0
2
1
0
To delete xj decrement the corresponding counters.
B
0 2
0
0
0 0
2
0
0
3
2
1
0
1
1
0
Can obtain a corresponding Bloom filter by reducing to 0/1.
B
0 1
0
0
0 0
1
0
0
1
1
1
0
1
1
0
COM S
Counting Bloom Filters: Overflow
•
•
Must choose counters large enough to avoid
overflow.
Poisson approximation suggests 4 bits/counter.
•
•
Average load using k = (ln 2)m/n counters is ln 2.
Probability a counter has load at least 16:
 e  ln 2 (ln 2)16 / 16! 6.78E  17
•
Failsafes possible.
COM S
Bloom Filters and ORAM?
24
COM S
Bloom Filters and PEKS?
25
COM S