Meet the privacy bound

Privacy Preserving Data Mining –
Secure multiparty computation and
random response techniques
Li Xiong
CS573 Data Privacy and Security
Outline
• Privacy preserving two-party decision tree
mining using SMC protocols (Lindell & Pinkas ’00)
• Primitive SMC protocols
–
–
–
–
Secure sum
Secure union (encryption based)
Secure max (probabilistic random response based)
Secure union (probabilistic and randomization
based)
• Secure data mining using sub protocols
• Random response for privacy preserving data
mining or data sanitization
Random response protocols
• Multi-round probabilistic protocols
• Randomization probability associated with
each round
• Random response with randomization
probability
Max Protocol – multi-round random
response
• Multiple rounds
• Randomization Probability at round r :
r 1
P
*
d
– Pr(r) = 0
• Local algorithm at round r and node i:
gi-1(r)>=vi gi-1(r)<vi
gi-1(r)
i
vi
gi(r)
gi(r)
gi-1(r)
w/ prob Pr:
rand [gi-1(r), vi)
w/ prob 1-Pr:
vi
4
Max Protocol - Illustration
Start
18
0
32
35
D2
D2
30
32
35
10
40
18
20
32
35
40
D4
D3
32
35
40
5
Min/Max Protocol - Correctness
• Precision bound:
r ( r 1)
2
1   j 1 Pr( j )  1  P0 * d
– Converges with r
– Smaller p0 and d provides faster convergence
r
r
6
Min/Max Protocol - Cost
• Communication cost
– single round: O(n)
– Minimum # of rounds given
precision guarantee (1-e):
7
Min/Max Protocol - Security
• Probability/confidence based metric: P(C|IR,R)
Provable Exposure
Absolute Privacy
0
0.5
1
– Different types of exposures based on claim
• Data value: vi=a
• Data ownership: Vi contains a
– Change of beliefs
• P(C|IR,R) – P(C|R)
• P(C|IR, R) / P(C|R)
• Relationship to privacy in anonymization
– Change of beliefs P(C|D*, BR) – P(C|BR)
8
Min/Max Protocol – Security (Analysis)
• Upper bound for average expected change of beliefs:
max r 1/2r-1 * (1-P0*dr-1)
• Larger p0 and d provides better privacy
9
Min/Max Protocol – Security (Experiments)
• Loss of privacy decreases with increasing number of nodes
• Probabilistic protocol achieves better privacy (close to 0)
• When n is large, anonymous protocol is actually okay!
10
Union
• Commutative encryption based approach
– Number of rounds: 2 rounds
– Each round: encryption and decryption
• Multi-round random-response approach?
b1
b2
p1
p2
0
1
1
0
0
1
1
1
OR
…
=
…
…
VG
…
bL
OR
pc
…
OR
Vector
0
0
0
0
• Each database has a boolean vector of the
data items
• Union vector is a logical OR of all vectors
Privacy Preserving Indexing of Documents on the Network, Bawa, 2003
Group Vector Protocol
0
1
0
1
0
1
…
…
…
…
0
0
v1
v2
vc
p1
p2
pc
1
0
11
01
0
0
00
vG’
vvGG’’ vG’
0
Pex=1/2r, Pin=1-Pex
for(i=1; i<L; i++)
if (Vs[i]=1 and VG’[i]=0)
Set VG’[i]=1 with prob. Pin
if (Vs[i]=0 and VG’[i]=1)
Set VG’[i]=0 with prob. Pex
11
01
…
1
…
…
1
0
…
…
0
Processing of VG’ at ps of round r
00
vvGG’’
r=1, Pex=1/2, Pin=1/2
r=2, Pex=1/4, Pin=3/4
Random Shares based Secure Union
• Phase 1: random item addition
– Multiple rounds with permutated ring
– Each node sends a random share of its item set and a random share of a random
item set
• Phase 2: random item removal
– Each node subtracts its random items set
14
Random Shares based Secure Union Analysis
• Item exposure attack
– An adversary makes a claim C on a particular item a
node i contributes to the final result (C: vi in xi)
• Set exposure attack
– An adversary makes a claim C on the whole set of
items a node i contributes to the final union result X
(C: xi = ai).
• Change of beliefs (posterior probability and prior
probability)
– P(C|IR,X) - P(C|X)
– P(C|IR,X)/P(C|X)
15
Exposure Risk – Set Exposure
• Disclosure decreases with increasing number of generated
random items and increasing number of participating nodes
• Set exposure risk is or close to 0 for probabilistic and crypto
approach
16
Exposure Risk – Risk Exposure
• Item exposure risk decreases with increasing number of
generated random items and participating nodes
• Item exposure risk for probabilistic approach is quite high
17
Cost Comparison
• Commutative protocol and anonymous communication protocol
efficient but sensitive to union size
• Probabilistic protocol efficient but sensitive to domain size
• Estimated runtime for the general circuit-based protocol
implemented by FairplayMP framework is 15 days, 127 days and 1.4
years for the domain sizes tested
18
Open issues
 Tradeoff between accuracy, efficiency, and security
 How to quantify security
 How to design adjustable protocols
 Can we generalize the random-response algorithms
and randomization algorithms for operators based on
their properties
 Operators: sum, union, max, min …
 Properties: commutative, associative, invertible,
randomizable
Specific Secure Tools
•Secure Sum
•Secure Comparison
•Secure Union
Data Mining on Horizontally
Partitioned Data
•Association Rule Mining
•Decision Trees
•EM Clustering
•Secure Logarithm
•Naïve Bayes Classifier
•Secure Poly. Evaluation
Specific Secure Tools
Data Mining on Vertically
Partitioned Data
•Association Rule Mining
•Secure Comparison
•Secure Set Intersection
•Secure Dot Product
•Secure Logarithm
•Decision Trees
•K-means Clustering
•Naïve Bayes Classifier
•Secure Poly. Evaluation
•Outlier Detection
Summary of SMC Based PPDDM
• Mainly used for distributed data mining.
• Efficient/specific cryptographic solutions for
many distributed data mining problems are
developed.
• Random response or randomization based
protocols offer tradeoff between accuracy,
efficiency, and security
• Mainly semi-honest assumption(i.e. parties
follow the protocols)
Ongoing research
• New models that can trade-off better
between efficiency and security
• Game theoretic / incentive issues in PPDM
Outline
• Privacy preserving two-party decision tree
mining using SMC protocols (Lindell & Pinkas ’00)
• Primitive SMC protocols
–
–
–
–
Secure sum
Secure union (encryption based)
Secure max (probabilistic random response based)
Secure union (probabilistic and randomization
based)
• Secure data mining using sub protocols
• Random response for privacy preserving data
mining or data collection
Data Collection Model
Data Miner
Step 2: Data Publishing
Data Publisher
Step 1: Data Collection
Individual
Data
Data cannot be shared
directly because of privacy
concern
Randomized Response
The true
answer is
“Yes”
Biased coin:
P( Head )  
  0.5
Do you smoke?
P(Yes )  
Head
Yes
(  0.5)
Tail
No
P'(Yes)  P(Yes)    P(No)  (1  )
P'(No)  P(Yes)  (1 )  P(No)  
Randomized Response
• Multiple attributes encoded in bits
Biased coin:
P( Head )  
  0.5
P(Yes )  
Head True answer E: 110
(  0.5)
Tail
False answer !E: 001
Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003
Generalization for Multi-Valued Categorical
Data
q1
q2
Si
Si+1
q3
q4
True Value: Si
Si+2
Si+3
P'(s1)  q1 q4 q3 q2 P(s1) 

 


P'(s2)  q2 q1 q4 q3P(s2) 
P'(s3)  q3 q2 q1 q4 P(s3) 

 


P'(s4)
q4
q3
q2
q1
P(s4)

 


M
A Generalization
• RR Matrices [Warner 65], [R.Agrawal 05],
[S. Agrawal 05]
• RR Matrix can be arbitrary
a11 a12 a13 a14 


a21 a22 a23 a24 

M
a31 a32 a33 a34 


a41 a42 a43 a44 
• Can we find optimal RR matrices?
OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang,
2008
What is an optimal matrix?
• Which of the following is better?
1 0 0


M1  0 1 0

0 0 1


13
1
M 2  3
1

3
1
3
1
3
1
3
1
3
1
3
1
3





What is an optimal matrix?
• Which of the following is better?
1 0 0


M1  0 1 0

0 0 1

13
1
M 2  3
1

3
1
3
1
3
1
3
1
3
1
3
1
3
Privacy: M2 is better
Utility: M1 is better

So, what is an optimal matrix?





Optimal RR Matrix
• An RR matrix M is optimal if no other RR
matrix’s privacy and utility are both better
than M (i, e, no other matrix dominates M).
– Privacy Quantification
– Utility Quantification
• A number of privacy and utility metrics have
been proposed.
– Privacy: how accurately one can estimate individual info.
– Utility: how accurately we can estimate aggregate info.
Optimization Methods
• Approach 1: Weighted sum:
w1 Privacy + w2 Utility
• Approach 2
– Fix Privacy, find M with the optimal Utility.
– Fix Utility, find M with the optimal Privacy.
– Challenge: Difficult to generate M with a fixed
privacy or utility.
• Proposed Approach: Multi-Objective
Optimization
Optimization algorithm
• Evolutionary Multi-Objective Optimization (EMOO)
• The algorithm
– Start with a set of initial RR matrices
– Repeat the following steps in each iteration
• Mating: selecting two RR matrices in the pool
• Crossover: exchanging several columns between the two RR
matrices
• Mutation: change some values in a RR matrix
• Meet the privacy bound: filtering the resultant matrices
• Evaluate the fitness value for the new RR matrices.
Note : the fitness values is defined in terms of privacy and utility metrics
Illustration
Output of Optimization
The optimal set is often plotted in the objective space as
Pareto front.
Worse
M6
M8
Utility
M1 M2
M5
M4
M7
M3
Better
Privacy
For First attribute of Adult data
Summary
• Privacy preserving data mining
– Secure multi-party computation protocols
– Random response techniques for computation
and data collection
• Knowledge sensitive data mining