s, v

?
Privacy in Database Publishing
a presentation by
?
Indian PhD
Avinash
Vyas
Student
?
Greek
Yannis PhD
Katsis
Student
Database Seminar @ 02/24/2006
1
?
•
•
•
Outline
Defining Privacy
Optimization Problem
– First-Cut Solution (k-anonymity)
– Second-Cut Solution (l-diversity)
Decision Problem
– First-Cut (Query-View Security)
– Second-Cut (View Safety)
2
?
Defining Privacy in DB Publishing
Privacy in this talk, IS NOT
traditional security of data
e.g. hacking,
access control,
theft of disk etc.
NO FOUL PLAY


3
Defining Privacy in DB Publishing
?
Privacy in this talk IS
logical security of data
If the attacker uses legitimate methods,
- can she infer the data I
want to keep private?
Decision Problem

V1
V2

Alice
Secret
- how can I keep some data
private while publishing
useful info?
Optimization Problem
Attacker
External
Knowledge

Modify
Data

4
?
•
•
•
Outline
Defining Privacy
Optimization Problem
– First-Cut Solution (k-anonymity)
– Second-Cut Solution (l-diversity)
Decision Problem
– First-Cut (Query-View Security)
– Second-Cut (View Safety)
5
?
Need for Privacy in DB publishing

•
Alice is a owner of person-specific data
– Public health agency, Telecom provider, Financial
Organization
•
The person-specific data contains
– Attribute values which can uniquely identify an
individual
{ zip-code, gender, date-of-birth } or/and
{name} or/and {SSN}
– sensitive information corresponding to individuals
medical condition, salary, location
•
Great demand for sharing of person-specific data
– Medical research, new telecom applications
•
Alice wants to publish this person-specific data s.t.
– Information remains practically useful
– Identity of the individual cannot be determined
Modify
Data

6
The Optimization Problem
?
Motivating Example
Secret: Alice wants to publish hospital data, while the
correspondence between name & disease stays private
Non-Sensitive Data

Modify
Data
#
Zip
1
Sensitive Data
Age
Nationality
Name
Condition
13053
28
Brazilian
Ronaldo
Heart Disease
2
13067
29
US
Bob
Heart Disease
3
13053
37
Indian
Kumar
Cancer
4
13067
36
Japanese
Umeko
Cancer

7
The Optimization Problem
?
Motivating Example (continued)
Published Data: Alice publishes data without the Name
Non-Sensitive Data

Modify
Data

#
Zip
1
Sensitive Data
Age
Nationality
Condition
13053
28
Brazilian
Heart Disease
2
13067
29
US
Heart Disease
3
13053
37
Indian
Cancer
4
13067
36
Japanese
Cancer
Attacker’s Knowledge: Voter registration list
#
Name
Zip
Age
Nationality
1 John
13067
45
US
2 Paul
13067
22
US
3 Bob
13067
29
US
4 Chris
13067
23
US
8
The Optimization Problem
?
Motivating Example (continued)
Published Data: Alice publishes data without the Name
Non-Sensitive Data

Modify
Data

#
Zip
1
Sensitive Data
Age
Nationality
Condition
13053
28
Brazilian
Heart Disease
2
13067
29
US
Heart Disease
3
13053
37
Indian
Cancer
4
13067
36
Japanese
Cancer
Attacker’s Knowledge: Voter registration list
#
Name
Zip
Age
Nationality
1 John
13067
45
US
2 Paul
13067
22
US
3 Bob
13067
29
US
4 Chris
13067
23
US
Data Leak !
9
The Optimization Problem
?
Source of the Problem
Even if we do not publish the individuals:
• There are some fields that may uniquely identify some individual
Non-Sensitive Data
Sensitive Data
#
Zip
Age
Nationality
Condition
…
…
…
…
…
Quasi Identifier
• The attacker can use them to join with other sources and identify the individuals
10
?
•
•
•
Outline
Defining Privacy
Optimization Problem
– First-Cut Solution (k-anonymity)
– Second-Cut Solution (l-diversity)
Decision Problem
– First-Cut (Query-View Security)
– Second-Cut (View Safety)
11
The Optimization Problem
?

First-Cut Solution: k-Anonymity
L. Sweeney:
Achieving k-Anonymity Privacy Protection Using Generalization and Suppression
Instead of returning the original data:
• Change the data such that for each tuple in the results there are at least
k-1 other tuples with the same value for the quasi-identifier
e.g.
#
Zip
Age
Nationality
1
13053
28
Brazilian
Heart
Disease
#
Zip
2
13067
29
US
3
13053
37
Indian
4
13067
36
Japanese
Heart Disease
13053
11 130**
Cancer
13067
22 130**
Cancer
13053
33 130**
Original Table
Condition
44
13067
130**
Age
Nationality
Condition
< 40
40
<
*
*
Heart Disease
Disease
Heart
< 40
40
<
*
*
Heart Disease
Disease
Heart
< 40
40
<
*
*
Cancer
< 40
40
<
*
*
Cancer
2-anonymous Table
4-anonymous
12
The Optimization Problem > k-Anonymity
?
Generalization & Suppression
Different ways of modifying data:

Modify
Data

• Randomization
• Data-Swapping
…
• Generalization
Replace the value with a less specific but
semantically consistent value
• Suppression
Do not release a value at all
#
Zip
Age
Nationality
Condition
1
13053
< 40
*
Heart Disease
2
13067
< 40
*
Heart Disease
3
13053
< 40
*
Cancer
4
13067
< 40
*
Cancer
13
The Optimization Problem > k-Anonymity
Generalization Hierarchies
?
• Generalization Hierarchies: Data owner defines how values can be generalized
3
Zip
1305
0 13053
13058
Nationality
*
< 40
130
2
1
Age

1306
13063
13067
*
< 30
28
29
3*
36
American
37
Brazilian US
Asian
Indian Japanese
• Table Generalization: A table generalization is created by generalizing all values
in a column to a specific level of generalization
e.g.
#
Zip
Age
2-anonymization
1
130**
13053
< 40
30
*
American
Heart Disease
2
130**
13067
< 40
30
*
American
Heart Disease
3
130**
13053
<
3*40
*
Asian
Cancer
4
130**
13067
<
3*40
*
Asian
Cancer
Nationality
Condition
14
The Optimization Problem > k-Anonymity
k-minimal Generalizations
?
• There are many k-anonymizations. Which to pick?
The ones that do not generalize the data more than needed
k-minimal Generalization: A k-anonymization that is not a generalization of
another k-anonymization
e.g.


2-minimal Generalization
Nationality
Condition#
2-minimal Generalization
#
Zip
Age
1
13053
< 40
*
Heart Disease
1 130**
< 30
American
Heart Disea
2
13067
< 40
*
Heart Disease
2 130**
< 30
American
Heart Disea
3
13053
< 40
*
Cancer
3
130**
3*
Asian
Cancer
4
13067
< 40
*
Cancer
4
130**
3*
Asian
Cancer
Nationality
Zip
Age
Nationality
#
Zip
Age
1
130**
< 40
*
2
130**
< 40
*
3
130**
< 40
*
Heart Disease
Non-minimal
Heart Disease
2-anonymization
Cancer
4
130**
< 40
*
Cancer
Condition
Condition

15
The Optimization Problem > k-Anonymity
k-minimal Distortions
?
• There are many k-minimal generalizations. Which to pick?
The ones the create the minimum distortion to the data
k-minimal Distortion: A k-minimal generalization that has the least distortion

attrib i
Distortion D =
Current level of generalization for attribute i
Max level of generalization for attribute i
Number of attributes
Condition
e.g.
#
Zip
Age
*
Heart Disease
1
130**
< 30
American
Heart Diseas
< 40
*
Heart Disease
2
130**
< 30
American
Heart Diseas
13053
< 40
*
Cancer
3
130**
3*
Asian
Cancer
13067
< 40
*
Cancer
4
130**
3*
Asian
Cancer
2
1
#
Zip
Age
1
13053
< 40
2
13067
3
4
D=(
0
3
+
2
3
Nationality
+
2
2
) /3
= 0.56
D=(
3
+
3
Nationality
+
1
2
) /3
Condition
= 0.5
16
The Optimization Problem > k-Anonymity
?
Complexity & Algorithms
Search Space:
• Number of generalizations =

(Max level of generalization for attribute i + 1)
attrib i
If we allow generalization to a different level for each value of an attribute:
• Number of generalizations =

#tuples
(Max level of generalization for attribute i + 1)
attrib i
Problem is NP-hard!
See paper for:
• Naïve Brute force algorithm
• Heuristics: Datafly,  - Argus
17
The Optimization Problem > k-Anonymity
?
k-Anonymity Drawbacks
k-Anonymity alone does not provide privacy if:
• Sensitive attributes lack diversity
• Attacker has background knowledge
18
The Optimization Problem > k-Anonymity
k-Anonymity Attack Example
?

Original Data
Quasi-Identifier
#
ZIP
Age
Nationality
Sensitive Data
Condition
The attacker knows:
• About quasi-identifiers:
1
13053
28
Russian
Heart Disease
2
13068
29
American
Heart Disease
3
13068
21
Japanese
Viral Infection
4
13053
23
American
Viral Infection
5
14853
50
Indian
Cancer
Bob
6
14853
55
Russian
Heart Disease
7
14850
47
American
Viral Infection
8
14850
49
American
Viral Infection
9
13053
31
American
Cancer
10
13053
37
Indian
Cancer
11
13068
36
Japanese
Cancer
12
13068
35
American
Cancer
Umeko
Zip
Age
National
13068
21
Japanese
Zip
Age
National
13053
31
American
• Other background
knowledge:
Japanese have low
incidence of heart disease
19
The Optimization Problem > k-Anonymity
k-Anonymity Attack Example
?
4-anonymization
Quasi-Identifiers
#
ZIP
Age
Sensitive Data
Nationality
Condition
1
130**
< 30
*
Heart Disease
2
130**
< 30
*
Heart Disease
3
130**
< 30
*
Viral Infection
4
130**
< 30
*
Viral Infection
5
1485*
> = 40
*
Cancer
6
1485*
> = 40
*
Heart Disease
7
1485*
> = 40
*
Viral Infection
8
1485*
> = 40
*
Viral Infection
9
130**
3*
*
Cancer
10
130**
3*
*
Cancer
11
130**
3*
*
Cancer
12
130**
3*
*
Cancer
Umeko
Zip
Age
National
13068
21
Japanese
Umeko has Viral Infection!
Data Leak !
Bob
Zip
Age
National
13053
31
American
Bob has Cancer!
20
?
•
•
•
Outline
Defining Privacy
Optimization Problem
– First-Cut Solution (k-anonymity)
– Second-Cut Solution (l-diversity)
Decision Problem
– First-Cut (Query-View Security)
– Second-Cut (View Safety)
21
The Optimization Problem
?

Second-Cut Solution: l-Diversity
A. Machanavajjhala et. al.:
l-Diversity: Privacy Beyond k-Anonymity
Return a k-anonymization with the additional property that:
• For each distinct value of the quasi-identifier there exist l different values for
the sensitive attributes
22
The Optimization Problem > l-Diversity
l-Diversity Example
?
3-diversified
Quasi-Identifiers
#
ZIP
Age
Sensitive Data
Nationality
Attack does not work!
Condition
1
1305*
<= 40
*
Heart Disease
2
1306*
<= 40
*
Heart Disease
3
1306*
<= 40
*
Viral Infection
4
1305*
<= 40
*
Viral Infection
5
1485*
>= 40
*
Cancer
6
1485*
>= 40
*
Heart Disease
7
1485*
>= 40
*
Viral Infection
8
1485*
>= 40
*
Viral Infection
9
1305*
<= 40
*
Cancer
10
1305*
<= 40
*
Cancer
11
1306*
<= 40
*
Cancer
12
1306*
<= 40
*
Cancer
Umeko
Zip
Age
National
13068
21
Japanese
Umeko has Viral Infection
or Cancer
Bob
Zip
Age
National
13053
31
American
Bob has Viral Infection
or Cancer or Heart Disease
23
?
•
•
•
Outline
Defining Privacy
Optimization Problem
– First-Cut Solution (k-anonymity)
– Second-Cut Solution (l-diversity)
Decision Problem
– First-Cut (Query-View Security)
– Second-Cut (View Safety)
24
The Decision Problem
Moving from practice to theory…
?

Gerome Miklau, Dan Suciu:
A formal Analysis of Information Disclosure in Data Exchange

Views
V1

V2
•
k-anonymity & l-diversity make it harder for the
attacker to figure out private associations…
•
… but they still give away some knowledge & they
do not give any guarantees on the amount of data
being disclosed
•
Alice wants to publish some views of her data and
wants to know:
– Do her views disclose some sensitive data?
– If she adds a new view, will there be an
additional data disclosure?
25
The Decision Problem
Motivating Example
?
Secret: Alice wants to keep the correlation
between Name & Condition secret
S = (name, condition)

V1

V2
#
Zip
Name
Condition
1
13053
Ronaldo
Heart Disease
2
13067
Bob
Heart Disease
3
13053
Kumar
Viral Infection
4
13067
Umeko
Cancer
Published Views: Alice publishes the views
V1 = (zip, name)
V2 = (zip, condition)
Zip
Name
Zip
Condition
13053
Ronaldo
13053
Heart Disease
13067
Bob
13067
Heart Disease
13053
Kumar
13053
Viral Infection
13067
Umeko
13067
Cancer
26
The Decision Problem
Motivating Example
?
Attackers Knowledge:
Before seeing the views:
(assuming he knows the domain)
Condition

Ronaldo
Heart Disease
Viral Infection
V1

Cancer
V2
After seeing the views:
V1
Zip
13053
V2
Condition
Name
Zip
Condition
Heart Disease
Ronaldo
Ronaldo  13053
Heart Disease
Viral Infection
13067 Heart Disease
13053
Viral! Infection
Data Leak
13067
Cancer
27
The Decision Problem > Model for attacker’s knowledge
?
Probability of possible tuples
•
Domain of possible values for all attributes: D = {Bob, Mary}
•
Set of possible tuples of relation R (e.g. cooksFor):
•
Bob
Bob
x1 = 1/2
Bob
Mary
x2 = 1/2
Mary
Bob
x3 = 1/2
Mary
Mary
x4 = 1/2
Attacker assigns a probability to each tuple
29
The Decision Problem > Model for attacker’s knowledge
?
•
Probability of possible Databases
This implies a probability for each possible database instance:
16
possible
instances
x1 = 1/2
Bob
Bob
Bob
Mary
1 – x2 = 1/2
Mary
Bob
1 – x4 = 1/2
Mary
Mary
1 – x3 = 1/2
 = 1/16
Bob
Bob
Bob
Mary
Mary
Bob
1 – x4 = 1/2
Mary
Mary
1 – x3 = 1/2
1 – x1 = 1/2
x2 = 1/2
 = 1/16
Bob
Bob
x1 = 1/2
Bob
Mary
x2 = 1/2
Mary
Bob
1 – x4 = 1/2
Mary
Mary
1 – x3 = 1/2
 = 1/16
30
The Decision Problem > Model for attacker’s knowledge
?
•
Probability of possible Secrets
This implies a probability for each possible secret value:
Probability that secret S(y) :- R(x,y) equals s = {(Bob)}
Sum of probabilities of
instances that can
return this query result
P[S(I) = s]=
3
Bob
Bob
Bob
Bob
Bob
Bob
Bob
Mary
Bob
Mary
Bob
Mary
Mary
Bob
Mary
Bob
Mary
Bob
Mary
Mary
Mary
Mary
Mary
Mary
16
Similarly for probability that view V equals v: P(V(I) = v)
31
The Decision Problem > Model for attacker’s knowledge
?
•
Prior & Posterior Probability
Prior Probability: Probability before seeing the view instance
P[S(I) = {(Bob)}]=
3
Secret S(y) :- R(x,y)
16
•
Posterior Probability: Probability after seeing the view instance
If V(I) = {(Mary)}
View V(x) :- R(x,y)
P[S(I) = {(Bob)} | V(I) = {(Mary)}]=
P[S(I) = {(Bob)} AND V(I) = {(Mary)}]
Mary
Bob
P[S(I) = {V(I) = {(Mary)}]
Mary
Bob
Mary
Mary
Mary
Bob
=
Mary
Mary
1/16
3/16
32
The Decision Problem
?
•
Query-View Security
A query S is secure w.r.t. a set of views V if
for any possible answer s to S & for any possible answer v to V:
P[S(I) = s]
Prior
Probability
=
P[S(I) = s | V(I) = v]
Posterior
Probability
Intuitively,
if some possible answer to S becomes more or less possible after
publishing the views V, then S is not secure w.r.t. V
33
The Decision Problem
?
•
From Probabilities to Logic
A possible tuple t is a critical tuple if
for some possible instance I:
Q[I]

Query result
in presence of t
Q[I – {t}]
Query result
in absence of t
Intuitively,
critical tuples are those of interest to the query
•
A query S is secure w.r.t. a set of views V iff:
crit(S)  crit(V) = 
The probability
distribution does not
affect the security of a
query
34
The Decision Problem
Example of Non-Secure Query
?
Previous Example Revisited:
Secret
S(y) :-
R(x,y)
View
V(x) :-
R(x,y)
Non-Secure Query S
Critical Tuples for S: crit(S)
Bob
Bob
Bob
Mary
Mary
Mary
Critical Tuples for V: crit(V)
Bob
Bob
Bob
Mary
Bob
Mary
Bob
Mary
Mary
Mary


e.g. S({(Mary,Mary)}  S{}
35
The Decision Problem
Example of Secure Query
?
Example 2:
Secret
S(x) :-
R(x,’Mary’)
View
V(x) :-
R(x,’Bob’)
Secure Query S
Critical Tuples for S: crit(S)
Bob
Bob
Bob
Mary
Mary
Mary
Critical Tuples for V: crit(V)
Bob
Bob
Bob
Mary
Bob
Mary
Bob
Mary
Mary
Mary

=
36
The Decision Problem
Example of Secure Query
?
Example 2 revisited using probabilistic definition of security:
Secret
S(x) :-
R(x,’Mary’)
View
V(x) :-
R(x,’Bob’)
P[S(I) = {(Mary)] = 4/16
Secure Query S
=
P[S(I) = {(Mary)} | V(I) = {(Bob)}] = 1/4
Bob
Bob
Bob
Bob
Bob
Bob
Bob
Bob
Bob
Mary
Bob
Mary
Bob
Mary
Bob
Mary
Mary
Bob
Mary
Bob
Mary
Bob
Mary
Bob
Mary
Mary
Mary
Mary
Mary
Mary
Mary
Mary
Bob
Bob
Bob
Bob
Bob
Bob
Bob
Bob
Bob
Mary
Bob
Mary
Bob
Mary
Bob
Mary
Mary
Bob
Mary
Bob
Mary
Bob
Mary
Bob
Mary
Mary
Mary
Mary
Mary
Mary
Mary
Mary
37
The Decision Problem
?
•
•
•
•
•
•
Properties of Query-View Security
Reflexivity
– Is S is secure w.r.t. V, V is secure w.r.t. S
No obscurity
– view definitions, secret query and schema are not concealed
Instance Independence
– If S is secure w.r.t. V even if the underlying database changes
Probability Distribution Independence
– If S and V are monotone queries
Domain Independence
– If S is secure w.r.t V for a domain D0 such that |D0| <= n(n+1),
then S is secure w.r.t. V for all Domains D where |D0| <= n(n+1)
Complexity of query-view security
– P2 - complete
38
The Decision Problem
?
Prior Knowledge
•
Prior knowledge
– other than domain D and probability distribution P
– e.g. key or foreign key constraint or
•
Represented as a Boolean query K over the instance
•
Query view security
P[S(I) = s | K(I)]  P[S(I)=s | V(I) = v  K(I)]
39
The Decision Problem
?
Measuring Disclosure
•
The query-view security is very strong
– rules out most of the views in practical usage as insecure
•
The applications are ready to tolerate some disclosures
•
Disclosure examples:
– Positive disclosure “Bob” has “Cancer”
– Negative disclosure “Umeko” does not have “Heart Disease”
•
Measure of Positive disclosure:
Leak(S,V)  sup ( P[sS(I) | v V(I)] - P[s S(I)] ) / P[sS(I)]
s, v
•
Disclosure is minute if:
leak(S,V) << 1
40
The Decision Problem
?
Query-View Security Drawbacks
•
Tuples are modeled as mutually independent
– This is not the case in presence of constraints
(e.g foreign key constraints)
•
Modeling prior or external knowledge
– Boolean predicate does not suffice
•
Conjunctive queries only is restrictive
•
Guarantees are instance-independent
– There may not be a privacy breach given the current instance
41
?
•
•
•
Outline
Defining Privacy
Optimization Problem
– First-Cut Solution (k-anonymity)
– Second-Cut Solution (l-diversity)
Decision Problem
– First-Cut (Query-View Security)
– Second-Cut (View Safety)
42
The Decision Problem
?

More general setting
Alin Deutsch, Yannis Papakonstantinou:
Privacy in Database Publishing
•
Alice has a database D which conforms to schema S.
– D satisfies a set of constraints .
– V is a set of views over D.
•
Model attacker’s belief as probability distribution
•
Views and queries are defined using UCQ
•
Alice wants to publish an additional view N.
Does view N provide any new information to the Attacker about the
answer to query Q?
43
The Decision Problem
Motivating Example (w/o Constraints)
?
Secret: Alice wants to hide the reviewer of paper P1
S(r) :- RP(r, ‘P1’)

V1
Reviewer
Committe
e
Committe
e
R1
C1
C1
P1
R1
P1
R2
C1
C1
P2
R2
P2
R3
C2
C2
P3
R3
P3
R4
C3
C3
P4
R4
P4
Paper
Reviewer
Paper
V2
Published Views:

New views
reveal nothing
about the secret
V1(r) :RC(r, c)
V2(c) :RC(r, c)
Reviewer
Committe
e
R1
C1
R2
R3
R4
New Additional Views:
N1(r, c) :RC(r, c)
Rev.
N2(c, p) :CP(c, p)
Commit.
Commit.
Paper
R1
C1
C1
P1
C2
R2
C1
C1
P2
C3
R3
C2
C2
P3
R4
C3
C3
P4
44
The Decision Problem
?
Motivating Example (with Constraint 1)
Published Views:
Data disclosure
depends on the
constraints
V1(r) :RC(r, c)
V2(c) :RC(r, c)
Reviewer
Committe
e
R1
C1
R2
R3
New Additional Views:
N1(r, c) :RC(r, c)
Commit.
Commit.
R1
C1
C1
P1
C2
R2
C1
C1
P2
C3
R3
C2
C2
P3
R4
C3
C3
P4
R4
Rev.
N2(c, p) :CP(c, p)
Paper
Constraint 1:
Papers assigned to a committee can only be reviewed by committee members
rp RP(r,p)  c RC(r,c)CP(c,p)
Possible secrets
with new views:
R1
R2
R1
R2
45
The Decision Problem
?
Motivating Example (with Constraint 2)
Published Views:
Data disclosure
depends on the
constraints
V1(r) :RC(r, c)
V2(c) :RC(r, c)
Reviewer
Committe
e
R1
C1
R2
R3
New Additional Views:
N1(r, c) :RC(r, c)
Commit.
Commit.
R1
C1
C1
P1
C2
R2
C1
C1
P2
C3
R3
C2
C2
P3
R4
C3
C3
P4
R4
Rev.
N2(c, p) :CP(c, p)
Paper
Constraint 1:
Papers assigned to a committee can only be reviewed by committee members
Constraint 2:
Each paper has exactly 2 reviewers
Possible secrets
with new views:
R1
R2
46
The Decision Problem
?
Motivating Example (different instance)
Published Views:
Data disclosure
depends on the
instance
V1(r) :RC(r, c)
V2(c) :RC(r, c)
Reviewer
Committe
e
R1
C0
New Additional Views:
N1(r, c) :RC(r, c)
Rev.
N2(c, p) :CP(c, p)
Commit.
Commit.
Paper
R1
C0
C0
P1
R2
R2
C0
C0
P2
R3
R3
C0
C0
P3
R4
R4
C0
C0
P4
Constraint 1:
Papers assigned to a committee can only be reviewed by committee members
New views reveal nothing about the secret,
since any subset of the reviewers in V1 may review paper ‘P1’
47
The Decision Problem
Probabilities Revisited: Plausible Secrets
?
• In order to allow correlation of tuples,
the attacker assigns probabilities to the plausible secrets (outcomes
for query S that are possible given the published views)
e.g. in previous example with constraint 1 & secret S(r) :- RP(r, ‘P1’)
Published Views:
Plausible Secrets:
V1(r) :RC(r, c)
V2(c) :RC(r, c)
Any subset of V1
Reviewer
Committe
e
e.g.
R1
R1
P1 = 3/8
C1
R2
P2 = 1/8
R2
C2
R3
R3
C3
P3 = 2/8
R4
R1
P4 = 2/8
R2
…
Pi = 0, i > 4
48
The Decision Problem
Possible Worlds
?
• This induces a probability distribution on the set of possible worlds (possible
instances that satisfy the constraints & the published views)
Possible Worlds where S = {(R1)}:
Published Views:
Rev
Com
Com
Pap
Rev
Pap
R1
C1
C1
P1
R1
P1
R2
C1
C1
P2
R2
P2
C2
C2
P3
R3
P3
R4
C3
P1 = 3/8
C3
P4
R4
P4
Rev
Com
Pap
Rev
Pap
C1
P1
R1
P1
C1
P2
R2
P2
C2
P3
R3
P3
C3
P4
R4
P4
Plausible Secrets:
V1(r) :RC(r, c)
V2(c) :RC(r, c)
Any subset of VR3
1
Reviewer
Committe
e
e.g.
R1
C1
R2
R2
C2
R3
R3
C3
R4
R1
Com
P2 = 1/8
R1
C1
P3 = 2/8
R2
C1
R1
R3
C1
P4 = 2/8
R4
C3
R2
…
Pi = 0, i > 4
…
PG1
PG2
49
The Decision Problem
Probability Distribution on Possible Worlds
?
• This induced probability distribution can be:
General: Sum of probabilities of
possible worlds for any secret value s
is equal to the probability of S = s
Published Views:
V1(r) :RC(r, c)
V2(c) :RC(r, c)
Reviewer
Committe
e
Rev
Com
Com
Pap
Rev
Pap
R1
C1
C1
P1
R1
P1
R2
C1
C1
P2
R2
P2
Any subset of VR3
1
C2
C2
P3
R3
P3
R4
C3
P1 = 3/8
C3
P4
R4
P4
Rev
Com
Pap
Rev
Pap
C1
P1
R1
P1
C1
P2
R2
P2
C2
P3
R3
P3
C3
P4
R4
P4
Plausible Secrets:
P1 =e.g.
3/8
R1
R1
C1
R2
R2
C2
R3
R3
C3
R4
Possible Worlds if S = {(R1)}:
Com
P2 = 1/8
R1
C1
P3 = 2/8
R2
C1
R1
R3
C1
P4 = 2/8
R4
C3
R2
…
Pi = 0, i > 4
…
PG1
+
PG2
+
50
The Decision Problem
Probability Distribution on Possible Worlds
?
• This induced probability distribution can be:
Equiprobable: Each of the possible
worlds for any secret value s is equally
probable (i.e. equal to the probability
of S = s / # of possible worlds for s)
Published Views:
Rev
Com
Com
Pap
Rev
Pap
R1
C1
C1
P1
R1
P1
R2
C1
C1
P2
R2
P2
Any subset of VR3
1
C2
C2
P3
R3
P3
R4
C3
P1 = 3/8
C3
P4
R4
P4
Rev
Com
Pap
Rev
Pap
C1
P1
R1
P1
C1
P2
R2
P2
C2
P3
R3
P3
C3
P4
R4
P4
Plausible Secrets:
V1(r) :RC(r, c)
V2(c) :RC(r, c)
Reviewer
Committe
e
R1
R1
C1
R2
R2
C2
R3
R3
C3
R4
Possible Worlds if S = {(R1)}:
Com
P2 = 1/8
R1
C1
P3 = 2/8
R2
C1
R1
R3
C1
P4 = 2/8
R4
C3
R2
…
Pi = 0, i > 4
…
PG1
=
PG2
=
51
The Decision Problem
A priori & a posteriori belief
?
• A priori belief:
The belief of the attacker in S = s before seeing the new views
PG(s
= S | V = v) =
Sum of probabilities of possible worlds for S = s
Sum of probabilities of all possible worlds
• A posteriori belief:
The belief of the attacker in S = s after seeing the new views
PG(s
= S | V = v ^ N = n) =
Sum of probabilities of possible worlds for S = s
Sum of probabilities of all possible worlds
Notice that the possible worlds will typically change after publishing views
52
The Decision Problem
?
•
Privacy Guarantees
A set of new view instances n is safe w.r.t. query S & initial set of view
instances v if
for any plausible secret s:
P[S = s | V = v]
A Priori
Probability
•
=
P[S = s | V = v ^ N = n]
A Posteriori
Probability
We can also have database instance independent guarantees if we
quantify the guarantee over all instances over the proprietary
database
53
The Decision Problem
Example of View Safety
?
Paper example revisited….
Published Views:
V1: RC(r, c)
New Additional Views:
V2: RC(r, c)
Reviewer
Committe
e
R1
C1
R2
R3
R4
N1: RC(r, c)
Rev.
Commit.
Commit.
R1
C1
C1
P1
C2
R2
C1
C1
P2
C3
R3
C2
C2
P3
R4
C3
Plausible Secrets:
Any subset of V1
e.g.
N2: CP(c, p)
P1 = 3/8
R2
R3
R2
C3
P4
Plausible
Secrets:
Only the following 3:
R1
R1
Paper
e.g.
R1
P 1’
P2 = 1/8
R2
P2’
P3 = 2/8
R3
P3 = 2/8
R1
P4 = 2/8
…
P[S = {(R3)} | V = v] = 1/8
R2

P4’
P[S = {(R3)} | V = v ^ N = n] = 0
54
The Decision Problem
?
•
View Safety for General Distributions
For general induced distributions:
P[S = s | V = v]
=
A Priori
Probability
P[S = s | V = v ^ N = n]
A Posteriori
Probability
iff
Set of possible
worlds before
seeing N
=
Set of possible
worlds after
seeing N
Infinite Number!
How to compute?
iff
Set of templates
of possible worlds
before seeing N
=
Set of templates
of possible worlds
after seeing N
55
The Decision Problem
Templates
?
•
Templates are a finite summarization of a set of possible worlds:
e.g.
Schema
View Extent
V(A, C) :- R(A, B, C)
A
B
C
A
C
a1
c1
a2
c2
Templates
A
B
C
A
B
C
a1
x1
c1
a1
x3
c1
a2
x2
c2
a2
x3
c2
56
The Decision Problem
?
•
e.g.
View Safety for Equiprobable Distributions
For equiprobable distributions, the set of possible worlds may change
but the probability may not
- Before the new views:
There are 200 possible worlds in total
100 of them for S = s1
100 of them for S = s2
- After the new views:
There are 100 possible worlds in total
50 of them for S = s1
50 of them for S = s2
Since every possible world that we discarded from S = s1 had the
same probability & similarly for S = s2, what counts is the ratio
between # Possible Worlds for S = s1 / # Possible Worlds for S = s2
which stayed the same
57
The Decision Problem
?
•
View Safety for Equiprobable Distributions
For equiprobable distributions:
P[S = s | V = v]
=
A Priori
Probability
P[S = s | V = v ^ N = n]
A Posteriori
Probability
iff
Set of plausible
secrets before
seeing N
=
Set of plausible
secrets after
seeing N
AND
for all plausible secrets s1, s2
# possible worlds
=
for S = s1 /
# possible worlds
for S = s2 before
seeing N
# possible worlds
for S = s1 /
# possible worlds
for S = s2 after
seeing N
Infinite Number!
How to compute?
> Templates
58
?
Summary
• Models for information disclosure
– K-anonymity
– Probabilistic
• K-anonymity
– Tension between usability and anonymity
– Optimal or Minimized
• Suppression
• Generalization
• Probabilistic Model
– very strong guarantees
– Probability distribution and Instance independence
– Reduced into logical statement
• Database
• Templates
59
?
Thanks
60