loss function

Beyond k-Anonimity:
A Decision Theoretic Framework
for Assessing Privacy Risk
M.Scannapieco, G.Lebanon,
M.R.Fouad and E.Bertino
Introduction
 Release of data
– Private organizations can benefit from sharing
data with others
– Public organizations see data as a value for
the society
 Privacy preservation
– Data disclosure can lead to economic
damages, threats to national security, etc.
– Regulated by law in both private and public
sectors
Two Facets of Data Privacy
 Identity disclosure
– Uncontrolled data release: even presence of
identifiers
– Anonymous data release: identifiers
suppressed, but no control on possible linking
with other sources
PrivateID
SSN
DOB
ZIP
Health_Problem
a
11/20/67
00198
Shortness of
breath
b
02/07/81
00159
Headache
c
02/07/81
00156
Obesity
d
08/07/76
00198
Shortness of
breath
Linkage of Anonymous Data
T1
QUASI-IDENTIFIER
T2
PrivateID
SSN
DOB
ZIP
Employment
Marital
Status
1
A
11/20/67
00198
Researcher
Married
5
E
08/07/76
00114
Private
Employee
Married
3
C
02/07/81
00156
Public
Employee
Widow
Two Facets of Data Privacy (cont.)
 Sensitive information disclosure
– Once identity disclosure occurs, the loss due to
such disclosure depends on how much sensitive
are the related data
– Data sensitivity is subjective
• E.g.: for women the age is in general more sensitive
than for men
Our proposal
 A framework for assessing privacy risk that takes into
accounts both facets of privacy
– based on statistical decision theory
 Definition and analysis of: disclosure policies modelled
by disclosure rules and several privacy risk functions
 Estimated risk as an upper-bound of true risk and realted
complexity analysis
 Algorithm for finding the disclosure rule minimizing the
privacy risk
Disclosure rules
 A disclosure rule is a function that maps a record to
a new record in which some attributes may have
been suppressed


 [ z ] j  
 zj
Zj=
The j-th attribute is suppressed
otherwise
Loss function
 Let    be the side information used by the
attacker in the identification attempt
 The loss function
l :    [0, ]
Measures the loss incurred by disclosing the data
(z) due to possible identification based on   
 Empirical distribution p associated with records
x1…xn
1 n
p( z )  1{ z  xi}
n i 1
Risk Definition
 The risk of the disclosure rule  in the
presence of the side information  is the
average loss of disclosing x1…xn :
1 n
R( , )  p ( z )(l ( ( z ),  ))   l ( ( xi ),  )
n i 1
Putting the pieces together so far…
 An hypothetical attacker performs an indentification
attempt on a disclosed record y=(x) on the basis of a
side information , that can be a dictionary
 The dictionary is used to link y with some entry present
in the dictionary
 Example:
– y has the form (name, surname,phone#),  is a phone book
– if all attributes revealed, it is likely y linked with one entry
– If phone# suppressed (or missing) y may or may not be linked to
a single entity, depending on the popularity of (name, surname)
Risk formulation
 Let’s decompose the loss function into an
identification part and into a sensitivity part
 Identification part: formalized by the random
variable Z
|  ( ( xi ),  |
pZ ( ( xi )(1)  
0
1
 ( ( xi ),  )  0
otherwise
Risk formulation (cont.)
 Sensitivity part:
 :   [0, ]
 where higher value
indicate higher sensitivity
 Therefore the loss is:
l ( y,  EpZ ( y )(( y) Z ( y)) 
( y )
 pZ ( y )(1)  ( y)  pZ ( y )(0)  0 
|  ( y,  ) |
Risk formulation (cont.)
 Risk:
1 n ( ( xi )
R( , )  Ep ( x )(l ( ( z ),  ))  
n i 1 |  ( ( xi ),  |
Disclosure Rule vs. Privacy Risk
 Suppose that true is the true attacker’s dictionary which
is publicly available and that * is the actual database
starting from which data will be published
 Under the following assumptions:
– true contains more records than * (* <= true )
– The non- in true will be more limited than the non- in *
Theorem: If θ* contains records that correspond to
x1, . . . ,xn and θ*<=θtrue, then:
 R(, θtrue)<= R(, θ*)
Disclosure Rule vs. Privacy Risk
(cont.)
 The theorem proves that the true risk is
bounded by R(, θ*)
 Under the hypothesis that the distribution
underlying  factorizes into a product form
Theorem: The rule that minimizes the risk
*=arg min  R(, θ) can be found in
O(nNm) computation
K-anonimity
 K anonimity is SIMPLY a special case of
our framework in whcih:
– θtrue=T
–  is a costant
–  is underspecified
 Our framework underlies some
questionable hypotheses of k-anonimity!!!
Conclusions
 New framework for privacy risk taking into
account sensitivity
 Risk estimation as an upperbound for the
true privacy risk
 Efficient algorithm for risk computation
 K-anonimity generalization