Beyond k-Anonimity:
A Decision Theoretic Framework
for Assessing Privacy Risk
M.Scannapieco, G.Lebanon,
M.R.Fouad and E.Bertino
Introduction
Release of data
– Private organizations can benefit from sharing
data with others
– Public organizations see data as a value for
the society
Privacy preservation
– Data disclosure can lead to economic
damages, threats to national security, etc.
– Regulated by law in both private and public
sectors
Two Facets of Data Privacy
Identity disclosure
– Uncontrolled data release: even presence of
identifiers
– Anonymous data release: identifiers
suppressed, but no control on possible linking
with other sources
PrivateID
SSN
DOB
ZIP
Health_Problem
a
11/20/67
00198
Shortness of
breath
b
02/07/81
00159
Headache
c
02/07/81
00156
Obesity
d
08/07/76
00198
Shortness of
breath
Linkage of Anonymous Data
T1
QUASI-IDENTIFIER
T2
PrivateID
SSN
DOB
ZIP
Employment
Marital
Status
1
A
11/20/67
00198
Researcher
Married
5
E
08/07/76
00114
Private
Employee
Married
3
C
02/07/81
00156
Public
Employee
Widow
Two Facets of Data Privacy (cont.)
Sensitive information disclosure
– Once identity disclosure occurs, the loss due to
such disclosure depends on how much sensitive
are the related data
– Data sensitivity is subjective
• E.g.: for women the age is in general more sensitive
than for men
Our proposal
A framework for assessing privacy risk that takes into
accounts both facets of privacy
– based on statistical decision theory
Definition and analysis of: disclosure policies modelled
by disclosure rules and several privacy risk functions
Estimated risk as an upper-bound of true risk and realted
complexity analysis
Algorithm for finding the disclosure rule minimizing the
privacy risk
Disclosure rules
A disclosure rule is a function that maps a record to
a new record in which some attributes may have
been suppressed
[ z ] j
zj
Zj=
The j-th attribute is suppressed
otherwise
Loss function
Let be the side information used by the
attacker in the identification attempt
The loss function
l : [0, ]
Measures the loss incurred by disclosing the data
(z) due to possible identification based on
Empirical distribution p associated with records
x1…xn
1 n
p( z ) 1{ z xi}
n i 1
Risk Definition
The risk of the disclosure rule in the
presence of the side information is the
average loss of disclosing x1…xn :
1 n
R( , ) p ( z )(l ( ( z ), )) l ( ( xi ), )
n i 1
Putting the pieces together so far…
An hypothetical attacker performs an indentification
attempt on a disclosed record y=(x) on the basis of a
side information , that can be a dictionary
The dictionary is used to link y with some entry present
in the dictionary
Example:
– y has the form (name, surname,phone#), is a phone book
– if all attributes revealed, it is likely y linked with one entry
– If phone# suppressed (or missing) y may or may not be linked to
a single entity, depending on the popularity of (name, surname)
Risk formulation
Let’s decompose the loss function into an
identification part and into a sensitivity part
Identification part: formalized by the random
variable Z
| ( ( xi ), |
pZ ( ( xi )(1)
0
1
( ( xi ), ) 0
otherwise
Risk formulation (cont.)
Sensitivity part:
: [0, ]
where higher value
indicate higher sensitivity
Therefore the loss is:
l ( y, EpZ ( y )(( y) Z ( y))
( y )
pZ ( y )(1) ( y) pZ ( y )(0) 0
| ( y, ) |
Risk formulation (cont.)
Risk:
1 n ( ( xi )
R( , ) Ep ( x )(l ( ( z ), ))
n i 1 | ( ( xi ), |
Disclosure Rule vs. Privacy Risk
Suppose that true is the true attacker’s dictionary which
is publicly available and that * is the actual database
starting from which data will be published
Under the following assumptions:
– true contains more records than * (* <= true )
– The non- in true will be more limited than the non- in *
Theorem: If θ* contains records that correspond to
x1, . . . ,xn and θ*<=θtrue, then:
R(, θtrue)<= R(, θ*)
Disclosure Rule vs. Privacy Risk
(cont.)
The theorem proves that the true risk is
bounded by R(, θ*)
Under the hypothesis that the distribution
underlying factorizes into a product form
Theorem: The rule that minimizes the risk
*=arg min R(, θ) can be found in
O(nNm) computation
K-anonimity
K anonimity is SIMPLY a special case of
our framework in whcih:
– θtrue=T
– is a costant
– is underspecified
Our framework underlies some
questionable hypotheses of k-anonimity!!!
Conclusions
New framework for privacy risk taking into
account sensitivity
Risk estimation as an upperbound for the
true privacy risk
Efficient algorithm for risk computation
K-anonimity generalization
© Copyright 2026 Paperzz