Privacy-Preserving Data
Publishing
Rain Oksvort
Motivation
●
Our data gets collected and sold to dataminers daily
○
●
Privacy of the data owner needs to be preserved
○
○
●
Just removing sensitive attributes such as name is not enough
Other attributes can identify person as well. For example: age, gender, city,
social security number, zip-code, etc
Failing to do so will cause problems
○
○
●
Example: Hospitals in california.
Examples: Impersonate victim, affect decisions
Netflix removed data after reidentification of a searcher
We need to define means and standards for protecting the privacy
2
Definitions
●
Explicit identifier
○
●
Quasi-identifier
○
●
QID = {Age, Gender, City}
Sensitive attribute
○
●
ID = {First name, Last name}
SI = {Work}
Non-sensitive attributes
3
Attack types
●
●
●
In record linkage attack attacker will use QID to link victim to some
specific record.
Successful Attribute linkage attack results in sensitive attribute of
victim to be released.
Successful table linkage attack would reveal whether or not victim is
present in data release.
4
Privacy preserving methods
Benjamin C. M. Fung, Ke Wang, Rui Chen, and Philip S. Yu.
Privacy-preserving data publishing: A survey of recent developments. ACM
Comput. Surv., 42(4):14:1–14:53, June 2010.
5
k-Anonymity
●
Generalization of quasi-identifier into indistinguishable groups of
size k.
○
○
Generalization to more more general value
■ Car, truck, motorbike -> vehicle
Generalize to interval
■ 33 years old -> [30-40) years old
6
(X, Y)-Anonymity
●
k-Anonymity assumes that each person is present only once in table.
○
●
●
Not always true, for example one person could have multiple diseases.
(X, Y) divides table attributes into two groups X and Y and requires
that each element from group X would be linked to at least k
attributes from group Y.
The following table illustrates (X, Y)-Anonymity where X is QID =
{age, gender, city} and Y is explicit identifier = {name}.
7
Multirelational K-anonymity
●
For relational databases.
○
●
Ensure that for each owner in the join of all tables
exists at least k-1 other owners.
■ PT = Person Table
■ T1, …, Tn tables contain sensitive attribute.
Idea is to first join all the tables and then apply k-Anonymity.
QID = {age, gender}
Sensitive Attribute = {job}
8
l-Diversity
●
Sometimes k-Anonymity may not offer sufficient protection.
○
●
QID may contain enough records but all records could have same value for sensitive
attribute.
l-Diversity solves this problem by setting requirement that each QID
must have at least l different values.
QID = {age, gender, city},
Sensitive Attribute = {job}
9
t-Closeness
●
●
Sometimes it can happen that QID group meets l-Diversity requirement
but sensitive attribute distribution is skewed inside QID.
Suppose we have table where 70% of sensitive attributes are
programmer, 20% are tester and 10% of sensitive attributes are
Designer but in some QID group there are 50% of records with
programmer and 50% records with designer.
○
●
Can lead to false inference.
t-Closeness is a advancement of l-Diversity that requires that the
distribution of sensitive attribute inside QID group would be close
to the distribution of sensitive attribute in overall table.
QID = {age, gender, city}
Sensitive Attribute = {job}
10
ε-Differential privacy
●
●
●
ε-Differential privacy is an other privacy notation.
Threat to data owner’s privacy must not increase.
Is achieved by adding random noise to sensitive attribute.
QID = {age, gender, city,
job}
Sensitive Attribute = {salary}
11
Questions?
Thank you!
12
© Copyright 2026 Paperzz