QS t-closeness - Data Mining and Security Lab @ McGill

Anonymizing Data with Quasi-Sensitive Attribute Values
Pu Shi1, Li Xiong1, Benjamin C. M. Fung2
1Departmen of Mathematics and Computer Science, Emory University, Atlanta, GA, USA
2CIISE, Concordia University, Montreal, QC, Canana
Problem Statement
Definitions
We study the problem of anonymizing microdata with
quasi-sensitive (QS) attributes which are not sensitive
by themselves, but can be linked to external knowledge
to reveal indirect sensitive information of an individual.
The external knowledge table E has each row as a pair (Li, Si), i
= 1, 2, ..., |E|, where Li is a sensitive label and Si is a
corresponding set of QS values. All sensitive labels that can be
linked to the d tuples in a QI group G with quasi-identifying (QI)
vector q is ∪di=1K(tpi), the sensitive label set of G. The attacker’s
prior belief α(q,L) and posterior belief β(q,L) are the probabilities
that a target tp with QI-vector q is linked to a label L before and
after the data release.
(a) Original microdata with quasi-sensitive attribute symptoms
(b) External knowledge that maps symptoms to disease
(c) A generalized table that cannot prevent indirect disclosure of
disease through symptoms
Preliminary Results
With the Mondrian generalization and our suppression
algorithm implemented in C++, we conducted experiments with:
1) a dataset with 3000 tuples augmented from the Adult dataset,
with 8 QI attributes and 9 synthesized QS terms per tuple, and 2)
an external table with 3000 pieces of knowledge labels linked to
random QS terms with Poison distribution.
Definition (QS (c,l)-diversity). A group G satisfies QS (c,l)diversity if and only if p1 ≤c (pl + pl +1 + ... + p|∪di=1K(tpi)| ), where
p1, p2, ..., p |∪di=1K(tpi)| are the values of β(q,Li) in decreasing order.
A table D∗ satisfies QS (c,l)-diversity if every group satisfies QS
(c,l)-diversity.
Definition (QS t-closeness). A group G satisfies QS t-closeness if
and only if the distance between α(q,L) and β(q,L) is no more than a
threshold t. A table D∗ satisfies QS t-closeness if every group
satisfies QS t-closeness.
Figure 1. Anonymizing data with QS attributes
Algorithm
Contributions
Phase 1 (QI generalization). Given D, an intermediate dataset Dg
is obtained that satisfies k-anonymity.
Figure 4. QS suppression for QS (c,l)-diversity showing adaptive
QS suppression outperforms baseline DFS search significantly
Phase 2 (QS suppression). Given Dg, a suppression algorithm is
used to remove proper QS values (items) until every QI group
satisfies QS (c,l)-diversity or QS t-closeness.
Figure 2. Disclosure risks with QS attributes
• Formal notions of QS l-diversity and QS t-closeness
that extend l-diversity and t-closeness to prevent
indirect attribute disclosure due to QS attribute values.
•A two-phase algorithm that combines generalization
and value suppression to achieve QS l-diversity and QS
t-closeness.
• Greedy search heuristics with
dynamic reordering of tailsets that
contain potential values to be
removed in the next step to enable
quick return of result
• Dynamic updates when a solution
with a lower cost is found to enable
continuous improvement of the
result within a bounded time period.
Figure 3. QS suppression search tree and algorithm features
Figure 5. Two-phase algorithm for QS t-closeness showing the
trade-off between better privacy and smaller removal cost and
benefit of the two-phase algorithm compared to generalization
only approach.