Data Anonymization (1)
Outline
Problem
concepts
algorithms on domain generalization
hierarchy
Algorithms on numerical data
The Massachusetts Governor Privacy Breach
•Name
•SSN
•Name
• Governor
MA
87 % of USofpopulation
uniquely identified
using ZipCode,
Birth Date, and Sex.
•Address
•Date
•Visit Date
• Birth
Registered
•Diagnosis date
•Party
•Procedure
affiliation Name linked to Diagnosis
•Medication• Sex
•Date last
•Total Charge
voted
• Zip
Quasi Identifier
Medical Data
Voter
List
Sweeney, IJUFKS 2002
3
Definition
Table
Column: attributes, row: records
Quasi-identifier
A list of attributes that can potentially be
used to identify individuals
K-anonymity
Any QI in the table appears at least k
times
Basic techniques
Generalization
Zip {02138, 02139} 0213*
Domain generalization hierarchy
A0 A1…An
Eg. {02138, 02139} 0213* 021* 02*0**
This hierarchy is a tree structure
suppression
Balance
Better privacy guarantee
Lower data utility
There are many schemes satisfying the k-anonymity specification.
We want to minimize the distortion of table, in order to maximize
data utility
• Suppression is required if we cannot find a k-anonymity group for
a record.
Criteria
Minimal generalization
Minimal generalization that satisfy the kanonymization specification
Minimal table distortion
Minimal generalization with minimal
utility loss
Use precision to evaluate the loss
[sweeny papers]
Application-specific utility
Complexity of finding optimal
solution on generalization
NP-hard (bayardo ICDE05)
So all proposed algorithms are
approximate algorithms
Shared features in different
solutions
Always satisfy the k-anonymity
specification
If some records not, suppress them
Differences are at the utility loss/cost
function
Sweeney’s precision metric
Discernibility & classification metrics
Information-privacy metric
Algorithms
Assume the domain generalization hierarchy is
given
Efficiency
Utility maximization
Metrics to be optimized
Two cost metrics – we want to minimize
(bayardo ICDE05)
Discernibility
# of items in the k-anony group
Classification
The dataset has a class label column – preserving
the classification model
# Records in minor classes in the group
metrics
A combination of information loss and
anonymity gain (wang ICDE04)
Information loss, anonymity gain
Information-privacy metric
metrics
Information loss
Dataset has class labels
Entropy
a set S, labeled by different classes
Entropy is used to calculate the impurity of labels
Info(S)=
p log p
i
i
Pi is the percentage of label i
i
Information loss of a generalization G
{c1,c2,…cn} p
N ci
I(G) = info(Sp) -
info (Rci)
i Np
Anonymity gain
A(VID) : # of records with the VID
AG(VID) >= A(VID): generalization
improves or does not change A(VID)
Anonymity gain
P(G) = x – A(VID)
x = AG (VID) if AG (VID) <=K
x = K, otherwise
As long as k-anonymity is satisfied, further
generalization of the VID does not gain
Information-privacy combined metric
IP = info loss/anonymity gain
= I(G)/P(G)
We want to minimize IP
If P(G) ==0, use I(G) only
Either small I(G) or large P(G) will reduce IP…
If P(G)s are same, pick one with minimum I(G)
Domain-hierarchy based
algorithms
The sweeny’s algorithm
Bayardo’s tree pruning algorithm
Wang’s top-down and bottom up
algorithms
They are all dimension-by-dimension
methods
Multidimensional techniques
Categorical data?
Categories are mapped to
numerize the categories
Bayardo 95 paper
Order matters? (no research on that)
Numerical data
K-anonymization n-dim space
partitioning
Many existing techniques can be applied
Single-dimensional vs.
multidimensional
The evolving procedure
Categorical(domain hierarchy)[sweeney, topdown/bottom-up]
numerized categories, single dimensional
[bayardo05]
numerized/numerical
multidimensional[Mondrian,spatial indexing,…]
Method 1: Mondrain
Numerize categorical data
Apply a top-down partioning process
Step2.2
Step2.1
step1
Allowable cut
Method 2: spatial indexing
Multidimensional spatial techniques
Kd-tree (similar to Mondrain algorithm)
R-tree and its variations
Upper
layer
Leaf layer
R-tree
R+-tree
Compacting bounds
Information is better
preserved
Example:
uncompacted: age[1-80], salary[10k-100k]
compacted: age[20-40], salary[10k-50k]
Original Mondrain does not consider compacting bounds
For R+-Tree, it is automatically done.
Benefits of using R+-Tree
Scalable: originally designed for
indexing disk-based large data
Multi-granularity k-anonymity: layers
Better performance
Better quality
Performance
Mondrain
Utility
Metrics
Discenibility penalty
KL divergence: describe the difference
between a pair of distributions
Anonymized data
distribution
Certainty penalty
T: table, t: record, m: # of attributes,
t.Ai generaled range, T.Ai total range
Other issues
Sparse high-dimensionality
Transactional data boolean matrix
“On the anonymization of sparse high-dimensional
data” ICDE08
Relate to the clustering problem of
transactional data!
The above one uses matrix-based clustering
item based clustering (?)
Other issues
Effect of numerizing categorical data
Ordering of categories may have certain
impact on quality
General-purpose utility metrics vs.
special task oriented utility metrics
Attacks on k-anonymity definition
© Copyright 2025 Paperzz