Data Mining And Privacy Protection
Prepared by: Eng. Hiba Ramadan
Supervised by: Dr. Rakan Razouk
Outline
Introduction
key directions in the field of privacy-preserving data
mining
Privacy-Preserving Data Publishing
Changing the results of Data Mining Applications to preserve
privacy
association rule hiding
Privacy-Preserving Distributed Data Mining
2
The randomization method
The k-anonymity model
l-diversity
Vertical Partitioning of Data
Horizontal partitioning
Data Mining And Privacy Protection
Privacy Preserving Data Mining
•What is data mining?
Non-trivial extraction of implicit, previously
unknown, and potentially useful information from
large data sets or databases [W. Frawley and G.
Piatetsky-Shapiro and C. Matheus, 1992]
•What is privacy preserving data mining?
Study of achieving some data mining goals without
scarifying the privacy of the individuals
3
Data Mining And Privacy Protection
Scenario (Information Sharing)
• A data owner wants to release a person-specific data
table to another party (or the public) for the purpose of
classification analysis without scarifying the privacy of
the individuals in the released data.
Person-specific
data
Data owner
4
Data recipients
Data Mining And Privacy Protection
Outline
Introduction
key directions in the field of privacy-preserving data
mining
Privacy-Preserving Data Publishing
Changing the results of Data Mining Applications to preserve
privacy
association rule hiding
Privacy-Preserving Distributed Data Mining
5
The randomization method
The k-anonymity model
l-diversity
Vertical Partitioning of Data
Horizontal partitioning
Data Mining And Privacy Protection
key directions in the field of privacypreserving data mining
Privacy-Preserving Data Publishing: These techniques tend to
study different transformation methods associated with privacy.
Changing the results of Data Mining Applications to preserve
privacy: In many cases, the results of data mining applications
such as association rule or classification rule mining can
compromise the privacy of the data.
Privacy-Preserving Distributed Data Mining: In many cases,
the data may be distributed across multiple sites, and the
owners of the data across these different sites may wish
to compute a common function.
6
Data Mining And Privacy Protection
Outline
Introduction
key directions in the field of privacy-preserving data
mining
Privacy-Preserving Data Publishing
Changing the results of Data Mining Applications to preserve
privacy
association rule hiding
Privacy-Preserving Distributed Data Mining
7
The randomization method
The k-anonymity model
l-diversity
Vertical Partitioning of Data
Horizontal partitioning
Data Mining And Privacy Protection
Randomization Approach Overview
30 | 25 | ...
50 | 40K | ...
Randomizer
Randomizer
65 | 20 | ...
25 | 60K | ...
Reconstruct
distribution
of Age
Reconstruct
distribution
of Salary
Data Mining
Algorithms
8
...
...
...
Model
Data Mining And Privacy Protection
Randomization
The method of randomization can be described as follows.
x={x1…xN}, For record xi
X
we add a noise component y1…yN, which is drawn from
the probability distribution fY(y),
the new set of distorted records are x1+y1….xN+yN
In general, it is assumed that the variance of the added
noise is large enough, so that the original record values
cannot be easily guessed from the distorted data,Thus,
the original records cannot be recovered, but the
distribution of the original records can be recovered.
9
Data Mining And Privacy Protection
Randomization
10
Data Mining And Privacy Protection
Reconstruction Problem
Original values x1, x2, ..., xn
from probability distribution X (unknown)
To hide these values, we use y1, y2, ..., yn
from probability distribution Y (known)
Given
x1+y1, x2+y2, ..., xn+yn
the probability distribution of Y
Estimate the probability distribution of X.
11
Data Mining And Privacy Protection
Intuition (Reconstruct single point)
10
V
Age
90
Original distribution for Age
Probabilistic estimate of original value of V
12
Data Mining And Privacy Protection
Intuition (Reconstruct single point)
10
V
Age
90
Original Distribution for Age
Probabilistic estimate of original value of V
13
Data Mining And Privacy Protection
Reconstructing the Distribution
Combine estimates of where point came from for all the
points:
Gives estimate of original distribution.
10
14
Age
Data Mining And Privacy Protection
90
Reconstruction
fX0 := Uniform distribution
j := 0 // Iteration number
repeat
j
n
1
f
((
x
y
)
a
)
f
i
X (a )
fXj+1(a) := Y i
(Bayes' rule)
j
n i 1
f
((
x
y
)
a
)
f
X (a )
Y i i
j := j+1
until (stopping criterion met)
15
Data Mining And Privacy Protection
Seems to work well!
Number of People
1200
1000
800
Original
600
Randomized
400
Reconstructed
200
0
20
60
Age
16
Data Mining And Privacy Protection
Pros & Cons
One key advantage of the randomization method is that it
is relatively simple, and does not require knowledge of
the distribution of other records in the data.
17
Data Mining And Privacy Protection
Pros & Cons
we only have a distribution containing the behavior of
X. Individual records are not available.
the distributions are available only along individual
dimensions.
While the approach can certainly be extended to
multi-dimension distributions, density estimation
becomes inherently more challenging with increasing
dimensionalities. For even modest dimensionalities
such as 7 to 10, the process of density estimation
becomes increasingly inaccurate, and falls prey to
the curse of dimensionality
18
Data Mining And Privacy Protection
Outline
Introduction
key directions in the field of privacy-preserving data
mining
Privacy-Preserving Data Publishing
Changing the results of Data Mining Applications to preserve
privacy
association rule hiding
Privacy-Preserving Distributed Data Mining
19
The randomization method
The k-anonymity model
l-diversity
Vertical Partitioning of Data
Horizontal partitioning
Data Mining And Privacy Protection
k-anonymity
the role of attributes in data
explicit identifiers are removed
quasi identifiers can be used to re-identify individuals
sensitive attributes (may not exist!) carry sensitive information
identifier
20
quasi identifiers
sensitive
Name
Name
Birthdate
Birthdate
Sex
Sex
Zipcode
Zipcode
Disease
Disease
Andre
Andre
21/1/79
21/1/79
male
male
53715
53715
Flu
Flu
Beth
Beth
10/1/81
10/1/81
female
female
55410
55410
Hepatitis
Hepatitis
Carol
Carol
1/10/44
1/10/44
female
female
90210
90210
Brochitis
Brochitis
Dan
Dan
21/2/84
21/2/84
male
male
02174
02174
Ellen
Ellen
19/4/72
19/4/72
female
female
02237
02237
Sprained
Sprained
Ankle
Ankle
AIDS
AIDS
Data Mining And Privacy Protection
k-anonymity
preserve privacy via k-anonymity, proposed by Sweeney
and Samarati
k-anonymity: intuitively, hide each individual among k-1
others
how to achieve this?
each QI set of values should appear at least k times in the
released data
sensitive attributes are not considered (going to revisit this...)
generalization and suppression
value perturbation is not considered (we should remain truthful
to original values )
privacy vs utility tradeoff
21
do not anonymize more than necessary
Data Mining And Privacy Protection
k-anonymity
Transform each QI value into a less specific form
A generalized table
Name Age Sex Zipcode
Bob 23 M 11000
22
Age
>21
>21
>21
>21
>61
>61
>61
>61
Sex
M
M
M
M
F
F
F
F
Zipcode
1100*
1100*
1100*
1100*
1100*
1100*
1100*
1100*
Disease
pneumonia
dyspepsia
dyspepsia
pneumonia
flu
gastritis
flu
bronchitis
k-anonymity example
tools for anonymization
generalization
publish more general values, i.e., given a domain hierarchy, roll-up
suppression
remove tuples, i.e., do not publish outliers
often the number of suppressed tuples is bounded
original data
2-anonymous data
Birthdate
Sex
Zipcode
21/1/79
male
53715
10/1/79
female
55410
1/10/44
female
90210
21/2/83
male
02274
19/4/82
male
02237
23
group 1
suppressed
group 2
Birthdate Sex
Zipcode
*/1/79
person
5****
*/1/79
person
5****
1/10/44
female
90210
*/*/8*
male
022**
*/*/8*
male
022**
Data Mining And Privacy Protection
generalization lattice
assume domain hierarchies exist for all QI attributes
Z2 ={537**}
Z1 ={5371*, 5370*}
S1 ={Person}
Z0 ={53715, 53710, 53706, 53703}
S0 ={Male, Female}
sex
zipcode
construct the generalization lattice
for the entire QI set
<S1, Z1>
<S0, Z2>
<S1, Z0>
<S0, Z1>
<S0, Z0>
generalization
more
<S1, Z2>
less
objective
find the minimum generalization that
satisfies k-anonymity
[1, 2]
i.e., maximize utility by
finding minimum
distance vector with kanonymity
Data Mining And Privacy Protection
[1, 1]
[0, 2]
[1, 0]
[0, 1]
[0, 0]
incognito
The Incognito algorithm generates the set of all possible
k-anonymous full-domain generalizations of T, with an
optional tuple suppression threshold.
the algorithm begins by checking single-attribute subsets
of the quasi-identifier, and then iterates, checking kanonymity with respect to increasingly large subsets
25
Data Mining And Privacy Protection
incognito
<S1, Z2>
<S1, Z1>
(I) generalization property
<S0, Z2>
if at some node k-anonymity holds, then it also holds for
any ancestor node
e.g., <S1, Z0> is k-anonymous and, thus, so is <S1, Z1> and <S1, Z2>
<S1, Z0>
<S0, Z1>
<S0, Z0>
note: the entire lattice, which
includes three dimensions <S,Z,B>,
is too complex to show
(II) subset property
if for a set of QI attributes k-anonymity doesn’t hold then
it doesn’t hold for any of its supersets
e.g., <S0, Z0> is not k-anonymous and, thus <S0, Z0, B0> and <S0, Z0, B1> cannot
be k-anonymous
incognito considers sets of QI attributes of increasing cardinality and prunes nodes in
the lattice using the two properties above
Data Mining And Privacy Protection
incognito
<z0> <b0> <s0>
<z0,b0> <z0,b1> <z0,s0> <z0,s1>……
<z0,b0,s0> <z0,b0,s1>…….
27
Data Mining And Privacy Protection
incognito
28
Data Mining And Privacy Protection
seen in the domain space
consider the multi-dimensional domain space
QI attributes are the dimensions
tuples are points in this space
attribute hierarchies partition dimensions
Z2
537**
5370*
29
S0
male
sex
hierarchy
Z1
5371*
53709 53711
53714
53718
(53705, female)
person
female
53703 53705
S1
(53711, male)
Data Mining And Privacy Protection
Z0
zipcode
hierarchy
seen in the domain space
not 2-anonymous
incognito example
2 QI attributes, 7 tuples, hierarchies
shown with bold lines
zipcode
sex
2-anonymous
group 1
w. 2 tuples
Data Mining And Privacy Protection
group 2
w. 3 tuples
group 3
w. 2 tuples
Outline
Introduction
key directions in the field of privacy-preserving data
mining
Privacy-Preserving Data Publishing
Changing the results of Data Mining Applications to preserve
privacy
association rule hiding
Privacy-Preserving Distributed Data Mining
31
The randomization method
The k-anonymity model
l-diversity
Vertical Partitioning of Data
Horizontal partitioning
Data Mining And Privacy Protection
k-anonymity problems
k-anonymity example
homogeneity attack: in the last group everyone has cancer
background knowledge: in the first group, Japanese have low chance of heart disease
we need to consider the sensitive values
data
id
1
2
3
4
5
6
7
8
9
10
11
12
Zipcod
e
13053
13068
13068
13053
14853
14853
14850
14850
13053
13053
13068
32
13068
Age
National.
28
29
21
23
50
55
47
49
31
37
36
35
Russian
American
Japanese
American
Indian
Russian
American
American
American
Indian
Japanese
American
4-anonymous data
Disease
id
Zipcod Age
e
1
130**
<30
Heart Disease
2
130**
<30
Heart Disease
3
130**
<30
Viral Infection
4
130**
<30
Viral Infection
5
1485*
≥40
Cancer
6
1485*
≥40
Heart Disease
7
1485*
≥40
Viral Infection
8
1485*
≥40
Viral Infection
9
130**
3∗
Cancer
10
130**
3∗
Cancer
11 Privacy
130**
3∗
Cancer
Data Mining And
Protection
12
130**
3∗
Cancer
National.
Disease
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
Heart Disease
Heart Disease
Viral Infection
Viral Infection
Cancer
Heart Disease
Viral Infection
Viral Infection
Cancer
Cancer
Cancer
Cancer
l-diversity
make sure each group contains well represented sensitive values
protect from homogeneity attacks
protect from background knowledge
l-diversity (simplified definition)
a group is l-diverse if the most frequent sensitive value
appears at most 1/l times in group
Name Age Sex Zipcode
Bob 23 M 11000
33
A 2-diverse generalized table
Age Sex
Zipcode
Disease
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000]
flu
[21, 60] M [10001, 60000] pneumonia
[61, 70] F [10001, 60000]
flu
[61, 70] F [10001, 60000] gastritis
[61, 70] F [10001, 60000]
flu
[61, 70] F [10001, 60000] bronchitis
anatomy
• fast l-diversity algorithm
• anatomy is not generalization
• seperates sensitive values from tuples
• shuffles sensitive values among groups
•For a given data table, Anatomy releases a quasi-identifier table (QIT)
and a sensitive table (ST)
SexGroup-ID
Zipcode
Age Sex Age
Zipcode
23 M 23
11000M
111000
27 M 27
13000M
113000
35 M
59000
1
35
M
59000
59 M
12000
1
61
F 59
54000M
212000
65
F 61
25000 F
254000
65
F 65
25000 F
225000
70
F 65
30000 F
225000
Quasi-identifier
70 Table
F (QIT)30000
34
data
Group-ID
Disease Disease
1
dyspepsia
pneumonia
1
pneumonia
flu
1
flu
dyspepsia
1
gastritis
gastritis
2
bronchitis
2
flu
dyspepsia
2
gastritis
gastritis
2
dyspepsia
flu
Sensitive Table (ST)
bronchitis
anatomy
algorithm
•
•
35
assign sensitive values to buckets
create groups by drawing from l largest
buckets
5
8
3
9
7
1
2
6
group 1
5
8
7
group 2
3
9
6
group 3
1
2
4
4
Privacy Preservation
From a pair of QIT and ST generated from an l-diverse
partition, the adversary can infer the sensitive value of each
individual with confidence at most 1/l
Name Age Sex Zipcode
Bob 23 M 11000
Age Sex Zipcode Group-ID
23 M
11000
1
27 M
13000
1
35 M
59000
1
59 M
12000
1
61
F
54000
2
65
F
25000
2
65
F
25000
2
70
F
30000
2
quasi-identifier table (QIT)
Group-ID Disease Count
1
dyspepsia
2
1
pneumonia
2
2
bronchitis
1
2
flu
2
2
gastritis
1
sensitive table (ST)
Outline
Introduction
key directions in the field of privacy-preserving data
mining
Privacy-Preserving Data Publishing
Changing the results of Data Mining Applications to preserve
privacy
association rule hiding
Privacy-Preserving Distributed Data Mining
37
The randomization method
The k-anonymity model
l-diversity
Vertical Partitioning of Data
Horizontal partitioning
Data Mining And Privacy Protection
Association Rule Hiding
Data Mining
Association Rules
Hide Sensitive Rules
Changed
Database
User
Recent years have seen tremendous advances in the ability to perform association rule
mining effectively
Such rules often encode important target marketing information about a business
38
Data Mining And Privacy Protection
Association Rule Hiding
There are various algorithms for hiding a group of association rules,
which is characterized as sensitive.
One rule is characterized as sensitive if its disclosure risk is above a
certain privacy threshold.
Sometimes, sensitive rules should not be disclosed to the public since,
among other things, they may be used for inferring sensitive data, or
they may provide business competitors with an advantage.
Association Rule Hiding Techniques
Distortion-based: Modify entries from 1s to 0s
39
Blocking-based Technique the entry is not modified, but is left incomplete. Thus,
unknown entry values are used to prevent discovery of association rules
Data Mining And Privacy Protection
Distortion-based Techniques
Sample Database
Distorted Database
A
B
C
D
1
1
1
0
1
0
0
1
1
0
0
0
1
1
0
1
1
1
0
1
1
1
0
0
1
A
B
C
D
1
1
1
0
1
0
1
1
0
0
0
1
1
1
0
Distortion
Algorithm
Rule A→C has:
Support(A→C)=80%
Confidence(A→C)=100%
40
Rule A→C has now:
Support(A→C)=40%
Confidence(A→C)=50%
Data Mining And Privacy Protection
Association Rule Hiding Strategies
TID
Items
TID
Items
T1
ABC
T1
111
T2
ABC
T2
111
T3
ABC
T3
111
T4
AB
T4
110
T5
A
T5
100
T6
AC
T6
101
41
Data Mining And Privacy Protection
Association Rule Hiding Strategies
If we want to lower the value of the ratio:
X
___
Y
42
X
___
Y
X
___
Data Mining And Privacy Protection
Y
Association Rule Hiding Strategies
X Y
Support =
43
N is the number of transactions in D.
Data Mining And Privacy Protection
Association Rule Hiding Strategies
AC B
Support=50%, conf=66%
Support=33%,
conf=75%
44
TID
Items
T1
111
110
T2
111
T3
111
T4
110
T5
100
T6
101
TID
Items
T1
ABC
T2
ABC
T3
ABC
T4
AB
T5
A
T6
AC
min_supp=35%, min_conf=70%
Data Mining And Privacy Protection
Association Rule Hiding Strategies
Decrease the support, making sure we hide
items from the right hand side of the rule
Confidence=
Increase the support of the left hand.
45
Data Mining And Privacy Protection
Association Rule Hiding Strategies
AC B
Support=50%, conf=66%
Support=33%,
conf=75%
46
TID
Items
T1
111
101
T2
111
T3
111
T4
110
T5
100
T6
101
TID
Items
T1
ABC
T2
ABC
T3
ABC
T4
AB
T5
A
T6
AC
min_supp=33%, min_conf=70%
Data Mining And Privacy Protection
Association Rule Hiding Strategies
AC B
Support=50%, conf=60%
conf=75%
47
TID
Items
T1
111
T2
111
T3
111
T4
110
T5
100
101
T6
101
TID
Items
T1
ABC
T2
ABC
T3
ABC
T4
AB
T5
A
T6
AC
min_supp=33%, min_conf=70%
Data Mining And Privacy Protection
Quality of Data
Sometimes it is dangerous to delete some items from the
database (etc. medical databases) because the false data may
create undesirable effects.
So, we have to hide the rules in the database by adding
uncertainty without distorting the database.
48
Data Mining And Privacy Protection
Blocking-based Techniques
Initial Database
49
A
B
C
D
1
1
1
0
1
0
1
1
0
0
0
1
1
1
0
New Database
A
B
C
D
1
1
1
0
1
0
?
1
1
?
0
0
1
1
0
1
1
1
0
1
1
1
0
1
1
Blocking
Algorithm
Data Mining And Privacy Protection
Outline
Introduction
key directions in the field of privacy-preserving data
mining
Privacy-Preserving Data Publishing
Changing the results of Data Mining Applications to preserve
privacy
association rule hiding
Privacy-Preserving Distributed Data Mining
50
The randomization method
The k-anonymity model
l-diversity
Vertical Partitioning of Data
Horizontal partitioning
Data Mining And Privacy Protection
Motivation
51
Setting
Data is distributed at different sites
These sites may be third parties (e.g., hospitals,
government bodies) or individuals
Aim
Compute the data mining algorithm on the data so
that nothing but the output is learned
That is, carry out a secure computation
Data Mining And Privacy Protection
Vertical Partitioning of Data
Global Database View
TID
Brain Tumor?
Medical Records
RPJ
Yes
CAC No Tumor
PTR
52
No Tumor
Diabetes?
Model
Battery
Cell Phone Data
Diabetic
RPJ
5210
Li/Ion
No
CAC
none
none
Diabetic
PTR
3650
NiCd
Data Mining And Privacy Protection
Horizontal partitioning
-Two banks
-Very similar information.
credit card
Is the account active?
Is the account delinquent?
Is the account new?
account balance
-No public Sharing
53
Data Mining And Privacy Protection
Privacy-Preserving Distributed Data Mining
Secure multiparty computation
Cryptography
54
Data Mining And Privacy Protection
55
55
Data Mining And Privacy Protection
56
56
Data Mining And Privacy Protection
© Copyright 2026 Paperzz