PPT

1
Last update: 11 December 2015
Knowledge and the Web /
Privacy and Big Data –
Data Mining {against, for, ?} Privacy
Bettina Berendt
KU Leuven,
Department of Computer Science
http://people.cs.kuleuven.be/~bettina.berendt/teaching
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
2
Where are we?
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
3
Agenda
The data
The problem (and all the problems we won’t go into detail about today)
The analytics used for demonstrating the argument
The approach: “Privacy-preserving data mining”
Data minimisation: (not only) for data mining
Incentives?
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
4
Agenda
The data
The problem (and all the problems we won’t go into detail about today)
The analytics used for demonstrating the argument
The approach: “Privacy-preserving data mining”
Data minimisation: (not only) for data mining
Incentives?
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
5
The cliché slide about data mining
Data
(in some
format)
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
6
5 ★ Open Data: Formats example
http://5stardata.info/en/
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
5 ★ Open Data (Berners-Lee, 2006):
From Web data via Open data to Linked Open Data
e.g. Commercial data,
Facebook
e.g. Twitter,
Much open govt./public data
e.g. dbpedia
http://www.w3.org/DesignIssues/LinkedData.html, http://5stardata.info/en/
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
7
8
Agenda
The data
The problem (and all the problems we won’t go into detail about today)
The analytics used for demonstrating the argument
The approach: “Privacy-preserving data mining”
Data minimisation: (not only) for data mining
Incentives?
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
9
What is this about?
Knowledge mined from data, descriptive and predictive (e.g.
“predictive analytics“) ...
... that some people would prefer some other people to not have
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
10
Targeted advertising / nuisance to the individual
Knowledge mined from data, descriptive and predictive (e.g.
“predictive analytics“) ...
... that some people would prefer some other people to not have
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
11
Targeted advertising / privacy violation for the individual
(“The Target story“, reconstructed on Amazon)
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
12
Profiling of individuals with consequences beyond nuisance
Knowledge mined from data, descriptive and predictive (e.g.
“predictive analytics“) ...
... that some people would prefer some other people to not have
cf. (Kosinski et al., 2013) –
the Converse/stupidity example was gleaned from interaction with the authors‘ Preference Tool demo account
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
13
Trade secrets
Knowledge mined from data, descriptive and predictive (e.g.
“predictive analytics“) ...
... that some people would prefer some other people to not have
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
14
 Focus
Is not, or not only, on privacy as
•
Individual privacy
•
A fundamental human right
But – both more generally and more narrowly – about
•
Confidentiality of data
(Depending on the jurisdiction, this is a plain misnomer, or just
confusing – but it is the terminology of the field ...)
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
An overview of key questions/problems
and where they are discussed (1)
•
How do data become available, and how can you (and others) use them?
•
•
Knowledge and the Web course
How can the availability of data affect indidividuals‘ privacy?
•
•
15
Privacy and Big Data course
How can the availability of data affect other interests in confidentiality?
•
Not covered, left to your/our common-sense understanding
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
An overview of key questions/problems
and where they are discussed (2)
•
How can data mining be a threat to individuals‘ privacy?
•
•
•
Not covered, left to your/our common-sense understanding
How can mining effects on privacy and confidentiality be mitigated?
•
Today: some technical modifications / decisions
•
(The question is much bigger, and technology is only one part of the
answer. But we can‘t possibly cover this within one lecture.)
How can data mining be a helper for privacy?
•
•
Martijn van Otterlo in the Privacy and Big Data course
How can data mining be a threat to other interests in confidentiality?
•
•
16
Discussion (based on research we and others did) with those who are
interested
More on my view of the last two questions: (Berendt, 2012)
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
17
Why should you care? (1)
From the KaW student feedback:
“[...] A lot of focus on research question while for me as a
computer scientist this does not seem relevant.”
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
Do you care?
http://www.w3.org/DesignIssues/LinkedData.html, http://5stardata.info/en/
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
18
Do you care?
http://www.w3.org/DesignIssues/LinkedData.html, http://5stardata.info/en/
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
19
20
Do you care? (From your questions)

“When every subject has its own URI and data is automatically
added and connected. Information about people will eventually end
up there as well.

Of course on the web now this is also the case, but when looking
for information about one person there is not a single source which
has everything or provides links to where this information comes
from.

I think one of the goals for the semantic web is to be able to link all
this information together so it can be easily (and even
automatically) retrieved and updated.

To me this seems a challenge looking at the privacy of those
persons that are now reduced to data. Everything available is easily
accessible and not scattered around.”

“How to protect the privacy of individuals. If someone doesn't want
some of his/her data to be linked, is there any method to cut the
link?”
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
21
BTW: also in not-so-open data environments
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
22
Should personal data be open data? Should it be linked
data? (1)
(from the discussion 2015-12-02, rephrasing from memory)

<< It can be linked, even by a unique URI. But I, the data subject,
should have control over who sees what. For example, the doctor
should see my medical records, just like in pre-Internet days. >>
(Rephrasing BB: I.e., personal data should not be open!)

<< Isn‘t this more a security issue? >>

Remarks BB:

It is definitely about security (when you think of access control
to be defended against hackers), but also about privacy (when
you think of access control as a way of exercising your right of
informational self-determination).

The latter idea is at the core of European data protection law,
so yes, in principle you have these rights, and personal data
should not be open.
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
23
Should personal data be open data? Should it be linked
data? (2)
But this presents some issues; just think of Twitter as an example:

What if someone else “owns“ or “co-owns“ these data because it‘s
their platform (Twitter) or because it‘s from a discussion they were
involved in too (other users)? (Legally tricky)

What if you voluntarily “made these data public“ (just read the
Twitter terms of service)?

What if this wasn‘t so voluntary, but a choice made due to the
necessity to speak via this monopoly player on the communications
market?

Is it practical to ask every user for consent if you analyse Twitter
data? (Note: some lawyers argue that this would be the only legal
way, at least in the EU. Others say, you accepted the terms of
service.)

What if some social good comes out of the analysis (lives/children
are saved, diseases are cured, social understanding is enhanced,
national security is increased, ...)?
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
24
Why should you care? (2)

Whether you have an interest in being an ethically aware
computer scientist or not

And if so, whatever this means specifically to you :

As a CS professional, you will build systems.

You will deal with data (be a “data controller“)

These will be personal data (for ~ 80% of data scientists,
recent survey)

There is a data protection and privacy legislation in pretty
much every country

You will have to comply

Failure to do so costs you money, consumer trust, and
maybe your job
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
25
“But I can‘t do anything“ (1) – as an individual
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
26
“But I can‘t do anything“ (2) – as an IT professional
Well, you are the designer of IT systems, aren‘t you?
The upcoming EU data protection regulation mandates
Privacy by Design
General reference for example: CNIL (2015).
Is this covered in a course?

see (Berendt & Coudert, 2014), now adapted in PaBD
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
27
Agenda
The data
The problem (and all the problems we won’t go into detail about today)
The analytics used for demonstrating the argument
The approach: “Privacy-preserving data mining”
Data minimisation: (not only) for data mining
Incentives?
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
28
One (the classical) technology behind recommendations
and other “predictive analytics“
frequent itemsets / association rules
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
Motivation for association-rule learning/mining: store layout
(Amazon, earlier: Wal-Mart, ...)
Where to put:
spaghetti,
butter?
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
29
30
Data
"Market basket data": attributes with boolean domains
In a table  each row is a basket (aka transaction)
Transaction ID
Attributes (basket items)
1
Spaghetti, tomato sauce
2
Spaghetti, bread
3
Spaghetti, tomato sauce, bread
4
bread, butter
5
bread, tomato sauce
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
31
Solution approach:
The apriori principle and the pruning of the search tree (1)
Spagetthi, Tomato sauce,
Bread, butter
Spagetthi,
Tomato sauce,
Bread
Spaghetti,
tomato sauce
spaghetti
Spagetthi,
Tomato sauce,
butter
Spaghetti,
bread
Spaghetti,
butter
Tomato sauce

Spagetthi,
Bread,
butter
Tomato s.,
bread
bread
Tomato sauce,
Bread,
butter
Tomato s.,
butter
butter
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
Bread,
butter
32
Solution approach:
The apriori principle and the pruning of the search tree (2)
Spagetthi, Tomato sauce,
Bread, butter
Spagetthi,
Tomato sauce,
Bread
Spaghetti,
tomato sauce
spaghetti
Spagetthi,
Tomato sauce,
butter
Spaghetti,
bread
Spaghetti,
butter
Tomato sauce

Spagetthi,
Bread,
butter
Tomato s.,
bread
bread
Tomato sauce,
Bread,
butter
Tomato s.,
butter
butter
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
Bread,
butter
33
Solution approach:
The apriori principle and the pruning of the search tree (3)
Spagetthi, Tomato sauce,
Bread, butter
Spagetthi,
Tomato sauce,
Bread
Spaghetti,
tomato sauce
spaghetti
Spagetthi,
Tomato sauce,
butter
Spaghetti,
bread
Spaghetti,
butter
Tomato sauce

Spagetthi,
Bread,
butter
Tomato s.,
bread
bread
Tomato sauce,
Bread,
butter
Tomato s.,
butter
butter
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
Bread,
butter
34
Solution approach:
The apriori principle and the pruning of the search tree (4)
Spagetthi, Tomato sauce,
Bread, butter
Spagetthi,
Tomato sauce,
Bread
Spaghetti,
tomato sauce
spaghetti
Spagetthi,
Tomato sauce,
butter
Spaghetti,
bread
Spaghetti,
butter
Tomato sauce

Spagetthi,
Bread,
butter
Tomato s.,
bread
bread
Tomato sauce,
Bread,
butter
Tomato s.,
butter
butter
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
Bread,
butter
35
More formally: Generating large k-itemsets with Apriori
Transaction ID Attributes (basket items)
1
Spaghetti, tomato sauce
2
Spaghetti, bread
3
Spaghetti, tomato sauce, bread
4
bread, butter
5
bread, tomato sauce
Min. support = 40%
step 1: candidate 1-itemsets

Spaghetti: support = 3 (60%)

tomato sauce: support = 3 (60%)

bread: support = 4 (80%)

butter: support = 1 (20%)
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
36
Contd.
Transaction ID
Attributes (basket items)
1
Spaghetti, tomato sauce
2
Spaghetti, bread
3
Spaghetti, tomato sauce, bread
4
bread, butter
5
bread, tomato sauce
step 2: large 1-itemsets

Spaghetti

tomato sauce

bread
candidate 2-itemsets

{Spaghetti, tomato sauce}: support = 2 (40%)

{Spaghetti, bread}: support = 2 (40%)

{tomato sauce, bread}: support = 2 (40%)
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
37
Contd.
Transaction ID
Attributes (basket items)
1
Spaghetti, tomato sauce
2
Spaghetti, bread
3
Spaghetti, tomato sauce, bread
4
bread, butter
5
bread, tomato sauce
step 3: large 2-itemsets

{Spaghetti, tomato sauce}

{Spaghetti, bread}

{tomato sauce, bread}
candidate 3-itemsets

{Spaghetti, tomato sauce, bread}: support = 1 (20%)
step 4: large 3-itemsets

{}
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
38
From frequent itemsets to association rules
Schema: If subset then large k-itemset with support s and
confidence c

s = (support of large k-itemset) / # tuples

c = (support of large k-itemset) / (support of subset)
Example:
If {spaghetti} then {spaghetti, tomato sauce}

Support: s = 2 / 5 (40%)

Confidence: c = 2 / 3 (66%)
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
39
Which interestingness measures are interesting for whom?
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
40
Agenda
The data
The problem (and all the problems we won’t go into detail about today)
The analytics used for demonstrating the argument
The approach: “Privacy-preserving data mining”
Data minimisation: (not only) for data mining
Incentives?
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
41
The basic idea of privacy-preserving data mining

Database inference problem: "The problem that arises when
confidential information can be derived from released data
by unauthorized users”

PPDM „develops algorithms for modifying the original data
[and/or the processing] in some way, so that the private data
and private knowledge remain private even after the mining
process“

Term coined (in DM) in 2000, builds on older research
traditions such as statistical disclosure control, secure multiparty computation

Trade off the utility of the mining results against (this sense
of) privacy

Measures of utility and of privacy (overview: Bertino et al.,
2008)
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
42
A classification of “privacy-preserving data mining“

What is to be protected? = What would the attacker want to know?

The data
The attacker will, given the data table T, not be able to
Anonymization
techniques,
see PaBD course

– link any row in T to a specific individual (identity disclosure)
– obtain an individual‘s value of a sensitive attribute (attribute
disclosure)
The inferred data mining result
– The attacker will, w/o T but given the results of DM, e.g. an
association rule learned from T, be able to identify some attributes
of a specific individual

How are the data held and processed?

centralized

distributed: every user knows only some rows (or columns) of T
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
Approach: Modify the data/algorithm/results to avoid
undesired patterns
Example Association Rule Hiding – approaches:

Distortion-based (Sanitization) Technique

Blocking-based Technique
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
43
44
High-level view
Data Mining
Association Rules
Hide Sensitive Rules
User
Changed
Database
Database
How to specify the unwanted patterns?
Configured/automatic:
Describe the sensitive rules by templates
(e.g. those that use or predict
sensitive variables)
This slide based on
http://dimacs.rutgers.edu/Workshops/Privacy/slides/pontikakis.ppt
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
Recall: Basic interestingness measures for
association rules
A rule X  Y, with X and Y itemsets, is interesting
if the measure > a threshold
Support

Proportion or % of instances in the database (e.g. people) who exhibit the
pattern (X and Y)

Ex.: „If britney then spears, supp=0.35“ is interesting for minsupp=0.05
Confidence

Proportion or % of instances with X that also have Y
= support(X & Y) / support (X)

Ex.: „If book1 then book2, supp=0.001, conf = 1 is interesting for any
minconf
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
45
46
Example
Sample Database
A
B
C
D
1
1
1
0
1
0
1
1
0
0
0
1
1
1
1
0
1
0
1
1
Rule A→C has:
Support(A→C)=80%
Confidence(A→C)=100%
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
47
Distortion-based Techniques for association-rule hiding
Sample Database
Distorted Database
A
B
C
D
1
1
1
0
1
0
0
1
1
0
0
0
1
1
0
1
1
1
0
1
1
1
0
0
1
A
B
C
D
1
1
1
0
1
0
1
1
0
0
0
1
1
1
0
Distortion
Algorithm
Rule A→C has:
Support(A→C)=80%
Confidence(A→C)=100%
Rule A→C has now:
Support(A→C)=40%
Confidence(A→C)=50%
This + the following 9 slides from/based on
http://dimacs.rutgers.edu/Workshops/Privacy/slides/pontikakis.ppt
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
48
Side Effects on non-sensitive rules
Before Hiding
Process
After Hiding
Process
Side Effect
Rule Ri has had
conf(Ri)>MCT
Rule Ri has now
conf(Ri)<MCT
Rule Eliminated
(Undesirable Side
Effect)
Rule Ri has had
conf(Ri)<MCT
Rule Ri has now
conf(Ri)>MCT
Ghost Rule
(Undesirable Side
Effect)
Large Itemset I has
had sup(I)>MST
Itemset I has now
sup(I)<MST
Itemset Eliminated
(Undesirable Side
Effect)
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
49
Distortion-based Techniques
Challenges/Goals:

To minimize the undesirable Side Effects that the hiding
process causes to non-sensitive rules.

Note: Many measures of utility (which is traded off against
privacy) are based on the number or proportion of ghost
rules etc.

To minimize the number of 1’s that must be deleted in
the database.

Algorithms must be linear in time as the database
increases in size.
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
50
Quality of Data
Sometimes it is dangerous to delete some items from
the database (e.g. medical databases) because the
false data may create undesirable effects.
So, we have to hide the rules in the database by
adding uncertainty without distorting the database.
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
51
Blocking-based Techniques
Initial Database
A
B
C
D
1
1
1
0
1
0
1
1
0
0
0
1
1
1
0
New Database
A
B
C
D
1
1
1
0
1
0
?
1
1
?
0
0
1
1
0
1
1
1
0
1
1
1
0
1
1
Blocking
Algorithm
Support and Confidence becomes marginal.
In New Database: 60% ≤ conf(A → C) ≤ 100%
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
52
Modification of Association Rule Definition
A rule’s A→B confidence and support becomes
marginal:
sup(A→B)  [minsup(A→B), maxsup(A→B)]
conf(A→B) [minconf(A→B), maxconf(A→B)]
minsup(A→B)=
( A  1)( B  1)
D
maxsup(A→B)=
( A  1)( B  1)  ( A  1)( B  ?)  ( A  ?)( B  1)  ( A  ?)( B  ?)
D
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
53
Modification of Association Rule Definition
minconf(A→B)=
|( A1)( B 1)|
| A1| |( A ?)( B  0)| |( A ?)( B  ?)|
maxconf(A→B)=
|( A1)( B 1)| |( A1)( B  ?)| |( A ?)( B 1)| |( A ?)( B  ?)|
| A1| |( A ?)( B 1)| |( A ?)( B  ?)|
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
54
Negative Border Rules Set (NBRS) Definition
When a rule R has either

sup(R)>MST AND conf(R)<MCT
OR

sup(R)<MST AND conf(R)>MCT,
then we say that R belongs to NBRS.
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
Side Effects Definition Modification in
Blocking-based Techniques
Before Hiding
Process
After Hiding Process
Side Effect
Rule Ri has had
conf(Ri)>MCT
Rule Ri has now
minconf(Ri)<MCT
Rule Eliminated
(Undesirable Side
Effect)
Rule Ri has had
conf(Ri)<MCT
Rule Ri has now
maxconf(Ri)>MCT
Ghost Rule
(Desirable Side Effect)
Large Itemset I has
had sup(I)>MST
Itemset I has now
minsup(I)<MST
Itemset Eliminated
(Undesirable Side
Effect)
Itemset I has had
sup(I)<MST
Itemset I has now
maxsup(I)>MST
Ghost Itemset
(Desirable Side Effect)
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
55
56
Blocking-Based Techniques
Goals that an algorithm has to achieve:
To put a relatively small number of ?’s and reduce significantly
the confidence of sensitive rules.
To minimize the undesirable side effects (rules and itemsets lost)
by selecting the items in the appropriate transactions to
change, and maximize the desirable side effects.
To modify the database in a way that an adversary cannot
recover the original values of the database.
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
57
Approach: Distribute data and processing
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
58
Distributed data mining / secure multi-party computation:
The principle explained by secure sum



Given a number of values x1,...,xn belonging to n
entities
compute  xi
such that each entity ONLY knows its input and the
result of the computation (The aggregate sum of the
data)
58
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
59
Distributed association-rule mining

Example distributed AR mining on horizontally partitioned
data (one approach: Kantarcioglu & Clifton, 2006)

In principle, easy:


If a rule has support > k% globally, it must have support > k%
on at least one site
1.
Request that each site send all rules with support > k%
2.
For each rule returned: request that all sites send the count of
their transactions that support the rule and the total count of
transactions
3.
From this, compute the global support of each rule
But: if you are the only site where a rule holds, would you
want to share that?
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
60
Phase 1: Find out which itemsets are frequent across
sites  all-2-all messages with commutative encryption
Site A
Frequent itemsets:
X, Y
KA(KB(X))
KA(KB(Z))
KA(KC(X))
KA(KC(Y))
Site B
Frequent itemsets:
X, Z
KB(KA(X))
KB(KA(Y))
KB(KC(X))
KB(KC(Y))
Site C
Frequent itemsets:
X, Y
KC(KB(X))
KC(KB(Z))
KC(KA(X))
KC(KA(Y))
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
61
Phase 1: Find out which itemsets are frequent across
sites  all-2-all messages with commutative encryption
Site A
Frequent itemsets:
X, Y
KA(KB(X))
KA(KB(Z))
KA(KC(X))
KA(KC(Y))
Site B
Frequent itemsets:
X, Z
KB(KA(X))
KB(KA(Y))
KB(KC(X))
KB(KC(Y))
Site C
Frequent itemsets:
X, Y
KC(KB(X))
KC(KB(Z))
KC(KA(X))
KC(KA(Y))
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
62
Phase 1: Find out which itemsets are frequent across
sites  all-2-all messages with commutative encryption
Site A
Frequent itemsets:
X, Y
KA(KB(X))
KA(KB(Z))
KA(KC(X))
KA(KC(Y))
Site B
Frequent itemsets:
X, Z
KB(KA(X))
KB(KA(Y))
KB(KC(X))
KB(KC(Y))
Site C
Frequent itemsets:
X, Y
KC(KB(X))
KC(KB(Z))
KC(KA(X))
KC(KA(Y))
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
63
Phase 1: Find out which itemsets are frequent across
sites  all-2-all messages with commutative encryption
Site A
Frequent itemsets:
X, Y
KA(KB(X))
KA(KB(Z))
KA(KC(X))
KA(KC(Y))
Site B
Frequent itemsets:
X, Z
KB(KA(X))
KB(KA(Y))
KB(KC(X))
KB(KC(Y))
Site C
Frequent itemsets:
X, Y
KC(KB(X))
KC(KB(Z))
KC(KA(X))
KC(KA(Y))
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
64
Phase 1: Find out which itemsets are frequent across
sites  all-2-all messages with commutative encryption
Site A
Frequent itemsets:
X, Y
Site B
Frequent itemsets:
X, Z
Site C
Frequent itemsets:
X, Y
X and Y
are frequent
KA(KB(X))
KA(KB(Z))
KA(KC(X))
KA(KC(Y))
KB(KA(X))
KB(KA(Y))
KB(KC(X))
KB(KC(Y))
KC(KB(X))
KC(KB(Z))
KC(KA(X))
KC(KA(Y))
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
65
Phase 2: secure multiparty computation (think of the
itemset X = {ABC} )
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
66
Some further issues in privacy-preserving data mining
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
67
Generalisation to data other than relational tables

Graph data

Search queries

Texts

Spatial data

...
Approaches for all of these exist but are beyond the scope of
this course!
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
A new problem: inferences from patterns
• Atzori et al. (2008):
publishing association
rules, even those with
high support, may lead to
anonymity leaks from
single individuals
•  solution approach:
release only kanonymous patterns
• By „sanitization“: adding
or deleting transactions
from data
Outlook: Privacy-preserving data
publishing (PPDP)
•
In contrast to the general assumptions of PPDM, arbitrary mining methods
may be performed after publishing
 need adversary models
•
Objective: "access to published data should not enable the attacker to learn
anything extra about any target victim compared to no access to the
database, even with the presence of any attacker’s background knowledge
obtained from other sources”
•
(this needs to be relaxed by assumptions about the background knowledge)
•
A comprehensive survey: Fung et al., ACM Computing Surveys 2010
•
With more recent literature: Manta, A. (2013). Literature Survey on Privacy
Preserving Mechanisms for Data Publishing. Masters thesis, TU Delft
•
(note for the PaBD people: this survey focusses on anonymization
approaches to PPDP & so is closely linked to the material by Claudia Diaz
that you have seen)
70
Agenda
The data
The problem (and all the problems we won’t go into detail about today)
The analytics used for demonstrating the argument
The approach: “Privacy-preserving data mining”
Data minimisation: (not only) for data mining
Incentives?
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
71
Two notes on data minimisation
Intuition:

“Existing data create desires.“

(Vorhandene Daten wecken Begehrlichkeiten, traditional adage
in German data-protection discourse)

“There are no innocent data.“ *

If there‘s no data, you can‘t misuse them.
 Principle of European data-protection law + other DP
frameworks: data minimisation:
“the policy of gathering the least amount of personal information
necessary to perform a given function.”
* Anke Domscheit-Berg, documented here:
http://blogs.taz.de/tazlab/2014/04/12/uberwachung-durch-nsa-es-gibt-keine-einfache-losung/
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
72
Data minimisation and data mining?!

“the policy of gathering the least amount of personal
information necessary to perform a given function.”
 Often considered a problem:

If the point of data mining is to explore the data in order to
find something new and interesting, there is no given
function or purpose!

So are data mining and data minimisation mutually
exclusive?

We believe that NO:
1.
For developing an app (which has a function), minimise the
data
2.
When planning an analysis, minimise the data
3.
Then mine the data, ideally in a privacy-friendly way
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
73
Aggregate, anonymize, reduce ... – when re-using “public“ data
User (e.g. name
or other ID)
Resource (e.g.
tweet text)
Tag (e.g.
Hashtag)
1. Get full records
Store as received
2. Get full records
Anonymize
3. Get full records
Filter all but tag
4. Get tag only
Analyse/
datamine
Store
Storeas
asreceived
is
Most minimal and legally safest: no personal data transferred
(assuming there is no way to reconstruct personal info from tag content ...)!
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
74
Agenda
The data
The problem (and all the problems we won’t go into detail about today)
The analytics used for demonstrating the argument
The approach: “Privacy-preserving data mining”
Data minimisation: (not only) for data mining
Incentives?
Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/
75

Legal requirements?

Econonomic pressure ?

Political considerations? (“national sovereignty“)


see talks in Security and Privacy in a Post-Snowden World

http://eng.kuleuven.be/evenementen/arenbergsymposium-2014
Ethical / consumer choice ?

This is related to ethical consumer choices in other areas, such
as Fair Trade. (argument made by E. Morozov)
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
76
”Few of us have had moral pangs about data-sharing schemes, but that
could change.
Before the environment became a global concern, few of us thought
twice about taking public transport if we could drive.
Before ethical consumption became a global concern, no one would
have paid more for coffee that tasted the same but promised “fair
trade.”
Consider a cheap T-shirt you see in a store. It might be perfectly legal to
buy it, but after decades of hard work by activist groups, a “Made in
Bangladesh” label makes us think twice about doing so.
Perhaps we fear that it was made by children or exploited adults.
Or, having thought about it, maybe we actually do want to buy the Tshirt because we hope it might support the work of a child who would
otherwise be forced into prostitution.”
Morozov, E. (2013).
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
77
But what does make people buy fair trade products?
An experiment on the effectiveness of “ethical apps“
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
78
Effectiveness of “ethical apps“?
Hudson et al. (2013):
What makes people buy a fair-trade product?
Informational film shown before buying decision?

NO
Having to make the decision in public?

NO
Some prior familiarity with the goals and activities of fair-trade
campaigns as well as broader understanding of national and
global political issues that are only peripherally related to fair
trade?

YES
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
79
Outlook
The data
The problem (and all the problems we won’t go into detail about today)
The analytics used for demonstrating the argument
The approach: “Privacy-preserving data mining”
Data minimisation: (not only) for data mining
Incentives?
Data mining and discrimination / fairness
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
80
References
The seminal article is on PPDM Rakesh Agrawal and Ramakrishnan Srikant. 2000. Privacy-preserving data mining. In Proceedings of
the 2000 ACM SIGMOD international conference on Management of data (SIGMOD '00). ACM, New York, NY, USA, 439-450.
DOI=10.1145/342009.335438 http://doi.acm.org/10.1145/342009.335438
Berendt, B. (2012). More than modelling and hiding: Towards a comprehensive view of Web mining and privacy. Data Mining and
Knowledge Discovery, 24 (3), 697-737. http://people.cs.kuleuven.be/~bettina.berendt/Papers/berendt_2012_DAMI.pdf
CNIL (2015). http://www.cnil.fr/english/news-and-events/article/privacy-impact-assessments- the-cnil-publishes-its-pia-manual/
Berendt, B. & Coudert, F. (2015). Privatsphäre und Datenschutz lehren - Ein interdisziplinärer Ansatz. Konzept, Umsetzung,
Schlussfolgerungen und Perspektiven. [Teaching privacy and data protection - an interdisciplinary approach. Concept, implementation,
conclusions and perspectives.] In Neues Handbuch Hochschullehre. [New Handbook of Teaching in Higher Education] (EG 71, 2015,
E1.9) (pp. 7-40). Berlin: Raabe Verlag.
Bertino, E., Lin, D., Jiang, W. (2008). A survey of quantification of privacy-preserving data mining algorithms. In C.C. Aggarwal & P.S. Yu
(Eds.), Privacy-preserving data mining: models and algorithms (pp. 181-200). New York: Springer.
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior.
Proceedings of the National Academy of Sciences, 110 (15), 5802-5805.
Pontikakis, E. & Verykios, V. (undated). An Experimental Study of Association Rule Hiding Techniques. (slideset).
http://dimacs.rutgers.edu/Workshops/Privacy/slides/pontikakis.ppt Please see their bibliography for sources for the distortion-based
and blocking-based techniques.
Murat Kantarcioglu and Chris Clifton. 2004. Privacy-Preserving Distributed Mining of Association Rules on Horizontally Partitioned
Data. IEEE Trans. on Knowl. and Data Eng. 16, 9 (September 2004), 1026-1037. DOI=10.1109/TKDE.2004.45
http://dx.doi.org/10.1109/TKDE.2004.45 (graphic on p. 75 from the paper)
Maurizio Atzori, Francesco Bonchi, Fosca Giannotti, Dino Pedreschi: Anonymity preserving pattern discovery. VLDB J. 17(4): 703-727
(2008). http://www.researchgate.net/publication/226264051_Anonymity_preserving_pattern_discovery/file/504635236dfc5f308e.pdf
Benjamin C. M. Fung, Ke Wang, Rui Chen, and Philip S. Yu. 2010. Privacy-preserving data publishing: A survey of recent developments.
ACM Comput. Surv. 42, 4, Article 14 (June 2010), 53 pages. DOI=10.1145/1749603.1749605 http://doi.acm.org/10.1145/1749603.1749605
Manta, A. (2013). Literature Survey on Privacy Preserving Mechanisms for Data Publishing. Masters thesis, TU Delft.
Morozov, E. (2013). The Real Privacy Problem. MIT Technology Review. http://www.technologyreview.com/featuredstory/520426/the-realprivacy-problem/
Hudson, M., Hudson, I., & Edgerton, J.D. (2013). Political Consumerism in Context: An Experiment on Status and Information in Ethical
Consumption Decisions. American Journal of Economics, 72 (4), 1009-1037. http://dx.doi.org/10.1111/ajes.12033
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/
81
Readings on PPDM
A very readable and recent introduction:
Matwin, S. (2013). Privacy-Preserving Data Mining Techniques: Survey and
Challenges. In B. Custers et al. (Eds.): Discrimination & Privacy in the Information
Society, SAPERE 3, pp. 209–221. Springer.
http://link.springer.com/chapter/10.1007%2F978-3-642-30487-3_11
The classification on p. 31 is taken from that paper.
A readable but somewhat old overview:
Verykios VS, Bertino E, Fovino IN, Provenza LP, Saygin Y, Theodoridis Y (2004) Stateof-the-art in privacy preserving data mining. SIGMOD Rec 33(1):50–57.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.3715
The quote on p. 29 (repeated on p. 30) defining the field is from that paper.
A thorough overview:
Aggarwal CC, Yu PS (2008a) A general survey of privacy-preserving data mining
models and algorithms. In: Aggarwal CC, Yu PS (eds) Privacy-preserving data mining:
models and algorithms. Springer, New York, pp 11–51.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.352.3032
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching/