Data Privacy * The Problem

CMPUT 691:
Differential Privacy:
Privacy Preserving Data-Analysis
http://webdocs.cs.ualberta.ca/~osheffet/CMPUT691F16.html
The Course
• Time: Tue. & Thr. 14:00-15:20
• Place: CSC B-43
• Webpage: eClass (for registered) and webdocs:
http://webdocs.cs.ualberta.ca/~osheffet/CMPUT691F16.html
• Book: Dwork & Roth, The Algorithmic Foundations of Differential
Privacy (link at webpage)
• Instructor:
• Or Sheffet, Athabasca 3-04
• Office hours 15:30-16:00 Tuesdays / coordinate via email
• [email protected]
• Subject must begin with [CMPUT691]
• Not a native English-speaker, not native to Canada
The Course
• Grade:
• 10-15% FFTq (Food-for-Thought Question)
• A question will be given at the end of the class – write an answer by the beginning of
next class.
• 1% per not-completely-trivial answer, more if you give a really good one / engage in a
discussion
• 35-40% based on 3-4 HW assignments
• Pen and paper assignments
• Maybe, but unlikely, draw a few plots
• PLEASE TYPE YOUR SOLUTION
• Late unapproved submission: -25% of assignment per 24 hrs past submission deadline
• Collaborations are encouraged…
• … but solutions ought to be individually written.
• Cite anything & anyone you used
The Course
• Grade:
• 50% Class project
• Groups of 1-2
• Type and scope – for you to decide
• Can be centered on implementation, practical testing or theoretical
• Tentative schedule:
• By week 3: ideas will be given
• By week 5: [5%] submission of title, group members, abstract + references (1-2 pages)
• And a meeting with me, for initial feedback
• By week 8: [10%] mid-project presentations (10-20 mins)
•
•
•
•
Papers you are based on
What is your research question
Plan of attack
Feedback opportunity from one another!
• By week 13: [15%] full presentations (20-25 mins)
•
•
•
•
Your research questions & “line of attack”
Results
Future directions
Last chance for a feedback
• End of semester: [20%] project due.
• Can this become a full-fledged paper?
Questions?
• Tuesday, Sep. 13th :
• Any objections to delaying the class by 1 hour? (Starting at 15:00, ending
16:20)
Today’s Class
• Overview of the problem of privacy in data analysis
• Atypical
• Slides
• Storytelling
• The rest of the course
• Whiteboard
• Math
• So if you’re still deciding on whether or not to take the class – judge it
based on next time.
Data Privacy – The Problem
• Given:
• a dataset with sensitive information
• Health records, census data, financial data, …
• How to:
• Compute and release functions of the dataset
• Answer queries, output summary, learn
• Without compromising individual privacy
• What the #$@& does it even mean????
Data Privacy – The Problem
Individuals
Server/agency
A
Users
(
queries
answers
)
Government,
researchers,
businesses
(or)
Malicious
adversary
A Real Problem
Typical examples:
• Census
• Civic archives
• Medical records
• Search information
• Communication logs
• Social networks
• Genetic databases
•…
Benefits:
• New discoveries
• Improved medical
care
• National securityprivacy
discoveries
med care
security
9
The Anonymization Dream
Database
Anonymized Database
• Trusted curator:
• Removes identifying information (name, address, ssn, …).
• Replaces identities with random identifiers.
• Idea hard wired into practices, regulations, …, thought.
• Many uses.
• Reality: series failures.
• Pronounced both in academic and public literature.
Linkage Attacks [Sweeney 2000]
GIC
Ethnicity
Group Insurance
Commission
visit date
ZIP
patient
specific data
Diagnosis
Birth date
( 135,000 patients)
Procedure
Sex
100 attributes
Medication
per encounter
Total Anonymized
Charge
Anonymized
GIC data
Name
Address
Voter registration
ZIP
of Cambridge
MA
Date
Birth date
registered
“Public records”
Party by anyone
open
Sex for inspection
affiliation
Date last
voted
Voter
registration
Linkage Attacks [Sweeney 2000]
Quasi identifiers  re-identification
Not a coincidence:
 dob+5zip  69%
 dob+9zip  97%
William Weld (governor of Massachusetts at the time)
According to the Cambridge Voter list:
Six people had his particular birth date
Of which three were men
He was the only one in his 5-digit ZIP code!
Azrieli
Towers
Thanks to Amos Fiat
13
Azrieli
Towers
Thanks to Google
14
15
AOL Data release (2006)
• AOL released search data
• A sample of ~20M web queries collected from ~650k users
over three months
• Goal: provide real query log data that is based on real
users
• “It could be used for personalization, query reformulation
or other types of search research”
• The data set:
AnonID
Query
QueryTime
ItemRank
ClickURL
16
4417749 best dog for older owner
3/6/2006 11:48:24 1
4417749 best dog for older owner
3/6/2006 11:48:24 5
4417749 landscapers in lilburn ga.
3/6/2006 18:37:26
4417749 effects of nicotine
3/7/2006 19:17:19 6
4417749 best retirement in the world
3/9/2006 21:47:26 4
4417749 best retirement place in usa
3/9/2006 21:49:37 10
4417749 best retirement place in usa
3/9/2006 21:49:37 9
4417749 bi polar and heredity
3/13/2006 20:57:11
4417749 adventure for the older american 3/17/2006 21:35:48
4417749 nicotine effects on the body
3/26/2006 10:31:15 3
4417749 nicotine effects on the body
3/26/2006 10:31:15 2
4417749 wrinkling of the skin
3/26/2006 10:38:23
4417749 mini strokes
3/26/2006 14:56:56 1
4417749 panic disorders
3/26/2006 14:58:25
4417749 jarrett t. arnold eugene oregon
3/23/2006 21:48:01 2
4417749 jarrett t. arnold eugene oregon
3/23/2006 21:48:01 3
4417749 plastic surgeons in gwinnett county 3/28/2006 15:04:231
4417749 plastic surgeons in gwinnett county 3/28/2006 15:04:234
4417749 plastic surgeons in gwinnett county 3/28/2006 15:31:00
4417749 60 single men
3/29/2006 20:11:52 6
4417749 60 single men
3/29/2006 20:14:14
4417749 clothes for 60 plus age
4/19/2006 12:44:03
4417749 clothes for age 60
4/19/2006 12:44:41 10
4417749 clothes for age 60
4/19/2006 12:45:41
4417749 lactose intolerant
4/21/2006 20:53:51 2
4417749 lactose intolerant
4/21/2006 20:53:51 10
4417749 dog who urinate on everything
4/28/2006 13:24:07 6
4417749 fingers going numb
5/2/2006 17:35:47
http://www.canismajor.com
http://dogs.about.com
http://www.nida.nih.gov
http://www.escapeartist.com
http://www.clubmarena.com
http://www.committment.com
http://www.geocities.com
http://health.howstuffworks.com
http://www.ninds.nih.gov
http://www2.eugeneweekly.com
http://www2.eugeneweekly.com
http://www.wedalert.com
http://www.implantinfo.com
http://www.adultlovecompass.com
http://www.news.cornell.edu
http://digestive.niddk.nih.gov
http://www.netdoctor.co.uk
http://www.dogdaysusa.com
Name: Thelma Arnold
Age: 62
Widow
Residence: Lilburn, GA
Other Re-Identification Examples
[partial and unordered list]
• Netflix award [Narayanan, Shmatikov 08].
• Social networks [Backstrom, Dwork, Kleinberg 07, NS 09].
• Computer networks [Coull, Wright, Monrose, Collins, Reiter ’07, Ribeiro, Chen,
Miklau, Townsley 08].
• Genetic data (GWAS) [Homer, Szelinger, Redman, Duggan, Tembe, Muehling,
Pearson, Stephan, Nelson, Craig 08, ...].
• Microtargeted advertising [Korolova 11].
• Recommendation Systems [Calandrino, Kiltzer, Naryanan, Felten, Shmatikov 11].
• Israeli CBS [Mukatren, N, Salman, Tromer].
• …
k-Anonymity [SS98,S02] … l-diversity … t-closeness …
• Prevent re-identification:
• Make every individual’s identity unidentifiable from other k-1
individuals
ZIP
Age
sex
Disease
ZIP
Age
sex
Disease
23456
55
Female
Heart
23456
**
*
Heart
12345
30
Male
Heart
1234*
3*
Male
Heart
12346
33
Male
Heart
1234*
3*
Male
Heart
13144
45
Female
Breast
Cancer
131**
4*
*
Breast
Cancer
13155
42
Male
Hepatitis
131**
4*
*
Hepatitis
23456
42
Male
Viral
23456
**
*
Viral
Both guys from zip 1234*
that are in their thirties
have heart problems
My (male) neighbor from
zip 13155 has hepatitis!
Bugger! I Cannot tell which
disease for the patients
from zip 23456
20
Auditing
Here’s the answer
OR
Here’s a new query: qi+1
Query denied (as the answer
would cause privacy loss)
Auditor
Query log
Statistical
database
q1,…,qi
21
Example 1: Sum/Max auditing
di real, sum/max queries, privacy breached if some di
learned
q1 = sum(d1,d2,d3)
sum(d1,d2,d3) = 15
q2 = max(d1,d2,d3)
Denied (the answer would
cause privacy loss)
Oh well…
Auditor
22
… After Two Minutes …
di real, sum/max queries, privacy breached if some di
learned
q1 = sum(d1,d2,d3)
sum(d1,d2,d3) = 15
q2 = max(d1,d2,d3)
There
must beiffa
q2 is denied
Ohreason
well…for the
d1=d2=d3 = 5
denial…
I win!
Denied (the answer would
cause privacy loss)
Auditor
23
Example 2: Interval Based Auditing
di  [0,100], sum queries,  =1 (PTIME)
q1 = sum(d1,d2)
Sorry, denied
q2 = sum(d2,d3)
sum(d2,d3) = 50
d1,d2 
Denial
[0,1]
d3  [49,50]
d1,d2[0,1]
or
[99,100]
Auditor
24
Max Auditing
d1 d2 d3 d4 d5 d6 d7 d8
…
dn-1 dn
di real
q1 = max(d1,d2,d3,d4)
M1234
q2 = max(d1,d2,d3)
If denied: d4=M1234
M123 / denied
q2 = max(d1,d2)
If denied: d3=M123
M12 / denied
Auditor
25
Adversary’s Success
q1 = max(d1,d2,d3,d4)
If denied: d4=M1234
Denied with probability 1/4
q2 = max(d1,d2)
If denied: d3=M123
Denied with probability 1/3
Success probability: 1/4 + (1- 1/4)·1/3 = 1/2
q2 = max(d1,d2,d3)
Recover 1/8 of the database!
Auditor
26
Boolean Auditing?
d1 d2 d3 d4 d5 d6 d7 d8
q1 = sum(d1,d2)
…
dn-1 dn
di Boolean
1 / denied
q2=sum(d2,d3)
…
1 / denied
qi denied iff di = di+1  learn database/complement
Let di,dj,dk not all equal, where qi-1, qi, qj-1, qj, qk-1, qk all denied
q2=sum(di,dj,dk)
1/2
Recover the entire database!
Auditor
27
Randomization
The Scenario
• Users provide
modified values of
sensitive attributes
• Data-miner develops
models about
aggregated data
Dataminer
Can we develop accurate models without access to
precise individual information?
29
Preserving Privacy
• Value distortion – return xi+ri
• Uniform noise: ri  U(-a,a)
• Gaussian noise: ri  N(0,stdev)
• Perturbation of an entry is fixed
• So that repeated queries do not reduce noise
• Privacy quantification: interval of confidence [AS 2000]
• With c% confidence xi is in the interval [a1, a2]. a2-a1 define the
amount of privacy at c% confidence level
• Examples:
Uniform
Gaussian
50%
0.5 x 2a
1.34 stdev
95%
0.95 x 2a
3.92 stdev
99.9%
0.999 x 2a
6.8 stdev
• Intuition: the larger the interval is, the better privacy is preserved.
30
Knowledge about the underlying
distribution affects privacy
• Let X=age
• We know that age > 0
• Suppose ri  U(-50,50)
• [AS]: Privacy 100 at 100% confidence
• Seeing an outcome -49.038
• x is reduced to the interval [0,1]
• Taking ‘facts of life’ into account affects privacy
31
Prior knowledge affects privacy
• Let X=age ri  U(-50,50)
• [AS]: Privacy 100 at 100% confidence
• Seeing a measurement -10
• Facts of life: Bob’s age is between 0 and 40
• Assume you also know Bob has two children
• Bob’s age is between 15 and 40
• a-priori information may be used in attacking
individual data
32
What went wrong?
De-identified data isn’t!
“These definitions of privacy are syntactic, not
semantic”
• These attempts fail because they define privacy as the result of some specific
algorithm…
• And they don’t talk about the meaning of preserving privacy.
• Maybe instead we should try to define privacy.
Food For Thought #1:
In light of the examples seen today in class:
• Put forward one (or more) property/ies a “good” definition
of privacy should satisfy.
• Try to define these properties formally.
• Bonus: try to define what it means to “preserve privacy.”