CMPUT 691: Differential Privacy: Privacy Preserving Data-Analysis http://webdocs.cs.ualberta.ca/~osheffet/CMPUT691F16.html The Course • Time: Tue. & Thr. 14:00-15:20 • Place: CSC B-43 • Webpage: eClass (for registered) and webdocs: http://webdocs.cs.ualberta.ca/~osheffet/CMPUT691F16.html • Book: Dwork & Roth, The Algorithmic Foundations of Differential Privacy (link at webpage) • Instructor: • Or Sheffet, Athabasca 3-04 • Office hours 15:30-16:00 Tuesdays / coordinate via email • [email protected] • Subject must begin with [CMPUT691] • Not a native English-speaker, not native to Canada The Course • Grade: • 10-15% FFTq (Food-for-Thought Question) • A question will be given at the end of the class – write an answer by the beginning of next class. • 1% per not-completely-trivial answer, more if you give a really good one / engage in a discussion • 35-40% based on 3-4 HW assignments • Pen and paper assignments • Maybe, but unlikely, draw a few plots • PLEASE TYPE YOUR SOLUTION • Late unapproved submission: -25% of assignment per 24 hrs past submission deadline • Collaborations are encouraged… • … but solutions ought to be individually written. • Cite anything & anyone you used The Course • Grade: • 50% Class project • Groups of 1-2 • Type and scope – for you to decide • Can be centered on implementation, practical testing or theoretical • Tentative schedule: • By week 3: ideas will be given • By week 5: [5%] submission of title, group members, abstract + references (1-2 pages) • And a meeting with me, for initial feedback • By week 8: [10%] mid-project presentations (10-20 mins) • • • • Papers you are based on What is your research question Plan of attack Feedback opportunity from one another! • By week 13: [15%] full presentations (20-25 mins) • • • • Your research questions & “line of attack” Results Future directions Last chance for a feedback • End of semester: [20%] project due. • Can this become a full-fledged paper? Questions? • Tuesday, Sep. 13th : • Any objections to delaying the class by 1 hour? (Starting at 15:00, ending 16:20) Today’s Class • Overview of the problem of privacy in data analysis • Atypical • Slides • Storytelling • The rest of the course • Whiteboard • Math • So if you’re still deciding on whether or not to take the class – judge it based on next time. Data Privacy – The Problem • Given: • a dataset with sensitive information • Health records, census data, financial data, … • How to: • Compute and release functions of the dataset • Answer queries, output summary, learn • Without compromising individual privacy • What the #$@& does it even mean???? Data Privacy – The Problem Individuals Server/agency A Users ( queries answers ) Government, researchers, businesses (or) Malicious adversary A Real Problem Typical examples: • Census • Civic archives • Medical records • Search information • Communication logs • Social networks • Genetic databases •… Benefits: • New discoveries • Improved medical care • National securityprivacy discoveries med care security 9 The Anonymization Dream Database Anonymized Database • Trusted curator: • Removes identifying information (name, address, ssn, …). • Replaces identities with random identifiers. • Idea hard wired into practices, regulations, …, thought. • Many uses. • Reality: series failures. • Pronounced both in academic and public literature. Linkage Attacks [Sweeney 2000] GIC Ethnicity Group Insurance Commission visit date ZIP patient specific data Diagnosis Birth date ( 135,000 patients) Procedure Sex 100 attributes Medication per encounter Total Anonymized Charge Anonymized GIC data Name Address Voter registration ZIP of Cambridge MA Date Birth date registered “Public records” Party by anyone open Sex for inspection affiliation Date last voted Voter registration Linkage Attacks [Sweeney 2000] Quasi identifiers re-identification Not a coincidence: dob+5zip 69% dob+9zip 97% William Weld (governor of Massachusetts at the time) According to the Cambridge Voter list: Six people had his particular birth date Of which three were men He was the only one in his 5-digit ZIP code! Azrieli Towers Thanks to Amos Fiat 13 Azrieli Towers Thanks to Google 14 15 AOL Data release (2006) • AOL released search data • A sample of ~20M web queries collected from ~650k users over three months • Goal: provide real query log data that is based on real users • “It could be used for personalization, query reformulation or other types of search research” • The data set: AnonID Query QueryTime ItemRank ClickURL 16 4417749 best dog for older owner 3/6/2006 11:48:24 1 4417749 best dog for older owner 3/6/2006 11:48:24 5 4417749 landscapers in lilburn ga. 3/6/2006 18:37:26 4417749 effects of nicotine 3/7/2006 19:17:19 6 4417749 best retirement in the world 3/9/2006 21:47:26 4 4417749 best retirement place in usa 3/9/2006 21:49:37 10 4417749 best retirement place in usa 3/9/2006 21:49:37 9 4417749 bi polar and heredity 3/13/2006 20:57:11 4417749 adventure for the older american 3/17/2006 21:35:48 4417749 nicotine effects on the body 3/26/2006 10:31:15 3 4417749 nicotine effects on the body 3/26/2006 10:31:15 2 4417749 wrinkling of the skin 3/26/2006 10:38:23 4417749 mini strokes 3/26/2006 14:56:56 1 4417749 panic disorders 3/26/2006 14:58:25 4417749 jarrett t. arnold eugene oregon 3/23/2006 21:48:01 2 4417749 jarrett t. arnold eugene oregon 3/23/2006 21:48:01 3 4417749 plastic surgeons in gwinnett county 3/28/2006 15:04:231 4417749 plastic surgeons in gwinnett county 3/28/2006 15:04:234 4417749 plastic surgeons in gwinnett county 3/28/2006 15:31:00 4417749 60 single men 3/29/2006 20:11:52 6 4417749 60 single men 3/29/2006 20:14:14 4417749 clothes for 60 plus age 4/19/2006 12:44:03 4417749 clothes for age 60 4/19/2006 12:44:41 10 4417749 clothes for age 60 4/19/2006 12:45:41 4417749 lactose intolerant 4/21/2006 20:53:51 2 4417749 lactose intolerant 4/21/2006 20:53:51 10 4417749 dog who urinate on everything 4/28/2006 13:24:07 6 4417749 fingers going numb 5/2/2006 17:35:47 http://www.canismajor.com http://dogs.about.com http://www.nida.nih.gov http://www.escapeartist.com http://www.clubmarena.com http://www.committment.com http://www.geocities.com http://health.howstuffworks.com http://www.ninds.nih.gov http://www2.eugeneweekly.com http://www2.eugeneweekly.com http://www.wedalert.com http://www.implantinfo.com http://www.adultlovecompass.com http://www.news.cornell.edu http://digestive.niddk.nih.gov http://www.netdoctor.co.uk http://www.dogdaysusa.com Name: Thelma Arnold Age: 62 Widow Residence: Lilburn, GA Other Re-Identification Examples [partial and unordered list] • Netflix award [Narayanan, Shmatikov 08]. • Social networks [Backstrom, Dwork, Kleinberg 07, NS 09]. • Computer networks [Coull, Wright, Monrose, Collins, Reiter ’07, Ribeiro, Chen, Miklau, Townsley 08]. • Genetic data (GWAS) [Homer, Szelinger, Redman, Duggan, Tembe, Muehling, Pearson, Stephan, Nelson, Craig 08, ...]. • Microtargeted advertising [Korolova 11]. • Recommendation Systems [Calandrino, Kiltzer, Naryanan, Felten, Shmatikov 11]. • Israeli CBS [Mukatren, N, Salman, Tromer]. • … k-Anonymity [SS98,S02] … l-diversity … t-closeness … • Prevent re-identification: • Make every individual’s identity unidentifiable from other k-1 individuals ZIP Age sex Disease ZIP Age sex Disease 23456 55 Female Heart 23456 ** * Heart 12345 30 Male Heart 1234* 3* Male Heart 12346 33 Male Heart 1234* 3* Male Heart 13144 45 Female Breast Cancer 131** 4* * Breast Cancer 13155 42 Male Hepatitis 131** 4* * Hepatitis 23456 42 Male Viral 23456 ** * Viral Both guys from zip 1234* that are in their thirties have heart problems My (male) neighbor from zip 13155 has hepatitis! Bugger! I Cannot tell which disease for the patients from zip 23456 20 Auditing Here’s the answer OR Here’s a new query: qi+1 Query denied (as the answer would cause privacy loss) Auditor Query log Statistical database q1,…,qi 21 Example 1: Sum/Max auditing di real, sum/max queries, privacy breached if some di learned q1 = sum(d1,d2,d3) sum(d1,d2,d3) = 15 q2 = max(d1,d2,d3) Denied (the answer would cause privacy loss) Oh well… Auditor 22 … After Two Minutes … di real, sum/max queries, privacy breached if some di learned q1 = sum(d1,d2,d3) sum(d1,d2,d3) = 15 q2 = max(d1,d2,d3) There must beiffa q2 is denied Ohreason well…for the d1=d2=d3 = 5 denial… I win! Denied (the answer would cause privacy loss) Auditor 23 Example 2: Interval Based Auditing di [0,100], sum queries, =1 (PTIME) q1 = sum(d1,d2) Sorry, denied q2 = sum(d2,d3) sum(d2,d3) = 50 d1,d2 Denial [0,1] d3 [49,50] d1,d2[0,1] or [99,100] Auditor 24 Max Auditing d1 d2 d3 d4 d5 d6 d7 d8 … dn-1 dn di real q1 = max(d1,d2,d3,d4) M1234 q2 = max(d1,d2,d3) If denied: d4=M1234 M123 / denied q2 = max(d1,d2) If denied: d3=M123 M12 / denied Auditor 25 Adversary’s Success q1 = max(d1,d2,d3,d4) If denied: d4=M1234 Denied with probability 1/4 q2 = max(d1,d2) If denied: d3=M123 Denied with probability 1/3 Success probability: 1/4 + (1- 1/4)·1/3 = 1/2 q2 = max(d1,d2,d3) Recover 1/8 of the database! Auditor 26 Boolean Auditing? d1 d2 d3 d4 d5 d6 d7 d8 q1 = sum(d1,d2) … dn-1 dn di Boolean 1 / denied q2=sum(d2,d3) … 1 / denied qi denied iff di = di+1 learn database/complement Let di,dj,dk not all equal, where qi-1, qi, qj-1, qj, qk-1, qk all denied q2=sum(di,dj,dk) 1/2 Recover the entire database! Auditor 27 Randomization The Scenario • Users provide modified values of sensitive attributes • Data-miner develops models about aggregated data Dataminer Can we develop accurate models without access to precise individual information? 29 Preserving Privacy • Value distortion – return xi+ri • Uniform noise: ri U(-a,a) • Gaussian noise: ri N(0,stdev) • Perturbation of an entry is fixed • So that repeated queries do not reduce noise • Privacy quantification: interval of confidence [AS 2000] • With c% confidence xi is in the interval [a1, a2]. a2-a1 define the amount of privacy at c% confidence level • Examples: Uniform Gaussian 50% 0.5 x 2a 1.34 stdev 95% 0.95 x 2a 3.92 stdev 99.9% 0.999 x 2a 6.8 stdev • Intuition: the larger the interval is, the better privacy is preserved. 30 Knowledge about the underlying distribution affects privacy • Let X=age • We know that age > 0 • Suppose ri U(-50,50) • [AS]: Privacy 100 at 100% confidence • Seeing an outcome -49.038 • x is reduced to the interval [0,1] • Taking ‘facts of life’ into account affects privacy 31 Prior knowledge affects privacy • Let X=age ri U(-50,50) • [AS]: Privacy 100 at 100% confidence • Seeing a measurement -10 • Facts of life: Bob’s age is between 0 and 40 • Assume you also know Bob has two children • Bob’s age is between 15 and 40 • a-priori information may be used in attacking individual data 32 What went wrong? De-identified data isn’t! “These definitions of privacy are syntactic, not semantic” • These attempts fail because they define privacy as the result of some specific algorithm… • And they don’t talk about the meaning of preserving privacy. • Maybe instead we should try to define privacy. Food For Thought #1: In light of the examples seen today in class: • Put forward one (or more) property/ies a “good” definition of privacy should satisfy. • Try to define these properties formally. • Bonus: try to define what it means to “preserve privacy.”
© Copyright 2025 Paperzz