ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza” Outline Privacy-aware integration – Privacy risk assessment – Private record linkage Quality-aware integration Summary New!!! – Flexible and fully automatic record linkage New!!! Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 2 PrivateID SSN DOB ZIP Health_Problem a 11/20/67 00198 Shortness of breath b 02/07/81 00159 Headache c 02/07/81 00156 Obesity d 08/07/76 00198 Shortness of breath Linkage of Anonymous Data T1 QUASI-IDENTIFIER T2 PrivateID SSN DOB ZIP Employment Marital Status 1 A 11/20/67 00198 Researcher Married 5 E 08/07/76 00114 Private Employee Married 3 C 02/07/81 00156 Public Employee Widow Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 3 Our Proposal A framework for assessing privacy risk that takes into accounts both facets of privacy – based on statistical decision theory Definition and analysis of: – disclosure policies modelled by disclosure rules – several privacy risk functions Estimated risk as an upper-bound of true risk and related complexity analysis Algorithm for finding the disclosure rule minimizing the privacy risk Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 4 The Formal Framework A B C x11 x12 x13 x21 x22 x23 x31 x32 x33 A Disclosure Rule δ Risk R(δ,)=f(l(δ,) ) • identification • sensitivity Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia B C x11 x21 x31 x33 Loss function l(δ,) representing attacker’s knowledge 5 K-anonimity K anonimity is SIMPLY a special case of our framework in which: 1. θtrue= relation T, more strict assumption on the attacker’s knowledge. We proved that under some assumption we can bound the Our framework underlies questionable true risk by our some “more general” risk hypotheses of k-anonimity!!! 2. is a costant, questionable: independence on the type of disclosed attributes (HIV result same loss as last doctor visit) 3. is underspecified, we can specify the set of disclosure rules in several ways Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 6 Private Record Linkage Being P and Q be two peers owning the relations RP (A1,…An) and RQ(B1,…,Bn), respectively, the privacypreserving record matching problem is to perform record matching between RP and RQ, such that at the end of the process – P will know only a set PMatch, consisting of records in RP that match with records in RQ. Similarly Q will know only the set QMatch. Of particular importance is that no information will be revealed to P and Q concerning records that do not match each other Published at SIGMOD 07 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 7 Key Ideas and Solutions (1) Cannot just encrypt data and then compute distances among them – by definition encryption functions do not preserve distances Let’s work on numbers, instead of records!!! Mapping of records in a vector space, and record matching performed in such a space Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 8 Key Ideas and Solutions (2) Third-party based protocol in which: – The two parties build together the embedding space by using a method (SparseMap) with “secure” features – Each of the two parties embeds its own dataset and sends it to the third party – The third party W performs the intersection and sends back to the parties Mapping of records in a vector space, and record matching performed in such a space Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 9 Key Ideas and Solutions (3) Th1: Given the two relations RP (D1,…,Ds) and RQ (D1,…,Dx), the set of matching records RecMatch, DBSize the database, the following result is proven, the record matching protocol ¯finds the matched records between the two relations with the following assurance: – – – – RecMatch is not disclosed to W; RP - RecMatch is not disclosed to Q RQ - RecMatch is not disclosed to P DBSize is disclosed to W and bounded by P and Q Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 10 Schema Matching Features Th2: Given the schemas RP and RQ, owned by parties P and Q respectively and the set of matching attributes AttrMatch, the schema matching protocol finds the attributes common to the two schemas with the following assurance: – AttrMatch is not disclosed to W – AttrMatch is not disclosed to P and Q – AttrMatchSize is not disclosed to P and Q Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 11 How good are we? 16000 Our Method Record Matching in the Original Space 12000 10000 8000 6000 4000 2000 0 0 5000 1.1 10000 15000 Dataset Size 20000 25000 1.0 0.9 Recall Original Precision Original Recall Embedded Precision Embedded 0.8 Precision and Recall Time: better than record linkage without privacy preservation Effectiveness: Comparable wrt recall and precision Total Execution Time (secs) 14000 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 0.1 0.2 0.3 0.4 0.5 0.6 Threshold 0.7 0.8 0.9 1.0 1.1 12 Flexible and Automatic RL P2P systems are loosely coupled, dynamic, open Manual phases of record linkage can be problematic: – Time consuming vs. dynamic feature/open – Syncronous interactions vs. loosely coupled systems Need for flexible and automatic RL Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 13 Background: Record Linkage Techniques Search Space Reduction: Comparison Functions: – Sorted Neighborhood Method – Blocking – Hierarchical grouping –… Decision Rules: – Probabilistic: Fellegi&Sunter – Empirical – Knowledge-based Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia – – – – – – – Edit distance Smith-Waterman Q-grams Jaro string comparator Soundex code TF-IDF … 14 Key Idea Record Linkage is a complex process and should be decomposed as much as possible in its constituting phases For each phase the most appropriate technique should be chosen depending on application and data requirements In order to dynamically build ad-hoc record linkage workflows RELAIS: toolkit serving such a purpose – developed at Istat – UNIROMA contribution on data profiling stuff (wait a couple of slides ) Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 15 RELAIS Toolkit Application Constraints: • Admissible error-rates • Privacy issues • Cost •… Database Features: • Size • Quality • Domain features •… RELAIS Record Linkage Workflow Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 16 RecLink WF Appl1 RL Workflows Preprocessing UpperLowerCase Normalization Normalization UpperLowerCase Schema reconciliation RecLink WF Appl2 SNM Search Space Reduction Blocking Blocking SNM Comparison Function Jaro Equality Edit Distance Jaro Equality Decision Model Probabilistic Empirical Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia Probabilistic Empirical 17 Making Automatic Some Phases Data profiling for choosing matching keys Automatic extraction of: – Completeness – Consistency – Identification power On going Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 18 Status of RELAIS Currently guided execution of RL workflows with all phases automatic Future: – Definition of RELAIS's architecture as a serviceoriented, web-accessible architecture. Formal specification of (i) input/output of services, and (ii) pre/post conditions by semantic Web Services technologies – Automatic generation of RL workflows by reasoning on service specification usage of either automatic [Berardi et al VLDB 2005] or semi automatic [Bouguettaya et al. VLDBJ 2003] service composition techniques Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 19 Implementation View Q-RELAIS PQ-RELAIS P-RELAIS Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia • Data Source profiling (quality metadata) • Quality-based trust evaluation • Automatic and flexible RL • Privacy risk assessment • Private RL 20
© Copyright 2026 Paperzz