RecMatch is not disclosed to W

ESTEEM:
Quality- and Privacy-Aware
Data Integration
Monica Scannapieco, Carola Aiello,
Tiziana Catarci, Diego Milano
Dipartimento di Informatica e Sistemistica
Università di Roma “La Sapienza”
Outline
 Privacy-aware integration
– Privacy risk assessment
– Private record linkage
 Quality-aware integration
Summary
New!!!
– Flexible and fully automatic record linkage
New!!!
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
2
PrivateID
SSN
DOB
ZIP
Health_Problem
a
11/20/67
00198
Shortness of
breath
b
02/07/81
00159
Headache
c
02/07/81
00156
Obesity
d
08/07/76
00198
Shortness of
breath
Linkage of Anonymous Data
T1
QUASI-IDENTIFIER
T2
PrivateID
SSN
DOB
ZIP
Employment
Marital
Status
1
A
11/20/67
00198
Researcher
Married
5
E
08/07/76
00114
Private
Employee
Married
3
C
02/07/81
00156
Public
Employee
Widow
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
3
Our Proposal
 A framework for assessing privacy risk that takes into
accounts both facets of privacy
– based on statistical decision theory
 Definition and analysis of:
– disclosure policies modelled by disclosure rules
– several privacy risk functions
 Estimated risk as an upper-bound of true risk and related
complexity analysis
 Algorithm for finding the disclosure rule minimizing the
privacy risk
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
4
The Formal Framework
A
B
C
x11
x12
x13
x21
x22
x23
x31
x32
x33
A
Disclosure
Rule δ
Risk R(δ,)=f(l(δ,) )
• identification
• sensitivity 
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
B
C
x11
x21
x31
x33
Loss function l(δ,)  representing
attacker’s knowledge
5
K-anonimity

K anonimity is SIMPLY a special case of
our framework in which:
1. θtrue= relation T, more strict assumption on
the attacker’s knowledge. We proved that
under some assumption
we can bound the
Our framework underlies
questionable
true risk by our some
“more
general” risk
hypotheses of k-anonimity!!!
2.  is a costant, questionable: independence
on the type of disclosed attributes (HIV
result same loss as last doctor visit)
3.  is underspecified, we can specify the set
of disclosure rules in several ways
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
6
Private Record Linkage
 Being P and Q be two peers owning the relations RP
(A1,…An) and RQ(B1,…,Bn), respectively, the privacypreserving record matching problem is to perform record
matching between RP and RQ, such that at the end of the
process
– P will know only a set PMatch, consisting of records in RP that
match with records in RQ. Similarly Q will know only the set
QMatch.
 Of particular importance is that no information will be
revealed to P and Q concerning records that do not
match each other
 Published at SIGMOD 07
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
7
Key Ideas and Solutions (1)
 Cannot just encrypt data and then
compute distances among them
– by definition encryption functions do not
preserve distances
 Let’s work on numbers, instead of
records!!!
 Mapping of records in a vector space, and
record matching performed in such a
space
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
8
Key Ideas and Solutions (2)
 Third-party based protocol in which:
– The two parties build together the embedding space
by using a method (SparseMap) with “secure”
features
– Each of the two parties embeds its own dataset and
sends it to the third party
– The third party W performs the intersection and sends
back to the parties
 Mapping of records in a vector space, and
record matching performed in such a space
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
9
Key Ideas and Solutions (3)
 Th1: Given the two relations RP (D1,…,Ds) and
RQ (D1,…,Dx), the set of matching records
RecMatch, DBSize the database, the following
result is proven, the record matching protocol
¯finds the matched records between the two
relations with the following assurance:
–
–
–
–
RecMatch is not disclosed to W;
RP - RecMatch is not disclosed to Q
RQ - RecMatch is not disclosed to P
DBSize is disclosed to W and bounded by P and Q
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
10
Schema Matching Features
 Th2: Given the schemas RP and RQ,
owned by parties P and Q respectively and
the set of matching attributes AttrMatch,
the schema matching protocol finds the
attributes common to the two schemas
with the following assurance:
– AttrMatch is not disclosed to W
– AttrMatch is not disclosed to P and Q
– AttrMatchSize is not disclosed to P and Q
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
11
How good are we?
16000
Our Method
Record Matching in the
Original Space
12000
10000
8000
6000
4000
2000
0
0
5000
1.1
10000
15000
Dataset Size
20000
25000
1.0
0.9
Recall Original
Precision Original
Recall Embedded
Precision Embedded
0.8
Precision and Recall
 Time: better than
record linkage without
privacy preservation

 Effectiveness:
Comparable wrt recall
and precision
Total Execution Time (secs)
14000
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
0.1
0.2
0.3
0.4
0.5 0.6
Threshold
0.7
0.8
0.9
1.0
1.1
12
Flexible and Automatic RL
 P2P systems are loosely coupled,
dynamic, open
 Manual phases of record linkage can be
problematic:
– Time consuming vs. dynamic feature/open
– Syncronous interactions vs. loosely coupled
systems
 Need for flexible and automatic RL
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
13
Background: Record Linkage
Techniques
 Search Space
Reduction:
 Comparison
Functions:
– Sorted Neighborhood
Method
– Blocking
– Hierarchical grouping
–…
 Decision Rules:
– Probabilistic:
Fellegi&Sunter
– Empirical
– Knowledge-based
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
–
–
–
–
–
–
–
Edit distance
Smith-Waterman
Q-grams
Jaro string comparator
Soundex code
TF-IDF
…
14
Key Idea
 Record Linkage is a complex process and should be
decomposed as much as possible in its constituting
phases
 For each phase the most appropriate technique should
be chosen depending on application and data
requirements
 In order to dynamically build ad-hoc
record linkage workflows
 RELAIS: toolkit serving such a purpose
– developed at Istat
– UNIROMA contribution on data profiling stuff (wait a couple of
slides )
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
15
RELAIS Toolkit
Application Constraints:
• Admissible error-rates
• Privacy issues
• Cost
•…
Database Features:
• Size
• Quality
• Domain features
•…
RELAIS
Record Linkage Workflow
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
16
RecLink WF Appl1
RL Workflows
Preprocessing
UpperLowerCase
Normalization
Normalization
UpperLowerCase
Schema
reconciliation
RecLink WF Appl2
SNM
Search Space Reduction
Blocking
Blocking
SNM
Comparison Function
Jaro
Equality
Edit Distance
Jaro
Equality
Decision Model
Probabilistic
Empirical
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Probabilistic
Empirical
17
Making Automatic Some Phases
 Data profiling for choosing matching keys
 Automatic extraction of:
– Completeness
– Consistency
– Identification power
 On going
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
18
Status of RELAIS
 Currently guided execution of RL workflows with
all phases automatic
 Future:
– Definition of RELAIS's architecture as a serviceoriented, web-accessible architecture. Formal
specification of (i) input/output of services, and (ii)
pre/post conditions by semantic Web Services
technologies
– Automatic generation of RL workflows by reasoning
on service specification usage of either automatic
[Berardi et al VLDB 2005] or semi automatic
[Bouguettaya et al. VLDBJ 2003] service composition
techniques
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
19
Implementation View
Q-RELAIS
PQ-RELAIS
P-RELAIS
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
• Data Source profiling
(quality metadata)
• Quality-based trust evaluation
• Automatic and flexible RL
• Privacy risk assessment
• Private RL
20