*** 1 - Wamdm

Record Linkage with Uniqueness
Constraints and Erroneous Values
Zhang Xiaojian
2010 November 26
WAMDM Group Meeting
Application1
Schema matching
E.Rahm VLDBJ01
Application2
Schema level
Cleaned
Data
Data exchange
R.Fagin TODS05
Data Fusion
Heterogeneous
sources
Duplicate Detection
Record Linkage
Two challenges
Duplicate detection
Record linkage
A.K.E TKDE07
Instance level
Entity resolution
Tect Report Stanford
Data fusion
X Dong VLDB09
contradiction
conflicting data
Schema matching
Value level
s
s
s
s
s
s
Data integration process
Data fusion
Felix WWW06
Name
Address
Age
John R Smith
16 Main Street
16
uncertainty
J R Smith
16 Main St
NULL
Data fusion
Felix ACMC08
Contents
• Motivation
• Problem definition
• Solution
• Experimental results
• Conclusions
• Getting some problems from the paper
3
Motivation
s1
s2
Src
Name
Phone
Address
City
V
A-Link Wireless
8185491449
2148 GLENDALE GALLERIA
GLENDALE
V
Abercrombie
8185020728
2229 GLENDALE GALLERIA
GLENDALE
V
Abercrombie & Fitch
8185507492
2151 GLENDALE GALLERIA
GLENDALE
V
Aeropostale
8185458972
2187 GLENDALE GALLERIA
GLENDALE
V
Aerosoles
8182462455
1163 GLENDALE GALLERIA
GLENDALE
V
2034266114
65 Church hill Rd
NEWTOWN
Src
Newtown Pizza Palace
Pizza Palace Of
Newtown
Name
2034266114
65 Church hill Rd
NEWTOWN
D
D
Aerosoles
Aldo Shoes
D
Newtown Pizza Palace
V
D
s3
Cleaned
Data
Search
Box
s4
Phone
Address
City
8182462455 1163 GLENDALE GALLERIA GLENDALE
8184090612 1157 GLENDALE GALLERIA GLENDALE
2034299114
Pizza Palace of Newtown 2034266114
65 Church hill Rd
Newtown
Church Hill Rd
Newtown
Src
Name
Phone
Address
City
A
A
A
A
A
A
A
A 24 Hour 1 A 1 Locksmith
A Link Wireless
Abercrombie
Abercrombie & Fitch
Newtown Pizza Palace
Aldo Shoes
Alert Cellular
8182404644
8185491449
8185020728
8185507492
2034266114
8185482540
8182404779
3210 GLENDALE GALLERIA
2148 GLENDALE GALLERIA
2229 GLENDALE GALLERIA
2151 GLENDALE GALLERIA
65 Church hill Rd
2154 GLENDALE GALLERIA
2148 GLENDALE GALLERIA
GLENDALE
GLENDALE
GLENDALE
GLENDALE
Newtown
GLENDALE
GLENDALE
Src
Name
Phone
Address
City
T
T
T
T
T
Newtown Piza Palace
Aldo Shoes
American Eagle Outfitters
ANN TAYLOR
Ann Taylor Stores
2034266114
8185482540
8189561893
8182460350
8182460350
65 Church hill Rd
2154 GLENDALE GALLERIA
2182 GLENDALE GALLERIA
2178 GLENDALE GALLERIA
1108 GLENDALE GALLERIA
Newtown
GLENDALE
GLENDALE
GLENDALE
4
GLENDALE
Current Solution
• Current two-step solution
– Step 1: Record Linkage
• link records that are likely to refer to the same real-world
entity
– [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06]
– Step 2: Data Fusion
• merge the linked records and decide the correct values for
each result entity in the presence of conflicts
[J. Bleiholder et. al, ACM Computing Surveys08]
• Uniqueness constraint
– Many real world entities has a unique value for the
attribute. E.g. Website(IP ), Phone, Facebook account
• Co-existence of conflicts and duplicates makes
the problem hard to solve
5
Limitations of Current Solution
SOURCE
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
NAME
Microsofe Corp.
Microsofe Corp.
Macrosoft Inc.
Microsoft Corp.
Microsofe Corp.
Macrosoft Inc.
Microsoft Corp.
Microsoft Corp.
Macrosoft Inc.
Microsoft Corp.
Microsoft Corp.
Macrosoft Inc.
Microsoft Corp.
Microsoft Corp.
Macrosoft Inc.
Microsoft Corp.
Macrosoft Inc.
MS Corp.
Macrosoft Inc.
MS Corp.
Macrosoft Inc.
Macrosoft Inc.
MS Corp.
PHONE
xxx-1255
xxx-9400
xxx-0500
xxx-1255
xxx-9400
xxx-0500
xxx-1255
xxx-9400
xxx-0500
xxx-1255
xxx-9400
xxx-0500
xxx-1255
xxx-9400
xxx-0500
xxx-2255
xxx-0500
xxx-1255
xxx-0500
xxx-1255
xxx-0500
xxx-0500
xxx-0500
ADDRESS
1 Microsoft Way
1 Microsoft Way
2 Sylvan W.
1 Microsoft Way
1 Microsoft Way
2 Sylvan Way
1 Microsoft Way
1 Microsoft Way
2 Sylvan Way
1 Microsoft Way
1 Microsoft Way
2 Sylvan Way
1 Microsoft Way
1 Microsoft Way
2 Sylvan Way
1 Microsoft Way
2 Sylvan Way
1 Microsoft Way
2 Sylvan Way
1 Microsoft Way
2 Sylvan Way
2 Sylvan Way
2 Sylvan Way
(Microsoft Corp. ,Microsofe Corp., MS Corp.)
(XXX-1255, xxx-9400)
(1 Microsoft Way)
(Macrosoft Inc.)
(XXX-0500)
(2 Sylvan Way, 2 Sylvan W.)
Assume that Phone and Address satisfy
uniqueness constraints
Erroneous values may prevent correct
matching
Current solutions may fall short when the
uniqueness constraints exist (PHONE)
9400 missing
6
Contents
• Motivation
• Problem definition
• Solution
• Experimental results
• Conclusions and Future work
7
Problem Definition
– Input
• A set of records provided by a set of independent data
sources
• A set of (hard or soft) uniqueness constraints
– Output:
• Real-world entities
• For each (hard or soft) uniqueness attribute of each entity
– True value
8
Concepts
• Entity and Attribute
– E.g.,
(Macrosoft Inc.)
(XXX-0500)
(2 Sylvan Way, 2 Sylvan W.)
(Microsoft Corp. ,Microsofe Corp., MS Corp.)
(XXX-1255, xxx-9400)
(1 Microsoft Way)
– Value vs. Representations (e.g., New York City  New York City, NYC, N.Y.C)
• Constraint
– Uniqueness constraint (hard constraint): DA
• Business Name, Business Phone, Business Address
1-p1
– Soft uniqueness
constraint (soft constraint): DA
1-p
1-p
1
• Business Phone
(e.g., p1=30%, p2=10% )
1-p
2
2
Where p1 is the upper bound probability of an entity having multiple values
for A and p2 is the upper bound probability of a value of A being shared by
multiple entities.
Special case: key attribute
9
Contents
• Motivation
• Problem definition
• Solution
• Experimental results
• Conclusions and Future work
10
K-Partite Graph Encoding
(Microsoft Corp. ,Microsofe Corp., MS Corp.)
(XXX-1255, xxx-9400)
(1 Microsoft Way)
(Macrosoft Inc.)
(XXX-0500)
(2 Sylvan Way, 2 Sylvan W.)
N1
P1
s(1)
A1
1 Microsoft Way
S1
Microsofe Corp.
Xxx-1255
1 Microsoft Way
Encoding of the ideal solution
N1
N3
N2
P1
P2
P3
N4
P4
A1
A2
1 Microsoft Way
2 Sylvan Way
A3
2 Sylvan W.
Pre-processing for the K-partite graph
Clustering in every partite (subset)
Clustering with Hard Constraint
N1
N3
N2
P1
A1
1 Microsoft Way
C1
N4
P2
P3
C2
C3
P4
A2
A3
2 Sylvan Way
Clustering the whole graph G(S)
C4
2 Sylvan W.
13
Clustering w.r.t hard constraint
• Ideal clustering should meet two requests
– HighC cohesion within each cluster
C
High
cohesion
High cohesion
– Low
correlation
between
different clusters
j
i
Lowd(C
correlation
i ,C j )
d(Ci , Ci )
d(C j ,C j )
• Objective function for getting “best” clustering
– Choosing Davies-Bouldin index [Davies and Bouldin TPAML79]
– The goal is to minimize Davies-Bouldin index min((C ))
(C )  Avg ( max
n
i 1
• d(Ci ,Ci
• d(Ci ,C j
j  [ 1,n], j  i
d(Ci ,Ci )  d(C j ,C j )
d(Ci ,C j )
)
) corresponds to complement of cohesion
) corresponds to complement of correlation
Computing cluster distance
• Cluster distance function
d (Ci ,C j ) 
d S (Ci ,C j )  d A (Ci ,C j )
2
– d S (Ci ,C j ) is similarity distance for measuring similarity between value
representations of the same attributes.
– d A (Ci ,C j ) is association distance for measuring association between value
representations of different attributes.
• The key is how to calculate d S (Ci ,C j ) and d A (Ci ,C j ) for
computing cluster distance
Similarity Distance
C1 d s (Ci , C j ) 
0.7
How to get
N1
0.65
0.95 0.65
N1 N2 N3
Within the same cluster
C4
0.7
A3
A3
d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3
= 0.25 (name)
2
d S(C1,C1) = 0 (phone)
d3S(C1,C1) = 0 (address)
dS(C1,C1) = (0.25+0+0)/3 = 0.083
0.4
N2
N3
N4
N1
1.0
0.95
0.65
0.7
N2
0.95
1.0
0.65
0.7
N3
0.65
N4
0.7
0.65 P1 1.0
0.7
0.4
A1
0
0.4
1.0
N4
A1
1.0
0
0
A2
0
1.0
0.9
A3
0
P40.9
1.0
0
0
A1
Within the different clusters
A2
0.9
A3
d1S(C1,C4) = 1 − (0.7+0.7+0.4)/3
= 0.4 (name)
2
d S(C1,C4) = 1-0 = 1 (phone)
d3S(C1,C4) = 1-0 = 1 (address)
dS(C1,C4) = (0.4+1+1)/3=0.8
Association Distance
How to get association distance d A (Ci ,C j )
Within the same cluster
d1,2A (C1,C1) = 1 − 7/9 = 0.22 
d1,3A(C1,C1) = 1− 8/9 = 0.11
d2,3A (C1,C1) = 1− 7/8 = 0.125
N1
N3
N2
s(1)
N4
S(10)
s(2-5)
dA(C1,C1) = (0.22+0.11+0.125)/3
= 0.153
Within the different clusters
S(1-9)
S(7-8)
P1
s(2-6)
S(7-8)
S(10)
P4
S(2-9)
s(1-2)
s(1-5,7,8)
S(2-10)
s(1)
s(1)
d1,2A (C1,C4) = 1 − max(1/10,0/10)
= 0.9
1,3
d A(C1,C4) = 0.9
d2,3A (C1,C4) = 1
dA(C1,C4) = (0.9+0.9+1)/3 = 0.93
d A(Ci ,C j )  Avgl ,l '[1,k ],l l 'd Al ,l '(Ci ,C j )
A1
1 Microsoft Way
C1
A2
A3

| S l ,l ' (Ci ) |
1 l
(i  j )

l'
2 Sylvanl ,lWay
2
Sylvan
W.
|
S
(
C
)

S
(
C
)
|

i
i
d A '(Ci ,C j )  
l ,l '
| S (Ci , C j ) |
| S l ,l ' (C j , Ci ) |
1  max{
,
} (i  j )
C4 
| S l (Ci )  S l ' (C j ) | | S l (C j )  S l ' (Ci ) |
18
Greedy Algorithm--CLUSTER
• Obtaining optimal clustering is intractable
– [T.F. Gonzales., 82],[J. Simal et al., 06]
• Algorithm: CLUSTER
– Step1: Initialization
• Cluster value representations according to their similarity
distance and association distance
– Step2: Adjustment
• For each node, moving to the cluster that minimize this DaviesBouldin(DB) index
– Step3: Convergence checking
• stop if step 2 doesn’t change the clustering result. Otherwise,
repeat step 2
20
Φ=0.94 Φ=0.71
Φ=0.93
Φ=0.92
Φ=1.16
Φ=1.15
N3
N1
N2
N4
Φ=0.89
Φ=0.71
Φ=0.45
P1
P4
P3
P2
A1
A2
1 Microsoft Way
2 Sylvan Way
C1
C2
C3
A3
C4
2 Sylvan W.
21
Matching w.r.t. Soft Constraints
MS Corp.
Microsoft Corp.
Microsofe Corp.
Macrosoft Inc.
NC1
7
s(1-5,7,8)
Graph
Transform
PC1
NC4
1
5
S(6)
s(1-5)
PC3
PC2
9
1
S(1-9)
S(10)
PC4
8
1
S(1-8)
S(10)
AC1
1 Microsoft Way
9
S(1-9)
AC4
2 Sylvan W.
2 Sylvan Way
• Next step is to find the best matching between key
attribute and soft uniqueness attributes
• How to match?
22
Matching w.r.t. Soft Constraint
• Goals
– Maximizing the sum of weights of selected edges w(e)
– Minimizing the gap for each node Gap(N)
– How to balance above two goals? Giving a score function to balance w(e) and
Gap(N)
• Getting the “best” matching
– Maximize Score function
___

Score( M ) 
( u , v )M
• Greedy algorithm: MATCHT
N1
– Getting Gap(N) and W(u,v)
Gap ( N )  Max( w(e))  Min( w(e))
Gap ( N1)  9  1  8
Gap ( P1)  0
w(u, v)
Gap(u )  Gap (v)  
1
(s1)
P1
9
7
(s4-s10) (s2-s10)
P2
P3
23
Continue the example
Solution 1
N1
(s1)
P1
N2
(s3-s5)
9
P3
Gap(N1) = 9
Gap(P1) = 0
w(N1,P1) = 1
P1
P4
3
___
Score( M ) 
Gap(N1) =1
Gap(P3) = 0
w(N1,P3) =9
P3
P4
Gap(N2) = 0
Gap(P4) = 2
w(N2,P2) = 8
P3
Greedily select
(s1)
(s2-s9)
P1
(s2-s9)
P4
Gap(N2) = 0
Gap(P4) = 2
w(N2,P4) = 8
Solution 4
N1
8
8
7 (s2-s10) 10
(s4-s10)
(s1-s10)
w(u, v)
Gap(u )  Gap (v)  
1
N2
(s3-s5)
9
P2
N2
(s1) 7 (s2-s10) 10
(s4-s10)
(s1-s10)
P2

( u , v )M
(s3-s5)
9
3
Gap(N1) = 3
Gap(P2) = 4
w(N1,P2) = 7
Gap(N2) = 5
Gap(P2) = 4
w(N2,P2) = 3
N1
P1
(s1)
(s2-s9)
Solution 3
1
1
Greedily select
8
7 (s2-s10) 10
(s4-s10)
(s1-s10)
P2
N1
3
N2
(s3-s5)
9
7 (s2-s10) 10
(s4-s10)
(s1-s10)
P2
Gap(N1) =0
Gap(P4) = 2
w(N1,P4) =10
P3
P4
Gap(N2) = 0
Gap(P4) = 2
w(N2,P2) = 8
8
(s2-s9)
Greedily select
1
3
Solution 2
Contents
• Motivation
• Problem definition
• Solution
• Experimental results
• Conclusions and Future work
25
Experiment Settings
• Dataset I
– Business listings for two zip codes(07035,07715) from
multiple sources
Zip
Business
07035
07715
662
149
Zip
07035
07715
Source
#Sources
#Sources/business
15
6
1—7
1—3
Records
#Records
#Names
#Phones
#Addresses
#(Error Phones)
1629
266
1154
243
839
184
735
55
72
12
26
Experiment Settings
• Implementation
–
–
–
–
MATCH +CLUSTER
LINK: linkage only
FUSE: data fusion only
LINKFUSE: first LINK , second FUSE
• Golden Standard: by manually checking
• Measures: Precision/Recall/F-measure
Matching of values of
different attributes
Precision
P
| GM  RM |
| RM |
Recall
R
| GM  RM |
| GM |
F
2 PR
PR
F-measure
Clustering of values
of the same attribute
| G A  RA |
| RA |
| AR  A G |
R
| AG |
2 PR
F
PR
P
27
Accuracy
07035 Matching (NAME-PHONE)
07035 Matching (NAME-ADDRESS)
07035 Clustering (NAME)
28
07715 Matching (NAME-PHONE)
07715 Matching (NAME-ADDRESS)
07715 Clustering (NAME)
Efficiency and Scalability
30
Conclusions
• In the real-world, we need to resolve duplicates
and conflicts at the same time.
• We reduce the problem to a k-partite graph
clustering and matching problem
– Combine linkage and fusion
• Experiments show high efficiency and scalability
31
Thank You!

Download Report

*** 1 - Wamdm

Paperzz.com

Your Paperzz