Poster Slides - University of Maryland

Relational Clustering for Entity
Resolution Queries
Indrajit Bhattacharya, Louis Licamele
and Lise Getoor
University of Maryland, College Park
The Entity Resolution Problem
Abdulla Ansari
Chih Chen
WeiWei Wang
P1: “A mouse immunity model”, W.Wang, C.Chen, A.Ansari
P2: “A better mouse immunity model”, W.Wang, A.Ansari
P3: “Measuring protein-bound fluxetine”, L.Li, C.Chen, W.Wang
P4: “Autoimmunity in biliary cirrhosis”, W.W.Wang, A.Ansari
Wenyi Wang


Liyuan Li
Discover the domain entities
Map each reference to an entity
Chien-Te Chen
Query-time ER: Motivation

Most publicly available databases do not
have resolved entities
o PubMed, CiteSeer have many unresolved authors

Millions of queries everyday require resolved
entities directly or indirectly
o “I am looking for all papers by Stuart Russell”

How do we address this problem?
1. Leave the burden on the user to do the resolution
2. Ask owners to ‘clean’ their databases
3. Develop techniques for query-time resolution
Entity Resolution Queries
 Disambiguation Query
o Among all papers with
‘W Wang’ as author,
find those written by
WeiWei Wang
P1: “A mouse immunity model”,
W.Wang, C.Chen, A.Ansari
P2: “A better mouse immunity model”,
W.Wang, A.Ansari
P4: “Autoimmunity in biliary
cirrhosis”, W.W.Wang, A.Ansari
P1: “A mouse immunity model”,
W.Wang, C.Chen, A.Ansari
P2: “A better mouse immunity model”,
W.Wang, A.Ansari
P3: “Measuring protein-bound
fluxetine”, L.Li, C.Chen, W.Wang
 Resolution Query
o Do disambiguation
o Also retrieve papers by
WeiWei Wang with a
different author name,
e.g. ‘W W Wang’ etc
Query-time ER using Relations
1. Simple approach for resolving queries
o Use attributes
o Quick but not accurate
2. Use best techniques available
o Collective resolution using relationships
o How can localize collective resolution?

Two-phase collective resolution for query
o Extract minimal set of relevant records
o Collective resolution on extracted records
Cut-based Evaluation of
Relational Clustering
• Vertices embedded in attribute space
• Additional (hyper)edges represent relationships
C1
C3
C2
C1
C2
C4
Good separation of attributes
Many cluster-cluster relationships
 C1-C3, C1-C4, C2-C4
C3
C4
Worse in terms of attributes
Fewer cluster-cluster relationships
 C1-C3, C2-C4
A Cut-based Objective Function
 w
i
weight for
attributes
A
sim A (ci , c j )  wR  (ci , c j ) f (ci , c j )
j
similarity of
attributes
weight for
relations
1 iff relational edge
exists between
ci and cj
compatibility
of ci and cj
 Greedy clustering algorithm: merge cluster pair with
max reduction in objective function
(ci c j )  wA sim A (ci c j )  wR ( N (ci)  N (c j )) f (ci c j )
Similarity of attributes
• Jaro, Levenstein; TF-IDF
Common cluster neighborhood
• Jaccard works better than intersection
Extracting Relevant Records
Query
W Wang
Name
expansion
Level 0
Hyper-edge
expansion
Level 1
P4: W W Wang
P4: A Ansari
P1: W Wang
P2: A Ansari
P2: W Wang
P1: A Ansari
P3: W Wang
P1: C Chen
P3: C Chen
Start with query name or record
Alternate between
Level 2
P: A Ansari
P: A Ansari
P: C Chen
P3: L Li
1.Name expansion: For any relevant record,
include other records with that name
2.Hyper-edge Expansion: For any relevant
record, include other related records
Terminate at some depth k
Name
expansion
P: C Chen
P: L Li
P: L Li
Adaptive Expansion for a Query
 Too many records with unconstrained expansion
o Adaptively select records based on ‘ambiguity’
o ‘Chen’ is more ambiguous than ‘Ansari’
 Adaptive Name Expansion
o Expand the more ambiguous records
 They need extra evidence
 Adaptive Hyper-edge expansion
o Add fewer ambiguous records
 They lead to imprecision
Unsupervised Estimation of Ambiguity
 Probability of multiple entities sharing an
attribute value
 Estimate ambiguity of one single valued
attribute (A1=a) using another (A2)
o Count number of different values of A2 observed
for records having A1=a
o e.g. #different first initials for last-name ‘Smith’
 Estimate improves with more independent
attributes
Evaluation Datasets
 arXiv High Energy Physics
o 29,555 publications, 58,515 refs to 9,200 authors
o Queries: All ambiguous names (75 in total)
 True authors per name: 2 to 11 (avg. is 2.4)
 Elsevier BioBase
o 156,156 publications, 831,991 author refs
o Keywords, topic classifications, language, country
and affiliation of corresponding author, etc
o Queries: 100 most frequent names
 True authors per name: 1 to 100 (avg. is 32)
Growth Rate of Relevant Records
and Query Processing Time
BioBase Sim
BioBase Exact
600
Number of relevant
references grows rapidly
with expansion depth
arXiv Exact
400
200
0
0
1
2
3
4
5
6
7
Expansion Depth
800
RC-ER is fast but not
good enough for
query-time resolution
BioBase
cpu secs
#references (in Thousands)
800
arXiv
600
400
200
0
0
10
20
30
40
# references (in thousands)
50
60
Query-time ER Results
arXiv F1
BioBase F1
A
0.721
0.701
A*
0.778
0.687
A+N
0.956
0.710
A+N*
0.952
0.753
RC-ER Depth 1
0.964
0.813
RC-ER Depth 3
0.970
0.821
Unconstrained expansion
o Collective resolution more
accurate
o Accuracy improves
beyond depth 1
A: pair-wise attributes similarity ; A+N: also neighbors’ attributes ; *: transitive closure
Adaptive expansion
o Minimal loss in accuracy
o Dramatic reduction in
query processing time
relv-set size
time (secs)
accuracy (F1)
Unconstr
AX-2
AX-1
44,129
5,510
3,743
607
43
31
0.821
0.820
0.818
AX-2: adaptive expansion at depths 2 and beyond
AX-1: adaptive expansion even at depth 1
Conclusions




Query-centric entity resolution
Cut-based evaluation of relational clustering
Adaptive selection of relevant references for a query
Resolution at query-time with minimal loss in accuracy
Future Directions
 Spectral algorithm for relational clustering
 Stronger coupling between extraction and resolution
 Localized resolution for incoming records
References
 "Query-Time Entity Resolution", Indrajit Bhattacharya, Louis
Licamele and Lise Getoor, ACM SIGKDD, 2006
 "A Latent Dirichlet Model for Unsupervised Entity Resolution",
Indrajit Bhattacharya and Lise Getoor, SIAM Data Mining, 2006
 "Entity Resolution in Graphs", Indrajit Bhattacharya and Lise
Getoor, Chapter in Mining Graph Data, Lawrence B. Holder and
Diane J. Cook, Editors, Wiley, 2006 (to appear).
 "Relational Clustering for Multi-type Entity Resolution", Indrajit
Bhattacharya and Lise Getoor, SIGKDD Workshop on Multi
Relational Data Mining (MRDM), 2005