Relational Clustering for Entity Resolution Queries Indrajit Bhattacharya, Louis Licamele and Lise Getoor University of Maryland, College Park The Entity Resolution Problem Abdulla Ansari Chih Chen WeiWei Wang P1: “A mouse immunity model”, W.Wang, C.Chen, A.Ansari P2: “A better mouse immunity model”, W.Wang, A.Ansari P3: “Measuring protein-bound fluxetine”, L.Li, C.Chen, W.Wang P4: “Autoimmunity in biliary cirrhosis”, W.W.Wang, A.Ansari Wenyi Wang Liyuan Li Discover the domain entities Map each reference to an entity Chien-Te Chen Query-time ER: Motivation Most publicly available databases do not have resolved entities o PubMed, CiteSeer have many unresolved authors Millions of queries everyday require resolved entities directly or indirectly o “I am looking for all papers by Stuart Russell” How do we address this problem? 1. Leave the burden on the user to do the resolution 2. Ask owners to ‘clean’ their databases 3. Develop techniques for query-time resolution Entity Resolution Queries Disambiguation Query o Among all papers with ‘W Wang’ as author, find those written by WeiWei Wang P1: “A mouse immunity model”, W.Wang, C.Chen, A.Ansari P2: “A better mouse immunity model”, W.Wang, A.Ansari P4: “Autoimmunity in biliary cirrhosis”, W.W.Wang, A.Ansari P1: “A mouse immunity model”, W.Wang, C.Chen, A.Ansari P2: “A better mouse immunity model”, W.Wang, A.Ansari P3: “Measuring protein-bound fluxetine”, L.Li, C.Chen, W.Wang Resolution Query o Do disambiguation o Also retrieve papers by WeiWei Wang with a different author name, e.g. ‘W W Wang’ etc Query-time ER using Relations 1. Simple approach for resolving queries o Use attributes o Quick but not accurate 2. Use best techniques available o Collective resolution using relationships o How can localize collective resolution? Two-phase collective resolution for query o Extract minimal set of relevant records o Collective resolution on extracted records Cut-based Evaluation of Relational Clustering • Vertices embedded in attribute space • Additional (hyper)edges represent relationships C1 C3 C2 C1 C2 C4 Good separation of attributes Many cluster-cluster relationships C1-C3, C1-C4, C2-C4 C3 C4 Worse in terms of attributes Fewer cluster-cluster relationships C1-C3, C2-C4 A Cut-based Objective Function w i weight for attributes A sim A (ci , c j ) wR (ci , c j ) f (ci , c j ) j similarity of attributes weight for relations 1 iff relational edge exists between ci and cj compatibility of ci and cj Greedy clustering algorithm: merge cluster pair with max reduction in objective function (ci c j ) wA sim A (ci c j ) wR ( N (ci) N (c j )) f (ci c j ) Similarity of attributes • Jaro, Levenstein; TF-IDF Common cluster neighborhood • Jaccard works better than intersection Extracting Relevant Records Query W Wang Name expansion Level 0 Hyper-edge expansion Level 1 P4: W W Wang P4: A Ansari P1: W Wang P2: A Ansari P2: W Wang P1: A Ansari P3: W Wang P1: C Chen P3: C Chen Start with query name or record Alternate between Level 2 P: A Ansari P: A Ansari P: C Chen P3: L Li 1.Name expansion: For any relevant record, include other records with that name 2.Hyper-edge Expansion: For any relevant record, include other related records Terminate at some depth k Name expansion P: C Chen P: L Li P: L Li Adaptive Expansion for a Query Too many records with unconstrained expansion o Adaptively select records based on ‘ambiguity’ o ‘Chen’ is more ambiguous than ‘Ansari’ Adaptive Name Expansion o Expand the more ambiguous records They need extra evidence Adaptive Hyper-edge expansion o Add fewer ambiguous records They lead to imprecision Unsupervised Estimation of Ambiguity Probability of multiple entities sharing an attribute value Estimate ambiguity of one single valued attribute (A1=a) using another (A2) o Count number of different values of A2 observed for records having A1=a o e.g. #different first initials for last-name ‘Smith’ Estimate improves with more independent attributes Evaluation Datasets arXiv High Energy Physics o 29,555 publications, 58,515 refs to 9,200 authors o Queries: All ambiguous names (75 in total) True authors per name: 2 to 11 (avg. is 2.4) Elsevier BioBase o 156,156 publications, 831,991 author refs o Keywords, topic classifications, language, country and affiliation of corresponding author, etc o Queries: 100 most frequent names True authors per name: 1 to 100 (avg. is 32) Growth Rate of Relevant Records and Query Processing Time BioBase Sim BioBase Exact 600 Number of relevant references grows rapidly with expansion depth arXiv Exact 400 200 0 0 1 2 3 4 5 6 7 Expansion Depth 800 RC-ER is fast but not good enough for query-time resolution BioBase cpu secs #references (in Thousands) 800 arXiv 600 400 200 0 0 10 20 30 40 # references (in thousands) 50 60 Query-time ER Results arXiv F1 BioBase F1 A 0.721 0.701 A* 0.778 0.687 A+N 0.956 0.710 A+N* 0.952 0.753 RC-ER Depth 1 0.964 0.813 RC-ER Depth 3 0.970 0.821 Unconstrained expansion o Collective resolution more accurate o Accuracy improves beyond depth 1 A: pair-wise attributes similarity ; A+N: also neighbors’ attributes ; *: transitive closure Adaptive expansion o Minimal loss in accuracy o Dramatic reduction in query processing time relv-set size time (secs) accuracy (F1) Unconstr AX-2 AX-1 44,129 5,510 3,743 607 43 31 0.821 0.820 0.818 AX-2: adaptive expansion at depths 2 and beyond AX-1: adaptive expansion even at depth 1 Conclusions Query-centric entity resolution Cut-based evaluation of relational clustering Adaptive selection of relevant references for a query Resolution at query-time with minimal loss in accuracy Future Directions Spectral algorithm for relational clustering Stronger coupling between extraction and resolution Localized resolution for incoming records References "Query-Time Entity Resolution", Indrajit Bhattacharya, Louis Licamele and Lise Getoor, ACM SIGKDD, 2006 "A Latent Dirichlet Model for Unsupervised Entity Resolution", Indrajit Bhattacharya and Lise Getoor, SIAM Data Mining, 2006 "Entity Resolution in Graphs", Indrajit Bhattacharya and Lise Getoor, Chapter in Mining Graph Data, Lawrence B. Holder and Diane J. Cook, Editors, Wiley, 2006 (to appear). "Relational Clustering for Multi-type Entity Resolution", Indrajit Bhattacharya and Lise Getoor, SIGKDD Workshop on Multi Relational Data Mining (MRDM), 2005
© Copyright 2026 Paperzz