Query-Driven Approach
to Entity Resolution
Amir Ilkhechi, Stavros Sintos
Paper is by Hotham Altwaijry, Dmitri V. Kalashnikov, Sharad Mehrotra
Used the video presentation by Kalashnikov
Take away points
´ Query driven entity resolution
´ Significantly reduce the cleaning overhead by resolving records that influence
the query’s answer.
´ Notion of vestigiality.
´ Use vestigiality to reduce computation.
´ Different levels of clustering representation
´ Exact, Distinct, Representative
´ Using blocking with QDA have a large improvement in the number of
resolves.
Overview:
´ Motivation
´ High-level introduction to the problem and solutions
´ Formal Problem Definition
´ Algorithms (detailed explanation of the solutions)
´ Experiments
Motivation:
´ More than 80% of data mining researchers spend >40% of their project time
on cleaning and preparation of data.
´ Analysis on bad data can lead to incorrect results:
´ Fix errors before analysis (common approach)
´ Account for them during the analysis
Introduction
´ ER phases:
´ Blocking
´ Similarity Computation
´ Clustering
R1
R6
R3
R4
R2
R5
R1
R4
R3
R6
R2
R5
Query Driven Approach
´ Traditional ETL Process and ER:
´ Extract data from static data resources
´ Transform (and clean)
´ Load
´ Save as Data Warehouse
´ Queries are on the warehouse
´ Query-driven ER:
´ Query on the raw data
´ Extract data (based on the query)
´ Do the necessary cleaning
Why query-driven
´ Big data
´ Queries on online data
´ Know what to clean only at the query time
´ Small organizations with large data sets and limited computational
resources
´ Suppose that only a small portion of the data needs to be analyzed
Running Example
Dirty relation(R)
Pid
Running Example
Cluster
A
1⊕7
B
2⊕3⊕4
C
5⊕6
Ground Truth
Record with id = 7 not included
Returned Answer Semantics:
´ Answer returned by first cleaning and then querying:
´ Let’s call it Q*
´ Exact Semantics:
´ It matches both in terms of clusters and their representations with Q*
´ E.g., {p1⊕p7,p2⊕p3⊕p4}
´ Distinct Semantics
´ It matches in terms of clusters with Q*, but representations might be different
´ E.g., {p1⊕p7,p2⊕p3}
´ Representative Semantics
´ It might include duplicates but it matches Q* in terms of clusters.
´ E.g., {p1,p7, p2⊕p3}
Example Query
´ Select * From R Where cited >= 45
Dirty relation(R)
Standard solution:
´ Step 1: De-duplicate the dirty relation thoroughly
´ Perhaps by calling the resolve function on all pairs of records
´ Step 2: Compute the query result over the acquired clean relation
´ Issues:
´ Large number of calls to resolve function
´ Resolve itself is generally expensive
Resolve Function
´ Resolve is a pairwise function R(r_i,r_j)
Merge Function
´ It consolidates two duplicate records to produce a new record.
´ Combine (⊕) function is used to combine each attribute
45
ADD semantics: v_i ⊕ v_j = v_i + v_j
25
MAX semantics: v_i ⊕ v_j = max(v_i, v_j)
20
MIN semantics: v_i ⊕ v_j = min(v_i, v_j)
{Alon Halevy, A. Y.Halevy}
UNION semantics: v_i ⊕ v_j = v_i ∪ v_j
{Alon Halevy}
EXEMPLAR semantics: v_i ⊕ v_j = choose v_i or v_j
Exploiting query semantics
´ Query: Select * From R Where cited >= 30
´ Assume ADD semantics
´ We can prune the resolves
Call R(p_2, p_3) = MustMerge
answer 1 = {1}
answer 2 = {1, 2⊕3}
Call R(p_4, p_5) = MustSeperate
Gains compared to other approaches:
Algorithm
# of resolve calls
Query-driven algorithm
2
Standard Transitivity Closure Algorithm
11
Correlation Clustering Algorithm
15
Formal Problem Definition
´ Notations:
´ The following represents a relation in the database
´ The set of attributes are given as
´ The k’th record is represented as
´ The goal of traditional ER:
´ Partition records in R into a set of non-overlapping clusters
´ 2 records from the same cluster correspond to the same entity
´ 2 records from distinct clusters correspond to different entities
´ Queries:
´
Problem: Optimization
´ Given: Query Q
´ Minimize: # of calls to resolve function
´ Subject to:
´ Query satisfaction:
´ Each cluster returned by QDA must satisfy Q.
´ User-defined equivalence:
´ The answer generated by QDA must be (exactly, distinctly, or representatively)
equivalent to that of TC
Vestigiality
Vestigiality
´ Categorize triples (𝑝,⊕, 𝑎& )
´ Deal with multi-predicate selection queries
´ Construction of the labeled graph
´ Use relevant clique, minimal clique to test for vestigiality.
Triple (𝑝,⊕, 𝑎& ) Categorization
´ 𝑝: Query predicate
´ ⊕: Combine function
´ 𝑎& : Attribute
´ Help to significantly reduce the cleaning overhead by resolving only those
edges that may influence the answer of 𝑄.
Triple (𝑝,⊕, 𝑎& ) Categorization
´ (cited>=45, ADD, cited): in-preserving
´ (cited<=45, ADD, cited): not in-preserving
Triple (𝑝,⊕, 𝑎& ) Categorization
´ (cited<=45, ADD, cited): out-preserving
Multi-Predicate Selection Queries
´ More complex selection queries (AND, OR, NOT).
Multi-Predicate Selection Queries
´ 𝑝) : 𝑐𝑖𝑡𝑒𝑑 ≥ 45
´ 𝑝3 : 𝑐𝑖𝑡𝑒𝑑 ≤ 65
´ 𝜏) = (𝑐𝑖𝑡𝑒𝑑 ≥ 45, 𝐴𝐷𝐷, 𝑐𝑖𝑡𝑒𝑑): in-preserving
´ 𝜏3 = (𝑐𝑖𝑡𝑒𝑑 ≤ 65, 𝐴𝐷𝐷, 𝑐𝑖𝑡𝑒𝑑): out-preserving
´ 𝜏) ∩ 𝜏3 : neither
´ Why?
´ 𝑟< → 𝑐𝑖𝑡𝑒𝑑 = 10
´ 𝑟A → 𝑐𝑖𝑡𝑒𝑑 = 40
´ 𝑟B → 𝑐𝑖𝑡𝑒𝑑 = 20
´ 𝑟< ⊕ 𝑟A → 𝑐𝑖𝑡𝑒𝑑 = 50 (IN the answer)
´ 𝑟< ⊕ 𝑟A ⊕ 𝑟B → 𝑐𝑖𝑡𝑒𝑑 = 70 (OUT the answer)
Creating and Labeling the Graph
´ Build and label the Graph.
´ Avoid creating as many nodes and edges as possible.
´ Blocking to reduce the edges in the graph.
´ Remove from consideration nodes and edges that will not influence further
processing of 𝑄.
Create and label the nodes
Create and label the edges
´ Add the edge eFG = 𝑣< , 𝑣A
´ Nodes 𝑣< , 𝑣A belong in the same block.
´ 𝑣< , 𝑣A ∈ 𝑉KLMNO
´ (𝑣< ∈ 𝑉PQR AND 𝑣A ∈ 𝑉KLMNO ) OR (𝑣A ∈ 𝑉PQR AND 𝑣< ∈ 𝑉KLMNO )
Create and label the edges
´ For efficiency:
´ Label no edge -> remove the edge from the graph.
´ Label yes edge (𝑣< , 𝑣A ) -> Merge nodes 𝑣< , 𝑣A .
Create and label the edges
´ For efficiency:
´ Label no edge -> remove the edge from the graph.
´ Label yes edge (𝑣< , 𝑣A ) -> Merge nodes 𝑣< , 𝑣A .
𝐴TQU is the answer result
assuming all vestigial
and unresolved edges
are NO edges.
Vestigial Edges
´ Identify Vestigial Edges?
Vestigiality Testing using Cliques.
´ Clique:
´ Given 𝐺(𝑉, 𝐸) a set 𝑆 ⊆ 𝑉 is a clique if for every 𝑣< , 𝑣A ∈ 𝑆, 𝑣< , 𝑣A ∈ 𝐸.
´ If a group of nodes is not a clique, that group corresponds to at least two
distinct entities.
Relevant Clique
Relevant Clique
´ Resolve 𝑒3Z
´ 𝑝 = (𝑐𝑖𝑡𝑒𝑑 ≥ 45)
´ 𝐴TQU = {𝑝) , 𝑝3 ⨁𝑝Z }
´ All edges incident to 𝑝) , 𝑝3 ⨁𝑝Z are vestigial.
´ Nodes 4, 5, 6 form a clique S.
´ Sum up of the cited attribute of S -> 5 + 10 + 15 = 30 ≯ 45.
´ Merge nodes in S cannot change 𝐴TQU
´ Edges 𝑒ab , 𝑒ac , 𝑒bc are not part of any relevant clique, so vestigial edges!
Minimal Clique
Minimal Clique
´ Resolve 𝑒ab
´ Triple (𝑐𝑖𝑡𝑒𝑑 ≥ 45, 𝐴𝐷𝐷, 𝑐𝑖𝑡𝑒𝑑) is in-preserving
´ 𝑆 = 𝑝3 , 𝑝Z , 𝑝a is a relevant clique (25 + 20 + 15 = 60 ≥ 45)
´ 𝑆K = 𝑝3 , 𝑝Z ⊂ 𝑆 forms a minimal clique(25 + 20 = 45 ≥ 45)
´ Edges 𝑒3a , 𝑒Za are vestigial since both 𝑒3a , 𝑒Za do not belong to any minimal
clique.
IS_VESTIGIAL()
´ NP-hard to test for Vestigiality.
´ Straightforward Reduction from the 𝑘-Clique problem.
´ Worse than 𝑂 𝑛3 calls of the resolve function.
´ Challenge: Design QDA that performs vestigiality testing efficiently.
Query-Driven Solution
Overview
´ The answer of QDA equivalent to first applying TC on the whole dataset
and then querying the cleaned dataset.
´ TC Method
´ Choose a pair of nodes to resolve
´ Apply the resolve function
´ Merge nodes if resolve function returns positive answer
´ QDA similar to TC
´ Uses its own pair-picking strategy.
´ Instead of calling the resolve function, is it a vestigial edge?
QDA Approach
´ Create and label the graph
´ Choose an edge to resolve
´ Not the main focus of the paper
´ Quickly add some cluster-representatives or break many relevant cliques.
´ Pick edges according to a weight 𝑤<A = 𝑣<& ⊕ 𝑣A&
´ Lazy edge removal
´ Implemented many optimizations.
´ Check if an edge exists.
´ After merging two nodes only common edges remain
´ 𝑂( 𝑅 ) in the worst case!
´ Do not remove edges at the time of merge – Check if a node has been merged
with another node. 𝑂(1) time!
´ Vestigiality testing (next)
´ Stopping condition
´ If there exists an edge that is neither resolved nor vestigial.
´ Compute the answer (later)
Vestigiality Testing
´ Given an edge 𝑒<A decide if it is vestigial.
´ Recall, NP-hard
´ Use an inexact but fast check if 𝑒<A is potentially part of any relevant clique.
Edge Miniclique Check Optimization
´ An Optimization
´ If (𝑝,⊕, 𝑎& ) is in-preserving and 𝑒<A can change the current answer to Q
´ 𝑒<A is not vestigial and the algorithm calls the resolve function.
Check for Potential Clique
´ CHECK-POTENTIAL-CLIQUE(): Test if 𝑒<A can potentially be part of any
relevant clique.
´ If yes, call the resolve function, otherwise it marks as vestigial.
CHECK-POTENTIAL-CLIQUE()
´ Quickly check if 𝑒<A is involved in any relevant/minimal clique.
´ Safe approximation function:
´ False only when 𝑒<A is not part of any relevant/minimal clique
´ True when 𝑒<A might be part of some relevant clique.
CHECK-POTENTIAL-CLIQUE()
´ Merge nodes 𝑣< , 𝑣A and check if their merge change Q’s answer.
´ If not, find all common neighbors of 𝑣< , 𝑣A and tries to find the smallest
potential clique which might change Q’s answer.
´ Keep expanding the size of such clique until no common neighbors left.
CHECK-POTENTIAL-CLIQUE()
𝑣)
𝑣<
𝑣)
𝑣Z
𝑣3
𝑣)
𝑣Z
𝑣)
𝑣Z
𝑣3
𝑣3
𝑣<A)3
𝑣<A
𝑣A
𝑣3
𝑣Z
𝑣3
𝑣<A)
𝑣<AZ
𝑣<
𝑣A
𝑣<A
Computing Answer of Given Semantics
´ Compute the final answer 𝐴TQU to query Q based on the answer semantics
the user requested.
Representative answer semantics
´ Add nodes from 𝑉KLMNO which satisfy Q to 𝐴TQU .
´ At this point, 𝐴TQU satisfies representative answer semantics.
Distinct answer semantics
´ Clean the representative answers in the current 𝐴TQU using the original TC
algorithm.
´ Remove duplicates by resolving all pairs of nodes in 𝐴TQU .
´ Additional cost: 𝑂( 𝐴TQU 3 )
Exact answer semantics
´ Compare all nodes in 𝐴TQU with all nodes in 𝐴TQU ∪ 𝑉KLMNO
´ Extra cost: 𝑂( 𝐴TQU ⋅ 𝑅 )
Exact answer semantics
´ Compare all nodes in 𝐴TQU with all nodes in 𝐴TQU ∪ 𝑉KLMNO
´ Extra cost: 𝑂( 𝐴TQU ⋅ 𝑅 )
To produce distinct or
exact answer, vestigial
edges are considered
unresolved.
Answer Correctness
´ Naturally, resolve functions are not always accurate -> no ER technique
can guarantee correctness.
´ They do not assume that resolve is always accurate.
Experimental Evaluation
Experimental Results
´ Dataset:
´ Bibliographical entries from Google Scholar
´ Contains 50 researchers
´ 16396 records, 14.3% of which are duplicates
´ Resolve Function
´ Soft-TF-IDF on titles
´ Jaro-Winkler distance on author names
´ Blocking Techniques:
´ Bucketize records that might be duplicates
´ 1: First two letters of titles as the hash key
´ 2: Last two letters of titles as the hash key
QDA vs. TC (Transitive Closure)
´ Query: Select * From R Where cited >= t
Very similar: means bottleneck!
QDA vs. TC [Answer Semantics]
QDA speed-up for different types of
queries
The effect of resolve function
Combined with blocking…
Effect of Edge Picking
Thank you!
谢谢 !
آپ ﮐﺎ ﺷﮑرﯾہ
Ευχαριστώ!
Teşekkürler!
© Copyright 2026 Paperzz