String Similarity Measures and Joins with Synonyms
Jiaheng Lu † , Chunbin Lin † , Wei Wang ‡ , Chen Li # , Haiyong Wang †
† School of Information and DEKE, MOE, Renmin University of China; Beijing, China
‡ University of New South Wales,Australia, # University of California, Irvine, CA
† {jiahenglu,chunbinlin,why}@ruc.edu.cn, ‡ [email protected], # [email protected]
ABSTRACT
A string similarity measure quantifies the similarity between two
text strings for approximate string matching or comparison. For
example, the strings “Sam” and “Samuel” can be considered to be
similar. Most existing work that computes the similarity of two
strings only considers syntactic similarities, e.g., number of common words or q-grams. While this is indeed an indicator of similarity, there are many important cases where syntactically different strings can represent the same real-world object. For example,
“Bill” is a short form of “William”. Given a collection of predefined synonyms, the purpose of the paper is to explore such existing knowledge to evaluate string similarity measures efficiently and
perform joins, thereby boosting the quality of string matching.
In particular, we first present an expansion-based framework to
measure string similarities efficiently while considering synonyms.
Because using synonyms in similarity measures is, while expressive, computationally expensive (NP-hard), we propose an efficient
algorithm, called selective-expansion, which guarantees the optimality in most real scenarios. We then study a novel index structure called SI-tree for the purpose of string similarity joins with
synonyms, combining signature filtering and length filtering strategies. We develop an estimator to approximate the size of candidates
to enable an online selection of signature filters to improve the efficiency. This estimator provides strong low-error, high-confidence
guarantees while requiring only logarithmic space and time costs,
thus making our method attractive both in theory and in practice.
Finally, the results from an empirical study of the algorithms verify
the effectiveness and efficiency of our approach.
1.
INTRODUCTION
Approximate string matching is used heavily in data integration
([2, 4]), data deduplication ([10]) and data mining ([11]). Most
existing work that computes the similarity of two strings only considers syntactic similarities, e.g., number of common words or qgrams. Although this is indeed an indicator of similarity, there are
many important cases where syntactically different strings can represent the same real-world object. For example, “Bill” is a short
form of “William”, “Big Apple” is a synonym of “New York”, and
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD’13, June 22–27, 2013, New York, New York, USA.
Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00.
“Database Management Systems” can be abbreviated as “DBMS”.
This equivalence information can help us identify semantically similar strings that may have been missed by syntactical similaritybased approaches.
In this paper, we study two problems related to approximate
string matching with synonyms1 . First, how to define an effective
similarity measure between two strings based on the synonyms?
The traditional similarity functions cannot be easily extended to
handle the synonyms in similarity matching. Second, how to efficiently perform similarity join based on newly defined similarity
functions? That is, given two collections of strings, how to efficiently find similar string pairs from the collections? The two
problems are closely related, because the similarity measures defined in the first problem will naturally provide a join predicate to
the second.
We give several applications about approximate string measures
and joins with synonyms to shed light on the importance of these
two problems. In a gene/protein database, one of the major obstacles that hinder the effective use is term variation [27], including
Roman-Arabic (e.g. “Synapsin 3” and “Synapsin III”), acronym
variation (e.g. “IL-2” and “interleukin-2”), and term abbreviation
(e.g. “Ah receptor” and “Ah dioxin receptor”). A new similarity measure that can handle term variation may help users discover more genes (proteins) of interests. New measures are also
applicable in information extraction systems to aid mapping from
text strings to canonical entries by similarity matching. More applications can be found in areas such as term classification and
term clustering, which can be used for advanced information retrieval/extraction applications. String approximate join is typically
useful in data integration and data cleaning. For example, in two
representative data cleaning domains, namely addresses and academic publications, there often exist rich sources of synonyms,
which can be leveraged to improve the effectiveness of systems
significantly.
For the first problem of similarity measures, there is a wealth of
research on string-similarity functions on the syntactical level, such
as Levenshtein distance [28], Hamming distance [18], Episode distance [12], Cosine metric [25], Jaccard Coefficient [10, 21], and
Dice similarity [6]. Unfortunately, none of them can be trivially
extended to leverage synonyms to improve the quality of similarity measurement. One exception is the JaccT technique proposed
by Arasu et al. [2], which generates all possible strings by applying synonym rewriting to both strings, and returning the maximum
similarity among the transformed pairs of strings, as illustrated by
the following example.
1
For brevity, we use “synonym” to describe any kind of equivalent pairs
which may include synonym, acronym, abbreviation, variation and other
equivalent expressions.
ID Signature
q1 Conf,2012
q2 VLDB,Intl
q3 Conf
q4 UK,Conf
q5 2012
q6 Intl,2012
q7 Conf,Intl
L
8
1
1
5
5
8
1
F
8
5
4
7
6
8
4
root
Conf Intl UK VLDB 2012
q1 q2 q4 q2 q1
q5
q3 q6
q6
q4 q7
q7
1 5
5 7
E1
4
5
6
7
8
q3 q2 q5 q4 q1
q6
q7
root
F2
5 7
F1
1 5
8 8
4
5
E2
E3
6
F3
8 8
7
E4
8
E5
Conf Intl Intl VLDB 2012 Conf UK Conf Intl 2012
q3 q7
q2 q2
q5
q4 q4
q1 q6 q1
q6
q7
Similarity
Algorithm
Complexity
Effectiveness
Comment
(a) Example
data from
Figure 1 Computational
(c) An S-Direcotry
(b) An I-List
(d) An SI-Index
Jacc
O(L)
Low
Does not consider synonym pairs.
If only single token synonym pairs are allowed (implying k = 1), an optimal algorithm
JaccT [2]
O(2n(L+kn))
High
based on maximum bipartite matching can be used with a complexity of O(L2 + Ln2).
Conf Intl UK VLDB 2012
Full Expansion (ours)
O(L+kn)
High
Most efficient and may have good quality.
1 4 2 5 48 17 48 2 5SE
3 48
1 7is2efficient
and has good quality, it is optimal under the conditions of Theorem 2,
2
SE Algorithm (ours)
O(L+kn )
High
and the complexity can be reduced to O(L+nlogn+kn).
4 5 8 4 7 8 5 7 3 8
Figure 2: Comparison of String Similarity Algorithms
with
(Assume both strings have a length of L; The maximum
q4 q2synonyms
q5 q1
q3 q4 q1 q7 q2 q6
q7
number of applicable synonyms is n; k is the maximum
size of the q6
right hand side of a rule.)
ID
q1
q2
...
Name
Intl WH Conf 2012
VLDB
...
(a)Dataset
Table Q
1000
ID
Synonym pairs
USPS
70.0%
WH Wireless Health
r1
CONF
62.4%
Intl
International
r3
Wireless
Health 84.4%
WH
r5
SPROT
...
...
r7
r8
Name
VLDB
PVLDB
(d) An IS-Index
ID
Very Large Scale Integeration
t1
t2 Intl Wireless Health Conference 2012 UK
...
...
(b)
Table T 7000
3000
5000
ID
Synonym pairs
71.2% Conference
72.8% Conf
72.2%
r2
67.6%
71.2%
UK68.9%
United Kingdom
r4
Conf
Conference
r6
86.7%
88.4%
...
...89.8%
(c) Transformation synonym pairs
Figure 1: An example to illustrate similarity matching
synonyms
International Conference on Very Large Databases
Proceedings of the VLDB Endowment
less to perform expanding operation than transforming operation,
as illustrated below.
9000
70.8%
70.4%
87.3%
with
E XAMPLE 2. (Continue the previous example.) We apply all
the applicable synonym pairs to q1 and t2 to obtain the “closure” of
their tokens, and then evaluate the Jaccard similarity between the
two expanded token sets. For instance, the fully expanded set of t2
by applying synonym pairs r2 , r3 , r4 , and r5 is {International
Intl, WH, Conf, 2012, Wireless, Health,Conference, United,
Kingdom, UK}. The Jaccard similarity between two fully expanded
8
, which is also greater than the original similarity
token sets is 11
2
without
synonyms.
8
We note that the full expansion-based similarity illustrated in the
E XAMPLE 1. Figure 1 shows an example of two tables Q and
above example is efficient to compute, and is also highly competT and a set of synonyms. Consider two strings q1 and t2 . Their
itive in terms of effectiveness in modeling the semantic similartoken-based Jaccard similarity is only 28 . Now consider applying
ity between strings. Nonetheless, using all the synonym pairs to
synonym pairs on them. Obviously, synonyms r1 , r3 and r6 can
perform a full expansion may hurt the final similarities. Consider
be applied to q1 , and r2 , r3 , r4 and r5 can be applied to t2 . We
Example 2 again, where the use of rule r4 on t2 is detrimental to
use AR (s) to denote the string generated by applying a set of synthe similarity between q1 and t2 , as none of the expanded tokens
Similarity Algorithm Computational Complexity Effectiveness Comment
onym
the sourceLowstring
s.consider
Forsynonym
example,
A{r1 ,r3 } (q1 ) is
appears in q1 or its expanded token set. Therefore, we propose a
Jaccpairs R to O(L)
Does not
pairs.
If only single token synonym pairs are allowed (implying k = 1), an optimal algorithm
O(2 (L+kn)) Wireless
High
“International
Conf
2012”.
based onHealth
maximum bipartite matching
can be used
with a complexity of O(L + Ln ).
selective expansion measure, which selects only “good” synonym
Full Expansion (ours)
O(L+kn)
High
Most efficient and may have good quality.
SEstrings
is efficient and has
good quality, it is optimal
the conditions of Theorem
[2] considersO(L+n
all+kn)the possible
resulting
fromunderapplying
a 3, pairs to expand both strings (Section 3), and accordingly the simiSE Algorithm (ours)
High
and the complexity can be reduced to O(L + n log n).
subset of applicable synonyms on them, and then return the maxilarity is defined as the maximum similarity between two approprimum Jaccard similarity. In
theVLDB
above example,
it Conference
needs to on
consider
ately expanded token sets of two strings. Unfortunately, selective
International
Very
Large
Databases
r
7
23 · 24 = 128 possible string
maximumofJaccard
is
PVLDBThe Proceedings
the VLDBvalue
Endowment
r8 pairs.
expansion is proved to be NP-hard by a reduction from the 3SAT
between A{r1 ,r3 ,r6 } (q1 ) and A{r3 } (t2 ), which is 56 .
problem. To provide an efficient solution, we propose a greedy algorithm, called SE algorithm, which is optimal if the rhs tokens of
The main limitation in the above approach is that its string simthe useful synonyms satisfy the distinct property (the precise definiilarity definition has to generate all possible strings and compare
tion will be described in Section 3.2). We empirically show that the
their similarity for each pair in the cross product. In general, given
condition is likely to be true in practice — the percentage of cases
two strings s and t, assuming there are n1 (resp. n2 ) synonym pairs
that such optimality hold in our experiments with three real-world
applied on s (resp. t), then the above approach needs to compute
data sets ranges between 70.0% and 89.8%. Figure 2 summarizes
the similarities for 2n1 +n2 pairs. Such a brute-force approach takes
the results of the two new similarity algorithms, contrasted with
time exponential in the number of applicable synonym pairs. A
results of existing methods [2].
special case of the problem for single-token synonyms is identified
Given the newly proposed similarity measures, it is imperative
in [2] where the problem can be reduced to a maximum bipartite
to study an efficient similarity join algorithm. In this paper, we first
graph matching-based problem. Unfortunately, in practice, many
show a signature-based join algorithm, called SN-Join, by followsynonyms contain multiple tokens, and even in the special case, it
ing the filter-and-verification framework (Section 4). It generates
still has a worst-case complexity that is quadratic in both the numa set of tokens as signatures for each string using synonyms, finds
ber of string tokens and the number of applicable synonyms. Since
candidate pairs that overlaps in the signatures, and finally verifies
such similarity measure has to be called for every candidate pairs
against the threshold using the above proposed similarity measure.
when employed in similarity joins, it introduces substantial overIn addition, we further propose the SI-Join, which speed up the perhead.
formance by using a native index, SI-tree, to reduce the candidate
In this paper we propose a novel expansion-based framework to
size with integrated signature and length filtering. Both our join almeasure the similarity of strings. Intuitively, given a string s, we
gorithms can work with signatures generated by prefix filtering or
first obtain its token set S, and then grow the token set by applying
locality sensitive hashing (LSH).
the applicable synonyms of s. When a synonym pair lhs → rhs
The methods for selecting signatures of strings may significantly
is applied, we merge all the tokens in rhs to the token set, i.e.,
affect the performance of algorithms [21, 24]. Therefore, we proS = S ∪ rhs. The similarity between two strings are the similarity
pose, for the first time (to our best knowledge), an online algobetween their expanded token sets. We have two key ideas here:
rithm to dynamically estimate the filtering power of a signature fil(i) We can avoid the exponential computational complexity by exter. In particular, we generate multiple signature filters using difpanding strings using all the applicable synonym pairs; (ii) it costs
ferent strategies offline. Then given two tables to join online, we
JaccT [1]
n
2
2
2
quickly estimate the performance of different filters and select the
best one to perform the actual filtering. Our technical contribution here is to propose a novel, hash-based synopsis data structure,
termed “Two Dimensional Hash Sketch (2DHS)”, by extending the
Flajolet-Martin sketch [14] on two join tables. We present novel estimation algorithms to provide high-confidence estimates to compare the filtering power of two filters. Our method carries both
theoretical and practical significances. In theory, we can prove that,
with any given high probability 1-δ, 2DHS returns correct estimates
on upper and lower bounds, and its space and time complexities are
logarithmic in the cardinality of input size. In practice, the experiments show that our estimation technique accurately predicts the
quality of different filters in all three data sets and its running time
is negligible compared to that of the actual filtering phase (accounting for only 1% of the total join time).
Finally, we perform a comprehensive set of experiments on three
data sets to demonstrate the superior efficiency and effectiveness of
our algorithms (Section 5). The results show that our algorithms
are up to two orders of magnitude more efficient than the existing
methods while offering comparable or better effectiveness.
2.
RELATED WORK
String similarity measures: There is a rich set of string similarity measures including character n-gram similarity [18], Levenshtein Distance [30, 28], Jaro-Winkler measure [29], Jaccard similarity [10], tf-idf based cosine similarity [25], and Hidden Markov
Model-based measure [23]. A comparison of many string similarity functions is performed in [12]. These metrics can only measure
syntactic similarity (or distance) between two strings, but cannot
capture semantic similarities such as synonyms, acronyms, and abbreviations.
Machine learning-based methods [7, 27] have been proposed
to improve the quality of traditional syntactic similarity-based approaches. In particular, learnable string similarity [7] presents trainable measures to improve the effectiveness of duplicate detection,
and [27] uses logistic regression to learn a string similarity measure from an existing gene/protein name dictionary. Although these
methods are effective in capturing semantic similarities between
strings, they are limited to special domains and its accuracy depends critically on the quality of the training data.
String similarity join: The state-of-the-art approaches [6, 10,
26, 22, 28] for performing string similarity joins mainly follow the
filtering-and-verification framework. The basic idea is to first employ an efficient filter to prune string pairs that cannot be similar,
and then verify the survived string pairs by computing their real
similarity. In the filtering step, there are two methods to generate signatures for each string: (i) Prefix filtering scheme ([6, 10,
28, 30]) first transforms each string to a set of tokens. Then the
tokens of each string are sorted based on a global ordering, and
a prefix set for each string is selected based on a given similarity
threshold to be signatures. The signatures can be used to guarantee that if two strings are similar, then their signatures have enough
overlap; (ii) LSH-based scheme ([9, 16, 31]) generates the signatures based on the idea of locality-sensitive hashing (LSH) [8], that
is each signature is a concatenation of k minhashes of the set s
under a minwise independent permutation, and l such signatures
are generated. k and l are two user-defined parameters. The main
limitation of the LSH approach is that the algorithms are approximate and could miss some correct results. In this paper, we will
extend these two signature schemes to perform efficient similarity
join with synonyms.
Synonyms generation approaches: Synonym pairs can be obtained in many ways, such as users’ existing dictionaries [27], explicit inputs, and synonyms mining algorithms [3]. In this paper,
we assume that all synonym pairs are predefined, and our focus is
how to effectively use the semantic information to provide meaningful similarity measures and efficient join algorithms.
Estimation of set size: Our work is also related to the estimation of distinct pairs in set-union. This is because, in the filtering
phase, we need to estimate the number of candidates (string pairs)
to dynamically select a signature scheme. There are many solutions
proposed for the estimation of the cardinality of distinct elements.
For example, in their influential paper, Flajolet and Martin [14] proposed a randomized estimator for distinct-element counting that relies on a hash-based synopsis; to this date, the Flajolet-Martin (FM)
technique remains one of the most effective approaches for this estimation problem. FM sketch and its variation (e.g., [1, 15]) focus
on one dimensional data. In our problem, we need to estimate the
distinct number of pairs, where each dimension is based on an independent FM sketch from one join collection. It turns out that
this problem is more complex and the existing analysis cannot be
extended easily.
Finally, the estimation of the result size of a string similarity
join has been studied recently, e.g., in [19, 20], by using random
sampling and LSH scheme. Note that these techniques cannot be
directly applied here, since given a specific filter, our goal is to
estimate the cardinality of candidates, which is often much greater
than that of final results.
3.
STRING SIMILARITY MEASURES
As described in Section 1, an expansion-based framework can efficiently measure the similarity of strings with synonyms to avoid
the exponential computational complexity. In this section, we describe two expansion-based measures in detail. In particular, Section 3.1 describes an efficient full-expansion measure, and Section 3.2 shows a selective expansion to achieve better effectiveness.
Finally, Section 3.3 analyzes the optimality of the methods.
3.1
Full Expansion
We first introduce notations. Given two strings s1 and s2 , let S1
and S2 denote the sets generated from s1 and s2 respectively by tokenization (e.g., splitting strings based on delimiters such as white
spaces). Let r denote a rule, which has the form: lhs(r) → rhs(r),
where lhs(r) and rhs(r) are the left and right hand sides of r, respectively. Slightly abusing the notations, lhs(r) and rhs(r) can
also refer to the set generated from the respective strings. Let R denote a collection of synonym rules, i.e., R = {r: lhs(r) → rhs(r)}.
Given a string s, we say a rule r ∈ R is an applicable rule of s if
lhs(r) is a substring of s. We use rhs(r) to expand the set S,
0
that is, S = S ∪ rhs(r). Let sim(s1 , s2 , R) denote the similarity
between s1 and s2 based on R. Let R(s1 ) and R(s2 ) denote the
applicable rule set of s1 and s2 , respectively. Under the expansion0
0
0
0
based framework, sim(s1 , s2 , R)=f (S1 , S2 ), where S1 and S2 are
expanded from S1 and S2 using some synonym rules in R(s1 ) and
R(s2 ), and f is a set-similarity function such as Jaccard, Cosine,
etc.
Based on how applicable synonyms are selected to expand S,
there are different kinds of schemes. A simple one is the full expansion, which expands S using all their applicable synonyms. That
S
0
is, the expanded set S = S
rhs(r).
r∈R(s)
The full expansion scheme is efficient to compute. Let L be the
maximum length of strings s1 and s2 . The time cost of doing sub-
string matching to identify the applicable synonym pairs is O(L)
using a suffix-trie-based method [13]. Then the complexity of the
set expansion is O(L + kn), where n is the maximum number of
applicable synonym pairs for s1 and s2 , and k is the maximum
size of the right-hand side of a rule. As we will demonstrate in our
experimental section, the full expansion provides a highly competitive performance with other optimized scheme to capture semantics
of strings using synonyms.
3.2
Selective Expansion
We observe that, as mentioned in Section 1, the full expansion
cannot guarantee that each applied rule is useful for similarity measure between two strings. Recall Example 2 and Figure 1. The use
of rule r4 on t2 is detrimental to the similarity between q1 and t2 ,
as “United Kingdom” does not appear in q1 . To address this
issue, we propose an optimized scheme, called selective expansion,
whose goal is to maximize the similarity value of two expanded sets
by a judicious selection of applicable synonyms to apply. Given a
collection of synonym rules R and two strings s1 and s2 , the se0
0
lective expansion scheme maximizes f (S1 , S2 ). Suppose, without
loss of generality, we use Jaccard Coefficient to measure the simi0
0
larity between S1 and S2 , which can be extended to other functions.
The Selective-Jaccard similarity of s1 and s2 based on R (denoted
by SJ(s1 , s2 , R)) is defined as follows:
SJ(s1 , s2 , R) =
0
max
0
0
{Jaccard(S1 , S2 )}.
R(s1 )⊆R,R(s2 )⊆R
0
where S1 and S2 are the sets expanded from s1 and s2 using R(s1 )
and R(s2 ), respectively.
Unfortunately, we can prove that it is NP-hard to compute the
Selective-Jaccard similarity with respect to the number of applicable synonym rules, by a reduction from the well-known 3SAT
problem [17]. Due to space limitation, we leave the detailed proof
in the full version. Since Selective-Jaccard is NP-hard, we design
an efficient greedy algorithm that guarantees the optimality under
certain conditions, which has been shown to be met in most cases
in practice. Before presenting the algorithm, we will first define the
notion rule gain, which can be used to estimate the fitness of a rule
to strings.
is zero (lines 2 ∼ 3). line 4 fully expands s2 and line 5 computes the
useful elements G. Finally, the rule gain RG(r) = |G|
is returned
|U |
(line 6).
We develop a “selective expand” algorithm based on rule gains,
which is formally described in Algorithm 2. Before doing any expansion, we first compute a candidate set of rules (line 1), which
only consists of applicable rules with relatively higher rule-gain
values. In particular, in Procedure findCandidateRuleSets, we first
let Ci contain all applicable rules whose rule gain is greater than
zero (line 4), and then we iteratively remove the useless rules whose
rule gain is too small from Ci (in lines 5∼11). In lines 6∼7, we
use the elements in Ci to expand Si to acquire Si0 . Then, in lines
θ
is removed from
9∼11, any rule whose rule gain is less than 1+θ
Ci , where θ is defined as Jaccard similarity of the current expanded
sets S10 and S20 (line 8), as its similarity is lower than the “average”
similarity and its usage would be detrimental to the final similarity.
Subsequently, in line 2, we iteratively expand s1 and s2 using the
most gain-effective rule from the candidate sets. Note that after a
new rule is used to expand the set, the rule gains of the remaining
rules need to be recomputed in line 16, since the new elements U
(in Algorithm 1) of the remaining rules will change.
Performance Analysis: The time complexity of SE algorithm is
O(L + kn2 ), where L is the maximum length of s1 and s2 , n is
the total number of applicable synonyms for s1 and s2 , and k is the
maximum size of the right-hand side of a rule. More precisely, we
use the suffix-trie to find applicable rule sets, and its complexity
is O(L). In Procedure findCandidateRuleSet, the complexity to
perform full expansion (lines 6 ∼ 7) is O(kn), and the iteration
time is no more than n, thus the complexity is O(L + kn2 ).
Algorithm 2: SE Algorithm
1
2
3
4
5
6
Algorithm 1: Rule Gain
1
2
3
Input
: r, s1 , s2 , R={ri : lhs(ri ) → rhs(ri )}
Output : the rule gain of r
U ← rhs(r) − S1 ; // r is applicable to s1 , and S1 is the token set of s1
if (U = φ) then
return 0;
//Let R
0
4 S2
← S2
0
S
7
0
ri ∈R
rhs(ri ) //full-expanding S2
6
return
|G|
;
|U |
2
10
11
12
0
0
θ ← Jaccard(S1 , S2 )
for (r ∈ C1 ∪ C2 ) do
θ
if (RG(r) < 1+θ
) then
Ci ← Ci − r; //r ∈ Ci , i=1 or 2
9
until (no rule can be removed in line 19);
return C1 , C2
Procedure expand(s1 , s2 , C1 , C2 ,)
0
0
//Let S1 and S2 denote the expanded sets of S1 and S2
0
G ← (rhs(r) ∩ S2 ) − S1 ;
Procedure findCandidateRuleSet(s1 , s2 , R)
Ci ← {r: r ∈ R ∧ lhs(r) is a substring of si (i=1 or 2) ∧ RG(r) > 0};
repeat
S
0
S1 ← S1 r∈C rhs(r);
1
S
0
S2 ← S2 r∈C rhs(r);
8
denote all applicable rules of S2
5
Input
: s1 , s2 , R={r: lhs(r) → rhs(r)}.
Output : Sim(s1 , s2 , R).
C1 , C2 ← findCandidateRuleSet(s1 , s2 , R);
θ ← expand(s1 , s2 , C1 , C2 );
return max(Jaccard(s1 , s2 ), θ);
0
13
S1 ← S1 ;
14
S2 ← S2 ;
while (C1 ∪ C2 6= φ) do
Find the current most gain-effective rule r ∈ Ci (i = 1 or 2);
if (RG(r) > 0) then
0
0
Si ← Si ∪ rhs(r);
0
15
Given two strings s1 and s2 , a collection of synonyms R, and a
rule r∈R, we assume that r is an applicable rule for s1 . We denote
the rule gain of r by RG(r, s1 , s2 , R). When s1 , s2 , and R are clear
from the context, we shall simply speak of RG(r). Informally, the
rule gain is defined by the number of useful elements introduced by
r divided by the number of new elements in r. In particular, we say
that a token t ∈ rhs(r) in r is useful if t belongs to the full expansion set of s2 and t ∈
/ s1 . The reason we use the full expansion set
of s2 rather than the original set is that s2 can be expanded using
new rules, but all possible new tokens must be contained in the full
expansion set. We now go through Algorithm 1. In line 1, we find
the new elements U introduced by r. If U = ∅, then the rule gain
16
17
18
Ci ← Ci - r;
19
0
20
0
return Jaccard(S1 , S2 );
3.3
Theoretical Analysis
In this subsection, we carve out one case and show that SE returns the optimal value, and analyze the reduced time complexity
in this optimal case.
s2
Sig(s1,R)={t1,t3}
…
L1 u1 p v1
s2
Sig(s2,R)={t1,t2,tn}
L2
(a) The structure of I-List
un
vn
p
…
...
t11 p t1m p
tn1 p tnm p
(b) The structure of S-Directory
Figure 3: The structure of I-List and S-Directory.
t1
…...
ti
…...
tj
…...
h(i)
2
5
9
11
14
15
tn
(b
String Ids
Binary strings LSB Bit-vector
0010
1
0100
0101
0
1100
1001
0
1100
1011
0
1100
1110
1
1100
1111
3
1101
Mapped In
s1
s2
2
5
vt(g1)
…
…
0
0
0
0
1
1
1
0
1
1
0
1
1
1
1
0
0
0
0
1
1
1
0
0
0
0
1
1
1
0
1
1
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
0
1
1
1
vt(g2)
…
…
vt(g): a synopsis of
signature g from table T
s3
8
n
S token-hash-index
9
Sig(Si ). Note that this generation of signatures s4s5
Sig(s, R)= 2i=1
11
s6
14
ID1
can be performed
offline. ID1
…...
…...
We observe
that SN-join
needs to check the
signature overlap
(a) 2-Dimensional hash sketch synopsis.
IDi
IDi
for each string
pair (s, t) …...
∈ S × T to find the candidates,
which
…...
1; v g
s (i)=1
and v g
s (j)=1
0; otherwise
can be time-consuming.
InIDjthis section, we propose
an
optimized
IDj
…...
…...
native indexes
and the corresponding
join algorithm (Section 4.1).
IDn
IDn
We then study the impact of signature filters on the performance of
(b) Example of 2-Dimensional hash sketches.
(a) 2-Dimensional hash sketch synopsis.
algorithms
depth of
byDouble-Indexes
proposing a novel
hash-based synopsis
data
(a) Thein
structure
structure to effectively select signatures (Section 4.2). Finally we
extend our discussion on LSH scheme (Section 4.3).
…
vs(g1) 1 1 1
(1) 2DHS of g1
vs(g): a synopsis of
signature g from table S
vs(g2)1 1 0
(2) 2DHS of g2
(3) 2DHS of g1
sig1 sig2
S
ICDE
0; hs[i]=0
and ht[j]=0
1; otherwise
…
…
表
对上 1
应的 0
的
0
0
1
1
1
0
I-List
FM
n
…
签
n-1
名的 …
所倒 j
对排 …
应索 1
的引 0
…
中
的
0 1 …i …m-1 m
一
hs:S的倒排索引中的一
个
个签名对应的sketch
…
Confer Interna
on
ence tional
…
SI-Join Algorithm
ACM Conf
…
4.1
ht: T
…
To optimize the above SN-Join method, we show a native index
sig1 sig2
by (i)
checking; and (ii) reducing the
q1 avoiding
q1
q1 q4
q1 signature-overlap
q2
表
对上
q2
应的 1
1
numberq3of candidates
for
the
final
verification,
as
follows.
的
q3
1
1
1
1
1
1
1
1
1
sig1 1
(b) sig1对应
I-List
FM
1
0
Avoiding signature-overlapping check. We design a signature in-1 0
(a) 基于签名的倒
排索引S和T
verted list, called I-List (see Figure 3(a)). Each token ti is associated with an inverted list including all string IDs whose signatures
1; v g
s (i)=1
anddirectly
v gt (j)=1
contain the token ti . Thus, given a signature ti , we can
0; otherwise
obtain all relevant string IDs without any signature-overlapping
check.
vs
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
v gs 1 1
1
1
…
1
…
…
vtg1
…
…
…
vg
s : a synopse of
(1) 2DHS of g1
signature g from table S
0
0
1
1
0
1
1
1
0
1
1
1
0
g2 1
1
0
g
vt 2
…
…
vgt : a synopse of
signature g from table T
1
T
sketch
(2) 2DHS of g2
…
…
…
…
…
vg
s : a synopse of
signature g from table S
vt(g1)
…
…
vgt : a synopse of
signature g from table T
(a) 2-Dimensional hash sketch synopsis.
(b) Example of 2-Dimensional
Reducing the number of candidates. We extend
the idea of length
filter [6, 26] to reduce the number of candidates. Intuitively,
given
1; v sg(i)=1
and v tg(j)=1
two strings s1 and s2 and a threshold value θ, if the size0; otherwise
of the
full expansion set of s1 is much less than the size of |s2 |, then
without checking the signatures of s1 and s2 , one can claim that
2-Dimensional
hash sketch synopsis.
(b) Example of 2DHS.
sim(s1 , s2 , R) must be smaller than θ. Based (a)on
the observation,
we propose a data structure, called S-Directory, to organize the
strings by their lengths.
The S-Directory (see Figure 3(b)) is a tree structure with three
levels, and nodes in different levels could be categorized into different kinds.
• root-entry. A node in level L0 is a root entry, which has multiple
pointers to the nodes in level L1 .
• fence-entry. A node in level L1 is called fence entry, which contains three fields hu, v, P i, where u and v are integers, and P is a
set of of pointers to the nodes in L2 . In particular, u denotes the
number of tokens of a string and v is the maximal number of tokens
in the full expanded sets of strings whose length is u.
• leaf-entry. A node in level L2 is called leaf entry, which contains
two fields ht, P i, where t is an integer to denote the number of the
tokens in the full expanded set of a string whose length is u, and P
is a pointer to an inverted list. As leaf entries are organized in the
ascending order of t, vi = tim (See Figure 3(b)).
We argue that S-Directory can easily fit in the main memory,
as the size of S-Directory is only O((vmax − umin )2 ), which is
quadratic to the difference between the maximal size of expanded
sets and the shortest length of records. For example, in our implementation, to perform a self-join for a table with one million strings
on USPS dataset, the size of S-Directory is only about 2.15KB.
To maximize the pruning power, we combine I-List and S-Directory
together. That is, for each leaf-entry=ht, P i, P points to a set of
…
STRING SIMILARITY JOINS
Given two collections of strings S and T , a collection of synonyms R, and a similarity threshold θ, a string similarity join finds
all string pairs (s, t) ∈ S × T , such that sim(s, t, R) ≥ θ, where
sim is one of the similarity functions in Section 3.
One join algorithm is a signature-based nested-loop approach,
called SN-Join, following the filtering-and-verification framework.
That is, in the filtering step, SN-Join generates candidate string
pairs by identifying all pairs (s, t) such that the signatures of s and
t overlap. In the verification step, SN-Join checks if sim(s, t, R)
≥ θ for each candidate and outputs those that satisfy the similarity
predicate.
To generate the signatures, one way is to extend the prefix-filter
scheme. Given a string s, the signatures of s are a subset of tokens
in s containing the d(1 − θ)|s|e smallest elements of s according
to a predefined global ordering. Therefore, we generate signatures
for a string s as follows. Let R ⊆ R be the set of n application
synonym pairs for s. For each subset Ri of R, we generate an
expanded set Si . For each Si , we obtain one signature set (denoted by sig(Si )). Note that the number of subset Si is 2n . The
exponential number is due to the fact that we try to enumerate all
possibilities of expanded sets for s with n synonym pairs. Finally,
we generate a signature set for s by including all signatures of Si :
s1
…
4.
s2
L0 root
tn
…
Note that the condition in Theorem 1 does not require all the rhs
tokens to be distinct. It only requires those from useful synonyms
of s1 and s2 to be distinct, which are typically a very small subset. In addition, under the condition, the time complexity can be
reduced to O(L + n log n + nk). The reason is that, the complexity to perform full expansion (lines 6∼7) is reduced to O(1).
Since the tokens are distinct, and that of expanding (lines 15∼19)
is O(n log n), since the rule gain of each rule does not change,
which can be implemented by a maximal heap. Finally, the Jaccard
computation in line 20 needs O(L + kn) time in the worst case.
Therefore, the complexity of SE algorithm is O(L+n log n+nk).
s1
…
…
i
P
|S ∩S + G∗ |
i
to Algorithm 2, G
> θ. Thus, θ > |S11 ∪S22 +P Z ∗i | = θ∗ , which
Zi
i
contradicts to the optimal property and such r does not exist. Thus,
∗
R⊆R .
Finally, we show that all the candidate rules in C1 ∪ C2 will be
used in Procedure expand (line 2). If all the rhs tokens are distinct,
the rule gain value will not change. The rule gain value of each rule
is greater than zero (line 17), so all the rules will be used by SE in
the expanding phase, which concludes the proof.
t3
ID-hash-index
∗
θ
θ
≥ θ, that is, G
> 1+θ
, i.e., RG(r∗ ) > 1+θ
, which means that r∗
U∗
will not be removed in Procedure findCandidateRuleSet (line 11)
and r∗ ∈ R. Thus, R∗ ⊆ R.
We then prove R ⊆ R∗ by contradiction.
Assume that there
P
|S ∩S + G∗ +G |
exists a rule r ∈ R but r 6∈ R∗ , θ = |S11 ∪S22 +P Zi∗ +Zii| . According
t2
ID-hash-index
P ROOF. Let R∗ denote the rule set used by the optimal algorithm. Let R denote the set C1 ∪ C2 returned in line 1 of Algorithm 2. Then we will prove that R∗ = R as follows. Given an applicable rule ri in R, let Gi denote the useful elements, Ui denote
the new elements, and let Zi = Ui − Gi . Let θ∗ and θ denote the
full expansion similarity using R∗ and R respectively. Note that all
∗
the rules in both R and
are
P R are useful. When all the rhs tokens
∗
|S1 ∩S2 + P Gi |
> θ∗
distinct, θ = |S1 ∪S2 + Zi | . For a rule r∗ ∈ R∗ , we have G
Z∗
t1
inverted
index
T HEOREM 1. Given two strings s1 and s2 and their applicable
rule sets R(s1 ) and R(s2 ), we call a rule r ∈ R(si ) (i = 1 or 2)
useful if RG(r) > 0. We can guarantee that SE is optimal if the rhs
tokens of all useful rules are distinct.
Signature
inverted
index
Signature
inverted
index
Signature
inverted
list
Signature
inverted
list
Signature
inverted
list
Signature
inverted
list
Signature
inverted
list
(b) SH-Index
(a) HS-Index
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
1
1 1
vs(g1)
0
0
ID
q1
Strings
Signatures
2013 ACM Intl Conf on
ACM, International,
Management of Data USA
Conference, on
q2 Very Large Data Bases Conf Conf, Conference
q3
VLDB Conf
Conf, Conference
q4
ICDE 2013
ICDE
L F
9
5
2
2
11
7
7
2
(a) Strings and signatures
ACM Conf
Confer Interna
on
ence tional
root
root
ICDE
2 7
9 11
E1 2
5 7
F1 2 7
E2 7
ICDE
q1
ID Synonym pairs ID
Synonym pairs
r1 Intl International r3 VLDB Very Large Data Bases
r2 Conf Conference r4 Very Large Data Bases VLDB
(b) Synonym pairs
q2
q3
q1
q2
q3
q1
(c) An I-List
q1
q4
2
q4
7
q3
11
7
q4
Conf
Confer
ence
q3
q3
F2 5 7
F3 9 11
E4 11
E3 7
Confer
Conf
ence
ACM
Confer Interna
on
ence tional
q1
q2
(d) An S-Direcotry
q2
q2
q1
q1 q1
q1
(e) An SI-Index
Figure 4: An example to illustrate the indexes. The column L and F denote the number of tokens in the string and in the full
expanded set, respectively.
q3
Intl ICDE
ICDE, International 2
Conf Intl UK VLDB 2012
7
Realstring
Time Systems
17 48 Estimation-based
25 17 23 4 8
signatures (from the
whoseSymp
size of the full expansion set1is4 2 5 48 4.2
signature selection
t) associated with multiple I-Lists. Thus, we have a combined in4 5 8 4 7 8 5 7 3 8
on, 39th, Intel, 2013,
Bases,leaf
Data, Large,
Very,VLDB
dex, called SI-tree,
i.e.,Conf,
each
entry
in S-Directory is associ4.2.1 Motivation
ated with an I-List. Based on the SI-tree, we design a join algoq4 q2 algorithm
q5 q1
q3 q4 q1 q7 q2
Theq6 SI-join
generates signatures for each string acrithm (called SI-Join), which follows the filtering-and-verification
q6
q7
cording
to
a
global
order of tokens. In theory, if N denotes the
framework and the verification phase is the same as that of SN-Join.
(d)
An IS-Index
number
of distinct tokens, there are N ! ways to permute the tokens
SI-Join (see Algorithm 3) has three steps in the filtering phase.
Global order International Conference
on 39th signatures.
Intel 2013 Conf Bases
Data Large
VLDBtables for approximate join,
to
select
Given
twoVery
string
In the first step (lines 2 ∼ 3), consider two fence-entries
F
and
s
Frequence
6
8
8 8 12 12 16 18 18 18 18
18
Mapped
IDs
1
2
3
4
5
6
7
8
9
10
11
12algorithm to select signatures
it
is
impractical
to
design
an
efficient
Ft . If the maximal size of expanded sets of strings below Fs (or
(b)for
Global
by IDF
value This is because even one scan of data is
eachorder
string
online.
Ft ) is still smaller than θ times the length of original strings below
time consuming and it is meaningless to spend comparative time as
Ft (or Fs ) (i.e. u), then all pairs of strings below Fs and Ft can
the real join to find good signatures. In this section, we propose a
be pruned safely. Similarly, in the second step (lines 4 ∼ 5), given
Name
ID
novel strategy to solve this problem. Our main idea is to use multitwo leafID
entries Es and
Et , where all expanded
sets have theName
same
ple ways to generate signatures offline, and then given two tables to
Large Scale
WH Conf 2012
t1 s (or Et Very
size, if the
of expanded
sets below E
) is smaller
than Integeration
q1 size Intl
weUK
quickly estimate the quality of each signature filter and seθ times the
length
of
original
strings
below
E
(or
E
),
then
all
t
s
2012
VLDB
t2 Intl Wireless Health Conferencejoin,
q2
lect the best one to perform the actual filtering. The main challenge
string pairs
... safely. In the third step
...
... below Es and
... Et can be pruned
(lines 6 ∼ 10),(a)
forTable
each signature
g that is an overlapping(b)
signature
Q
Table T in this idea is two-fold. First, we need to select multiple promising signature filters offline. Second, the estimator should quickly
pointed ID
by Es .P and
E
.P
,
Function
getIList
returns
the
I-List
t
Synonym pairs
ID
Synonym
pairs
determine the best filter online according to their filtering power.
whose key
is
g.
Then
SI-Join
generates
string
pair
candidates
for
WH Wireless Health
Conference
Conf
r1
r2
To address the first challenge, we use the itf -based method ([2,
every twor3I-Lists.
Intl
International
UK
United Kingdom
r4
21])
to generate filters. In particular, let tf(w) denote the occurWireless
Health
WH
Conf
Conference
r5
r6
... 3: SI-Join Algorithm
...
...
... rences of token w appearing in the join tables, i.e., the frequency
Algorithm
of w. Then the inverse term frequency of the token w, itf(w), is
Input
: SI-Trees SIS for S and SIT for T , threshold value θ, a collection
defined as 1/tf(w). The global order is arranged in a descending
of synonym pairs R.
Output : C: Candidate string pairs (s, t) ∈ SIS × SIT
order based on itf values. Intuitively, tokens with a high itf value
1 C←−∅;
(c) Transformation synonym pairs
are rare tokens in the collection and have a high probability not to
2 for (∀ Fs in SIS , ∀ Ft in SIT ) do
be chosen as signatures. Since tokens come from two sources, ta//Fence-entry:F =<u, v, P >
3
if (min(Fs .v, Ft .v) ≥ θ × max(Fs .u, Ft .u)) then
bles and rules, there are multiple ways to compare the frequency
4
for (∀ Es ∈Fs .P ,∀ Et ∈Ft .P ) do
of tokens as follows. (1) ITF1: sort the tokens from data tables
//Leaf-entry:E=<t, P >
and synonyms separately by decreasing itf, and then arrange tokens
5
if (min(Es .t, Et .t) ≥ θ × max(Fs .u, Ft .u)) then
6
for (∀ g is an overlapping signature pointed by Es .P and
from tables first, followed by those from synonyms; (2) ITF2: sort
Et .P ) do
separately again, but synonyms first, tables second; (3) ITF3: sort
7
Ls =Es .P →getIList(g);
Similarity Algorithm Computational Complexity Effectiveness Comment
all the tokens together by a decreasing itf order; (4) ITF4: gener8
Lt =Et .P →getIList(g);
Jacc
O(L)
Low
Does not consider synonym pairs.
9
C={(s, t)|s ∈ Ls , t ∈ LtIf };
only single token synonym pairs are allowed (implying
1), an
optimal algorithm of expanded sets for each string, then sort the
atek =all
possibilities
JaccT [1]
O(2 (L+kn))
High
based on maximum bipartite matching can be used with a complexity of O(L + Ln ).
10
C=C∪ C;
tokens by the decreasing itf order in all the expanded sets.
Full Expansion (ours)
O(L+kn)
High
Most efficient and may have good quality.
SE is efficient and has good quality, it is optimal under the conditions of Theorem 3,
In order to give an empirical study of the effect of different signaSE Algorithm (ours)
O(L+n +kn)
High
and the complexity can be reduced to O(L + n log n).
tures on filtering power, we perform experiments using only signa11 return C;
ture filters with ITF1∼ITF4 against two datasets: CONF and USPS
VLDB
International Conference
on Very
Large
Databases
r7
data.
On the
CONF
dataset, we perform a join between two tables
PVLDB
Proceedings of the VLDB Endowment
r8
S and T using a synonym table R. Table S contains 1000 abbreviations of conference names, while table T has 1000 full expressions
E XAMPLE 3. We use this example to illustrate the SI-Join alof conference names. On the USPS dataset, we perform a self-join
gorithm. Consider
a self-join
the table
Fig. 4(a) with θ = 0.8
Dataset
1000
3000 on
5000
7000 in 9000
USPS in Fig.
70.0%4(b).
71.2%
72.2% 70.8%
on a table containing one million records, each of which contains a
and synonyms
The72.8%
corresponding
I-Lists, S-Directory,
CONF
62.4% 67.6% 68.9% 71.2% 70.4%
person name, a street name, a city name, etc., using 284 synonyms.
and SI-index
are shown
in Figs.
4(c),89.8%
(d) and
(e). SI-Join prunes
SPROT
84.4% 86.7%
88.4%
87.3%
The detailed description of the datasets can be found in Section 5.
all the string pairs below F1 and F3 , since min(7, 11) < 0.8 ×
Figure 5 reports the experimental results. We observe that (i)
max(2, 9) (lines 2∼3). Similarly, all pairs below (E1 , E3 ) can
four signature filters achieve quite different filtering powers. The
be pruned (lines 4∼5). Then SI-Join uses the signatures “Con|C|
ference” and “Conf” to get I-Lists below (E2 , E3 ) (lines 7∼8),
, where |C| denotes the
filtering power is measured by ψ = |N
|
since they are overlapping signatures pointed by E2 and E3 (line
number of pruned pairs and |N | is the total string pairs. For in6). Then a candidate (q2 , q3 ) is generated and added to C (lines
stance, on the CONF dataset, when the join threshold is 0.7, the
9∼10). The final results are (q2 , q3 ) and {(qi , qi )|i ∈ [1, 4]}.
filtering power of ITF1 is 89.2%, while that of ITF2 is only 53.1%;
n
2
2
2
s5
s6
ITF1
ITF3
ITF2
ITF4
Filtering Power(%)
100
90
80
70
60
50
40
0.5
0.6
0.7
0.8
0.9
(a)CONF dataset
ITF1
ITF3
100
90
80
70
60
50
40
30
0.5
0.6
0.7
ITF2
ITF4
0.8
ITF1 String idsITF2Hash integers
ITF3
s1 ITF4
2
100
90
80
70
60
50
40
0.9
11
14
s2
s3
s4
1011
1110
0
0
1110
1110
Binary strings
First 1-bit
Bit-vector
0010
0101
1000
1011
2
1
0
0
0010
0110
1110
1110
5
8
11
Figure 6: An example of bit-vector generated by FM method.
(The first 1-bit is the left-most 1-bit.)
0.5
(b)USPS dataset
Figure 5: Filtering powers of different signature filters.
0.6
0.7
0.8
0.9
(c)SPROT dataset
a higher probability to be set to 1. Figure 6 illustrates the FM
method. Assume there are six strings. The length of the domain
L=4. Finally, the synopsis ν is “1110”.
and (ii) no signature filer can always perform the best in each dataset.
Our two dimensional hash sketch (2DHS): In the following, we
For example, ITF1 performs the best in the USPS dataset, but it perdescribe our ideas to extend FM sketch to provide better bounds to
forms the worst in the CONF dataset. The main reason is that ITF1
IDF1IDF2-estimate the number
IDF3of candidate pairs. The main challenge is that
gives a higher priority for the tokens in the data table to become
the existing FM-based methods [14, 15] are only applicable to one
signatures, but there are many overlaps in these signatures in the
dimensional data, but we need to estimate the number of distinct
CONF dataset, resulting in a large number of candidates.
pairs based on two independent FM sketches. In addition, the extra
These results suggest that it is unrealistic to find a “one-size-fitscomplication occurs, since we require the bounds to be correct with
all” solution. Therefore, in the following, given two tables for join,
arbitrarily high probability. We now describe four steps to address
we propose an estimator to effectively estimate the filtering power
the challenges.
of different filters to dynamically select the best filter.
Step 1 (Estimating the distinct number of one table): At first,
we estimate the distinct number of string IDs for all signature to4.2.2 Estimation-based selection algorithms
kens from one table, that is, µS and µT , where µS = ∪ `(S, g) and
Let `(S, g)= {q g , ..., q g } denote a list of string IDs (i.e., I-lists in
1
g∈G
n
Figs. 3 and 4) associated with the signature token g from the table
S. Then an ID pair (qig , qjg )∈ `(S, g) × `(T, g) is an element of the
cross-product of the two lists. Let G denote the set of all common
signatures between S and T . Therefore, given a signature filter λ,
the union of all ID pairs from all signatures in G constitutes the set
of candidates after signature filtering, that is, ∪ `(S, g) × `(T, g).
g∈G
We can use the following lemma to estimate the lower bound and
the upper bound of the cardinality of candidates.
L EMMA 1. Given a signature filter λ, the upper and lower bounds
of the cardinality of the candidate pairs are:
U B(λ) =
and
X
(k`(S, g)k × k`(T, g)k)
g∈G
LB(λ) = max{k`(S, g)k × k`(T, g)k, g ∈ G}
where k`(·, g)k denote the number of IDs in a list.
µT = ∪ `(T, g). Ganguly et al. [15] provide an (, δ)-estimate alg∈G
gorithm for µS (or µT ) by using the extension of the FM technique,
which estimates the quantity of µS (or µT ) within a small relative
error and a high confidence 1-δ. Space precludes detailed discussion about their method in this paper. However, readers may find it
in Section 3.3 of [15]. Note that after the first step, we indeed obtain a new upper bound: µS × µT . But it is still not tight enough.
We proceed to the following steps to get a better value.
Step 2 (Constructing a two-dimensional hash sketch): Let νSg
and νTg be an FM sketch of signature g in I-Lists of table S and
table T respectively. We construct a 2-dimensional hash sketch
synopsis (2DHS) of signature g, which is a 2-dimensional bitvector. The bucket (i, j) in a 2DHS is 1 if and only if νSg (i) = 1 and
νTg (j) = 1. Further, let G be the set of common signatures of S and
T . We construct a new 2DHS 0 , which is the union of all 2DHS’s
of each signature, namely ∪ 2DHS g , such that 2DHS 0 (i, j)=1
g∈G
In the above lemma, the upper bound is defined as the total number of string pairs and the lower bound is the maximal number of
string pairs associated with one signature. A signature λ guarantees
to return fewer candidates than signature λ0 if U B(λ) < LB(λ0 ).
Obviously, the tighter the bound is, the better the results are, since
a tighter bound implies more accurate estimates on the ability of
a filter. Therefore, in the following, we introduce tighter upper
and lower bounds by extending the Flajolet-Martin (FM) [14] technique.
The Flajolet-Martin estimator: The Flajolet-Martin (FM) technique [14] for estimating the number of distinct elements (i.e., setunion cardinality) relies on a family of hash functions H that map
data values uniformly and independently over the collection of binary strings in the input data domain L. In particular, FM works
as follows: (i) Use a hash function h∈H to map string ids to integers sufficiently uniformly in range [0, 2L−1 ]; (ii) Initialize a
bit-vector ν (also called synopses) of length L to all zeros and convert the integers into binary strings of length L; and (iii) Compute the position of the first 1-bit in the binary string (denoted
as p(h(i))), and set the bit located at position p(h(i)) to be 1.
Note that for any i∈[0, 2L−1 ], p(h(i)) ∈ {0, . . . , logM -1} and
1
, which means the bits in lower position have
P r[p(h(i))=`]= 2`+1
if ∃ 2DHS g (i, j)=1.
E XAMPLE 4. Figure 7 illustrates Step 2. Assume there are two
common signatures g1 and g2 of S and T , νSg1 ={1,1,1},νTg1 ={1,1,0,1}
and νSg2 ={1,1,0},νTg2 ={1,1,1,0}. Figure 7(b1) is the 2DHS of g1
and Figure 7(b2) is the 2DHS of g2 . Then the union 2DHS is
shown in Figure 7(b3).
Step 3 (Computing witness probabilities): Once the union 2DHS
is built, we select a bucket (i, j) to compute a witness probability as follows. We examine a collection of ω independent 2DHS
(ν1g ,ν2g ,. . . ,νωg ) (each copy using independently-chosen hash functions from H). Algorithm 4 shows the steps to compute witness
probabilities. In particular, let Rs =blog2 µS c and Rt =blog2 µT c.
We investigate a column of buckets (Rs , y) (line 2 ∼ 8), and a row
of buckets (x, Rt ) (line 9 ∼ 15), and select two smallest indexes y
and x at which only a constant fraction 0.3ω of the sketch buckets
turns out to be 1. As our analysis will show, the observed fraction
pˆs and pˆt (line 16) and their positions can be used to provide a
robust estimate for bounds in the next step.
Step 4 (Computing tighter upper and lower bounds): The final
step is to compute the upper bound TUB and the lower bound TLB
-List
FM
1
1
1
1
0
0
…
…
vtg1
…
…
…
v gs : a synopse of
1
0
1
1
0
1
1
1
1
0
0
0
0
1
1
1
0
0
0
0
1
1
1
1
1
1
1
1
v gs 1 1
1
1
1
0
1
1
0
1
1
1
0
1
1
1
1
1
1
1
0
1
1
1
1
vgs2 1 1
0
g
vt 2
…
…
vgt : a synopse of
signature g from table T
1; v gs (i)=1
and v gt (j)=1
0; otherwise
1
S×T
(d) 把各个signature的
sketch并起来后的sketch
(a) 基于签名的倒
排索引S和T
…
1
1
(1) 2DHS of g1
signature g from table S
(a) 2-Dimensional hash sketch synopsis.
(2) 2DHS of g2
(3) 2DHS of g1 g2
(b) Example of 2-Dimensional hash sketches.
vgt : a synopse of
signature g from table T
Figure 7: The structure of 2-dimensional hash sketch (2DHS)
1; v gs (i)=1
0
0
0
0
1
1
1
0
6
7
8
9
10
11
12
13
14
15
16
vt(g1)
1
…
5
…
4
…
3
…
1
0; otherwise
: ω independent 2-dimensional hash
sketches
0 0 0 0 0
(2DHS1 ,2DHS2 ,...,2DHSω ), Rs and Rt .
1 1 1 1 0
Output : pˆs and pˆt and two witness positions (Rs , y) and (x, Rt )
…
1 1 1 1 0
f =0.3ω;
v gs : a0synopse
1 1 1 0
for (y from
to Rtof) do
vs(g1)
signature g from table S
count=0; hash sketch synopsis.
(a) 2-Dimensional
(b) Example of 2DHS.
for (i from 1 to ω) do
if (2DHSi [Rs ][y]==1) then
count = count+1;
Input
2
…
…
0
and v gt two
(j)=1 witness probabilities
Algorithm …4: Computing
L EMMA 2. If µS ≥ µ0S and µT ≥ µ0T , then M IN (N, µS , µT , x, y)
≥ M IN (N, µ0S , µ0T , x, y) and M AX(N, µS , µT , x, y) ≥
M AX(N, µ0S , µ0T , x, y).
if (count≤f ) then
pˆs =count/ω and goto Line 9;
for (x from 0 to Rs ) do
count=0;
for (i from 1 to ω) do
if (2DHSi [x][Rt ]==1) then
count = count+1;
L EMMA 3. The lower bound of the MIN probability can be
computed as:
p1 = [1 − (1 −
if (count≤f ) then
pˆt =count/ω and goto Line 16;
of a signature filter λ. Assume, without loss of generality, that
µS ≤ µT .
TUB(λ) = ϕ1 · min(u1 , u2 ), where
ln(1−ps )
pˆs
u1 = µS · ln(1−
1 ) , and ps = 1−(1− 1
u 2 = µT ·
ln(1−pt )
,
ln(1− 21x )
TLB(λ) = ϕ2 · µS ·
∆p = [1 − (1 −
and pt =
ln(1−p0s )
,
ln(1− 21y )
µS
1
2Rs +y
)
)µS
2Rs
pˆt
1 )µT
1−(1− R
2 t
where p0s =
]−
1
2y
× [1 −
1 µS
) ] × [1 − (1 −
2x
1 µT
) ] × [1 − (1 −
2y
N
1 µS
)
2y
N
1 µT
)
2x
p2 = [1 − (1 −
M IN (N, µS , µT , x, y) ≥ max(p1 , p2 ).
return pˆs , pˆt and their witness positions (Rs , y), (x, Rt )
2y
consider the following two cases: One is (1,1), (2,1), . . . , (n,1) and
the other is (1,n+1), (2,n+2),. . . ,(n,2n). Both cases have n distinct
pairs, but their probabilities are different. The first case is 21y ×
1
)n ]. There[1 − (1 − 21x )n ], but the second case is [1 − (1 − 2x+y
fore, we need to differentiate the probabilities for these two extremes. Let M IN (N, µS , µT , x, y) and M AX(N, µS , µT , x, y)
denote the minimum and maximum probabilities respectively that
the bucket (x, y) is 1 with N distinct pairs, where µS and µT denote the number of distinct IDs in each dimension respectively. The
computation of the accurate values of the MIN and MAX is highly
complicated. But for the purpose of this paper to differentiate the
ability of filters, we derive a lower (resp. upper) bound for the MIN
(resp. MAX) probability.
The following lemma implies that the MIN and MAX probabilities are monotonically increasing on µS and µT . This monotonicity can ensure Lemmas 3 and 4 to compute the lower and upper
bounds.
;
pˆs −∆p
1 )µS
1−(1− R
2 s
1
(1 − 2Rs )µS ]
where constants ϕ1 =1.43, ϕ2 =0.8, are derived in Theorem 2.
E XAMPLE 5. This example illustrates the computation of TUB
and TLB. Assume µS =100 and µT =200. Thus Rs =blog100
2 c=6,
Rt =blog200
2 c=7. Assume, in Algorithm 4, that the returned values are pˆs =0.2, pˆt =0.19, x=5, and y=6. According to the formulas in Step 4, we have ps =0.2522, pt =0.24, u1 =1845, u2 =1729,
∆p=0.0117, p0s =0.2375, and the upper bound T U B(λ) = 1.43*1729
= 2472 and T LB(λ) = 1722×0.8=1278.
Therefore, one filter is selected as the best one if its TUB is less
than all the TLBs of other filters, which means that it returns the
smallest number of candidates. In practice, if there are several filters whose bounds have overlaps, then it means that their abilities
are similar with a high probability and we can arbitrarily choose
one of them to break the tie.
Theoretical analysis: We first demonstrate that, with an arbitrarily
high probability, the returned lower and upper bounds are correct,
and then we analyze its space and time complexities. Due to space
constraint, some proofs are omitted here.
It is important to note that given n distinct ID pairs and a position
(x, y) in a 2DHS matrix, the probability that the bucket in (x, y) is
1 depends on the distribution of the input data. To understand it,
];
];
P ROOF. (Sketch) According to Lemma 2, the MIN probability
is monotonically increasing with µS and µT . Therefore, the final
MIN probability is greater than p1 and p2 , since µS ≥ µNT and
µT ≥ µNS . That concludes the proof.
L EMMA 4. Assume that µS ≤ µT . Then the upper bound of
the MAX probability can be computed as:
N
p1 = [1 − (1 − 21x )µS ] × [1 − (1 − 21y ) µS ];
1
p2 = [1 − (1 − 2x+y
)µS ] − 21y × [1 − (1 − 21x )µS ];
M AX(N, µS , µT , x, y) ≤ p1 + p2 .
P ROOF. (Sketch) According to Lemma 2, with a fixed µS , the
greater the value µT is, the greater the MAX probability is. In the
worst case, µT = N . Therefore, the MAX probability pmax is no
greater than:
N
pmax ≤ 1 − [(1 − 21x ) + 21x · (1 − 21y ) µS ]µS
We then prove that pmax −p1 ≤ p2 . We can show that pmax −p1
is a monotonically decreasing function when N ≥ µS . Therefore,
when N = µS (in this case, N = µS = µT ), we have the minimal
value as follows.
pmax − p1 ≤ 1 − [(1 − 21x ) + 21x · (1 − 21y )]µS - [1 − (1 −
1 µS
) ] × [1 − (1 − 21y )] = p2 .
2x
T HEOREM 2. Given a signature filter λ, with any high probability 1-δ, our algorithm guarantees to correctly return the upper
bound and the lower bound to estimate the number of candidate
pairs with λ.
P ROOF. (Sketch) There are three steps in this proof. Firstly, we
show that the estimated probabilities pˆs and pˆt returned by Algorithm 4 are low-error and high-confidence, and then we show our
formulas TLB and TUB are correct. Finally, we show how to adjust
formulas based on the estimated pˆs , pˆt and µS , µT .
First, in Algorithm 4, by Chernoff bound, the estimated p̂=count/ω
satisfies |p̂ − p| ≤ p with a probability at least 1-δ as long as
L EMMA 5. Let f(x) = ln(1 −
x
1 )µS (1+ε) ).
1−(1− R
2
If y − x ≤
s
0.15x for some x ≤ 0.3 and ε ≤0.1, then f (y) − f (x)≤ 0.43f (x).
P ROOF. Since 1−(1− 2R1 s )µS (1+ε) ≈ 1−e−(1+ε) and ε ≤ 0.1,
1
then 1−e−(1+ε)
≥ 1.5. By Taylor series, there is a value w ∈ (x, y)
such that ln(1 − 1.5y) = ln(1 − 1.5x)-1.5(y − x)/(1-1.5w). Thus,
1.5(y−x)
0.225x
we have f (y) − f (x) ≤ 1−1.5max(x,y)
≤ 1−1.725x
≤ 0.225x
≈
0.52
0.43x ≤ −0.43ln(1 − 1.5x) ≤ 0.43f (x).
L EMMA 6. Let f(x) = ln(1 −
x−∆p
1 )µS (1+ε) ).
1−(1− R
2
If x − y ≤
s
0.15x for some x ≤ 0.3 and ε ≤0.1, then f (x) − f (y)≤ 0.2f (x).
P ROOF. (Sketch) First, we can prove
∆p
1 )µS (1+ε)
1−(1− R
2
<0.71.
s
The details are omitted here. Then by Taylor series again, there is a
value w ∈ (x, y) such that ln(1.71-1.5y) = ln(1.71-1.5x)-(1.5(x −
1.5(x−y)
y))/(1.71-1.5w). Thus, we have |f (y)−f (x)| ≤ 1.71−1.5max(x,y)
≤
0.225x
0.225x
≤ 1.19 ≈ 0.189x ≤ 0.2ln(1.71 − 1.5x), since by
1.71−1.725x
Maclaurin series, ln(1.71−1.5x) ≈ln1.71−(1.5/1.71)x ≈0.540.88x.
T HEOREM 3. Given a high probability 1-δ, our estimator needs
a total space of Θ(T log M log(1/δ)) bits, where T is the number
of overlapping signatures and M is the maximal length of inverted
list associated with a signature, and its computing complexity is
Θ(T log2 M + logM log(1/δ)).
P ROOF. The size of one FM sketch is log M and there are T
signatures. So the space for storing all FM sketches is 2T log M .
We need to generate a collection of ω=Θ(log 1δ ) sketches. Therefore, the total space requirement is Θ(T log M log(1/δ)). Further, to compute the time complexity, in step 1, the computing cost
is 2ωlogM . The cost of Step 2 is 2T log 2 M and that of Step 3
and 4 are ωlogM and O(1) respectively. Therefore the total cost is
bounded by Θ(T log2 M + logM log(1/δ)).
4.3
Extension to LSH filters
Although we focus on prefix filters in the above discussion, our
indexes and estimators can also be extended to other filters, such as
LSH scheme. In particular, each signature s in LSH is a concatenation of a fixed number k of minhashes of the set and there are
l such signatures (using different minhashes). Then, in the I-lists,
each inverted list is associated with a pair (s, i), where i means that
Dataset
# of Strings
String Len
(avg/max)
# of Synonym
Pairs
# of Applicable Synonym
Pairs Per String (avg/max)
USPS
CONF
SPROT
1,000,000
10,000
1,000,000
6.75 / 15
5.84 / 14
10.32 / 20
300
1000
10,000
2.19 / 5
1.43 /4
17.96 /68
Figure 8: Characteristics of Datasets.
# of Matches
80
70
1
Jacc
P
R
JaccT
F
P
R
Full
F
P
R
F
P
2
60 signature s comes from the i-th minhash, 1 ≤ i ≤ l. Therethe
3
50
0.70 0.04
0.04 0.04
0.52our
0.98SN-join
0.63 0.51 0.97 0.64 0.4
fore,
by replacing the prefix45 signature
g with
(s, i),
40
0.75 0.04 0.04 0.04 0.64 0.96 0.72 0.57 0.97 0.69 0.5
6-10
(Algorithm
3) and 2DHS estimator
can 0.04
be directly
LSH
30
0.80
0.04 0.04 applied
0.74 0.88on
0.78
0.72 0.93 0.78 0.7
>10
20
0.85
0.04 0.04
0.04
0.83 0.87 0.84both
0.86 0.87 0.86 0.8
scheme.
As
described
in
the
next
section,
we
implemented
10
0.90 0.04 0.04 0.04 0.80 0.82 0.81 0.84 0.84 0.84 0.8
LSH
and prefix filters in our algorithms
to 0.04
make
comprehensive
0
0.95 0.04
0.04a 0.80
0.81 0.80 0.78 0.78 0.78 0.8
Jacc
JaccT
Full
Select ive
comparison.
# of Matches
ωp≥ 2 log(1/δ)
. Let =0.15, Then ωp≥88 log(1/δ). Since pˆs and
2
pˆt returned in Algorithm 4 are the smallest indexes where only 0.3ω
of sketches turn to be 1, their values are no less than 0.3/2=0.15.
Therefore, p ≥ 0.15, and we can generate enough number of
sketches such that ω ≥ 587 log(1/δ) = Θ(log 1δ ). Strictly speaking, the witness probabilities pˆs and pˆt are estimated values, and
we need to use the Chernoff bound again to show that they are close
to 0.15 with a high confidence. The details are omitted here.
Second, according to Lemmas 3 and 4, it is not hard to mathematically compute the lower bound and upper bound to estimate
the number of distinct pairs number N .
Finally, the following two lemmas demonstrate that we only need
to multiply a constant in the formulas to use the low-error estimated pˆs , pˆt , µS and µT for the accuracy guarantee. In particular,
Lemma 5 is used to derive ϕ1 in T U B and Lemma 6 for ϕ2 in
T LB, respectively. (A similar proof technique can be found in [5,
15], even though our estimation is quite different from theirs.)
(a) Rank Distribution
5.
(b) Result Evaluation (P: Precision, R: Recall, F: F-m
EXPERIMENTAL RESULTS
In this section, we report an extensive experimental evaluation of
algorithms on three real-world datasets. First, we show the effectiveness and efficiency of the expansion-based framework to measure the similarity of strings with synonyms. Second, we evaluate
our SN-join, SI-Join and the state-of-the-art approaches to perform
similarity join with prefixes and LSH signatures. Finally, we analyze the performance of 2DHS estimators.
All the algorithms were implemented in Java 1.6.0 and run on a
Windows XP with dual-core Intel Xeon CPU 4.0GHz, 2GB RAM,
and a 320GB hard disk.
5.1
Datasets and Synonyms
Data Sets. We use three datasets: US addresses (USPS), conference titles (CONF), and gene/protein data (SPROT). These datasets
differ from each other in terms of rule-number, rule-complexity,
data-size and string-length. Our goal in choosing these diverse
sources is to understand the usefulness of algorithms in different
real world environments.
USPS: We downloaded common person names, street names, city
names, states, and zip codes from the United States Postal Service website (http://www.usps.com). We then generated one million records, each of which contains a person name, a street name,
a city name, a state, and a zip code. USPS also publishes extensive
information about the format of US addresses, from which we obtained 284 synonym pairs. The synonym pairs covers a wide range
of alternate representations of common strings, e.g., street → st.
CONF: We collected 10,000 conference names from more than ten
domains, including Medicine and Computer Science. We obtained
1000 synonym pairs between the full names of conferences and
their abbreviations by manually examining conference websites or
homepages of scientists.
SPROT: We obtained one million gene/protein records from the
Expasy website (http://www.expasy.ch/sprot). Each record contains an identifier (ID) and its name. In this dataset, each ID has
5 ∼ 22 synonyms. We generated 10,000 synonym rules describing
gene/protein equivalent expressions.
Figure 8 gives the characteristics of the three datasets.
5.2
5.2.1
Experimental results
Quality and Efficiency of Similarity Measures
The first experiment is to demonstrate the effectiveness and efficiency of the various similarity measures. We compared our two
measures: full expansion (Full), and selective expansion (SE) with
Jaccard without synonyms and JaccT using synonyms from [2].
For each of the three datasets, we performed the experiments
by conducting a similarity join between a query table TQ and a
ID
Name
Intl WH Conf 2012
q1
VLDB
q2
CONF dataset
...Jacc
JaccT ... Full
CONF dataset
ID
Name
Very Large Scale Integeration
t1
t2 Intl Wireless Health Conference 2012 UK
USPS dataset
SPROT dataset
... Jacc
... SE
JaccT
Full
Jacc
JaccT
Full
SE
SE
USPS dataset
SPROT dataset
P Jacc
R F P JaccT
R F P Full
R F P SE
R F P Jacc
R F P JaccT
R F P Full
R F P R
R F P JaccT
R F P R
SE F P Jacc
FullF P RSE F
0.50 0.04
P 0.44
R 0.07
F 0.45
P 0.83
R 0.60
F 0.43
P 0.91
R 0.58
F 0.67
P 0.94
R 0.78
F 0.01
P 0.74
R 0.02
F 0.08
P 0.95
R 0.15
F 0.06
P 0.95
R 0.11
F 0.05
P 1.00
R 0.10
F 0.01
P 0.54
R 0.01
F 0.07
P 0.81
R 0.12F 0.05P 0.72
R 0.09F 0.05P 0.94R 0.10F
(a) Table Q
(b) Table T
0.14 0.44 0.21 0.47 0.83
0.60 0.50 0.85 0.63pairs
0.69 0.90 0.78 0.01 0.42 0.02
0.08 0.82 0.15 0.12
0.93 0.21 0.10 0.95pairs
0.19 0.01 0.32 0.02 0.07 0.72 0.13 0.08 0.68 0.15 0.10 0.90 0.18
ID
Synonym
ID
Synonym
0.04 0.40
0.44 0.42
0.07 0.52
0.45 0.80
0.83 0.63
0.60 0.51
0.43 0.82
0.91 0.63
0.58 0.72
0.67 0.85
0.94 0.78
0.78 0.06
0.010.32
0.540.08
0.010.19
0.070.71
0.810.30
0.120.18
0.050.65
0.720.28
0.090.33
0.050.87
0.940.48
0.10
0.01 0.42
0.74 0.11
0.02 0.22
0.080.82
0.950.35
0.150.26
0.060.93
0.950.41
0.110.36
0.050.93
1.000.52
0.100.05
0.44
0.14 0.35
0.44 0.45
0.21 0.76
0.47 0.74
0.83 0.75
0.60 0.55
0.50 0.70
0.85 0.62
0.63 0.87
0.69 0.84
0.90 0.85
0.78 0.21
0.010.32
0.320.21
0.020.33
0.070.62
0.720.43
0.130.52
0.080.64
0.680.58
0.150.69
0.100.83
0.900.75
0.18
0.01 0.42
0.42 0.28
0.02 0.39
0.080.73
0.820.51
0.150.61
0.120.80
0.930.70
0.210.79
0.100.91
0.950.85
0.190.16
0.63
0.44
0.40
0.42
0.52
0.80
0.63
0.51
0.82
0.63
0.72
0.85
0.78
0.05
0.32
0.08
0.19
0.71
0.30
0.18
0.65
0.28
0.33
0.87
0.48
0.06
0.42
0.11
0.22
0.82
0.35
0.26
0.93
0.41
0.36
0.93
0.52
0.71
0.42
0.53
0.61
0.73
0.66
0.84
0.75
0.79
0.96
0.89
0.92
0.82
0.21
0.33
0.80
0.71
0.75
0.68
0.60
0.64
0.89
0.80
0.84
0.54
0.32
0.40
0.48
0.57
0.52
0.69
0.61
0.65
0.83
0.82
0.82
WH Wireless Health
Conference
Conf
r1
r2
0.63 0.35 0.45 0.76 0.74 0.75 0.55 0.70 0.62 0.87 0.84 0.85 0.21 0.42 0.28 0.39 0.73 0.51 0.61 0.80 0.70 0.79 0.91 0.85 0.16 0.32 0.21 0.33 0.62 0.43 0.52 0.64 0.58 0.69 0.83 0.75
0.61 0.73
0.66 0.84 0.75
0.79 0.96Kingdom
0.89 0.92 0.54 0.32 0.40 0.48 0.57 0.52 0.69 0.61 0.65 0.83 0.82 0.82
0.82
0.80 0.71 0.75
0.68 0.60 0.64 0.89 0.80 0.84 0.71 0.42 0.53
International
UK
United
r3 0.21 0.33Intl
r4measures
Figure 9: Quality of similarity
(P: precision,
R: recall, F: F-measure).
CONF
dataset WH
USPS Conference
dataset
SPROT dataset
Wireless
Health
Conf
r5
r6
of Washington
1705 NE
s1=“P93214”
0.14 s =“Proceedings
of the VLDB
Endowment 2012: 38th
Jacc...
Jacc 0.00 s1=“University
Jacc 0.00 SPROT
...
...
...
CONF
dataset
USPS
dataset
dataset protein 9”
Pacific
St
Seattle,
WA
98195”
s2=“14339_SOLL,14-3-3
International Conference on Very Large Databases, Turkey”
0.60
0.50
0.70
0.60
0.80
0.70
0.90
0.80
0.90
1
JaccT
Jacc
Full
JaccT
SE
Full
SE
0.57
0.14
0.93
0.57
0.93
0.93
0.93
of Washington 1705 NE Jacc
JaccT
0.70 s1=“University
JaccT0.00
0.75 s1=“P93214”
s2=“UW”
2012ofTurkey”
s21=“PVLDB
=“Proceedings
the VLDB Endowment 2012: 38th
Jacc 0.00
r1:P93214->14339_SOLL protein 9”
St Seattle, ofWA
98195”
r1:UW->University
Washington
International Conference on Very Large Databases, Turkey”
Full 0.70
0.92 Pacific
Full 0.75
0.83 s2=“14339_SOLL,14-3-3
r2:P93214->14-3-3 protein 9
rs12:PVLDB->International
Conference
on
Very
Large
Databases
JaccT
JaccT
s
2
=“UW”
r2:UW->1705
NE
Pacific
St
Seattle,
WA
98195
=“PVLDB 2012 Turkey”
r3:14339_SOLL->UPI0000124DEC
r2:PVLDB->Proceedings of the VLDB Endowment
SE 1.00 r1:UW->University
r3:UW->University of
of Washington
Waterloo
SE 1.00 r1:P93214->14339_SOLL
Full 0.83 r2:P93214->14-3-3 protein 9
r1:PVLDB->International Conference on Very Large Databases Full 0.92 r2:UW->1705 NE Pacific St Seattle, WA 98195
(d) Exampleofstring
records
and their applicable synonyms
with their similarities
on three datasets
the VLDB
Endowment
r2:PVLDB->Proceedings
SE 1.00 along
r3:UW->University
of Waterloo measured by four
SE measures
1.00 r3:14339_SOLL->UPI0000124DEC
(c)
Transformation
pairsof similarity measures.
Figure
10: Three examples tosynonym
illustrate the quality
(d) Example string records and their applicable synonyms along with their similarities measured by four measures on three datasets
# of Matches
# of Matches
of Matches
# of #Matches
target table TT as follows: (1) TQ consists of 100 manually selected
Efficiency of Measures. Having verified the effectiveness of dif6)
USPShas 200 recordsCONF
SPROT ferent measures, Filtering
Power(%)
Candidate
Size(x10
full names,JaccT
and (2) TT Ours
100 of themJaccT
are the
in the sequel
we assess their
efficiency.
Figure
JaccTwhere Ours
Ours
JaccT
SN-Join
SI-Join
JaccT
SN-Join
SI-Join
USPS
CONF
Filtering
Candidate
Size(x10
correct abbreviations
of the corresponding
records
in TQAvg/max
(i.e. SPROT
the Avg/max
12(a) shows the time
cost ofPower(%)
the four measures
on SPROT
dataset6)
Avg/max
Avg/max
Avg/max
Avg/max
8
JaccT
Ours
JaccT
Ours
JaccT
Ours
ground truth),
and theAvg/max
other 100 “dirty”
records
are selected
such Avg/max
(10K records) by JaccT
runningSN-Join
10 times
of string
similarity
measureSI-Join
JaccT
SN-Join
SI-Join
Avg/max
Avg/max
Avg/max
Avg/max
0.95 1.415/5
1.137/3
1.263/3
1.138/3
2.363/11
1.432/7
0.95 86.1
88.8
99.0
695.4
561.3
49.8
that each of them is a similar record (in terms of Jaccard coefficient)
ment based on the nested-loop self-join. The x-axis denotes the
0.90
1.527/6
1.447/6
1.346/4
1.223/4
2.654/11
2.322/9
0.90
85.7
86.8
98.9
714.3
661.3
56.9
1.415/5 1.137/3
1.138/3 2.363/11 1.432/7
86.1rules88.8
561.3and49.8
to0.95
the corresponding
records in T1.263/3
number of0.95
synonym
applied 99.0
on one 695.4
single string,
the
Q . This is to ensure that there is
0.85
2.078/8
2.201/7
1.739/4
1.634/5
3.532/12
2.948/11
0.85
86.0
98.9
777.6
701.9
57.4
0.90
1.527/6
1.447/6
1.223/4
2.654/11
2.322/9
0.90total84.5
85.7
98.9
714.3
661.3
56.9
only
one correct
matching
record 1.346/4
in TT for each
record in
TQ .
y-axis is the
running86.8
time. We
find that
although
the full0.80
3.780/9
2.349/7
1.974/5
1.885/5
3.796/12
3.508/13
0.80
82.5
84.7
98.7
764.2
63.0
0.85 of
2.078/8
2.201/7
1.739/4
1.634/5
3.532/12
2.948/11
86.0
98.9set 874.0
777.6rules,
701.9
57.4
expansion0.85
needs 84.5
to expand
the string
using
its perforQuality
Measures.
In Figure2.303/6
9, we report
the quality
of the 4.105/13
0.75
3.913/9
2.806/8
2.208/6
4.556/14
0.75
82.3
83.7
98.7
885.8
817.6
0.80
3.780/9
2.349/7
1.974/5
1.885/5
3.796/12
3.508/13
0.80
82.5
84.7
98.7
874.0
764.2
63.0
mance is comparable to that of Jaccard, which does not use66.7
any
measures
by
testing
the
Precision
(short
for
“P”),
Recall
(“R”),
and
0.70
4.456/10
2.741/6
2.682/6
5.070/14
4.969/16
0.70
81.9
82.4
98.4
904.0
878.7
78.5
2×P ×R3.568/10
0.75
3.913/9
2.806/8
2.303/6
2.208/6
4.556/14
4.105/13
0.75
82.3
83.7
98.7
885.8
817.6
66.7
rules
at
all.
This
is
because,
the
full-expansion
adds
all
applied
F-measure= P +R (“F”) on three datasets. We observe that:
rules directly
and
computing
cost98.4
increases
slowly with
the num(c)
Number
Signatures
Per String
(d) its
Filtering
power
on the904.0
USPS
dataset
4.456/10
3.568/10
2.741/6
2.682/6
5.070/14
0.70
81.9
82.4
878.7
78.5
• 0.70
The similarity
measures
usingofsynonyms
(including
JaccT,
Full 4.969/16
ber
of
rules.
In
contrast,
JaccT
needs
to
enumerate
all
the
possible
and SE) obtain higher
scores
than
Jaccard
which
does
not
consider
(c) Number of Signatures Per String
(d) Filtering power on the USPS dataset
intermediate strings and its performance deteriorates rapidly with
Similarity Algorithm
Complexity
Effectiveness
Comment
80
synonym
pairs. TheComputational
reason is that
without using
synonyms,
Jaccard
Jacc
JaccT
Full
Seletive
Jacc
O(L)
Low
Does
not
consider
synonym
pairs.
the increase of rules. Finally, SE is in the “middle” of the two ex1
has no chance
70 to improve the similarity.
If
only
single
token
synonym
pairs
are
allowed
(implying
k
=
1),
an
optimal
algorithm
80
P
R
F
P
R
F
P
R
F
P
R
F
2
Jacc
Full strings toSeletive
JaccT [1]
tremes,
which avoidsJaccT
enumerating all the possible
achieve
O(2n(L+kn))
• SE significantly
outperforms
JaccT in eachHigh
dataset.
For exam60
matching can be used with a complexity of O(L2 + Ln2).
70
31based on maximum bipartite
better
selects
R performance
F
P thanRJaccT,Fand which
P
R
Fonly appropriate
P
R
F
Full Expansion
and may P
have
good
quality.
ple,
on SPROT
dataset, theO(L+kn)
F-measures of SEHigh
and JaccT
are 0.82
50 (ours)
42Most efficient
60
0.70
0.04
0.04
0.04
0.52
0.98
0.63 0.51
0.97
1.00of0.63
rules
to achieve
better
effectiveness
full-expansion
in terms
SE
is efficient
and has
good
quality,
it is optimal
under
the conditions
of than
Theorem
3, 0.64 0.49
3
2
and
0.52
respectively.
The
main
reason
is
that:
an
abbreviation
5
40 (ours)
SE Algorithm
O(L+n +kn)
High
50
the complexity
can beF-measures
O(L(illustrated
+ 0.64
n log n).0.96
0.75
0.04
0.72
0.57
in Figure
4and
may have30
various full expressions and the join records6-10
may
contain
0.70 0.04
0.04reduced
0.04to0.04
0.04
0.52 0.98
0.639).
0.510.97
0.970.69
0.640.53
0.491.00
1.000.66
0.63
5
0.80
0.04
0.04
0.04
0.74
0.88
0.78
0.72
0.93
0.78
0.73
0.98
0.80
40
>10
the combination
of
multiple
expressions.
Therefore
SE
can
apply
0.75 0.045.2.2
0.04 0.04
0.64
0.96
0.72
0.57
0.97
0.69
0.53
1.00
0.66
20
6-10
Efficiency0.87
and Scalability
of Join Algorithms
30
multiple rules,
while JaccT can apply only one. Note >10
that such0.85
sit- 0.04
0.80
0.04 0.04
0.04 0.04
0.04 0.83
0.74 0.88 0.84
0.780.86
0.720.87
0.930.86
0.780.80
0.730.90
0.980.87
0.80
10
0.90
0.04
0.04
0.04
0.80
0.82
0.84
0.84
0.87
20 rare in the real world data, as rone
The
second
experiment
is0.81
to test0.84
the efficiency
and0.86
scalability
of0.86
VLDB
International
Conference
on
Very
Large
Databases
7
uation is not
fragment
of
a
string
0.85
0.04
0.04
0.04
0.83
0.87
0.84
0.86
0.87
0.86
0.80
0.90
0.87
0
0.95
0.04
0.04
0.04
0.80
0.81
0.80
0.78
0.83
compared
our algorithms,
SN-Join,
10
likely involves
multiple
synonym
rules.Full
We
one example
PVLDB
of algorithms.
the
VLDB
Endowment
r8illustrate
0.90Proceedings
0.04various
0.04 join
0.04
0.80
0.82We
0.81
0.840.78
0.840.78
0.840.82
0.86
0.870.82
0.86
Jacc
JaccT
Select
ive
and SI-Join with the algorithm in [2] (also termed JaccT). Our imon each of 0three datasets in Figure 10 to compare the performance
0.95
0.04
0.04
0.04
0.80
0.81
0.80
0.78
0.78
0.78
0.82
0.83
0.82
(a)
Rank
Distribution
(b) Result
Evaluation
(P: includes
Precision,
R: optimizations
Recall, F: F-measure)
plementation
of JaccT
all the
proposed in
Full
Select ive
of four similarity Jacc
measures. JaccT
[2].
For
our
algorithms,
since
they
can
work
with
the
two
similar(a) Rank Distribution
(b) Result Evaluation (P: Precision, R: Recall, F: F-measure)
ity measures (i.e., full expansion, selective expansion), we denote
Dataset
1000
3000
5000
7000
9000
them as SN(F), SN(S), and SI(F) and SI(S) respectively. We impleUSPS
70.0% 71.2% 72.8% 72.2% 70.8%
mented all algorithms using both prefix and LSH filters. Therefore,
CONF
62.4% 67.6% 68.9% 71.2% 70.4%
with respect to the algorithms using LSH scheme, we append “-L”
SPROT
84.4% 86.7% 88.4% 89.8% 87.3%
to the name, e.g., JaccT-L denotes the JaccT algorithm using LSH
Figure 11: Empirical Probabilities of Optimal Cases for SE
scheme. Note that, we use the false negative ratio δ ≤ 5%, i.e, the
accuracy is 1 − δ ≥ 95%. The parameters k and l should satisfy:
Optimality Scenarios of SE measure As described in Theorem 1,
δ ≥ (1 − θk )l [31]. For example, if threshold θ = 0.8, k = 3, then
SE is optimal if the rhs tokens of useful rules are distinct. The purl should be at least 5 to guarantee at least 95% accuracy.
pose of this experiment is to verify how likely this optimal conMetrics.
We took the following measures: (i) the size of sigdition is satisfied in practice. We performed experiments on three
natures,
(ii)
the filtering ratio of the algorithms, which is typically
datasets using different data sizes. The results are shown in Figure
defined as the number of pruned string pairs divided by the total
11. The average percentages of optimal cases are 71.39%, 68.11%,
number of string pairs; and (iii) the running time (including filterand 87.32% in USPS, CONF and SPROT data, respectively. Thereing time and verification time).
fore, the condition in Theorem 1 are likely to be met by the majority
of candidate string pairs, which means that the values returned by
Number of Signatures. We first performed experiments to report
the number of signatures of a query string. It has a major impact
the SE algorithm are optimal in most cases. Further, the percentage
on the query time, as both JaccT and our join algorithms need to
on SPROT data is larger than those on USPS and CONF. This is
frequently check the overlaps of string signatures. The results are
because many of the synonym pairs on SPROT are single-token-toreported in Figure 13. For both prefix and LSH schemes, the numsingle-token synonyms, which are more likely to meet the distinct
ber of signatures of our expansion-based algorithms is smaller than
condition. In contrast, many of the synonyms of the USPS and
that of JaccT. The reason is that based on a transformation frameCONF are multi-token synonyms.
200 4.21 66.5
63.5
72.4 4.54 73.9
100 4.23 89.4
87.9
83.4 3.89 83.1
0 3.73 86.8
81.1
82.6 10
3.29 2082.530
4.32
4.77
4.56
4.22
3.88
40
3.69
76.3
79.1
99.2
87.8
92.3
50
94.1
(a) # of synonyms (SPROT)
3.61 600
78.8
3.55 83.1
3.52 300
89.4
2.78 84.7
2.43 092.310
2.19 90.4
4.22
4.38
4.13
3.69
3.77
20
3.59
80.1
85.6
89.9
89.5
93.7
30
92.9
4.19 84.5
4.28 90.7
3.85 97.1
4.04 95.5
3.27
99.8
40 50
3.73 96.2
(b) # of synonyms (SPROT)
JaccT-L
3.71
3.33
3.42
2.94
3.06
2.73
0.95
0.90
0.85
0.80
0.75
0.70
USPS (k=4)
SPROT(k=6)
4000
3000 JaccT
Ours
JaccT
Ours
JaccT
Ours
Ours
l JaccT 3000
l
Avg/max Avg/max Avg/max Avg/max
Avg/max Avg/max
Avg/max Avg/max
2000
1.415/5 1.137/3 8.363/17 7.432/16 2 3.384/8 2000
3.254/8 3 14.87/46 12.16/44
1000 8.654/18 8.322/16 3 4.576/19 1000
1.527/6 1.447/6
4.499/18 4 19.84/58 19.44/57
2.078/8 2.201/70 9.532/18 8.953/17 5 8.952/21 8.911/20
7 34.72/72 34.58/71
0
(K)
400 9.512/19
600 800 61000
200 10
40049.62/103
600 800
1000(K)
3.780/9 2.349/7 200
10.79/19
11.14/29
11.09/27
48.19/100
(b) SPROT
dataset (k=11,l=34)
(a) USPS
dataset
(k=9,l=21)
3.913/9 2.806/8
11.55/19
10.10/21
8 15.52/36 14.15/35
16 69.36/192
68.73/185
4.456/10 3.568/10 13.07/20 12.97/23 11 20.09/51 19.21/51 24 75.04/297 74.92/294
USPS
SPROT
(a)
Number
of Signatures
JaccT
SN-Join
SI-Join Per String
SN-Join
SI-Join
k/LNumber
Figure 13:
of signatures
(PrefixJaccT
filtering
scheme
and
FR(%) FN(%) FR(%) FN(%) FR(%) FN(%) FR(%) FN(%)FR(%) FN(%) FR(%) FN(%)
LSH scheme).
USPS
(100K
self-join)
SPROT(1000K
self-join)
k=2,l=2 63.5 4.21 66.5 4.32 76.3 3.61 78.8 4.22 80.1 4.19 84.5 3.71
72.4
k=3,l=3
Filtering
Ratio(%)
4.54
73.9 Size(x10
4.77 679.1
Candidate
)
4.23 89.4 4.56 99.2
3.384/8
4.576/19
8.952/21
11.14/29
15.52/36
20.09/51
SN(F)-L
4000
86.1 88.810 99.0
85.73000
86.8
98.9
84.5 86.0
98.9
82.5 84.75 98.7
2000
82.3 83.70 98.7
81.91000
82.4
298.4
3
14.87/46
19.84/58
7
34.72/72
10
49.62/103
0
4
6
8 10 (K)
16 2 69.36/192
24(a) CONF
75.04/297
dataset (k=3,l=3)
54
SPROT(1000K self-join)
SI(F)-L Candidate
SN(S)-L
SI(S)-L
Filtering Ratio(%)
Size(x109)
6000
JaccT SN-Join SI-Join JaccT SN-Join SI-Join JaccT SN-Join SI-Join
5000
695.4
714.3
777.6
874.0
885.8
4904.06
561.3
661.3
701.9
764.2
817.6
878.7
8
49.8
56.9
57.4
63.0
66.7
78.5
10
91.8 93.5
90.74000
91.7
88.5 90.0
86.63000
89.3
85.22000
87.5
78.1 80.3
99.5
98.6
95.5
92.1
90.3
89.7
41.3
46.5
57.5
67.7
74.0
109.5
32.7
41.9
50.2
53.4
62.4
98.8
2.6
7.1
22.1
39.5
48.9
51.3
k/L
k=2,l=3
k=4,l=6
k=6,l=10
k=8,l=17
k=10,l=27
1000
0
(b)
SPROT
dataset (k=11,l=34)
(a) USPS dataset
(k=9,l=21)
(b) Filtering ratio on the USPS
dataset
0
(a) CONF dataset (k=3,l=3
600 800ratio
1000 (K)
200 400
600 800 1000(K)
Figure200
14:400
Filtering
with varying
thresholds.
6000
Locality 5000
sensitive hashing (LSH) scheme
Running time (Sec)
Running time (Sec)
Prefix4000
filtering scheme
USPS
SPROT
5000
15 SI-Join
JaccT SN-Join
0.95
0.90
0.85
0.80
0.75
0.70
filter ratio
(FR)
andPerformance
False Negative with
(FN) with
varying synonyms.
parameters k and l. (accuracy=0.95,
Figure
12:
varying
SI(F)-L
SN(S)-L
JaccT-L
SI(S)-L
threshold=0.9)
SN(F)-L
k/L
5000
2
3
5
6
8
11
USPS (100K self-join)
20
JaccT-L
SN(F)-L
Filtering
Ratio(%)
Candidate
Size(x106)
SI(S)
1200
USPS
SPROT
JaccT
SN-Join
SI-Join
JaccT
SN-Join
SI-Join
900
300
FR(%) FN(%) FR(%) FN(%) FR(%) FN(%) FR(%) FN(%)FR(%) FN(%) FR(%) FN(%)
400
0.95
0.90
0.85
0.80
0.75
0.70
Running time (Sec)
SI(F)
7.432/16
21
8.322/16
30
8.953/17
48
9.512/19
59
10.10/21
50 76
12.97/23
(b) # of synonyms
(SPROT)
(a) Number
of Signatures Per String
OT)
Running time (Sec)
SN(S)
SE
1.137/3 8.363/17
83.2
124
1.447/61138.654/18
95.7
158
2.201/71629.532/18
99.1
207
2.349/721310.79/19
99.5
335
2.806/837411.55/19
20 30
40463
3.568/10
99.8
49413.07/20
Running
Running t
RunningRunning
time (Seti
Running ti
4000
100
3000
3000
0
2
4
6
8 10
200 400 600 800 1000
2000
2000 (b) USPS dataset (K)
(a) CONF dataset (K)
1000
1000
0
0
200 400 600 800 1000 (K) Jacc 200 JaccT
400 600 800 1000(K)
JaccT
Jacc
(b) SPROT datasetSN(F)
(k=11,l=34)
(a) USPS
dataset (k=9,l=21)
Full
600
0.95 1.415/5
k=2,l=3 0.90
79.6 1.527/6
81.3
300
k=4,l=6 0.85
86.1
88.6
2.078/8
k=6,l=10 0.80
88.6 3.780/9
89.9
k=8,l=17 0.75
91.7 03.913/9
93.0
10
50
0.70
k=10,l=27
92.34.456/10
95.7
5000
200
Running time (Sec)
k=2,l=2
k=3,l=3
k=4,l=3
k=5,l=4
k=6,l=4
k=7,l=5
15
10
5
0
Running time (Sec)
k/L
4000
Running time (Sec)
Running time (Se
5000
9)
3.55
83.1
4.38 85.6
4.28 90.7
3.33
Filtering
Ratio(%)
Candidate
Size(x10
3.52 89.4 4.13 89.9 3.85 97.1 3.42
Filtering Power(%)
Filtering Time (Sec)
USPS
SPROT
SN-Join
SI-Join
JaccT
SN-Join
SI-Join
FR(%) FN(%) FR(%) FN(%) FR(%) FN(%) FR(%) FN(%)FR(%) FN(%) FR(%) FN(%)
k=2,l=3 79.6 81.3 83.2 113
124
21
CONF
k=4,l=6k=2,l=2
86.1 88.6
63.5 95.7
4.21 162
66.5 158
4.32 30
76.3 3.61 78.8 4.22 80.1 4.19 84.5 3.71
k=6,l=10k=3,l=3
88.6 89.9
213
207
JaccT
72.4 99.1
4.54
73.9Ours
4.77 48
79.1 3.55 83.1 4.38 85.6 4.28 90.7 3.33
k=8,l=17 91.7 93.0 99.5
374 Avg/max
335
Avg/max
87.9 99.8
4.23 494
89.4 463
4.56 59
99.2 3.52 89.4 4.13 89.9 3.85 97.1 3.42
k=4,l=3
k=10,l=27
92.3 95.7
76
JaccT SN-Join SI-Join
JaccT
k/L
JaccT SN-Join SI-Join
3.89
4.22 87.8 2.78 84.7 3.69 89.5 4.04 95.5 2.94
1.263/383.11.138/3
3.73
3.88 92.3 2.43 92.3 3.77 93.7 3.27 99.8 3.06
1.346/486.81.223/4
3.29
3.69 94.1 2.19 90.4 3.59 92.9 3.73 96.2 2.73
1.739/482.51.634/5
1.974/5 1.885/5
2.303/6 2.208/6
Figureratio
15: Filter
(FR)
and(FN)
false
negative
(FN) with
2.741/6
2.682/6
filter
(FR)
andratio
False
Negative
with
varying parameters
k andvaryl. (accuracy=0.95,
k=5,l=4
k=6,l=4
k=7,l=5
83.4
81.1
82.6
threshold=0.9) threshold=0.9).
ing parameters k and l. (accuracy=95%,
Prefix filtering scheme
USPS
SPROT
JaccT
Ours
JaccT
Ours
Locality sensitive hashing (LSH) scheme
USPS (k=9)
SPROT(k=11)
Para
Signatures
Para
Signatures
Avg/max
Avg/max Avg/max
Avg/maxas LSHl is anAvg/max
l
Avg/max
different
application
scenarios,
approximate
solution,
0.95
1.415/5 must
1.137/3return
2.363/11
1.432/7
0.95
4
78.84/180
4
1662.32/4576
k=6,l=4 81.1 3.73 86.8 3.88 92.3 2.43 92.3 3.77 93.7 3.27 99.8 3.06
but
prefix
all
correct
answers,
making
somewhat
un0.90 1.527/6 1.447/6 2.654/11 2.322/9 0.90 7
137.97/315
8
3324.64/9152
0.95 86.1JaccT
88.8 is 99.0
695.4
561.33.69
93.5
99.5 92.9
41.33.73
32.7
work,
more
to49.8
include
into
signatures
82.6
3.29likely
82.5
94.1 91.8
2.19new
90.4tokens
3.59
96.2 2.6
2.73
k=7,l=5
0.85
2.078/8
2.201/7
3.532/12 2.948/11
0.85 them.
12 236.52/540
7064.86/19448
0.90 85.7 86.8
98.9 714.3 661.3 56.9
90.7 91.7 98.6 46.5
41.9
7.1
fair for
a direct
comparison
between
We observe17that
LSH is
than
ours.
In ratio
addition,
we
observe
the90.0
size
of signatures
in22.1
0.85 84.5
86.0
98.9 (FR)
777.6
701.9
57.4 that
88.5
95.5
57.5
50.2
0.80 3.780/9 2.349/7 3.796/12 3.508/13 0.80 21 413.91/945
34 14129.72/38896
filter
and
False
Negative
(FN)
with
varying
parameters
k and
l.the
(accuracy=0.95,
faster3.913/9
than prefix
the4.105/13
threshold
This
0.80 82.5 84.7
98.7 874.0 764.2 63.0
86.6
89.3 92.1 67.7
53.4
39.5
0.75
2.806/8 when
4.556/14
0.75 is
39 approximately
768.7/1755
70≥ 0.8.
29090.6/80080
threshold=0.9)
LSH
scheme
greater
scheme,
in0.75 82.3
83.7 is98.7
885.8 than
817.6 that
66.7 in prefix
85.2 87.5
90.3 and
74.0 the
62.4gap48.9
0.70
4.456/10
3.568/10
5.070/14
4.969/16
0.70
73
1438.83/3285
151
62752.58/172744
is
mainly
because
more
than
one
minhash
signatures
is
combined
0.70 81.9 when
82.4 the
98.4threshold
904.0 878.7decreases.
78.5
78.1 As80.3
89.7 see
109.5 98.8
51.3
creases
wesensitive
will
this
Prefix filtering scheme
Locality
hashingshortly,
(LSH) scheme
Number
of Signatures
Per String
(i.e, k ≥ 2) in (a)
these
settings,
which makes
LSH method quite seUSPS (k=4)
SPROT(k=6)
USPS
SPROT
CONF
results in substantial
overhead
of
filtering
time
for
the
LSH
scheme.
Ours
JaccT
Ours
JaccT
Ours
JaccT
Ours
JaccT when
Ours the threshold is less than 0.8, we tried all
lective.
However,
l JaccT
l
USPS
(100KAvg/max
self-join)
SPROT
Avg/max Avg/max Avg/max Avg/max
Avg/max Avg/max
Avg/max Avg/max
Avg/max
Filtering
ratio on the
the filtering
USPS dataset
Filtering Power. We(b)
then
investigated
power of differ6)
reasonable
combinations
k and
l Filtering
valuesPower(%)
and stillCandidate
cannotSize(x10
find 9)a
Filtering Power(%)
Size(x10
0.95 1.415/5 1.137/3 8.363/17 7.432/16 2 3.384/8 3.254/8 3 14.87/46 12.16/44
1.263/3 Candidate
1.138/3 of
ent algorithms
in Figures
14–158.322/16
using3 prefix
LSH4 schemes,
re0.90 1.527/6
1.447/6 8.654/18
4.576/19and
4.499/18
19.84/58 19.44/57
1.346/4 JaccT
1.223/4
JaccT to
SN-Join
SN-Join
JaccT
SN-Jointhe
SI-Join
JaccT SN-Join
SI-Join
setting
beatSI-Join
prefix.
This
isSI-Join
because
when
threshold
is small,
0.85 2.078/8 2.201/7 9.532/18 8.953/17 5 8.952/21 8.911/20 7 34.72/72 34.58/71
1.739/4
1.634/5
spectively.
The experiments
were
performed
USPS
and SPROT
0.80 3.780/9
2.349/7 10.79/19
9.512/19
6 11.14/29 on
11.09/27
10 49.62/103
48.19/100
1.974/5
1.885/5
0.95
86.1 88.8
99.0 695.4
561.3
49.8 power,
91.8 we
93.5 need
99.5 to41.3
32.7
2.6
in
order
to
improve
the
filtering
select
a
large
3.913/9 2.806/8 11.55/19 10.10/21 8 15.52/36 14.15/35 16 69.36/192 68.73/185
2.208/6661.3 56.9
0.90 85.7 86.8 2.303/6
98.9 714.3
90.7 91.7 98.6 46.5
41.9
7.1
datasets0.75
for4.456/10
self-join.
For13.07/20
prefix-based
SN-Join
is slightly
0.70
3.568/10
12.97/23 11algorithms,
20.09/51 19.21/51
24 75.04/297
74.92/294
2.741/6
2.682/6 a large l to maintain the same confidence
k, which
in turn
0.85
84.5 86.0
98.9requires
777.6 701.9
57.4
88.5 90.0 95.5 57.5
50.2
22.1
better than JaccT, while
SI-Joinof has
a substantial
0.80
82.5
84.7
98.7
874.0
764.2in substantial
63.0
86.6 overhead
89.3 92.1 in67.7
53.4 time.
39.5
(a) Number
Signatures
Per Stringlead over JaccT
level
(≥
95%);
this
results
filtering
0.75 82.3 83.7
98.7 885.8 817.6 66.7
85.2 87.5 90.3 74.0
62.4
48.9
and SN-Join. These results are mainly because of the additional
0.70
81.9 82.4LSH
98.4trades
904.0 the
878.7
78.5
78.1 for
80.3 higher
89.7 filtering
109.5 98.8power.
51.3
Therefore,
filtering
time
USPS (100K self-join)
SPROT(1000K self-join)
filtering power
inRatio(%)
SI-JoinCandidate
bought
by ) length
filtering
signature
Filtering
Size(x10
Filtering
Ratio(%) and
Candidate
Size(x10 )
Then the overall running time of LSH is greater than that of preJaccT SN-Join
SI-Join JaccT SN-Join
SI-Join JaccT Figure
SN-Join SI-Join
JaccT SN-Join SI-Join
filtering. Next,
for LSH-based
algorithms,
15 demonstrates
fix in such case. This
also the reason why the best running time
(b) Filtering ratio on the USPS dataset
0.95 86.1
88.8 That
99.0 is,
695.4SI-Join
561.3 49.8
91.8winner
93.5 99.5
41.3
32.7 by
2.6
the similar
trend.
is
the
in
all
settings
achieved for threshold < 0.8 is with small k and l values to reduce
0.90 85.7 86.8
98.9 714.3 661.3 56.9
90.7 91.7 98.6 46.5
41.9
7.1
86.0
98.9 k
777.6
57.4
88.5 90.0
95.5prefix
57.5 scheme,
50.2
22.1
varying0.85the84.5parameters
and701.9
l. Compared
to the
the overall running time in Figure 18.
0.80 82.5 84.7
98.7 874.0 764.2 63.0
86.6 89.3 92.1 67.7
53.4
39.5
0.75 a82.3
83.7
98.7 885.8filtering
817.6 66.7ratio
85.2when
87.5 parameters
90.3 74.0
62.4
48.9
LSH has
slightly
higher
are set
0.70 81.9 82.4
98.4 904.0 878.7 78.5
78.1 80.3 89.7 109.5 98.8
51.3
optimally (e.g. 99.2% v.s. 98.9% in USPS data, with threshold
5.2.3 Estimation-based signature selection
0.9). On the other hand, note that LSH may filter away correct anThe last set of experiments (Figure 19) studies the quality of
(b) Filtering ratio
on the
swers, resulting in false negatives,
as can
beUSPS
seendataset
from the false
2DHS synopsis on prefix and LSH schemes. For prefix filter, we
negative percentages shown in Figure 15.
use the four frequency-based signature filters, i.e., ITF1∼ITF4, which
are described in Section 4.2. For LSH scheme, we vary the paramRunning Time and Scalability. Figure 16 and Figure 17 show
eters k and l to generate six filters. The off-line processing time for
the running time of five join algorithms based on prefix and LSH
the generation of all signatures of one filter is around 200s ∼ 300s,
schemes respectively (the join threshold is 0.8). The x-axis reprewhich is a one-time cost.
sents the join data size. As shown, the running times of both JaccT
We first computed the upper bound (UB) and the lower bound
and SN-Join (i.e., SN(F), SN(S)) have an exponential growth, whereas
(LB) of each filter by Lemma 1. One filter is returned as the best
SI-Join (i.e., SI(F), SI(S)) scales better (i.e., linear) than SN-Join
one if its UB is less than all the LBs of other filters, which means
and JaccT. The reason is that SI-Join methods have more efficient
that it returns the smallest number of candidates. For example, in
filtering strategies and generate smaller size of candidates. In addiFigure 19(a), ITF2 is the best one in CONF dataset. If there is
tion, in order to study the scalability of algorithms with the increase
no filter which can beat all other filters using only UB and LB,
of the number of rules, we plot Figure 12(b), where SI(F) (i.e. SIthen we further compute the tighter upper bound (TUB) and lower
Join with the full-expansion) is a clear winner in algorithms using
bound (TLB) using 2DHS synopsis. Consequently, in Figure 19(b),
rules: it is insensitive to the number of rules and thus able to outperITF1 is the best filters in USPS dataset , as its TUB is less than all
form other methods when one string involves more than 10 rules.
others’ TLBs. Note that, our estimate correctly predicts the ability
Prefix vs. LSH We then sought to analyze system performances
of four filters, which can be empirically verified in Figure 5 for
with different signature schemes, namely, prefix vs. LSH. Figprefix scheme. In addition, the estimate cost accounts for only 1%
ure 18 plots the running times of SI(S)-L (LSH) and SI(S) (Preof the join time. For example, in USPS dataset, the time cost of
fix) with varying join thresholds. Results on SI(F) and SN-join
our estimator is only 0.81s while the filtering time is 83.7s and the
are similar. Note that, for SI(S)-L, the choice of k and l has subtotal running time is 113.4s.
stantial impact on the query time, hence we tuned and used the
Summary Finally, we summarize our findings.
parameters of k and l for each threshold to achieve the best running
(1) Considering four similarity measures. we observe that Jactime. Before we describe the results, it should be noted that the
card is inadequate due to its complete neglect of synonym rules.
two schemes, LSH and prefix, have different goals and probably
k=4,l=3
87.9
JaccT SN-Join
SI-Join
JaccT4.04
SN-Join
83.4 JaccT
3.89 SN-Join
83.1 4.22
87.8 JaccT
2.78 SN-Join
84.7 SI-Join
3.69 89.5
95.5 SI-Join
2.94
k=5,l=4 SI-Join
6
9
F
Ja
7
8
8
9
9
k
0.95 4
0.90 4
0.85 2
0.80
k l2
0.75 1
0.950.70
4
21
200
0.90 4
0.85 2
0.80 (a)
2
0.75 1
0.70 1
200 400 600 800 1000
(b) USPS dataset (K)
OT)
T)
400
300
200
100
300
SI(F)-L
JaccT-L
5000
SI(S)-L
4000
SN(S)-L
SN(F)-L
k/L
98
SI(S)-L
SI(S)
100
99
97
0.95 0.9 0.85 0.8 0.75 0.7
(b) USPS dataset
20
0
SI(F)-L
SI(S)-L
UB
UB
400
400
200
100
100
200
200
LB
LB
100
80
60
40
20
80
60
50
UB
TUB
20
0.75 0.7
L1:k=2,l=2 L4
L2:k=3,l=3 L5
40
L4:k=5,l=4L3:k=4,l=3 L6
L1:k=2,l=2
L2:k=3,l=3 L5:k=6,l=4
L3:k=4,l=3 30
L6:k=7,l=5
40
UB
30
40
0.95 0.9
(b) USPS dataset
98
TUB
20
10
TLB
LB
0
20 UB
TUB
10
TLB
LB
0
TLB
LB
200 400 600 800 1000 (K) 0 200 400 600 800 10000(K)
0 SPROT dataset (k=11,l=34)
0
0
0
(b)
(a)
USPS
dataset
(k=9,l=21)
ITF1
ITF2
ITF3
L1
L2
L4
ITF4
L3
L5L1L6 L2 L3 L4
ITF1
ITF2
ITF3
ITF4
ITF1 ITF2 ITF3 ITF4
200 400 600 800 1000 (K)
ITF1 ITF2 ITF3 ITF4
(b) USPS dataset (k=4,l=3)
(b) USPS dataset(Prefix) (c) USPS dataset (LSH)
(a) CONF dataset(Prefix)
SI(F)-L
Filtering6000
Power(%)
Running time (Sec)
Running time (Sec)
SN(S)-L
6000
5000
100
99
settings40
600
300
200
0
L
(b) USPS dataset
(c) USPS dataset
(a) CONF dataset
Filtering Time (Sec)
k/L JaccT
Figure
19: Estimation results of 2DHS synopsis for prefix and
SN-Join SI-Join JaccT SN-Join SI-Join
USPS
SPROT
LSH schemes
(threshold=0.90).
JaccT
SN-Join
SI-Join
JaccT
SN-Join
SI-Join
k/L
Figure 17: Performance on LSH signature scheme.
L
SN(S)-LI
600
800
1000
500
800
600
400
candidates
##ofofcandidates
(K) (K)
Running time (Sec)
Running time (Sec)
50
Running time (Sec)
1200
20
900
15
600
10
300
5
0
0 10 20 30 40 50
4
6 (SPROT)
8 10 (K)
(b) # of2 synonyms
(a) CONF dataset (k=3,l=3)
60
SI(S)-L
SI(S)
Figure 18: Prefix scheme0.95vs.0.9LSH
thresh0.85 0.8scheme
0.75 0.7 (with varying
97
0.95 0.9 0.85 0.8
(a) k,l settings
(b) USPS dataset
(b) USPS dataset
old1000
values).
100
50
(b) USPS dataset (K)
Figure 16: Running time in CONF and USPS datasets.
JaccT
SN(F)
SN(S)
Full
JaccT-LI
SN(F)-LI
SI(S)
SI(F)
SI(F)-L 400 SI(S)-L
SN(S)-L
JaccT-L
SN(F)-L
(S)
3
3
k,l33
3
2
3
3
3 120
3 100
3 80
120
100
80
60
40
SI(S)-L
SI(S)
20
0
# of candidates (10^10)
100
0
l
Running time(Sec)
SI(S)
500
400
300
# of candidates (10^10)
2
4
6
8 10
200 400 600 800 (a)
1000
CONF dataset (K)
Running time (Sec)
Running time (Sec)
K)
SI(F)
SN(S)
# of candidates (10^10)
100
0
SN(F)
Running time(Sec)
200
JaccT
SI(S)
# of candidates (10^10)
Running time (Sec)
Running time (Sec)
10
500
400
300
30
25
20
15
10
5
0
Running time (Sec)
SI(F)
SN(S)
Filtering Power(%)
SI(S)-L
Filtering Time (Sec)
5000SI-Join JaccT SN-Join SI-Join
JaccT SN-Join
Running time (Sec)
79.6 81.3 83.2
113
124
21
FP(%) ER(%) FP(%) ER(%)USPS
FP(%) ER(%) FP(%) ER(%) FP(%) ER(%) FP(%) ER(%)
SPROT
JaccT is not efficient because it enumerates
all transformed stringsk=2,l=3
k=4,l=6 86.1 88.6 95.7 162
158
30
4000
k=3,l=5
71.1 213
4.83
75.5
76.3 3.61 79.8
4.22 80.2 JaccT
4.19 84.6 3.71
3000
JaccT
SI-Join
SN-Join
SI-Join
k/L
k=6,l=10 88.6 89.9
99.1
207 4.82SN-Join
48
k=2,l=3
79.6
81.3
83.2
113
124
21
k=5,l=899.5
77.5 374
4.74 76.9
4.77 59
79.1 3.55 84.2 4.38 85.6 4.28 92.7 3.33
and entails large query overhead.
Full-expansion
extremely
ef-k=8,l=17 91.7 93.0
3000 95.7 162 is158
335
FP(%)
ER(%)
FP(%)
ER(%)
FP(%)
ER(%)
FP(%)
ER(%)
FP(%)
k=4,l=6
86.1
88.6
30
k=7,l=13 81.7 4.23 84.3 4.56 87.3 3.52 86.4 4.13 89.9 3.85 95.1 3.42 ER(%) FP(%) ER(%)
4000
2000
k=10,l=27 92.3
99.8
494
463
[7] 95.7
M.
Bilenko
J. Mooney.
learnable
k=6,l=10
88.6as89.9
213
207
48
k=9,l=21
84.1 and
3.89 R.
89.4
4.21 76
99.3 Adaptive
2.78 91.7 duplicate
3.69 90.0 detection
4.04 98.5 using
2.94
ficiently,
but
its
F-measure
is
not
good
as
Selective-expansion,
2000 99.1
k=3,l=578.2 71.1
75.594.34.82
4.19 84.6 3.71
k=11,l=34
3.75 4.83
86.1 3.88
2.43 76.3
92.3 3.61
3.77 93.779.8
3.27 4.22
99.8 80.2
3.06
3000
k=8,l=17 91.7 93.0 99.5 374
335
59
string
similarity
In 4.77
KDD,
pages
39–48,
2003.
1000
k=13,l=53
75.3 77.5
3.22 measures.
82.5
2.19 79.1
90.4 3.55
3.59
92.984.2
3.73 4.38
96.2 85.6
2.73
k=5,l=8
4.74 3.69
76.994.1
4.28 92.7 3.33
1000
k=10,l=27
92.3 95.7
99.8 494
463
76
a good balance
between
effectiveness
and efficiency.
2000 which makes
k=7,l=13
81.7
4.23
84.3
4.56
87.3
3.52 kand
86.4
4.13
89.9
3.85 95.1 3.42
0
[8]
A.
Z.
Broder,
S.
C.
Glassman,
M.
S.
Manasse,
G.
Zweig.
Syntactic
0
Filtering
power
(FP)
and
error
ratio
(ER)
with
varying
parameters
and
l.
(accuracy=0.95,
threshold=0.8)
(K)
(K) is
200
400
600
800
1000
200
400
600
800
1000
k=9,l=21
84.1
3.89
89.4
4.21
99.3
2.78
91.7
3.69
90.0
4.04 98.5 2.94
(2)
With
respect
to
three
similarity
join
algorithms,
SI-join
1000
clustering
the
web.
Computer
29(8-13):1157–1166,
(b) SPROT dataset (k=11,l=34)
(a) USPS dataset (k=9,l=21)
USPS
(100K self-join)
k=11,l=34 of
78.2
3.75
86.1 3.88Networks,
94.3 2.43
92.3SPROT
3.77 93.7 1997.
3.27 99.8 3.06
0 the winner in all settings
Filtering
Power(%)
Candidate
Size(x10 94.1
) andFiltering
Power(%)
Candidate
Size(x10
)
over JaccT and SN-join, by using SI-tree
75.3
82.5V. 3.69
2.19
90.4 Robust
3.59
92.9
3.73
96.2
2.73
[9] S.k=13,l=53
Chaudhuri,
K. 3.22
Ganjam,
Ganti,
R.
Motwani.
and
efficient
fuzzy
0 1000 (K)
200 400 600 800 1000(K)
JaccT
SN-Join SI-Join
JaccT SN-Join In
SI-Join
JaccT SN-Join
SI-Join JaccTpages
SN-Join313–324,
SI-Join
USPS
SPROT
(b) SPROT
dataset
(k=11,l=34)
21)
match
for
online
data
cleaning.
SIGMOD
Conference,
2003.
to efficiently
perform
length
filtering
and
signature
filtering.
We
Filtering
power
(FP)
and
error
ratio
(ER)
with
varying
parameters
k
and
l.
(accuracy=0.95,
threshold=0
JaccT
SN-Join
SI-Join
JaccT
SN-Join
SI-Join
k/L
0.95 86.1 88.8
99.0 695.4 561.3 49.8
91.8 93.5 99.5 41.3
32.7
2.6
[10] S. Chaudhuri,
and661.3
R. Kaushik.
A primitive
for similarity
0.90 85.7 86.8V. Ganti,
98.9 714.3
56.9
90.7
91.7 98.6 operator
46.5
41.9
7.1
ER(%) FP(%)
ER(%)
ER(%) FP(%)
FP(%)
ER(%)sets
FP(%)over
ER(%) the
achieve
aFP(%)
speedup
of up
toFP(%)
50∼300x
onER(%)
three
data
SPROT
(100K
SPROT
0.85in84.5
86.0
98.9USPS
777.6
701.9self-join)
57.4 5,
88.5
90.0 95.5 57.5
50.2
22.1
joins
data
cleaning.
In
ICDE,
page
2006.
SI-Join
JaccT state-of-the-art
SN-Join 71.1 SI-Join
0.80 82.5 Filtering
84.7
98.7
874.0 764.2Candidate
63.0
86.6
89.3) 92.1Filtering
67.7 Power(%)
53.4
39.5
4.83
75.5 4.82 76.3 3.61 79.8 4.22 80.2 4.19 84.6 3.71
k=3,l=5
Power(%)
Size(x10
Candidate Size(x10 )
approach.
0.75 82.3 83.7
885.8 817.6Extending
66.7
85.2autocompletion
87.5 90.3 74.0 to 62.4
48.9 errors. In
[11] S. Chaudhuri
and 98.7
R. Kaushik.
tolerate
77.5FR(%)
4.74FN(%)
76.9 4.77 79.1 3.55 84.2 4.38 85.6 4.28 92.7 3.33
k=5,l=8FN(%)
R(%) FN(%) FR(%) FN(%)FR(%)
0.70 81.9JaccT
82.4SN-Join
98.4 904.0
878.7
78.5
78.1 SI-Join
80.3 89.7
109.5
98.8SI-Join
51.3 JaccT SN-Join SI-Join
SI-Join
JaccT
SN-Join
JaccT
SN-Join
81.7
4.23
84.3
4.56
87.3
3.52
86.4
4.13
89.9
3.85
95.1
3.42
k=7,l=13
(3)
Finally,84.52DHS
synopsis offers very good results, and is exSIGMOD Conference, pages 707–718, 2009.
76.3 3.61 78.8 4.22 k=9,l=21
80.1 4.19
3.71
84.1 3.89 89.4 4.21 99.3 2.78 91.7 3.69 90.0 4.04 98.5 2.94
0.95
86.1 P.88.8
99.0Filtering
695.4
561.3
49.8
93.5 99.5of string
41.3
32.7
2.6
(b)
ratio
the
USPS 91.8
dataset
79.1 3.55 83.1 4.38
85.6 4.28
3.33
[12] W. W.
Cohen,
D. Ravikumar,
and
S. on
E. Fienberg.
A comparison
78.2 90.7
3.75to
86.1
3.88 94.3
92.3 power
3.77 93.7of 3.27
99.8 3.06
tremely
efficient
estimate
the 2.43
filtering
different
filters,
k=11,l=34
0.90 85.7 86.8
98.9 714.3 661.3 56.9
90.7 91.7 98.6 46.5
41.9
7.1
99.2 3.52 89.4 4.13 k=13,l=53
89.9 3.85
75.3 97.1
3.22 3.42
82.5 3.69 94.1 2.19 90.4 3.59 92.9 3.73 96.2 2.73
distance
for name-matching
tasks. In
IIWeb,
pages
73–78,
0.85 metrics
84.5 86.0
98.9 777.6 701.9
57.4
88.5
90.0
95.52003.
57.5
50.2
22.1
( <89.5
1 second
, accounting
for only 1% of the total running time),
87.8 2.78 84.7 3.69
4.04 95.5
2.94
0.80 82.5
84.7 suffix
98.7tree
874.0
764.2 63.0
86.6 alphabets.
89.3 92.1
67.7
53.4
39.5
Filtering
and error
Optimal
construction
with large
In FOCS,
92.3 2.43 92.3 3.77
93.7power
3.27(FP)
99.8
3.06 ratio (ER) with varying parameters k and l. (accuracy=0.95, threshold=0.8)[13] M. Farach.
which
its application in practice.
0.75
82.3 83.7
85.2 87.5 90.3 74.0
62.4
48.9
94.1 2.19 90.4 3.59
92.9 strongly
3.73 96.2motivates
2.73
pages
137–143,
1997. 98.7 885.8 817.6 66.7
6
9
6
0.70 81.9
6
n SI-Join
49.8
56.9
57.4
63.0
66.7
78.5
9
9
JaccT SN-Join SI-Join JaccT SN-Join SI-Join
91.8
90.7
88.5
86.6
85.2
78.1
0.80 82.5 84.7
98.7 874.0 764.2 63.0
86.6 89.3 92.1 67.7
53.4
39.5
98.7 2.6
885.8 817.6 66.7
85.2 87.5 90.3 74.0
62.4
48.9
93.5 0.75
99.582.341.383.7 32.7
98.4 7.1
904.0 878.7 78.5
78.1 80.3 89.7 109.5 98.8
51.3
91.7 0.70
98.681.946.582.4 41.9
90.0 95.5 57.5
50.2
22.1
89.3 This
92.1 research
67.7
53.4is partially
39.5
supported by 973 Program of China (Project
87.5 90.3 74.0
62.4
48.9
2012CB316205),
NSF
China (No.
National High-tech
80.3No.89.7
109.5 98.8
51.3
(b) Filtering
ratio60903056),
on the USPS863
dataset
7.
ACKNOWLEDGEMENTS
Research Plan of China (No. 2012AA011001), Beijing Natural Science
Foundation (No. 4112030), and New Century Excellent Talents in University. Wei
Wang is supported by ARC Discovery Projects DP130103401 and
ing ratio on the USPS
dataset
DP130103405.
8.[1] N.REFERENCES
Alon, Y. Matias, and M. Szegedy. The space complexity of approximating
the frequency moments. In STOC, pages 20–29, 1996.
[2] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for
record matching. In ICDE, pages 40–49, 2008.
[3] A. Arasu, S. Chaudhuri, and R. Kaushik. Learning string transformations from
examples. PVLDB, 2(1):514–525, 2009.
[4] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In
VLDB, pages 918–929, 2006.
[5] Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan.
Counting distinct elements in a data stream. In RANDOM, pages 1–10, 2002.
[6] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In
WWW, pages 131–140, 2007.
78.5
78.1
80.3
89.7
109.5
98.8
Running time (Sec)
Running time (Sec)
6
98.4 904.0 878.7
Running time (Sec)
In this
paper,
we JaccT
studiedOurs
the problem
of
strings similarity
meaPara
Signatures Para
Signatures
JaccT
Ours
l
Avg/max
l We proposed
Avg/max
sures and joins with semantic rules (i.e.
synonyms).
CONF 4
0.95expansion-based
1.415/5 1.137/3 2.363/11
1.432/7 to0.95
4
78.84/180
1662.32/4576
two
methods
measure
the
of strings
Signatures
Signatures
urs
JaccTsimilarity
Ours
0.90 1.527/6
1.447/6
2.654/11 2.322/9 0.90 7
137.97/315
8
3324.64/9152
l
l
g/max
Avg/max
Avg/max
Avg/max
Avg/max
0.85a 2.078/8
2.201/7
3.532/12
2.948/11
0.85 SI-tree
12 236.52/540
17 the
7064.86/19448
and
novel
index
structure
called
to
perform
similar0.80 3.780/9
2.349/7
3.796/12 3.508/13 0.80 21 413.91/945
34 14129.72/38896
32/16 0.95 2
3.384/8
3
14.87/46
1.263/3 1.138/3
0.75
3.913/9
4.556/14 4.105/13
39 768.7/1755
70
29090.6/80080
ity
join.
We2.806/8
also
proposed
an 0.75
estimator
to select
signatures
on22/16 0.90 3
4.576/19
4
19.84/58
1.346/4
1.223/4
0.70 4.456/10 3.568/10 5.070/14 4.969/16 0.70 73 1438.83/3285 151 62752.58/172744
53/17 0.85 5
8.952/21
7
34.72/72
1.739/4 in
1.634/5
line to optimize
the efficiency of signature filters
join algorithms.
12/19 0.80 6
11.14/29
10
49.62/103
1.974/5
1.885/5
(a) Number of Signatures Per String
10/21 0.75 8
15.52/36
16
69.36/192
2.303/6 2.208/6
The estimator
provides provably high-confidence
and low-error es97/23 0.70 11
20.09/51
24
75.04/297
2.741/6 2.682/6
USPS
(100K the
self-join)
timates to compare
filtering power of filtersSPROT
with the logarithSignatures Per String Filtering Power(%) Candidate Size(x10 ) Filtering Power(%) Candidate Size(x10 )
mic space
and time complexity. The extensive experiments demonJaccT SN-Join SI-Join JaccT SN-Join SI-Join JaccT SN-Join SI-Join JaccT SN-Join SI-Join
SPROT
strated
the
advantages
of our proposed
approach over the
state-of0.95 86.1 88.8
99.0 695.4 561.3 49.8
91.8 93.5 99.5 41.3
32.7
2.6
ze(x10 )
Filtering Power(%)
Candidate
)
0.90 85.7methods
86.8 Size(x10
98.9
714.3
661.3
56.9
90.7 91.7 98.6 46.5
41.9
7.1
the-art
in three
data sets.
0.85 84.5 86.0
98.9 777.6 701.9 57.4
88.5 90.0 95.5 57.5
50.2
22.1
Locality sensitiveAvg/max
hashing (LSH)
scheme
Avg/max
Avg/max Avg/max
USPS (k=4)
SPROT(k=6)
82.4
51.3
[14] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base
applications. J. Comput. Syst. Sci.,
31(2):182–209,
1985.
(b) Filtering
ratio on
the USPS
dataset SN(S)-L
SI(F)-L
SI(F)
JaccT-L
SN(F)-L
JaccT
SN(F)CONF SN(S)
SI(S)
[15] S. Ganguly,
M. N.
Garofalakis, and R. Rastogi. Tracking
set-expression
JaccT
Ours
20
400
30
500
Avg/max
Avg/max
cardinalities
over
continuous update streams. VLDB J., 13(4):354–369, 2004.
25
400
15
300
1.263/3 1.138/3
X. Yu, N. Koudas, and D. Srivastava. Hashed samples:
20[16] M. Hadjieleftheriou,
300
1.346/4
1.223/4for set similarity selection queries.
10
200
15
selectivity
estimators
PVLDB,
1(1):201–212,
200
1.739/4 1.634/5
10
100
2008. 1.974/5 1.885/5
5
100
5
2.303/6
and S.2.208/6
Tamaki.
Improved upper bounds for
0 3-sat. In SODA, 2004.
0
0
0[17] K. Iwama
2
4
6
8 10 (K)
200 400 600
2
4
6 2.741/6
8 10 2.682/6 200 400 600 800 1000
[18]
G. Kondrak.
N-gram similarity
distance.
pages 115–126, 2005. (b) USPS dataset
(a) CONF
(a)
CONF
datasetSN(F)
(K)
(b)SN(S)
USPSand
dataset
(K)
SI(F)In SPIRE,
JaccT-L
SN(F)-L
JaccT
SI(S) dataset (k=3,l=3)
[19] H. Lee, R. T. Ng, and K. Shim. Power-law based estimation of set20
similarity
30
500
join size. PVLDB, 2(1):658–669, 2009.
25
400 Similarity join size estimation using
15 locality
[20] H. Lee, R. T. Ng, and K. Shim.
20
sensitive hashing. PVLDB,300
4(6):338–349, 2011.
15 [21] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms 10
for
200
10
approximate string searches. In ICDE, pages 257–266, 2008.
5
100
5 [22] G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for
0
0
0
2
4
6
8 10 (
2 similarity
4
6 joins.
8 In10VLDB, 2012.200 400 600 800 1000
(a) CONF dataset (k=3,l=3)
[23]
D. R. H.
Miller,(K)
T. Leek, and R.(b)
M. USPS
Schwartz.
A hidden
(a) CONF
dataset
dataset
(K) markov model
information retrieval system. In SIGIR, pages 214–221, 1999.
[24] J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity
query processing with the asymmetric signature scheme. In SIGMOD
Conference, pages 1033–1044, 2011.
[25] G. Salton and C. Buckley. Term-weighting approaches in automatic text
retrieval. Inf. Process. Manage., 24(5):513–523, 1988.
[26] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In
SIGMOD Conference, pages 743–754, 2004.
[27] Y. Tsuruoka, J. McNaught, J. Tsujii, and S. Ananiadou. Learning string
similarity measures for gene/protein name dictionary look-up using logistic
regression. Bioinformatics, 23(20):2768–2774, 2007.
[28] J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive
framework for similarity join and search. In SIGMOD Conference, pages
85–96, 2012.
[29] W. E. Winkler. The state of record linkage and current research problems.
Technical report, Statistical Research Division, U.S. Census Bureau, 1999.
[30] C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity
joins with edit distance constraints. PVLDB, 1(1):933–944, 2008.
[31] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for
near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011.
Running time (Sec)
Locality sensitive hashing (LSH) scheme
USPS (k=9)
SPROT(k=11)
Running time (Sec)
CONCLUSIONS
Running time (Sec)
6.
Running time (Sec)
ative (FN) with varying parameters k and l. (accuracy=0.95,
Prefix filtering scheme
threshold=0.9)
USPS
SPROT
9
© Copyright 2026 Paperzz