String Similarity Measures and Joins with Synonyms

String Similarity Measures and Joins with Synonyms
Jiaheng Lu†
†
Chunbin Lin†
Wei Wang‡
Chen Li#
Haiyong Wang†
School of Information and DEKE, MOE, Renmin University of China, China
{jiahenglu,chunbinlin,why}@ruc.edu.cn
‡
School of Computer Science and Engineering, University of New South Wales, Australia
[email protected]
#
School of Information and Computer Sciences, University of California, Irvine, USA
[email protected]
ABSTRACT
A string similarity measure quantifies the similarity between two
text strings for approximate string matching or comparison. For example, the strings “Sam” and “Samuel” can be considered similar.
Most existing work that computes the similarity of two strings only
considers syntactic similarities, e.g., number of common words or
q-grams. While these are indeed indicators of similarity, there are
many important cases where syntactically different strings can represent the same real-world object. For example, “Bill” is a short
form of “William”. Given a collection of predefined synonyms,
the purpose of the paper is to explore such existing knowledge to
evaluate string similarity measures more effectively and efficiently,
thereby boosting the quality of string matching.
In particular, we first present an expansion-based framework to
measure string similarities efficiently while considering synonyms.
Because using synonyms in similarity measures is, while expressive, computationally expensive (NP-hard), we propose an efficient
algorithm, called selective-expansion, which guarantees the optimality in many real scenarios. We then study a novel indexing
structure called SI-tree, which combines both signature and length
filtering strategies, for efficient string similarity joins with synonyms. We develop an estimator to approximate the size of candidates to enable an online selection of signature filters to further
improve the efficiency. This estimator provides strong low-error,
high-confidence guarantees while requiring only logarithmic space
and time costs, thus making our method attractive both in theory
and in practice. Finally, the results from an empirical study of the
algorithms verify the effectiveness and efficiency of our approach.
1.
INTRODUCTION
Approximate string matching is used heavily in areas such as
data integration [2, 4], data deduplication [9], and data mining [10].
Most existing work that computes the similarity of two strings only
considers syntactic similarities, e.g., number of common words
or q-grams. Although syntactic similarities are indeed indicative,
there are many important cases where syntactically different strings
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD’13, June 22–27, 2013, New York, New York, USA.
Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00.
can represent the same real-world object. For example, “Bill” is
a short form of “William”, “Big Apple” is a synonym of “New
York”, and “Database Management Systems” can be abbreviated
as “DBMS”. Such equivalence information can help us identify semantically similar strings that may have been missed by syntactical
similarity-based approaches.
In this paper, we study two problems related to approximate
string matching with synonyms1 . Firstly, how to define an effective
similarity measure between two strings based on the synonyms?
The traditional similarity functions cannot be easily extended to
handle the synonyms in similarity matching. Secondly, how to efficiently perform similarity join based on newly defined similarity
functions? That is, given two collections of strings, how to efficiently find similar string pairs from the collections? The two
problems are closely related, because the similarity measures defined in the first problem will naturally provide a join predicate to
the second.
We give several applications about approximate string measures
and joins with synonyms to shed light on the importance of these
two problems. In a gene/protein databases, one of the major obstacles that hinders the effective use is term variation [27], including
Roman-Arabic (e.g. “Synapsin 3” and “Synapsin III”), acronym
variation (e.g. “IL-2” and “interleukin-2”), and term abbreviation
(e.g. “Ah receptor” and “Ah dioxin receptor”). A new similarity
measure that can handle term variation may help users discover
more genes/proteins of interest. New measures are also applicable
in information extraction systems to aid mapping from text strings
to canonical entries by similarity matching. More applications can
be found in areas such as term classification and term clustering,
which can be used for advanced information retrieval/extraction
applications. String approximate join is typically useful in data
integration and data cleaning. For example, in two representative
data cleaning domains, namely addresses and academic publications, there often exist rich sources of synonyms, which can be
leveraged to improve the effectiveness of systems significantly.
For the first problem of similarity measures, there is a wealth
of research on string-similarity functions on the syntactical level,
such as Levenshtein distance [29], Hamming distance [17], Episode
distance [11], Cosine metric [25], Jaccard Coefficient [9, 20], and
Dice similarity [6]. Unfortunately, none of them can be easily extended to leverage synonyms to improve the quality of similarity
measurement. One exception is the JaccT technique proposed by
Arasu et al. [2], which generates all possible strings by applying
1
For brevity, we use “synonym” to describe any kind of equivalent pairs
which may include synonym, acronym, abbreviation, variation and other
equivalent expressions.
4 5 8 4 7 8
5 7 3
8
q3 q4 q1 q7 q2 q6 q4 q2 q5 q1
q6
q7
(d) An IS-Index
ID
q1
q2
...
Name
Intl WH Conf 2012
VLDB
...
ID
Name
Very Large Scale Integeration
t1
t2 Intl Wireless Health Conference 2012 UK
...
...
(a) Table Q
ID
r1
r3
r5
...
Synonym pairs
WH Wireless Health
Intl
International
Wireless Health
WH
Intl, WH, Conf, 2012, Wireless, Health,Conference, United,
Kingdom, UK}. The Jaccard similarity between two fully expanded
8
, which is also greater than the original similarity
token sets is 11
2
without synonyms.
8
(b) Table T
ID
r2
r4
r6
...
Synonym pairs
Conference
Conf
UK
United Kingdom
Conf
Conference
We note that the full expansion-based similarity illustrated in the
above example is efficient to compute, and is also highly compet...
...
itive in terms of effectiveness in modeling the semantic similar(c) Transformation synonym pairs
ity between strings. Nonetheless, using all the synonym pairs to
Figure 1: An example to illustrate similarity matching with
perform a full expansion may hurt the final similarities. Consider
synonyms
Example 2 again, where the use of rule r4 on t2 is detrimental to
the similarity between q1 and t2 , as none of the expanded tokens
appears in q1 or its expanded token set. Therefore, we propose a
synonym rewriting to both strings, and returns the maximum simselective expansion measure, which selects only “good” synonym
ilarity among the transformed pairs of strings, as illustrated by the
pairs to expand both strings (Section 3), and accordingly the simifollowing example.
larity is defined as the maximum similarity between two appropriately expanded token sets of two strings. Unfortunately, selective
E XAMPLE 1. Figure 1 shows an example of two tables Q and
expansion is proved to be NP-hard by a reduction from the 3SAT
T and a set of synonyms. Consider two strings q1 and t2 . Their
2
problem. To provide an efficient solution, we propose a greedy altoken-based
Jaccard similarity is only 8 . Now consider applying
Similarity Algorithm Computational Complexity Effectiveness Comment
gorithm, called SE algorithm, which is optimal if the rhs tokens of
Jacc
Low
Does not consider synonym pairs.
synonym
pairs O(2
onO(L)
them.
Obviously,
synonyms
r
,
r
and
r
can
1
3
6
If only single token synonym pairs are allowed (implying k = 1), an optimal algorithm
(L+kn))
High
based on maximum bipartite matching can be used with a complexity of O(L + Ln ).
the useful synonyms satisfy the distinct property (the precise definibe
applied
, and r2 , rHigh
and
can
bequality.
applied to t2 . We
3 , r4Most
5 may
Full Expansion
(ours) to q1O(L+kn)
efficientrand
have good
SE is efficient and has good quality, it is optimal under the conditions of Theorem 3,
tion will be described in Section 3.2). We empirically show that the
SE Algorithm
O(L+n +kn) the string
High
use
AR(ours)
(s) to denote
generated
by
applying
a
set
of
synand the complexity can be reduced to O(L + n log n).
condition is likely to be true in practice — the percentage of cases
onym pairs R to the source string s. For example, A{r1 ,r3 } (q1 ) is
that such optimality hold in our experiments with three real-world
VLDB
International
Conference on Very Large Databases
r7
“International Wireless
Health
Conf
2012”.
PVLDB
the VLDB Endowment
data sets ranges between 70.0% and 89.8%. Figure 2 summarizes
r8
[2] considers all the possible
stringsProceedings
resultingoffrom
applying a
the results of the two new similarity algorithms, contrasted with
subset of applicable synonyms on them, and then return the maxiresults of existing methods [2].
mum Jaccard similarity. In the above example, it needs to consider
Given the newly proposed similarity measures, it is imperative
23 · 24 = 128 possible string pairs. The maximum Jaccard value is
5
to
study an efficient similarity join algorithm. In this paper, we first
between A{r1 ,r3 ,r6 } (q1 ) and A{r3 } (t2 ), which is 6 .
show a signature-based join algorithm, called SN-Join, by following the filter-and-verification framework (Section 4). It generates
The main limitation in the above approach is that its string sima set of tokens as signatures for each string using synonyms, finds
ilarity definition has to generate all possible strings and compare
candidate pairs that overlaps in the signatures, and finally verifies
their similarity for each pair in the cross product. In general, given
against the threshold using the above proposed similarity measure.
two strings s and t, assuming there are n1 (resp. n2 ) synonym pairs
In addition, we further propose the SI-Join, which speeds up the
applied on s (resp. t), then the above approach needs to compute
performance by using a native index, SI-tree, to reduce the candithe similarities for 2n1 +n2 pairs. Such a brute-force approach takes
date size with integrated signature and length filtering. Our join
time exponential in the number of applicable synonym pairs. A
algorithms are also general enough to work with signatures generspecial case of the problem for single-token synonyms is identified
ated by either prefix filtering or locality sensitive hashing (LSH).
in [2] where the problem can be reduced to a maximum bipartite
The methods for selecting signatures of strings may significantly
graph matching-based problem. Unfortunately, in practice, many
affect the performance of algorithms [20, 24]. Therefore, we prosynonyms contain multiple tokens, and even in the special case, it
pose, for the first time (to our best knowledge), an online algostill has a worst-case complexity that is quadratic in both the numrithm to dynamically estimate the filtering power of a signature filber of string tokens and the number of applicable synonyms. Since
ter. In particular, we generate multiple signature filters using difsuch similarity measure has to be called for every candidate pairs
ferent strategies offline. Then given two tables to join online, we
when employed in similarity joins, it introduces substantial overquickly estimate the performance of different filters and select the
head.
best one to perform the actual filtering. Our technical contribuIn this paper we propose a novel expansion-based framework to
tion here is to propose a novel, hash-based synopsis data structure,
measure the similarity of strings. Intuitively, given a string s, we
termed “Two Dimensional Hash Sketch (2DHS)”, by extending the
first obtain its token set S, and then grow the token set by applying
Flajolet-Martin sketch [13] on two join tables. We present novel esthe applicable synonyms of s. When a synonym pair lhs → rhs
timation algorithms to provide high-confidence estimates to comis applied, we merge all the tokens in rhs to the token set, i.e.,
pare the filtering power of two filters. Our method carries both
S = S ∪ rhs. The similarity between two strings are the similarity
theoretical and practical significances. In theory, we can prove that,
between their expanded token sets. We have two key ideas here:
with any given high probability 1-δ, 2DHS returns correct estimates
(i) We can avoid the exponential computational complexity by exon upper and lower bounds, and its space and time complexities are
panding strings using all the applicable synonym pairs; (ii) it costs
logarithmic in the cardinality of input size. In practice, the experless to perform expanding operation than transforming operation,
iments show that our estimation technique accurately predicts the
as illustrated below.
quality of different filters in all three data sets and its running time
E XAMPLE 2. (Continue the previous example.) We apply all
is negligible compared to that of the actual filtering phase (accountthe applicable synonym pairs to q1 and t2 to expand their token
ing for only 1% of the total join time).
sets, and then evaluate the Jaccard similarity between the two exFinally, we perform a comprehensive set of experiments on three
panded token sets. For instance, the fully expanded set of t2 by
data sets to demonstrate the superior efficiency and effectiveness of
applying synonym pairs r2 , r3 , r4 , and r5 is {International,
our algorithms (Section 5). The results show that our algorithms are
JaccT [1]
n
2
2
2
Similarity Algorithm Computational Complexity Effectiveness Comment
Jacc
O(L)
Low
Does not consider synonym pairs.
If only single token synonym pairs are allowed (implying k = 1), an optimal algorithm
JaccT [2]
O(2n(L+kn))
High
based on maximum bipartite matching can be used with a complexity of O(L2 + Ln2).
Full Expansion (ours)
O(L+kn)
High
Most efficient and may have good quality.
SE is efficient and has good quality, it is optimal under the conditions of Theorem 2,
2
SE Algorithm (ours)
O(L+kn )
High
and the complexity can be reduced to O(L+nlogn+kn).
Figure 2: Comparison of String Similarity Algorithms with synonyms (Assume both strings have a length of L; The maximum
number of applicable synonyms is n; k is the maximum size of the right hand side of a rule.)
VLDB
International Conference on Very Large Databases
r7
PVLDB Finally,
Proceedings
of the VLDB Endowment
r8
up to two orders of magnitude as efficient as the existing methods
the estimation of the result size of a string similarity
while offering comparable or better effectiveness.
2.
RELATED
WORK3000
Dataset
1000
5000
7000
9000
USPS measures:
70.0% There
71.2%
72.8%
72.2%
70.8%
String similarity
is a rich
set of string
similarity
measures CONF
including character
Levenshtein
62.4% n-gram
67.6%similarity
68.9% [17],71.2%
70.4%
Distance [15],
Jaro-Winkler
[31], Jaccard
tfSPROT
84.4%measure
86.7%
88.4% similarity
89.8% [9],87.3%
idf based cosine similarity [25], and Hidden Markov Model-based
measure [22]. A comparison of many string similarity functions is
performed in [11]. These metrics can only measure syntactic similarity (or distance) between two strings, but cannot capture semantic similarities such as synonyms, acronyms, and abbreviations.
Machine learning-based methods [7, 27] have been proposed
to improve the quality of traditional syntactic similarity-based approaches. In particular, learnable string similarity [7] presents trainable measures to improve the effectiveness of duplicate detection,
and [27] uses logistic regression to learn a string similarity measure
from an existing gene/protein name dictionary.
String similarity join: Majority of the state-of-the-art approaches
[33, 24, 21, 29] for performing string similarity joins follow the
filtering-and-verification framework. The basic idea is to first employ an efficient filter to prune string pairs that cannot be similar,
and then verify the remaining string pairs by computing their real
similarity. Prefix filtering [9] is a popular filtering scheme, which
transforms each string to a set of tokens; then the tokens of each
string are sorted based on a global ordering, and a prefix set for
each string is selected based on a given similarity threshold to be
signatures. Other efficient approaches are based on enumeration [4,
30, 21] or tries [28, 32].
LSH-based schemes are widely used to process the queries approximately, meaning that certain correct answers may be missed.
For Jaccard similarity, one can generate l super-signatures, each
of which is a concatenation of k minhashes [8] of the set s under
a minwise independent permutation; only string pairs sharing one
common super-signature are deemed as candidates. k and l are two
parameters that can be tuned to achieve trade-off between efficiency
and accuracy [33].
Synonyms generation approaches: Synonym pairs can be obtained in many ways, such as users’ existing dictionaries [27], explicit inputs, and synonyms mining algorithms [3, 23]. In this paper, we assume that all synonym pairs are predefined.
Estimation of set size: Our work is also related to the estimation
of distinct pairs in set-union. There are many solutions proposed
for the estimation of the cardinality of distinct elements. For example, in their influential paper, Flajolet and Martin [13] proposed
a randomized estimator for distinct-element counting that relies on
a hash-based synopsis; to this date, the Flajolet-Martin (FM) technique remains one of the most effective approaches for this estimation problem. FM sketch and its variation (e.g., [1, 14]) focus on
one dimensional data, and we extend it to two-dimensional cases in
this work.
join has been studied recently, e.g., in [18, 19], by using random
sampling and LSH scheme. Note that these techniques cannot be
directly applied here, since given a specific filter, our goal is to
estimate the cardinality of candidates, which is often much greater
than that of final results.
3.
STRING SIMILARITY MEASURES
As described in Section 1, an expansion-based framework can efficiently measure the similarity of strings with synonyms to avoid
the exponential computational complexity. In this section, we describe two expansion-based measures in detail. In particular, Section 3.1 describes an efficient full-expansion measure, and Section 3.2 shows a selective expansion to achieve better effectiveness.
Finally, Section 3.3 analyzes the optimality of the methods.
3.1
Full Expansion
We first introduce notations. Given two strings s1 and s2 , let S1
and S2 denote the sets generated from s1 and s2 respectively by tokenization (e.g., splitting strings based on delimiters such as white
spaces). Let r denote a rule, which has the form: lhs(r) → rhs(r),
where lhs(r) and rhs(r) are the left and right hand sides of r, respectively. Slightly abusing the notations, lhs(r) and rhs(r) can
also refer to the set generated from the respective strings. Let R denote a collection of synonym rules, i.e., R = {r: lhs(r) → rhs(r)}.
Given a string s, we say a rule r ∈ R is an applicable rule of s if
lhs(r) is a substring of s. We use rhs(r) to expand the set S,
0
that is, S = S ∪ rhs(r). Let sim(s1 , s2 , R) denote the similarity
between s1 and s2 based on R. Let R(s1 ) and R(s2 ) denote the
applicable rule set of s1 and s2 , respectively. Under the expansion0
0
0
0
based framework, sim(s1 , s2 , R)=f (S1 , S2 ), where S1 and S2 are
expanded from S1 and S2 using some synonym rules in R(s1 ) and
R(s2 ), and f is a set-similarity function such as Jaccard, Cosine,
etc.
Based on how applicable synonyms are selected to expand S,
there are different kinds of schemes. A simple one is the full expansion, which expands S using all their applicable synonyms. That
S
0
is, the expanded set S = S
rhs(r).
r∈R(s)
The full expansion scheme is efficient to compute. Let L be the
maximum length of strings s1 and s2 . The time cost of doing substring matching to identify the applicable synonym pairs is O(L)
using a suffix-trie-based method [12]. Then the complexity of the
set expansion is O(L + kn), where n is the maximum number of
applicable synonym pairs for s1 and s2 , and k is the maximum
size of the right-hand side of a rule. As we will demonstrate in our
experimental section, the full expansion provides a highly competitive performance with other optimized scheme to capture semantics
of strings using synonyms.
3.2
Selective Expansion
We observe that, as mentioned in Section 1, the full expansion
cannot guarantee that each applied rule is useful for similarity measure between two strings. Recall Example 2 and Figure 1. The use
of rule r4 on t2 is detrimental to the similarity between q1 and t2 ,
as “United Kingdom” does not appear in q1 . To address this
issue, we propose an optimized scheme, called selective expansion,
whose goal is to maximize the similarity value of two expanded sets
by a judicious selection of applicable synonyms to apply. Given a
collection of synonym rules R and two strings s1 and s2 , the se0
0
lective expansion scheme maximizes f (S1 , S2 ). Suppose, without
loss of generality, we use Jaccard Coefficient to measure the simi0
0
larity between S1 and S2 , which can be extended to other functions.
The Selective-Jaccard similarity of s1 and s2 based on R (denoted
by SJ(s1 , s2 , R)) is defined as follows:
SJ(s1 , s2 , R) =
0
max
0
0
{Jaccard(S1 , S2 )}.
R(s1 )⊆R,R(s2 )⊆R
0
where S1 and S2 are the sets expanded from s1 and s2 using R(s1 )
and R(s2 ), respectively.
Unfortunately, we can prove that it is NP-hard to compute the
Selective-Jaccard similarity with respect to the number of applicable synonym rules, by a reduction from the well-known 3SAT
problem [16]. Due to space limitation, we leave the detailed proof
in the full version. Since Selective-Jaccard is NP-hard, we design
an efficient greedy algorithm that guarantees the optimality under
certain conditions, which has been shown to be met in most cases
in practice. Before presenting the algorithm, we will first define the
notion rule gain, which can be used to estimate the fitness of a rule
to strings.
rule gain is too small from Ci (in lines 5∼11). In lines 6∼7, we
use the elements in Ci to expand Si to acquire Si0 . Then, in lines
θ
is removed from
9∼11, any rule whose rule gain is less than 1+θ
Ci , where θ is defined as Jaccard similarity of the current expanded
sets S10 and S20 (line 8), as its similarity is lower than the “average”
similarity and its usage would be detrimental to the final similarity.
Subsequently, in line 2, we iteratively expand s1 and s2 using the
most gain-effective rule from the candidate sets. Note that after a
new rule is used to expand the set, the rule gains of the remaining
rules need to be recomputed in line 16, since the new elements U
(in Algorithm 1) of the remaining rules will change.
Performance Analysis: The time complexity of SE algorithm is
O(L + kn2 ), where L is the maximum length of s1 and s2 , n is
the total number of applicable synonyms for s1 and s2 , and k is the
maximum size of the right-hand side of a rule. More precisely, we
use the suffix-trie to find applicable rule sets, and its complexity
is O(L). In Procedure findCandidateRuleSet, the complexity to
perform full expansion (lines 6 ∼ 7) is O(kn), and the iteration
time is no more than n, thus the complexity is O(L + kn2 ).
Algorithm 2: SE Algorithm
1
2
3
4
5
6
Algorithm 1: Rule Gain
1
2
3
Input
: r, s1 , s2 , R={ri : lhs(ri ) → rhs(ri )}
Output : the rule gain of r
U ← rhs(r) − S1 ; // r is applicable to s1 , and S1 is the token set of s1
if (U = φ) then
return 0;
//Let R
0
0
S
S2 ← S2
5
G ← (rhs(r) ∩ S2 ) − S1 ;
return
0
ri ∈R
rhs(ri ) //full-expanding S2
2
10
11
12
0
0
θ ← Jaccard(S1 , S2 )
for (r ∈ C1 ∪ C2 ) do
θ
if (RG(r) < 1+θ
) then
Ci ← Ci − r; //r ∈ Ci , i=1 or 2
9
until (no rule can be removed in line 19);
return C1 , C2
Procedure expand(s1 , s2 , C1 , C2 ,)
0
0
//Let S1 and S2 denote the expanded sets of S1 and S2
0
|G|
;
|U |
Procedure findCandidateRuleSet(s1 , s2 , R)
Ci ← {r: r ∈ R ∧ lhs(r) is a substring of si (i=1 or 2) ∧ RG(r) > 0};
repeat
S
0
S1 ← S1 r∈C rhs(r);
1
S
0
S2 ← S2 r∈C rhs(r);
8
denote all applicable rules of S2
4
6
7
Input
: s1 , s2 , R={r: lhs(r) → rhs(r)}.
Output : Sim(s1 , s2 , R).
C1 , C2 ← findCandidateRuleSet(s1 , s2 , R);
θ ← expand(s1 , s2 , C1 , C2 );
return max(Jaccard(s1 , s2 ), θ);
0
13
S1 ← S1 ;
14
S2 ← S2 ;
while (C1 ∪ C2 6= φ) do
Find the current most gain-effective rule r ∈ Ci (i = 1 or 2);
if (RG(r) > 0) then
0
0
Si ← Si ∪ rhs(r);
0
Given two strings s1 and s2 , a collection of synonyms R, and a
rule r∈R, we assume that r is an applicable rule for s1 . We denote
the rule gain of r by RG(r, s1 , s2 , R). When s1 , s2 , and R are clear
from the context, we shall simply speak of RG(r). Informally, the
rule gain is defined by the number of useful elements introduced by
r divided by the number of new elements in r. In particular, we say
that a token t ∈ rhs(r) in r is useful if t belongs to the full expansion set of s2 and t ∈
/ s1 . The reason we use the full expansion set
of s2 rather than the original set is that s2 can be expanded using
new rules, but all possible new tokens must be contained in the full
expansion set. We now go through Algorithm 1. In line 1, we find
the new elements U introduced by r. If U = ∅, then the rule gain
is zero (lines 2 ∼ 3). line 4 fully expands s2 and line 5 computes the
is returned
useful elements G. Finally, the rule gain RG(r) = |G|
|U |
(line 6).
We develop a “selective expand” algorithm based on rule gains,
which is formally described in Algorithm 2. Before doing any expansion, we first compute a candidate set of rules (line 1), which
only consists of applicable rules with relatively higher rule-gain
values. In particular, in Procedure findCandidateRuleSets, we first
let Ci contain all applicable rules whose rule gain is greater than
zero (line 4), and then we iteratively remove the useless rules whose
15
16
17
18
Ci ← Ci - r;
19
0
20
0
return Jaccard(S1 , S2 );
3.3
Theoretical Analysis
In this subsection, we carve out one case and show that SE returns the optimal value, and analyze the reduced time complexity
in this optimal case.
T HEOREM 1. Given two strings s1 and s2 and their applicable
rule sets R(s1 ) and R(s2 ), we call a rule r ∈ R(si ) (i = 1 or 2)
useful if RG(r) > 0. We can guarantee that SE is optimal if the rhs
tokens of all useful rules are distinct.
P ROOF. Let R∗ denote the rule set used by the optimal algorithm. Let R denote the set C1 ∪ C2 returned in line 1 of Algorithm 2. Then we will prove that R∗ = R as follows. Given an applicable rule ri in R, let Gi denote the useful elements, Ui denote
the new elements, and let Zi = Ui − Gi . Let θ∗ and θ denote the
∗
≥ θ, that is,
>
i.e., RG(r ) >
which means that r
will not be removed in Procedure findCandidateRuleSet (line 11)
and r∗ ∈ R. Thus, R∗ ⊆ R.
We then prove R ⊆ R∗ by contradiction.
Assume that there
P
|S ∩S + G∗ +G |
exists a rule r ∈ R but r 6∈ R∗ , θ = |S11 ∪S22 +P Zi∗ +Zii| . According
Sig(s1,R)={t1,t3}
vn
p
…
...
t11 p t1m p
tn1 p tnm p
(b) The structure of S-Directory
t1
4.1
…...
ti
…...
tj
…...
h(i)
2
5
9
11
14
15
tn
SI-Join Algorithm
token-hash-index
(b
String Ids
Binary strings LSB Bit-vector
0010
1
0100
0101
0
1100
1001
0
1100
1011
0
1100
1110
1
1100
1111
3
1101
…
s
0; otherwise
0
0
0
0
1
1
1
1
1
1
0
0
vt(g2)
IDj
…
IDj
Mapped In
s1
s2
s3
s4
s5
s6
ID1 the above SN-Join
ID1
To optimize
method, we show a native index
…...
…...
by (i) avoiding
signature-overlap
checking; and
(ii) reducing
(a) 2-Dimensional
hash sketch synopsis.the
IDi
IDi
1; v g
…... final verification, as follows.
number of …...
candidates for the
s (i)=1
g
and v (j)=1
…
2
5
8
9
11
14
1
0
1
1
0
0
0
1
1
1
…
Avoiding signature-overlapping
check. We design a signature in…...
…...
n
verted list, ID
called
I-List (see
IDn Figure 3(a)). Each token ti is associated with an inverted list including all string
IDs
(b) Example of 2-Dimensional hash sketches.
(a) 2-Dimensional
hash whose
sketch synopsis.signatures
(a) The structure of Double-Indexes
contain the token ti . Thus, given a signature ti , we can directly
obtain all relevant string IDs without any signature-overlapping
check.
sig1 sig2
…
…
…
…
0
0
0
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
1
0
1
1
1
1
1
1
0
1
1
1
vs(g2)1 1 0
(2) 2DHS of g2
(3) 2DHS of g1
1
S
Confer Interna
0
vs(g1) 1 1 1
(1) 2DHS of g1
vs(g): a synopsis of
signature g from table S
n
…
签
n-1
名的 …
所倒 j
对排 …
应索 1
的引 0
…
中
的
0 1 …i …m-1 m
一
hs:S的倒排索引中的一
个
个签名对应的sketch
ht: T
0; hs[i]=0
and ht[j]=0
1; otherwise
表
1
…
…
ACM Conf the number
on ICDE
对上 1
1
Reducing
of candidates. We extend the idea of length
ence tional
应的 0
1
的
filter [6, 26] to reduce the number of candidates. Intuitively, given00 10
sig1
two strings s1 and s2 and a threshold value θ, if the size of thesig1 sig2 (b) sig1对应
q1 q2
q1
q1 q1 q4
表
对上
full expansion
set of s1 is much less than the size of |s2 |, then
q2
q3
应的 1
1
的
q3
without checking
the signatures of s1 and s2 , one can claim that11 10
1
0
sim(s1 , s2 , R) must be smaller than θ. Based on the observation,
(a) 基于签名的倒
排索引S和T
we propose a data structure, called S-Directory, to organize the
strings by their lengths.
1; v g
s (i)=1
and v gt (j)=1
0; otherwise three
The S-Directory (see Figure 3(b)) is a tree structure with
levels, and nodes in different levels could be categorized into difvs
ferent kinds.
(a) 2-Dimensional hash sketch synopsis.
(b) Example of 2-Dimensional
• root-entry. A node in level L0 is a root entry,
which has multiple
pointers to the nodes in level L1 .
1; v sg(i)=1
and v tg(j)=1
• fence-entry. A node in level L1 is called fence entry, which
con0; otherwise
tains three fields hu, v, P i, where u and v are integers, and P is a
set of of pointers to the nodes in L2 . In particular, u denotes the
(a) 2-Dimensional
hash sketch
number of tokens of a string and v is the maximal
number
of synopsis.
tokens(b) Example of 2DHS.
in the full expanded sets of strings whose length is u.
• leaf-entry. A node in level L2 is called leaf entry, which contains
two fields ht, P i, where t is an integer to denote the number of the
tokens in the full expanded set of a string whose length is u, and P
is a pointer to an inverted list. As leaf entries are organized in the
ascending order of t, vi = tim (See Figure 3(b)).
We argue that S-Directory can easily fit in the main memory,
as the size of S-Directory is only O((vmax − umin )2 ), which is
quadratic to the difference between the maximal size of expanded
sets and the shortest length of records. For example, in our implementation, to perform a self-join for a table with one million strings
on USPS dataset, the size of S-Directory is only about 2.15KB.
To maximize the pruning power, we combine I-List and S-Directory
together. That is, for each leaf-entry=ht, P i, P points to a set of
signatures (from the string whose size of the full expansion set is
t) associated with multiple I-Lists. Thus, we have a combined index, called SI-tree, i.e., each leaf entry in S-Directory is associated with an I-List. Based on the SI-tree, we design a join algorithm (called SI-Join), which follows the filtering-and-verification
framework and the verification phase is the same as that of SN-Join.
SI-Join (see Algorithm 3) has three steps in the filtering phase.
In the first step (lines 2 ∼ 3), consider two fence-entries Fs and
Ft . If the maximal size of expanded sets of strings below Fs (or
Ft ) is still smaller than θ times the length of original strings below
Ft (or Fs ) (i.e. u), then all pairs of strings below Fs and Ft can
1
1
…
…
I-List
FM
…
…
1
1
1
1
1
T
sketch
I-List
FM
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
v gs 1 1
1
1
…
1
…
…
…
…
…
vg
s : a synopse of
(1) 2DHS of g1
signature g from table S
…
…
…
…
…
…
…
…
vg
s : a synopse of
signature g from table S
0
0
1
1
0
1
1
1
0
1
1
1
0
g2 1
1
0
g
vt 2
…
…
Given two collections of strings S and T , a collection of synonyms R, and a similarity threshold θ, a string similarity join finds
all string pairs (s, t) ∈ S × T , such that sim(s, t, R) ≥ θ, where
sim is one of the similarity functions in Section 3.
One join algorithm is a signature-based nested-loop approach,
called SN-Join, following the filtering-and-verification framework.
That is, in the filtering step, SN-Join generates candidate string
pairs by identifying all pairs (s, t) such that the signatures of s and
t overlap. In the verification step, SN-Join checks if sim(s, t, R)
≥ θ for each candidate and outputs those that satisfy the similarity
predicate.
To generate the signatures, one way is to extend the prefix-filter
scheme. Given a string s, the signatures of s are a subset of tokens
in s containing the d(1 − θ)|s|e smallest elements of s according
to a predefined global ordering. Therefore, we generate signatures
for a string s as follows. Let R ⊆ R be the set of n application
synonym pairs for s. For each subset Ri of R, we generate an
expanded set Si . For each Si , we obtain one signature set (denoted by sig(Si )). Note that the number of subset Si is 2n . The
exponential number is due to the fact that we try to enumerate all
possibilities of expanded sets for s with n synonym pairs. Finally,
we generate a signature set for s by including all signatures of Si :
S n
Sig(s, R)= 2i=1 Sig(Si ). Note that this generation of signatures
can be performed offline.
We observe that SN-join needs to check the signature overlap
for each string pair (s, t) ∈ S × T to find the candidates, which
can be time-consuming. In this section, we propose an optimized
native indexes and the corresponding join algorithm (Section 4.1).
We then study the impact of signature filters on the performance of
algorithms in depth by proposing a novel hash-based synopsis data
structure to effectively select signatures (Section 4.2). Finally we
extend our discussion on LSH scheme (Section 4.3).
L2
(a) The structure of I-List
ID-hash-index
STRING SIMILARITY JOINS
Sig(s2,R)={t1,t2,tn}
un
Figure 3: The structure of I-List and S-Directory.
ID-hash-index
Note that the condition in Theorem 1 does not require all the rhs
tokens to be distinct. It only requires those from useful synonyms
of s1 and s2 to be distinct, which are typically a very small subset. In addition, under the condition, the time complexity can be
reduced to O(L + n log n + nk). The reason is that, the complexity to perform full expansion (lines 6∼7) is reduced to O(1).
Since the tokens are distinct, and that of expanding (lines 15∼19)
is O(n log n), since the rule gain of each rule does not change,
which can be implemented by a maximal heap. Finally, the Jaccard
computation in line 20 needs O(L + kn) time in the worst case.
Therefore, the complexity of SE algorithm is O(L+n log n+nk).
…
L1 u1 p v1
s2
s2
∗
to Algorithm 2,
> θ. Thus, θ >
= θ , which
contradicts to the optimal property and such r does not exist. Thus,
R ⊆ R∗ .
Finally, we show that all the candidate rules in C1 ∪ C2 will be
used in Procedure expand (line 2). If all the rhs tokens are distinct,
the rule gain value will not change. The rule gain value of each rule
is greater than zero (line 17), so all the rules will be used by SE in
the expanding phase, which concludes the proof.
4.
s1
vtg1
i
P
|S1 ∩S2 + G∗
|
P i
|S1 ∪S2 + Zi∗ |
s2
vt(g1)
Gi
Zi
s1
L0 root
tn
vt(g1)
θ
,
1+θ
…
vgt : a synopse of
signature g from table T
∗
t3
vgt : a synopse of
signature g from table T
θ
,
1+θ
t2
vt(g): a synopsis of
signature g from table T
G∗
U∗
t1
inverted
index
full expansion similarity using R∗ and R respectively. Note that all
∗
the rules in both R and
are
P R are useful. When all the rhs tokens
∗
|S1 ∩S2 + P Gi |
distinct, θ = |S1 ∪S2 + Zi | . For a rule r∗ ∈ R∗ , we have G
> θ∗
Z∗
Signature
inverted
index
Signature
inverted
index
Signature
inverted
list
Signature
inverted
list
Signature
inverted
list
Signature
inverted
list
Signature
inverted
list
(b) SH-Index
(a) HS-Index
(2) 2DHS of g2
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
1
1 1
vs(g1)
0
0
ID
q1
Strings
Signatures
2013 ACM Intl Conf on
ACM, International,
Management of Data USA
Conference, on
q2 Very Large Data Bases Conf Conf, Conference
q3
VLDB Conf
Conf, Conference
q4
ICDE 2013
ICDE
L F
9
5
2
2
11
7
7
2
(a) Strings and signatures
ACM Conf
Confer Interna
on
ence tional
root
root
ICDE
2 7
9 11
E1 2
5 7
F1 2 7
E2 7
ICDE
q1
ID Synonym pairs ID
Synonym pairs
r1 Intl International r3 VLDB Very Large Data Bases
r2 Conf Conference r4 Very Large Data Bases VLDB
(b) Synonym pairs
q2
q3
q1
q2
q3
q1
(c) An I-List
q1
q4
2
q4
7
q3
11
7
q4
Conf
Confer
ence
q3
q3
F2 5 7
F3 9 11
E4 11
E3 7
Confer
Conf
ence
ACM
Confer Interna
on
ence tional
q1
q2
(d) An S-Direcotry
q2
q2
q1
q1 q1
q1
(e) An SI-Index
Figure 4: An example to illustrate the indexes. The column L and F denote the number of tokens in the string and in the full
expanded set, respectively.
q3
Intl ICDE
ICDE, International 2
Conf Intl UK VLDB 2012
7
Real Time Systems
1 4 2 5 48 novel
2 3solve
4 8 this problem. Our main idea is to use multi17 48strategy
2 5 1 7 to
be pruned safely. Similarly,
in theSymp
second step (lines 4 ∼ 5), given
two leaf entries Es and Et , where all expanded sets have the same
ple ways to generate signatures offline, and then given two tables to
4 5 8 4 7 8 5 7 3 8
on, 39th,
2013, Conf, Bases,
Large, Very,VLDB
size, if the size
ofIntel,
expanded
setsData,below
Es (or Et ) is smaller than
join, we quickly estimate the quality of each signature filter and seθ times the length of original strings below Et (or Es ), then all
lect the best one to perform the actual filtering. The main challenge
q4 q2 q5 q1
q3 q4 q1 q7
string pairs below Es and Et can be pruned safely. In the third step
in q2
thisq6idea
is two-fold. First, we need to select multiple promisq6
q7
(lines 6 ∼ 10), for each signature g that is an overlapping signature
ing signature filters offline. Second, the estimator should quickly
(d)
An IS-Index
pointed by Es .P and Et .P , Function getIList returns the I-List
determine
the best filter online according to their filtering power.
Global order International Conference on 39th Intel 2013 Conf Bases Data Large Very VLDB
whose key is g. Then SI-Join generates string pair candidates
for
To
address
the first challenge, we18use the itf -based method ([2,
Frequence
6
8
8 8 12 12 16 18 18 18 18
Mapped IDs
1
220]) 3to 4generate
5
6
7filters.
8
9In 10
11
12 let tf(w) denote the occurevery two I-Lists.
particular,
(b)rences
Global of
order
by IDF
token
w value
appearing in the join tables, i.e., the frequency
Algorithm 3: SI-Join Algorithm
of w. Then the inverse term frequency of the token w, itf(w), is
Input
: SI-Trees SIS for S and SIT for T , threshold value θ, a collection
defined as 1/tf(w). The global order is arranged in a descending
IDof synonym pairs
Name
ID
Name
R.
order based on itf values. Intuitively, tokens with a high itf value
Output : C: Candidate string pairs (s, t) ∈ SIS × SIT
Very Large Scale Integeration
Intl WH Conf 2012
t1
are rare tokens in the collection and have a high probability not to
1 C←−∅;q1
2 for (∀ Fs in SIS , ∀ Ft in SIT ) do
chosen
UK as signatures. Since tokens come from two sources, taVLDB
t2 Intl Wireless Health Conferencebe2012
q
2
//Fence-entry:F =<u, v, P >
bles and rules, there are multiple ways to compare the frequency
...
...
...
...
3
if (min(Fs .v, Ft .v) ≥ θ × max(Fs .u, Ft .u)) then
(a)EsTable
QEt ∈Ft .P ) do
(b) Table T of tokens as follows. (1) ITF1: sort the tokens from data tables
4
for (∀
∈Fs .P ,∀
//Leaf-entry:E=<t,
P
>
and synonyms separately by decreasing itf, and then arrange tokens
ID
Synonym pairs
ID
Synonym pairs
5
if (min(Es .t, Et .t) ≥ θ × max(Fs .u, Ft .u)) then
WHfor (∀Wireless
Health signature pointed
Conference
Conf from tables first, followed by those from synonyms; (2) ITF2: sort
r1
r2
6
g is an overlapping
by
Es .P and
separately again, but synonyms first, tables second; (3) ITF3: sort
) do
IntlEt .PInternational
UK
United Kingdom
r3
r4
all the tokens together by a decreasing itf order; (4) ITF4: gener7
L
=E
.P
→getIList(g);
s
s
Wireless
Health
WH
Conf
Conference
r5
r6
8
Lt =Et .P
...
... →getIList(g);
...
... ate all possibilities of expanded sets for each string, then sort the
9
C={(s, t)|s ∈ Ls , t ∈ Lt };
tokens by the decreasing itf order in all the expanded sets.
10
C=C∪ C;
In order to give an empirical study of the effect of different signatures on filtering power, we perform experiments using only signa(c) Transformation synonym pairs
11 return C;
ture filters with ITF1∼ITF4 against two datasets: CONF and USPS
data. On the CONF dataset, we perform a join between two tables
S and T using a synonym table R. Table S contains 1000 abbreviations of conference names, while table T has 1000 full expressions
E XAMPLE 3. We use this example to illustrate the SI-Join alof conference names. On the USPS dataset, we perform a self-join
gorithm. Consider a self-join on the table in Fig. 4(a) with θ = 0.8
on a table containing one million records, each of which contains a
and synonyms in Fig. 4(b). The corresponding I-Lists, S-Directory,
person name, a street name, a city name, etc., using 284 synonyms.
and SI-index are shown in Figs. 4(c), (d) and (e). SI-Join prunes
Similarity Algorithm Computational Complexity Effectiveness Comment
The detailed description of the datasets can be found in Section 5.
all the
string
pairs belowO(L)
F1 and F3 ,Low
sinceDoes
min(7,
11) < 0.8 ×
Jacc
not consider synonym pairs.
only
single token
synonym
pairs
are allowed (implying Figure
k = 1), an optimal
algorithm the experimental results. We observe that (i)
5 reports
max(2, 9)
2∼3).O(2 (L+kn))
Similarly, allHighpairsIfbased
below
(E
,
E
)
can
JaccT(lines
[1]
1
3
on maximum bipartite matching can be used with a complexity of O(L + Ln ).
four signature filters achieve quite different filtering powers. The
Full Expansion
(ours) 4∼5). O(L+kn)
and may have good
quality.
be pruned
(lines
Then SI-JoinHighuses Most
theefficient
signatures
“ConSE is efficient and has good quality, it is optimal under the conditions of Theorem 3,
|C|
SE Algorithm
O(L+n
+kn)I-Lists below
High
ference”
and (ours)
“Conf” to
get
(E
E3 ) (lines
7∼8),
, where |C| denotes the
and the
can be reduced
to O(L + n log n).filtering power is measured by ψ =
2 ,complexity
|N |
since they are overlapping signatures pointed by E2 and E3 (line
number of pruned pairs and |N | is the total string pairs. For in6). Then a candidate (q2 , q3 ) is generated
and added
to C (lines
stance,
on Large
the CONF
dataset, when the join threshold is 0.7, the
VLDB
International Conference
on Very
Databases
r7
9∼10). The final results are (q2 , q3 )r8and {(q
[1, 4]}.
i , qi )|i ∈ Proceedings
filtering
power of ITF1 is 89.2%, while that of ITF2 is only 53.1%;
PVLDB
of the VLDB
Endowment
and (ii) no signature filer can always perform the best in each dataset.
4.2 Estimation-based signature selection
For example, ITF1 performs the best in the USPS dataset, but it perDataset
1000
3000
5000
7000
9000
forms the worst in the CONF dataset. The main reason is that ITF1
USPS
70.0% 71.2% 72.8% 72.2% 70.8%
4.2.1 Motivation
gives a higher priority for the tokens in the data table to become
CONF
62.4% 67.6% 68.9% 71.2% 70.4%
SPROT algorithm
84.4% 86.7%
88.4% signatures
89.8% 87.3%for each string acThe SI-join
generates
signatures, but there are many overlaps in these signatures in the
cording to a global order of tokens. In theory, if N denotes the
CONF dataset, resulting in a large number of candidates.
number of distinct tokens, there are N ! ways to permute the tokens
These results suggest that it is unrealistic to find a “one-size-fitsto select signatures. Given two string tables for approximate join,
all” solution. Therefore, in the following, given two tables for join,
it is impractical to design an efficient algorithm to select signatures
we propose an estimator to effectively estimate the filtering power
for each string online. This is because even one scan of data is
of different filters to dynamically select the best filter.
time consuming and it is meaningless to spend comparative time as
the real join to find good signatures. In this section, we propose a
4.2.2 Estimation-based selection algorithms
n
2
2
2
n
… s5
0; hs[i]=0
11
签
and ht[j]=0
n-1
s6
名的 …
14
1; otherwise
所倒 j
对排 …
应索 1
的引 0
…
中
的
0 1 …i …m-1 m
一
hs:S的倒排索引中的一
个
String idsITF2Hash integers
ITF1
个签名对应的sketch
0.5
0.6
0.7
0.8
0.9
(b)USPS dataset
0.5
0.6
0.7
0.8
0.9
Filtering Power(%)
对上
00101
应的
的 0101
1
10001
10111
1; v gs (i)=1
and v gt (j)=1
0; otherwise
(c)SPROT dataset
…
…
…
…
…
…
{q1g , ..., qng }
0
0
1110
0
0
0
0
0
0
1
0
01110
0
1
1
1
1
0
1
1
0
0
1
1
1
1
0
0
First 1-bit
Bit-vector
2
1
0
0
1
1
0
0
0010
0110
1110
1110
S×T
1
0
0
0
1
0
0
0
1
1
1
0
1
1
1
0
(a) 基于签名的倒
(d) method.
把各个signature的
Figure 6: An example of bit-vector
generated by FM
排索引S和T
sketch并起来后的sketch
(The first 1-bit is the left-most 1-bit.)
…
Figure 5: Filtering powers of different signature filters.
1
sig1 sig2
表 strings
Binary
2
5
8
11
01
01
sig2 1 1 1 0
sig1 1 0 0 0
(b) sig1对应的sketch (c) sig2对应的ske
…
v gs : a synopse of
1
1
1
1
0
0
0
0
1
1
1
0
0
0
0
1
1
1
0
1
1
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
0
1
1
1
v gs 1 1
1
1
vgs2 1 1
0
g
vt 2
0.9
100
90
80
70
60
50
40
ITF4
vtg1
0.8
s1
s2
s3
s4
ITF3
vgt : a synopse of
signature g from table T
0.7
100
90
80
70
60
50
40
30
ITF2
ITF4
I-List
FM
0.6
(a)CONF dataset
ITF1
ITF3
T
0.5
q1 ITF2q4
ITF4
…
…
90
80
70
60
50
40
q1 ITF1 q1
q2 ITF3
q3
I-List
FM
…
…
q2
q3
100
sketch
q1
表 1011
对上 1
1
应 的 1110
0
1
的
0
1
0
0
…
ICDE
…
Confer Interna
on
ence tional
ht: T
ACM Conf
…
…
vt(g1)
…
…
vgt : a synopse of
signature g from table T
1
…
…
Let `(S, g)=
denote a list of string IDs (i.e., I-lists in
(3) 2DHS of g g
(1) 2DHS of g
(2) 2DHS of g
signature g from table S
Figs. 3 and 4) associated with the signature token g from the table
(a) 2-Dimensional hash sketch synopsis.
(b) Example of 2-Dimensional hash sketches.
IDF1IDF2IDF3S. Then an ID pair (qig , qjg )∈ `(S, g) × `(T, g) is an element of the
Figure 7: The structure of 2-dimensional hash sketch (2DHS)
cross-product of the two lists. Let G denote the set of all common
signatures between S and T . Therefore, given a signature filter λ,
1; v gs (i)=1
0 0 0 0 0
…
and v gt (j)=1
the union of all ID pairs from all signatures in G constitutes the set
arbitrarily high probability.
We now1 describe
steps to address
1 1 1 four
0
0; otherwise
of candidates after signature filtering, that is, ∪ `(S, g) × `(T, g).
the
challenges.
0 0 0 0 0
g∈G
1 one
0
1 1 1 of
Step 1 (Estimating the distinct number
table): At first,
We can use the following lemma to estimate the lower bound and
…
1string
1 1 IDs
1 0for all signature towe
estimate
the
distinct
number
of
the upper bound of the cardinality of candidates.
v gs : a synopse of
1 1 1 0
kens from
one
table,
µS = ∪ `(S, g) and
vs(g1)
signature
g from
table S that is, µS and µT , where
(b) Example of 2DHS. g∈G
(a) 2-Dimensional hash sketch synopsis.
L EMMA 1. Given a signature filter λ, the upper and lower bounds
µT = ∪ `(T, g). Ganguly et al. [14] provide an (, δ)-estimate alg∈G
of the cardinality of the candidate pairs are:
gorithm for µS (or µT ) by using the extension of the FM technique,
X
which estimates the quantity of µS (or µT ) within a small relative
U B(λ) =
(k`(S, g)k × k`(T, g)k)
error and a high confidence 1-δ. Space precludes detailed discusg∈G
and
sion about their method in this paper. However, readers may find it
LB(λ) = max{k`(S, g)k × k`(T, g)k, g ∈ G}
in Section 3.3 of [14]. Note that after the first step, we indeed obwhere k`(·, g)k denote the number of IDs in a list.
tain a new upper bound: µS × µT . But it is still not tight enough.
We proceed to the following steps to get a better value.
In the above lemma, the upper bound is defined as the total numStep 2 (Constructing a two-dimensional hash sketch): Let νSg
ber of string pairs and the lower bound is the maximal number of
and νTg be an FM sketch of signature g in I-Lists of table S and
string pairs associated with one signature. A signature λ guarantees
table
T respectively. We construct a 2-dimensional hash sketch
to return fewer candidates than signature λ0 if U B(λ) < LB(λ0 ).
synopsis (2DHS) of signature g, which is a 2-dimensional bitObviously, the tighter the bound is, the better the results are, since
vector. The bucket (i, j) in a 2DHS is 1 if and only if νSg (i) = 1 and
a tighter bound implies more accurate estimates on the ability of
νTg (j) = 1. Further, let G be the set of common signatures of S and
a filter. Therefore, in the following, we introduce tighter upper
T . We construct a new 2DHS 0 , which is the union of all 2DHS’s
and lower bounds by extending the Flajolet-Martin (FM) [13] techof each signature, namely ∪ 2DHS g , such that 2DHS 0 (i, j)=1
g∈G
nique.
if ∃ 2DHS g (i, j)=1.
The Flajolet-Martin estimator: The Flajolet-Martin (FM) technique [13] for estimating the number of distinct elements (i.e., setE XAMPLE 4. Figure 7 illustrates Step 2. Assume there are two
union cardinality) relies on a family of hash functions H that map
common signatures g1 and g2 of S and T , νSg1 ={1,1,1},νTg1 ={1,1,0,1}
data values uniformly and independently over the collection of biand νSg2 ={1,1,0},νTg2 ={1,1,1,0}. Figure 7(b1) is the 2DHS of g1
nary strings in the input data domain L. In particular, FM works
and Figure 7(b2) is the 2DHS of g2 . Then the union 2DHS is
as follows: (i) Use a hash function h∈H to map string ids to inshown in Figure 7(b3).
L−1
tegers sufficiently uniformly in range [0, 2
]; (ii) Initialize a
bit-vector ν (also called synopses) of length L to all zeros and conStep 3 (Computing witness probabilities): Once the union 2DHS
vert the integers into binary strings of length L; and (iii) Comis built, we select a bucket (i, j) to compute a witness probabilpute the position of the first 1-bit in the binary string (denoted
ity as follows. We examine a collection of ω independent 2DHS
as p(h(i))), and set the bit located at position p(h(i)) to be 1.
(ν1g ,ν2g ,. . . ,νωg ) (each copy using independently-chosen hash funcNote that for any i∈[0, 2L−1 ], p(h(i)) ∈ {0, . . . , logM -1} and
tions from H). Algorithm 4 shows the steps to compute witness
1
, which means the bits in lower position have
P r[p(h(i))=`]= 2`+1
probabilities. In particular, let Rs =blog2 µS c and Rt =blog2 µT c.
a higher probability to be set to 1. Figure 6 illustrates the FM
We investigate a column of buckets (Rs , y) (line 2 ∼ 8), and a row
method. Assume there are six strings. The length of the domain
of buckets (x, Rt ) (line 9 ∼ 15), and select two smallest indexes y
L=4. Finally, the synopsis ν is “1110”.
and x at which only a constant fraction 0.3ω of the sketch buckets
turns out to be 1. As our analysis will show, the observed fraction
Our two dimensional hash sketch (2DHS): In the following, we
pˆs and pˆt (line 16) and their positions can be used to provide a
describe our ideas to extend FM sketch to provide better bounds to
robust estimate for bounds in the next step.
estimate the number of candidate pairs. The main challenge is that
the existing FM-based methods [13, 14] are only applicable to one
Step 4 (Computing tighter upper and lower bounds): The final
dimensional data, but we need to estimate the number of distinct
step is to compute the upper bound TUB and the lower bound TLB
pairs based on two independent FM sketches. In addition, the extra
of a signature filter λ. Assume, without loss of generality, that
complication occurs, since we require the bounds to be correct with
µS ≤ µ T .
2
1
2
Algorithm 4: Computing two witness probabilities
: ω independent 2-dimensional hash sketches
(2DHS1 ,2DHS2 ,...,2DHSω ), Rs and Rt .
Output : pˆs and pˆt and two witness positions (Rs , y) and (x, Rt )
f =0.3ω;
for (y from 0 to Rt ) do
count=0;
for (i from 1 to ω) do
if (2DHSi [Rs ][y]==1) then
count = count+1;
Input
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
if (count≤f ) then
pˆs =count/ω and goto Line 9;
p1 = [1 − (1 −
TUB(λ) = ϕ1 · min(u1 , u2 ), where
ln(1−ps )
pˆs
u1 = µS · ln(1−
1 ) , and ps = 1−(1− 1
TLB(λ) = ϕ2 · µS ·
∆p = [1 − (1 −
and pt =
ln(1−p0s )
,
ln(1− 21y )
µS
1
2Rs +y
)
)µS
2Rs
pˆt
1 )µT
1−(1− R
2 t
where p0s =
]−
1
2y
N
1 µS
)
2y
N
1 µT
)
2x
× [1 −
];
];
P ROOF. (Sketch) According to Lemma 2, the MIN probability
is monotonically increasing with µS and µT . Therefore, the final
MIN probability is greater than p1 and p2 , since µS ≥ µNT and
µT ≥ µNS . That concludes the proof.
return pˆs , pˆt and their witness positions (Rs , y), (x, Rt )
ln(1−pt )
,
ln(1− 21x )
1 µS
) ] × [1 − (1 −
2x
1 µT
) ] × [1 − (1 −
2y
p2 = [1 − (1 −
M IN (N, µS , µT , x, y) ≥ max(p1 , p2 ).
if (count≤f ) then
pˆt =count/ω and goto Line 16;
u 2 = µT ·
L EMMA 2. If µS ≥ µ0S and µT ≥ µ0T , then M IN (N, µS , µT , x, y)
≥ M IN (N, µ0S , µ0T , x, y) and M AX(N, µS , µT , x, y) ≥
M AX(N, µ0S , µ0T , x, y).
L EMMA 3. The lower bound of the MIN probability can be
computed as:
for (x from 0 to Rs ) do
count=0;
for (i from 1 to ω) do
if (2DHSi [x][Rt ]==1) then
count = count+1;
2y
ability of filters, we derive a lower (resp. upper) bound for the MIN
(resp. MAX) probability.
The following lemma implies that the MIN and MAX probabilities are monotonically increasing on µS and µT . This monotonicity can ensure Lemmas 3 and 4 to compute the lower and upper
bounds.
;
pˆs −∆p
1 )µS
1−(1− R
2 s
1
(1 − 2Rs )µS ]
where constants ϕ1 =1.43, ϕ2 =0.8, are derived in Theorem 2.
E XAMPLE 5. This example illustrates the computation of TUB
and TLB. Assume µS =100 and µT =200. Thus Rs =blog100
2 c=6,
Rt =blog200
2 c=7. Assume, in Algorithm 4, that the returned values are pˆs =0.2, pˆt =0.19, x=5, and y=6. According to the formulas in Step 4, we have ps =0.2522, pt =0.24, u1 =1845, u2 =1729,
∆p=0.0117, p0s =0.2375, and the upper bound T U B(λ) = 1.43*1729
= 2472 and T LB(λ) = 1722×0.8=1278.
Therefore, one filter is selected as the best one if its TUB is less
than all the TLBs of other filters, which means that it returns the
smallest number of candidates. In practice, if there are several filters whose bounds have overlaps, then it means that their abilities
are similar with a high probability and we can arbitrarily choose
one of them to break the tie.
Theoretical analysis: We first demonstrate that, with an arbitrarily
high probability, the returned lower and upper bounds are correct,
and then we analyze its space and time complexities. Due to space
constraint, some proofs are omitted here.
It is important to note that given n distinct ID pairs and a position
(x, y) in a 2DHS matrix, the probability that the bucket in (x, y) is
1 depends on the distribution of the input data. To understand it,
consider the following two cases: One is (1,1), (2,1), . . . , (n,1) and
the other is (1,n+1), (2,n+2),. . . ,(n,2n). Both cases have n distinct
pairs, but their probabilities are different. The first case is 21y ×
1
[1 − (1 − 21x )n ], but the second case is [1 − (1 − 2x+y
)n ]. Therefore, we need to differentiate the probabilities for these two extremes. Let M IN (N, µS , µT , x, y) and M AX(N, µS , µT , x, y)
denote the minimum and maximum probabilities respectively that
the bucket (x, y) is 1 with N distinct pairs, where µS and µT denote the number of distinct IDs in each dimension respectively. The
computation of the accurate values of the MIN and MAX is highly
complicated. But for the purpose of this paper to differentiate the
L EMMA 4. Assume that µS ≤ µT . Then the upper bound of
the MAX probability can be computed as:
N
p1 = [1 − (1 − 21x )µS ] × [1 − (1 − 21y ) µS ];
1
p2 = [1 − (1 − 2x+y
)µS ] − 21y × [1 − (1 − 21x )µS ];
M AX(N, µS , µT , x, y) ≤ p1 + p2 .
P ROOF. (Sketch) According to Lemma 2, with a fixed µS , the
greater the value µT is, the greater the MAX probability is. In the
worst case, µT = N . Therefore, the MAX probability pmax is no
greater than:
N
pmax ≤ 1 − [(1 − 21x ) + 21x · (1 − 21y ) µS ]µS
We then prove that pmax −p1 ≤ p2 . We can show that pmax −p1
is a monotonically decreasing function when N ≥ µS . Therefore,
when N = µS (in this case, N = µS = µT ), we have the minimal
value as follows.
pmax − p1 ≤ 1 − [(1 − 21x ) + 21x · (1 − 21y )]µS - [1 − (1 −
1 µS
) ] × [1 − (1 − 21y )] = p2 .
2x
T HEOREM 2. Given a signature filter λ, with any high probability 1-δ, our algorithm guarantees to correctly return the upper
bound and the lower bound to estimate the number of candidate
pairs with λ.
P ROOF. (Sketch) There are three steps in this proof. Firstly, we
show that the estimated probabilities pˆs and pˆt returned by Algorithm 4 are low-error and high-confidence, and then we show our
formulas TLB and TUB are correct. Finally, we show how to adjust
formulas based on the estimated pˆs , pˆt and µS , µT .
First, in Algorithm 4, by Chernoff bound, the estimated p̂=count/ω
satisfies |p̂ − p| ≤ p with a probability at least 1-δ as long as
. Let =0.15, Then ωp≥88 log(1/δ). Since pˆs and
ωp≥ 2 log(1/δ)
2
pˆt returned in Algorithm 4 are the smallest indexes where only 0.3ω
of sketches turn to be 1, their values are no less than 0.3/2=0.15.
Therefore, p ≥ 0.15, and we can generate enough number of
sketches such that ω ≥ 587 log(1/δ) = Θ(log 1δ ). Strictly speaking, the witness probabilities pˆs and pˆt are estimated values, and
we need to use the Chernoff bound again to show that they are close
to 0.15 with a high confidence. The details are omitted here.
Second, according to Lemmas 3 and 4, it is not hard to mathematically compute the lower bound and upper bound to estimate
the number of distinct pairs number N .
Finally, the following two lemmas demonstrate that we only need
to multiply a constant in the formulas to use the low-error estimated pˆs , pˆt , µS and µT for the accuracy guarantee. In particular,
Lemma 5 is used to derive ϕ1 in T U B and Lemma 6 for ϕ2 in
T LB, respectively. (A similar proof technique can be found in [5,
14], even though our estimation is quite different from theirs.)
2
0.15x for some x ≤ 0.3 and ε ≤0.1, then f (y) − f (x)≤ 0.43f (x).
P ROOF. Since 1−(1− 2R1 s )µS (1+ε) ≈ 1−e−(1+ε) and ε ≤ 0.1,
1
then 1−e−(1+ε)
≥ 1.5. By Taylor series, there is a value w ∈ (x, y)
such that ln(1 − 1.5y) = ln(1 − 1.5x)-1.5(y − x)/(1-1.5w). Thus,
1.5(y−x)
0.225x
we have f (y) − f (x) ≤ 1−1.5max(x,y)
≤ 1−1.725x
≤ 0.225x
≈
0.52
0.43x ≤ −0.43ln(1 − 1.5x) ≤ 0.43f (x).
L EMMA 6. Let f(x) = ln(1 −
x−∆p
1 )µS (1+ε) ).
1−(1− R
2
If x − y ≤
s
0.15x for some x ≤ 0.3 and ε ≤0.1, then f (x) − f (y)≤ 0.2f (x).
P ROOF. (Sketch) First, we can prove
∆p
1 )µS (1+ε)
1−(1− R
2
<0.71.
s
The details are omitted here. Then by Taylor series again, there
is a value w ∈ (x, y) such that ln(1.71 − 1.5y) = ln(1.71 −
1.5x) − (1.5(x − y))/(1.71 − 1.5w). Thus, we have |f (y) −
1.5(x−y)
0.225x
f (x)| ≤ 1.71−1.5
≤ 1.71−1.725x
≤ 0.225x
≈ 0.189x ≤
max(x,y)
1.19
0.2 ln(1.71 − 1.5x), since by Maclaurin series, ln(1.71 − 1.5x) ≈
ln 1.71 − (1.5/1.71)x ≈ 0.54 − 0.88x.
T HEOREM 3. Given a high probability 1-δ, our estimator needs
a total space of Θ(T log M log(1/δ)) bits, where T is the number
of overlapping signatures and M is the maximal length of inverted
list associated with a signature, and its computing complexity is
Θ(T log2 M + log M log(1/δ)).
P ROOF. The size of one FM sketch is log M and there are T
signatures. So the space for storing all FM sketches is 2T log M .
We need to generate a collection of ω=Θ(log 1δ ) sketches. Therefore, the total space requirement is Θ(T log M log(1/δ)). Further,
to compute the time complexity, in step 1, the computing cost is
2ω log M . The cost of Step 2 is 2T log 2 M and that of Step 3 and
4 are ω log M and O(1) respectively. Therefore the total cost is
bounded by Θ(T log2 M + log M log(1/δ)).
4.3
Extension to LSH filters
Although we focus on prefix filters in the above discussion, our
indexes and estimators can also be extended to other filters, such as
LSH scheme. In particular, each signature s in LSH is a concatenation of a fixed number k of minhashes of the set and there are
l such signatures (using different minhashes). Then, in the I-lists,
each inverted list is associated with a pair (s, i), where i means that
the signature s comes from the i-th minhash, 1 ≤ i ≤ l. Therefore, by replacing the prefix signature g with (s, i), our SN-join
(Algorithm 3) and 2DHS estimator can be directly applied on LSH
scheme. As described in the next section, we implemented both
LSH and prefix filters in our algorithms to make a comprehensive
comparison.
5.
EXPERIMENTAL RESULTS
In this section, we report an extensive experimental evaluation of
algorithms on three real-world datasets. First, we show the effectiveness and efficiency of the expansion-based framework to measure the similarity of strings with synonyms. Second, we evaluate
# of Synonym
Pairs
# of Applicable Synonym
Pairs Per String (avg/max)
USPS
CONF
SPROT
1,000,000
10,000
1,000,000
6.75 / 15
5.84 / 14
10.32 / 20
300
1000
10,000
2.19 / 5
1.43 /4
17.96 /68
80
70
If y − x ≤
s
String Len
(avg/max)
Figure 8: Characteristics of Datasets.
# of Matches
x
1 )µS (1+ε) ).
1−(1− R
# of Strings
1
Jacc
P
R
JaccT
F
P
R
Full
F
P
R
F
P
2
60 SN-join, SI-Join and the state-of-the-art approaches to perform
our
3
50
4
0.70 signatures.
0.04 0.04 0.04Finally,
0.52 0.98we
0.63ana0.51 0.97 0.64 0.4
similarity
join with prefixes and
LSH
5
40
0.75 0.04 0.04 0.04 0.64 0.96 0.72 0.57 0.97 0.69 0.5
6-10
30
lyze the performance of 2DHS
estimators.
0.80 0.04 0.04 0.04 0.74 0.88 0.78 0.72 0.93 0.78 0.7
>10
20
0.85 0.04
0.04 0.04
0.83and
0.87run
0.84on
0.86
All the algorithms were implemented
in Java
1.6.0
a 0.87 0.86 0.8
10
0.90 0.04 0.04 0.04 0.80 0.82 0.81 0.84 0.84 0.84 0.8
0
Windows
XP with dual-core Intel Xeon
CPU
RAM,
0.95 0.04
0.044.0GHz,
0.04 0.80 2GB
0.81 0.80
0.78 0.78 0.78 0.8
Jacc
JaccT
Full
Select ive
and a 320GB
hard disk.
(a) Rank Distribution
(b) Result Evaluation (P: Precision, R: Recall, F: F-m
# of Matches
L EMMA 5. Let f(x) = ln(1 −
Dataset
5.1
Datasets and Synonyms
Data Sets. We use three datasets: US addresses (USPS), conference titles (CONF), and gene/protein data (SPROT). These datasets
differ from each other in terms of rule-number, rule-complexity,
data-size and string-length. Our goal in choosing these diverse
sources is to understand the usefulness of algorithms in different
real world environments.
USPS: We downloaded common person names, street names, city
names, states, and zip codes from the United States Postal Service website (http://www.usps.com). We then generated one million records, each of which contains a person name, a street name,
a city name, a state, and a zip code. USPS also publishes extensive
information about the format of US addresses, from which we obtained 284 synonym pairs. The synonym pairs covers a wide range
of alternate representations of common strings, e.g., street → st.
CONF: We collected 10,000 conference names from more than ten
domains, including Medicine and Computer Science. We obtained
1000 synonym pairs between the full names of conferences and
their abbreviations by manually examining conference websites or
homepages of scientists.
SPROT: We obtained one million gene/protein records from the
Expasy website (http://www.expasy.ch/sprot). Each record contains an identifier (ID) and its name. In this dataset, each ID has
5 ∼ 22 synonyms. We generated 10,000 synonym rules describing
gene/protein equivalent expressions.
Figure 8 gives the characteristics of the three datasets.
5.2
5.2.1
Experimental results
Quality and Efficiency of Similarity Measures
The first experiment is to demonstrate the effectiveness and efficiency of the various similarity measures. We compared our two
measures: full expansion (Full), and selective expansion (SE) with
Jaccard without synonyms and JaccT using synonyms from [2].
For each of the three datasets, we performed the experiments
by conducting a similarity join between a query table TQ and a
target table TT as follows: (1) TQ consists of 100 manually selected
full names, and (2) TT has 200 records where 100 of them are the
correct abbreviations of the corresponding records in TQ (i.e. the
ground truth), and the other 100 “dirty” records are selected such
that each of them is a similar record (in terms of Jaccard coefficient)
to the corresponding records in TQ . This is to ensure that there is
only one correct matching record in TT for each record in TQ .
Quality of Measures. In Figure 9, we report the quality of the
measures by testing the Precision (short for “P”), Recall (“R”), and
×R
(“F”) on three datasets. We observe that:
F-measure= 2×P
P +R
• The similarity measures using synonyms (including JaccT, Full
and SE) obtain higher scores than Jaccard which does not consider
r3
r5
...
Intl
International
Wireless Health
WH
...
Jacc
P Jacc
R F
0.50
0.60
0.50
0.70
0.60
0.80
0.70
0.90
0.80
0.90
CONF dataset
JaccT
Full
CONF dataset
P JaccT
R F P Full
R F
SE
P SE
R F
r4
r6
...
Jacc
P Jacc
R F
UK
Conf
United Kingdom
Conference
...
USPS dataset
JaccT
Full
USPS dataset
P JaccT
R F P Full
R F
SE
R
SE F
(c) Transformation synonym pairsP
Jacc
P Jacc
R F
SPROT dataset
JaccT
Full
SPROT dataset
P JaccT
R F P R
FullF
P
SE
RSE F
0.04
P 0.44
R 0.07
F 0.45
P 0.83
R 0.60
F 0.43
P 0.91
R 0.58
F 0.67
P 0.94
R 0.78
F 0.01
P 0.74
R 0.02
F 0.08
P 0.95
R 0.15
F 0.06
P 0.95
R 0.11
F 0.05
P 1.00
R 0.10
F 0.01
P 0.54
R 0.01
F 0.07
P 0.81
R 0.12F 0.05P 0.72
R 0.09F 0.05P 0.94R 0.10F
0.14 0.44 0.21 0.47 0.83 0.60 0.50 0.85 0.63 0.69 0.90 0.78 0.01 0.42 0.02 0.08 0.82 0.15 0.12 0.93 0.21 0.10 0.95 0.19 0.01 0.32 0.02 0.07 0.72 0.13 0.08 0.68 0.15 0.10 0.90 0.18
0.04 0.40
0.44 0.42
0.07 0.52
0.45 0.80
0.83 0.63
0.60 0.51
0.43 0.82
0.91 0.63
0.58 0.72
0.67 0.85
0.94 0.78
0.78 0.06
0.010.32
0.540.08
0.010.19
0.070.71
0.810.30
0.120.18
0.050.65
0.720.28
0.090.33
0.050.87
0.940.48
0.10
0.01 0.42
0.74 0.11
0.02 0.22
0.080.82
0.950.35
0.150.26
0.060.93
0.950.41
0.110.36
0.050.93
1.000.52
0.100.05
0.44
0.14 0.35
0.44 0.45
0.21 0.76
0.47 0.74
0.83 0.75
0.60 0.55
0.50 0.70
0.85 0.62
0.63 0.87
0.69 0.84
0.90 0.85
0.78 0.21
0.010.32
0.320.21
0.020.33
0.070.62
0.720.43
0.130.52
0.080.64
0.680.58
0.150.69
0.100.83
0.900.75
0.18
0.01 0.42
0.42 0.28
0.02 0.39
0.080.73
0.820.51
0.150.61
0.120.80
0.930.70
0.210.79
0.100.91
0.950.85
0.190.16
0.63
0.44 0.21
0.40 0.33
0.42 0.80
0.52 0.71
0.80 0.75
0.63 0.68
0.51 0.60
0.82 0.64
0.63 0.89
0.72 0.80
0.85 0.84
0.78 0.71
0.050.32
0.320.40
0.080.48
0.190.57
0.710.52
0.300.69
0.180.61
0.650.65
0.280.83
0.330.82
0.870.82
0.48
0.06 0.42
0.42 0.53
0.11 0.61
0.220.73
0.820.66
0.350.84
0.260.75
0.930.79
0.410.96
0.360.89
0.930.92
0.520.54
0.82
0.63 0.35 0.45 0.76 0.74 0.75 0.55 0.70 0.62 0.87 0.84 0.85 0.21 0.42 0.28 0.39 0.73 0.51 0.61 0.80 0.70 0.79 0.91 0.85 0.16 0.32 0.21 0.33 0.62 0.43 0.52 0.64 0.58 0.69 0.83 0.75
0.82 0.21 0.33 0.80 0.71 0.75 0.68 0.60 0.64 0.89 0.80 0.84 0.71 0.42 0.53 0.61 0.73 0.66 0.84 0.75 0.79 0.96 0.89 0.92 0.54 0.32 0.40 0.48 0.57 0.52 0.69 0.61 0.65 0.83 0.82 0.82
Jacc 0.14
JaccT
Jacc
Full
JaccT
SE
Full
SE
0.57
0.14
0.93
0.57
0.93
0.93
0.93
Figure 9: Quality of similarity measures (P: precision, R: recall, F: F-measure).
CONF dataset
USPS dataset
SPROT dataset
of Washington 1705 NE
s1=“P93214”
s1=“Proceedings
of the VLDB
Endowment 2012: 38th
Jacc 0.00 s1=“University
Jacc 0.00 SPROT
CONF
dataset
USPS
dataset
dataset protein 9”
Pacific
St
Seattle,
WA
98195”
s2=“14339_SOLL,14-3-3
International Conference on Very Large Databases, Turkey”
of Washington 1705 NE Jacc
JaccT
0.70 s1=“University
JaccT0.00
0.75 s1=“P93214”
s2=“UW”
2012ofTurkey”
s21=“PVLDB
=“Proceedings
the VLDB Endowment 2012: 38th
Jacc 0.00
r1:P93214->14339_SOLL protein 9”
St Seattle, ofWA
98195”
r1:UW->University
Washington
International Conference on Very Large Databases, Turkey”
Full 0.70
0.92 Pacific
Full 0.75
0.83 s2=“14339_SOLL,14-3-3
r2:P93214->14-3-3 protein 9
rs12:PVLDB->International
Conference
on
Very
Large
Databases
JaccT
JaccT
s
2
=“UW”
r2:UW->1705
NE
Pacific
St
Seattle,
WA
98195
=“PVLDB 2012 Turkey”
r3:14339_SOLL->UPI0000124DEC
r2:PVLDB->Proceedings of the VLDB Endowment
SE 1.00 r1:UW->University
r3:UW->University of
of Washington
Waterloo
SE 1.00 r1:P93214->14339_SOLL
Full 0.83 r2:P93214->14-3-3 protein 9
r1:PVLDB->International Conference on Very Large Databases Full 0.92 r2:UW->1705 NE Pacific St Seattle, WA 98195
(d) Exampleofstring
records
and their applicable synonyms
with their similarities
on three datasets
the VLDB
Endowment
r2:PVLDB->Proceedings
SE 1.00 along
r3:UW->University
of Waterloo measured by four
SE measures
1.00 r3:14339_SOLL->UPI0000124DEC
(d) Example string records and their applicable synonyms along with their similarities measured by four measures on three datasets
Figure 10: Three examples to illustrate the quality of similarity measures.
Running time (Sec)
Running time (Sec)
Running time (Sec)
k=2,l=2
k=3,l=3
k=4,l=3
k=5,l=4
k=6,l=4
k=7,l=5
Running time (Sec)
Running time (Sec)
Running time (Sec)
# of Matches
# of Matches
of Matches
# of #Matches
Similarity Algorithm
Complexity
Effectiveness
Comment
synonym
pairs. TheComputational
reason is that
without using
synonyms,
Jaccard
intermediate strings and its performance deteriorates rapidly with
6)
USPS
CONF
SPROT
Filtering
Power(%)
Candidate
Size(x10
Jacc
O(L)
Low
Does not consider
synonym
has no chance
to
improve
the
similarity.
thepairs.
increase of rules.
Finally,
SE is in the “middle”
of the
two exJaccT
Ours
JaccT
Ours
JaccT
Ours
If
only
single
token
synonym
pairs are allowed (implying k = 1), an optimal algorithm
n(L+kn))
JaccT
[1]
O(2
High
JaccT
SN-Join
SI-Join
JaccT
SN-Join
SI-Join
USPS
CONF
SPROT
Power(%)
Size(x10
2 +possible
• SE significantly
outperforms
in
eachAvg/max
dataset.based
For
tremes,
which
enumerating
the
strings to
achieve6)
Avg/max
Avg/max JaccT
Avg/max
Avg/max
Avg/max
onexammaximum
bipartite
matching
can beavoids
usedFiltering
with
a complexity
ofall
O(L
Ln2).Candidate
JaccT
Ours
JaccT
Ours
JaccT
Ours
Full
Expansion
(ours)
O(L+kn)
High
Most
efficient
and
may
have
good
quality.
ple, on SPROT
dataset,
the F-measures
of SEAvg/max
and JaccT Avg/max
are 0.82 Avg/max
better performance
than JaccT,
andSI-Join
which selects
appropriate
JaccT
SN-Join
JaccT only
SN-Join
SI-Join
Avg/max
Avg/max
Avg/max
0.95 1.415/5
1.137/3
1.263/3
1.138/3
2.363/11
1.432/7
0.95 86.1
88.8
99.0
695.4
561.3
49.8
SE is efficient and has good
quality,
it is optimal
undereffectiveness
the conditions of than
Theorem
3,
2+kn) reason is that:
and
0.52 respectively.
The
main
rules
to achieve
better
full-expansion
in terms of
SE
Algorithm
(ours)
O(L+n
High an abbreviation
0.90
1.527/6
1.447/6
1.346/4
1.223/4
2.654/11
2.322/9
0.90
85.7
86.8
98.9
714.3
661.3
56.9
the complexity can
be reduced to O(L
+ n log
n).
0.95
2.363/11
0.95
86.1
88.8 9).99.0 695.4 561.3 49.8
may
have1.415/5
various full1.137/3
expressions1.263/3
and the join1.138/3
records and
may
contain 1.432/7
F-measures
(illustrated
in Figure
0.85
2.078/8
2.201/7
1.739/4
1.634/5
2.948/11 0.85
1.527/6of multiple
1.447/6
1.346/4Therefore
1.223/4SE3.532/12
2.654/11
0.90 84.5
85.7 86.0
86.8 98.9
98.9 777.6
714.3 701.9
661.3 57.4
56.9
the0.90
combination
expressions.
can apply 2.322/9
0.80
3.780/9
2.349/7
1.974/5
1.885/5
3.796/12
3.508/13
0.80
82.5
84.7
98.7
874.0
764.2
63.0
5.2.2
Efficiency
and
Scalability
of
Join
Algorithms
0.85 rules,
2.078/8
2.201/7
1.739/4
1.634/5
multiple
while JaccT
can apply
only one.
Note that3.532/12
such sit- 2.948/11 0.85 84.5 86.0 98.9 777.6 701.9 57.4
0.75
3.913/9
2.806/8
4.556/14
4.105/13
0.75
82.3
83.7
885.8
66.7
VLDB
International
Conference
Very
Large
Databases
7 2.208/6
0.80is not
3.780/9
2.349/7
1.974/5
1.885/5
3.796/12
3.508/13
0.80experiment
82.5 on
84.7
98.7
874.0and817.6
764.2
63.0
The second
is
to test98.7
the
efficiency
scalability
of
uation
rare in the
real world2.303/6
data, as rone
fragment
of
a string
0.70
4.456/10
3.568/10
2.682/6
5.070/14
4.969/16
0.70
81.9
82.4
98.4
904.0
878.7
78.5
0.75involves
3.913/9
2.806/8
2.303/6
2.208/6
4.556/14
4.105/13
0.75
82.3
83.7
98.7
885.8
817.6
66.7
various
join
algorithms.
We
compared
our
algorithms,
SN-Join,
likely
multiple
synonym2.741/6
rules. We
illustrate
one
example
PVLDB
Proceedings of the VLDB Endowment
r8
and SI-Join
with
algorithm
in [2]
(also
termed
JaccT).
Our78.5
imon0.70
each of
three (c)
datasets
in Figure
10 to compare
theString
performance
Number
of2.741/6
Signatures
Per
(d) the
Filtering
power
on
the
USPS
dataset
4.456/10
3.568/10
2.682/6
5.070/14 4.969/16
0.70
81.9
82.4
98.4
904.0
878.7
plementation of JaccT includes all the optimizations proposed in
of four similarity(c)
measures.
Number of Signatures Per String
(d) Filtering power on the USPS dataset
[2]. For our algorithms, since they can work with the two similar80
Jacc
JaccT
Full
Seletive
Dataset
1000
3000
5000
7000
9000
ity measures (i.e., full expansion, selective expansion), we denote
1
70
80
P
R
F
P
R
F
P
R
F
P
R
F
USPS
70.0% 71.2% 72.8% 72.2% 270.8%
Jacc
JaccT
them
as SN(F), SN(S),
and SI(F) and SI(S)Full
respectively. WeSeletive
imple60
1
70
3
CONF
62.4% 67.6% 68.9% 71.2% 70.4%
Therefore,
P mented
R all algorithms
F
P using
R both
F prefix
P andRLSH filters.
F
P
R
F
2
50
60
SPROT
84.4% 86.7% 88.4% 89.8% 487.3% 0.70 0.04with
0.04
0.04to0.52
0.98 0.63
0.51
0.64we0.49
1.00
respect
the algorithms
using
LSH0.97
scheme,
append
“-L”0.63
3
5
40
50 Empirical Probabilities of Optimal Cases
0.75
0.04
0.64
0.96
0.72
0.97
0.69
1.00
the name,
JaccT-L
the JaccT
using
LSH0.66
4
Figure 30
11:
for SE
6-10
0.70 0.04
0.04to0.04
0.04
0.04e.g.,
0.52
0.98denotes
0.630.57
0.51
0.97algorithm
0.640.53
0.49
1.00
0.63
5
0.80
0.04
0.04
0.04
0.74
0.88
0.78
0.72
0.93
0.78
0.73
0.98
40
scheme.
Note
that,
we
use
the
false
negative
ratio
δ
≤
5%,
i.e,
the0.80
>10
0.75 0.04 0.04 0.04 0.64 0.96 0.72 0.57 0.97 0.69 0.53 1.00
0.66
20
6-10
0.85
0.04
0.86
0.87
0.86
0.80
0.90
30
is 1 −0.83
δ ≥ 0.87
95%.
The
k and
l should
satisfy:
Optimality
Theorem
1, 0.04
0.80
0.04accuracy
0.04 0.04
0.04
0.88 0.84
0.78parameters
0.72SN(S)
0.93
0.78
0.73SI(S)
0.980.87
0.80
>10
10 Scenarios of SE measure As described in
SI(F)
k l 0.74
JaccT
SN(F)
0.04
0.80
0.82
0.81
0.84
0.84
0.86
δ 0.04
≥ (1 −0.04
θ ) [33].
For
example,
if
threshold
θ0.84
= 0.8,
k = 0.87
3, then0.86
20 if the rhs tokens of useful rules are distinct. The 0.90
SE is optimal
pur0.85
0.04
0.04
0.04
0.83
0.87
0.84
0.86
0.87
0.86
0.80
0.90
0.87
30
500
0
be at least
to guarantee
at least
95% 0.78
accuracy.
0.95
0.04
0.78
10 experiment is to verify how likely this optimal conpose of this
0.90 0.04
0.04l should
0.04 0.04
0.04250.80
0.805 0.81
0.82 0.80
0.810.78
0.84400
0.84
0.840.82
0.860.83
0.870.82
0.86
Jacc
JaccT
Full
Select ive
20 took the following measures:
0
300
Metrics.
We
(i)
the 0.82
size of0.83
sig- 0.82
dition is satisfied
in practice. We performed experiments on three
0.95
0.04
0.04
0.04
0.80
0.81
0.80
0.78
0.78
0.78
15
(a)
Rank
Distribution
Evaluation
(P: Precision,
Recall,which
F: F-measure)
Jacc dataJaccT
Full
ive in Figure(b) Result
200
natures,
(ii) the10 filtering
ratio of the R:
algorithms,
is typically
datasets using different
sizes. The results
areSelect
shown
100
definedEvaluation
as the 5number
of
pruned
string
pairs
divided
by
the total
11. The average percentages
of optimal
cases are 71.39%, 68.11%, (b) Result
(a) Rank
Distribution
(P: Precision, R: Recall, F: F-measure)
0
0 pairs; and (iii) the running
number of string
filterand 87.32% in USPS, CONF and SPROT data, respectively. There2
4
6
8 10
200time
400 (including
600 800 1000
(a) CONFtime).
dataset (K)
(b) USPS dataset (K)
ing time and verification
fore, the condition in Theorem 1 are likely to be met by the majority
of candidate string pairs, which means that the values returned by
Jacc
JaccT
SN(F)
JaccT
Jacc
the SE algorithm are optimal in most cases. Further, the percentage
SN(S)
SI(F)
SI(S)
Full
SE
on SPROT data is larger than those on USPS and CONF. This is
1200
400
because many of the synonym pairs on SPROT are single-token-to900
300
single-token synonyms, which are more likely to meet the distinct
600
200
condition. In contrast, many of the synonyms of the USPS and
300
100
CONF are multi-token synonyms.
0
0
10 20 30 40 50
Efficiency of Measures. Having verified the effectiveness of dif10 20 30 40 50
(b)
# of synonyms (SPROT)
(a)
#
of
synonyms
(SPROT)
ferent measures, in the sequel we assess their efficiency. Figure
Figure 12: Performance with varying synonyms.
F
12(a) shows the time cost of the four measures on SPROT dataset
SI(F)-L
SN(S)-L
JaccT-L
SI(S)-L
SN(F)-L
k/L Ja
6000
5000
(10K records) by running 108 times of string similarity measure5000
4000
79
Number of Signatures.
We first performed
experiments to report k=2,l=3
ment based on the nested-loop self-join. The x-axis denotes the
k=4,l=6 86
4000
the number of3000
signatures of a query string.
It has a major impact k=6,l=10 88
number of synonym rules applied on one single string, and the
3000
k=8,l=17 91
2000 as both JaccT and our join algorithms need to
k=10,l=27 92
on the query time,
y-axis is the total running time. We find that although the full2000
1000the overlaps of string signatures.
frequently check
The results are
expansion needs to expand the string set using rules, its perfor1000
0
(K)
200 For
400 600
1000 (K)
200schemes,
400 600 800
reported in Figure0 13.
both800
prefix
and LSH
the 1000
nummance is comparable to that of Jaccard, which does not use any
(b) SPROT dataset (k=11,l=34)
(k=9,l=21)
ber of signatures(a)ofUSPS
ourdataset
expansion-based
algorithms is smaller than
rules at all. This is because, the full-expansion adds all applied
SPROT
that of JaccT. The reason USPS
is that based on a transformation
framerules directly and its computing cost increases slowly with the numJaccT
SN-Join
SI-Join
JaccT
SN-Join
SI-Join
k/L
work, JaccT isFR(%)
more
likely
to include
new
tokens
into
signatures
ber of rules. In contrast, JaccT needs to enumerate all the possible
FN(%)
FR(%) FN(%)
FR(%) FN(%)
FR(%)
FN(%)FR(%)
FN(%)
FR(%) FN(%)
63.5
72.4
87.9
83.4
81.1
82.6
4.21
4.54
4.23
3.89
3.73
3.29
66.5
73.9
89.4
83.1
86.8
82.5
4.32
4.77
4.56
4.22
3.88
3.69
76.3
79.1
99.2
87.8
92.3
94.1
3.61
3.55
3.52
2.78
2.43
2.19
78.8
83.1
89.4
84.7
92.3
90.4
4.22
4.38
4.13
3.69
3.77
3.59
80.1
85.6
89.9
89.5
93.7
92.9
4.19
4.28
3.85
4.04
3.27
3.73
84.5
90.7
97.1
95.5
99.8
96.2
3.71
3.33
3.42
2.94
3.06
2.73
filter ratio (FR) and False Negative (FN) with varying parameters k and l. (accuracy=0.95,
threshold=0.9)
Prefix filtering scheme
USPS
SPROT
Locality sensitive hashing (LSH) scheme
USPS (k=4)
SPROT(k=6)
Ours
JaccT
Ours
JaccT
Ours
JaccT
Ours
l JaccT
l
Avg/max Avg/max Avg/max Avg/max
Avg/max Avg/max
Avg/max Avg/max
0.95
0.90
0.85
0.80
0.75
0.70
1.415/5 1.137/3
1.527/6 1.447/6
2.078/8 2.201/7
3.780/9 2.349/7
3.913/9 2.806/8
4.456/10 3.568/10
8.363/17
8.654/18
9.532/18
10.79/19
11.55/19
13.07/20
7.432/16
8.322/16
8.953/17
9.512/19
10.10/21
12.97/23
2
3
5
6
8
11
3.384/8
4.576/19
8.952/21
11.14/29
15.52/36
20.09/51
3.254/8
4.499/18
8.911/20
11.09/27
14.15/35
19.21/51
3
4
7
10
16
24
14.87/46 12.16/44
19.84/58 19.44/57
34.72/72 34.58/71
49.62/103 48.19/100
69.36/192 68.73/185
75.04/297 74.92/294
filter ratio
(FR)
and False
with varying
k and
l. (accuracy=0.95,
Prefix
filtering
scheme Negative (FN)
Locality
sensitive parameters
hashing (LSH)
scheme
USPS (k=4)
SPROT(k=6)
threshold=0.9)
USPS
SPROT
98.9
99.0
98.7
98.9
98.7
98.9
98.4
98.7
98.7
98.4
777.6
695.4
874.0
714.3
885.8
777.6
904.0
874.0
885.8
904.0
701.9
561.3
764.2
661.3
817.6
701.9
878.7
764.2
817.6
878.7
57.4
49.8
63.0
56.9
66.7
57.4
78.5
63.0
66.7
78.5
1.974/5
88.5
91.8
86.6
90.7
85.2
88.5
78.1
86.6
85.2
78.1
90.0
93.5
89.3
91.7
87.5
90.0
80.3
89.3
87.5
80.3
50.2
32.7
53.4
41.9
62.4
50.2
98.8
53.4
62.4
98.8
22.1
2.6
39.5
7.1
48.9
22.1
51.3
39.5
48.9
51.3
0.95
0.90
0.85
0.80
0.75
0.70
Running Time and Scalability. Figure 16 and Figure 17 show
the running time of five join algorithms based on prefix and LSH
schemes respectively (the join threshold is 0.8). The x-axis represents the join data size. As shown, the running times of both JaccT
and SN-Join (i.e., SN(F), SN(S)) have an exponential growth, whereas
SI-Join (i.e., SI(F), SI(S)) scales better (i.e., linear) than SN-Join
and JaccT. The reason is that SI-Join methods have more efficient
filtering strategies and generate smaller size of candidates. In addition, in order to study the scalability of algorithms with the increase
of the number of rules, we plot Figure 12(b), where SI(F) (i.e. SIJoin with the full-expansion) is a clear winner in algorithms using
rules: it is insensitive to the number of rules and thus able to outperform other methods when one string involves more than 10 rules.
Prefix vs. LSH We then sought to analyze system performances
with different signature schemes, namely, prefix vs. LSH. Figure 18 plots the running times of SI(S)-L (LSH) and SI(S) (Prefix) with varying join thresholds. Results on SI(F) and SN-join
are similar. Note that, for SI(S)-L, the choice of k and l has substantial impact on the query time, hence we tuned and used the
parameters of k and l for each threshold to achieve the best running
time. Before we describe the results, it should be noted that the
two schemes, LSH and prefix, have different goals and probably
different application scenarios, as LSH is an approximate solution,
but prefix must return all correct answers, making somewhat un-
80.1
85.6
89.9
89.5
93.7
92.9
4.19
4.28
3.85
4.04
3.27
3.73
84.5
90.7
97.1
95.5
99.8
96.2
3.71
3.33
3.42
2.94
3.06
2.73
Running tim
SN(S)
Para
Full
Locality sensitive hashing (LSH) scheme
USPS (k=9)
SPROT(k=11)
(b) USPS dataset (K)
USPS (100K self-join)
Filtering
Power(%)SN(F)
Candidate
Size(x106)
JaccT
SN(S)
SPROT
Figure 16: Running time in CONF
and USPS
datasets.
Filtering Power(%)
Candidate Size(x109)
SN(S)JaccT SN-Join
JaccT SN-Join SN(F)-LI
SI-Join
SI(S)SN-Join SI-Join JaccT SN-Join SI-Join JaccT-LI
SI(F)SI-Join JaccT
6
400
120088.8
0.95 86.1
99.0 695.4 561.3 49.8
91.8 93.5 99.5 41.3
32.7
2.6
5
0.90
86.8
98.9
714.3 661.3between
56.9
90.7
91.7
98.6
41.9
7.1
fair85.7
for
a
direct
comparison
them.
We
observe
LSH
is
30046.5 that
4
0.85 84.5900 86.0
98.9 777.6 701.9 57.4
88.5 90.0 95.5 57.5
50.2
22.1
faster
is approximately
0.80
82.5than
84.7prefix
98.7 when
874.0 the
764.2threshold
63.0
86.6
89.3 92.1 20067.7≥ 0.8.
53.4 This
39.5
3
600
0.75
82.3 83.7
98.7 more
885.8 817.6
66.7 minhash
85.2 87.5signatures
90.3 74.0is combined
62.4
48.9
is mainly
because
than one
2
0.70 81.9 82.4
98.4 904.0 878.7 78.5
78.1 80.3 89.7 100
109.5 98.8
51.3
1
(i.e, k 300
≥ 2) in these settings, which makes LSH method quite se0
200
600 all
800 1000 (K)
lective. 0However,
when
we400
tried
10 20 30
40 the
50 threshold is less than 0.8,
50
(
(a) USPS dataset (k=9,l=21)
Running time (Sec)
Filtering
ratiol on
the USPS
reasonable
of k and
values
and dataset
still cannot find a
(b) # ofcombinations
synonyms(b)
(SPROT)
OT)
setting to beat prefix. This is because when the threshold is small,
Filterin
in order to JaccT-L
improve theSN(F)-L
filtering SN(S)-L
power, weSI(F)-L
need to select
SI(S)-L a large k/L
JaccT SN
6000
5000
k, which in turn requires a large l to maintain the same confidence
5000
4000
79.6
level (≥ 95%);
this results in substantial overhead in filtering time.k=2,l=3
k=4,l=6 86.1
4000
3000
k=6,l=10 88.6
Therefore, LSH trades the filtering time
for
higher
filtering
power.
3000
k=8,l=17 91.7
Then the 2000
overall running time of LSH
is greater than that of pre-k=10,l=27 92.3
2000
1000
fix in such
case. This also the reason1000
why the best running time
0
0
(K) small
(K)
200 400 600
800 1000
200
400 l600
800 1000
achieved for threshold
< 0.8
is with
k and
values
to reduce
(b) SPROT dataset (k=11,l=34)
(a) USPS dataset (k=9,l=21)
the overall running time in Figure 18.
Running time (Sec)
Filtering Power. We then investigated the filtering power of different algorithms in Figures 14–15 using prefix and LSH schemes, respectively. The experiments were performed on USPS and SPROT
datasets for self-join. For prefix-based algorithms, SN-Join is slightly
better than JaccT, while SI-Join has a substantial lead over JaccT
and SN-Join. These results are mainly because of the additional
filtering power in SI-Join bought by length filtering and signature
filtering. Next, for LSH-based algorithms, Figure 15 demonstrates
the similar trend. That is, SI-Join is the winner in all settings by
varying the parameters k and l. Compared to the prefix scheme,
LSH has a slightly higher filtering ratio when parameters are set
optimally (e.g. 99.2% v.s. 98.9% in USPS data, with threshold
0.9). On the other hand, note that LSH may filter away correct answers, resulting in false negatives, as can be seen from the false
negative percentages shown in Figure 15.
0
JaccTSN(F)
Ours
Avg/max Avg/max
(a) CONF dataset (K)
Figure 14: Filtering ratio with varying thresholds.
than ours. In addition, we observe that the size of signatures in the
LSH scheme is greater than that in prefix scheme, and the gap increases when the threshold decreases. As we will see shortly, this
results in substantial overhead of filtering time for the LSH scheme.
4.22
4.38
4.13
3.69
3.77
3.59
SI(F) ParaSI(S)Signatures
Signatures
Avg/max
l
Avg/max
500 l
2.363/11 1.432/7 0.95400 4
78.84/180
4
1662.32/4576
2.654/11 2.322/9 0.90300 7
137.97/315
8
3324.64/9152
3.532/12 2.948/11 0.85 12 236.52/540
17 7064.86/19448
3.796/12 3.508/13 0.8020021 413.91/945
34 14129.72/38896
4.556/14 4.105/13 0.75 39 768.7/1755
70 29090.6/80080
5.070/14 4.969/16 0.7010073 1438.83/3285 151 62752.58/172744
0 Per String
Signatures
2 (a)
4 Number
6
8 of10
200 400 600 800 1000
JaccT
JaccT
Ours
30Avg/max
Avg/max
1.415/525 1.137/3
1.527/620 1.447/6
2.078/815 2.201/7
3.780/9 2.349/7
3.913/910 2.806/8
4.456/105 3.568/10
(b) Filtering ratio on the USPS dataset
(b) Filtering ratio on the USPS dataset
78.8
83.1
89.4
84.7
92.3
90.4
1.885/5
Prefix filtering scheme
USPS
SPROT
Candidate Size(x109)
95.5 57.5
99.5
92.1 41.3
67.7
98.6
90.3 46.5
74.0
95.5
89.7 57.5
109.5
92.1 67.7
90.3 74.0
89.7 109.5
3.61
3.55
3.52
2.78
2.43
2.19
2.303/6
2.208/6
Figureratio
15: Filter
and(FN)
false
negative
(FN) with
filter
(FR) andratio
False (FR)
Negative
with
varying parameters
k andvaryl. (accuracy=0.95,
2.741/6 2.682/6
threshold=0.9) threshold=0.9).
ing parameters k and l. (accuracy=95%,
JaccT SN-Join
SI-Join JaccT
SN-Join SI-Join
SPROT(1000K
self-join)
9)
Filtering
Ratio(%)
Size(x102.6
91.8
93.5
99.5 Candidate
41.3
32.7
90.7 SN-Join
91.7 SI-Join
98.6 JaccT
46.5 SN-Join
41.9 SI-Join
7.1
JaccT
76.3
79.1
99.2
87.8
92.3
94.1
Running time (Sec)
86.0
88.8
84.7
86.8
83.7
86.0
82.4
84.7
83.7
82.4
Avg/max Avg/max
63.5 4.21 66.5 4.32
CONF
1.263/3
72.4 4.54
73.91.138/3
4.77
JaccT
Ours
1.346/489.4
1.223/4
87.9 4.23
4.56
1.739/4Avg/max
83.4Avg/max
3.89
83.11.634/5
4.22
1.974/586.8
1.885/5
81.1 1.263/3
3.73
3.88
1.138/3
2.303/682.5
2.208/6
82.6 1.346/4
3.29
1.223/43.69
2.741/61.634/5
2.682/6
1.739/4
k=2,l=2
k=3,l=3
k=4,l=3
k=5,l=4
k=6,l=4
k=7,l=5
Running time (Sec)
0.85 84.5
0.95 82.5
86.1
0.80
0.90 82.3
85.7
0.75
0.85 81.9
84.5
0.70
0.80 82.5
0.75 82.3
0.70 81.9
Filtering Ratio(%)
200 400 600 800 1000 (K)
(a) USPS dataset (k=9,l=21)
Running time (Sec)
Candidate Size(x10 )
JaccT SN-JoinUSPS
SI-Join
JaccT
SN-Join SI-Join
(100K
self-join)
6
Filtering
Candidate
Size(x10
0.95 86.1
88.8Ratio(%)
99.0 695.4
561.3
49.8)
0.90 85.7
86.8 SI-Join
98.9 714.3
661.3 SI-Join
56.9
JaccT SN-Join
JaccT SN-Join
1000
0
k=6,l=10 8
k=8,l=17 9
k=10,l=27 9
Running time (Sec)
SPROT(1000K self-join)
(a) Number of Signatures
Per String
6
Filtering Ratio(%)
2000
4000
3000
2000
1000
0
200 400 600 800 1000(K)
(b) SPROT dataset (k=11,l=34)
Running time (Sec)
LSH scheme).
USPS (100K self-join)
3000
USPS
SPROT
CONF
JaccT
SN-Join
SI-Join
JaccT
SN-Join
SI-Join
JaccTFR(%)Ours
FR(%) FN(%)
FN(%) FR(%) FN(%) FR(%) FN(%)FR(%) FN(%) FR(%) FN(%)
k/L
Ours
JaccT
Ours
JaccT
Ours
JaccT
Ours
l JaccT
l
Avg/max Avg/max
Avg/max
Avg/max
Avg/max
Avg/max
Avg/max
Prefix filtering
scheme Avg/max
Locality
sensitive hashing
(LSH) scheme
USPS3.254/8
(k=4)
SPROT(k=6)
SPROT
0.95 1.415/5USPS
1.137/3 8.363/17
7.432/16 2 3.384/8
3 14.87/46
12.16/44
Signatures
Signatures
JaccT 1.447/6
Ours 8.654/18
JaccT 8.322/16
Ours 3 4.576/19
0.90 1.527/6
4.499/18 4 19.84/58
19.44/57
l
l
Avg/max Avg/max
Avg/max 8.953/17
Avg/max 5 8.952/21 8.911/20
Avg/max 7 34.72/72
Avg/max
0.85 2.078/8
2.201/7 9.532/18
34.58/71
0.80
6 11.14/29
11.09/27
48.19/100
0.95 3.780/9
1.415/5 2.349/7
1.137/3 10.79/19
8.363/17 9.512/19
7.432/16 0.95
2
3.384/8 10 49.62/103
3
14.87/46
0.75
3.913/9
2.806/8
11.55/19
10.10/21
8
15.52/36
14.15/35
16
69.36/192
68.73/185
0.90 1.527/6 1.447/6 8.654/18 8.322/16 0.90 3
4.576/19
4
19.84/58
0.70
19.21/51
74.92/294
0.85 4.456/10
2.078/8 3.568/10
2.201/7 13.07/20
9.532/18 12.97/23
8.953/17 11
0.8520.09/51
5
8.952/21 24 75.04/297
7
34.72/72
0.80 3.780/9 2.349/7 10.79/19 9.512/19 0.80 6
11.14/29
10
49.62/103
(a)
Number
of
Signatures
Per
String
0.75 3.913/9
2.806/8 11.55/19
10.10/21 0.75 (Prefix
8
15.52/36
16 scheme
69.36/192
Figure
13:
Number
of
signatures
filtering
and
0.70 4.456/10 3.568/10 13.07/20 12.97/23 0.70 11
20.09/51
24
75.04/297
Running tim
4.54 73.9
4.77 79.1
3.55 83.1
4.28 90.7
k=3,l=3
SN-Join
SI-Join
JaccT4.38 85.6
SN-Join
SI-Join3.33
k/L 72.4JaccT
87.9 FN(%)
4.23 FR(%)
89.4 FN(%)
4.56 FR(%)
99.2 FN(%)
3.52 FR(%)
89.4 FN(%)FR(%)
4.13 89.9 FN(%)
3.85 FR(%)
97.1 FN(%)
3.42
k=4,l=3 FR(%)
k=5,l=4 83.4 3.89 83.1 4.22 87.8 2.78 84.7 3.69 89.5 4.04 95.5 2.94
63.5 3.73
4.21 86.8
66.5 3.88
4.32 92.3
76.3 2.43
3.61 78.8
k=2,l=2 81.1
92.3 4.22
3.77 80.1
93.7 4.19
3.27 84.5
99.8 3.71
3.06
k=6,l=4
72.4 3.29
4.54 82.5
73.9 3.69
4.77 94.1
79.1 2.19
3.55 83.1
k=3,l=3 82.6
90.4 4.38
3.59 85.6
92.9 4.28
3.73 90.7
96.2 3.33
2.73
k=7,l=5
k=4,l=3 87.9 4.23 89.4 4.56 99.2 3.52 89.4 4.13 89.9 3.85 97.1 3.42
83.4 (FR)
3.89and83.1
4.22
87.8 2.78
84.7
3.69 89.5
4.04 95.5
2.94
k=5,l=4
filter
ratio
False
Negative
(FN)
with
varying
parameters
k
and
l.
(accuracy=0.95,
k=6,l=4 81.1 3.73 86.8 3.88 92.3 2.43 92.3 3.77 93.7 3.27 99.8 3.06
2.19 90.4 3.59 92.9 3.73 96.2 2.73
k=7,l=5 82.6 3.29 82.5 3.69 94.1 threshold=0.9)
USPS
SPROT
JaccT
SN-Join
SI-Join
JaccT
SN-Join
SI-Join
k/L Estimation-based
5.2.3
signature
selection
FP(%) ER(%) FP(%) ER(%) FP(%) ER(%) FP(%) ER(%) FP(%) ER(%) FP(%) ER(%)
71.1
75.5 4.82 76.3 (Figure
3.61 79.8 19)
4.22 studies
80.2 4.19
3.71 of
k=3,l=5last
The
set4.83
of experiments
the84.6
quality
k=5,l=8 77.5 4.74 76.9 4.77 79.1 3.55 84.2 4.38 85.6 4.28 92.7 3.33
81.7 4.23
4.56 and
87.3LSH
3.52 schemes.
86.4 4.13 89.9
95.1filter,
3.42 we
2DHS
synopsis
on 84.3
prefix
For 3.85
prefix
k=7,l=13
k=9,l=21 84.1 3.89 89.4 4.21 99.3 2.78 91.7 3.69 90.0 4.04 98.5 2.94
usek=11,l=34
the four
signature
filters,
78.2frequency-based
3.75 86.1 3.88 94.3
2.43 92.3
3.77 i.e.,
93.7 ITF1∼ITF4,
3.27 99.8 3.06 which
75.3 3.22 82.5 3.69 94.1 2.19 90.4 3.59 92.9 3.73 96.2 2.73
arek=13,l=53
described
in Section 4.2. For LSH scheme, we vary the paramFiltering power (FP) and error ratio (ER) with varying parameters k and l. (accuracy=0.95, threshold=0.8)
eters k and l to generate six filters. The off-line processing time for
the generation of all signatures of one filter is around 200s ∼ 300s,
Prefix filteringcost.
scheme
Locality sensitive hashing (LSH) scheme
which is a one-time
USPS (k=9)
SPROT(k=11)
USPS
SPROT
We JaccT
first computed
the upper
and the
bound
Para (UB)
Signatures
Para lower
Signatures
Ours
JaccT
Ours bound
Avg/max Avg/max Avg/max Avg/max
l
Avg/max
l
(LB) of each filter by Lemma 1. One filter is returned asAvg/max
the best
0.95 1.415/5 1.137/3 2.363/11 1.432/7 0.95 4
78.84/180
4
1662.32/4576
one
its UB1.447/6
is less2.654/11
than all
the 0.90
LBs 7of other
filters,8 which
means
0.90if1.527/6
2.322/9
137.97/315
3324.64/9152
0.85 2.078/8 2.201/7 3.532/12 2.948/11 0.85 12 236.52/540
17 7064.86/19448
that
returns2.349/7
the smallest
number
For
in
0.80it3.780/9
3.796/12 3.508/13
0.80 of21candidates.
413.91/945
34 example,
14129.72/38896
0.75 3.913/9 2.806/8 4.556/14 4.105/13 0.75 39 768.7/1755
70 29090.6/80080
Figure
19(a),3.568/10
ITF25.070/14
is the4.969/16
best one
dataset.
If there is
0.70 4.456/10
0.70 in
73 CONF
1438.83/3285
151 62752.58/172744
no filter which (a)
canNumber
beat all
other filters
using only UB and LB,
of Signatures
Per String
then we further compute the tighter upper bound (TUB) and lower
USPS (100K self-join)
SPROT
boundFiltering
(TLB)
using 2DHS
synopsis. Filtering
Consequently,
in Figure 19(b),
Power(%)
Candidate Size(x10 )
Power(%)
Candidate Size(x10 )
ITF1JaccT
is the
best
filters
USPS
dataset
, as its
TUB
is SN-Join
less than
SN-Join
SI-Join
JaccTin
SN-Join
SI-Join
JaccT SN-Join
SI-Join
JaccT
SI-Joinall
others’
that,561.3
our estimate
the ability
0.95 86.1TLBs.
88.8 Note
99.0 695.4
49.8
91.8 correctly
93.5 99.5predicts
41.3
32.7
2.6
85.7 86.8
98.9 714.3 661.3 56.9
90.7 91.7 98.6 46.5
41.9
7.1
of0.90
which
empirically
in Figure
5 for
0.85four
84.5 filters,
86.0
98.9
777.6can
701.9be 57.4
88.5 90.0verified
95.5 57.5
50.2
22.1
0.80 82.5 84.7
98.7 874.0 764.2 63.0
86.6 89.3 92.1 67.7
53.4
39.5
prefix
scheme.
In addition,
estimate
cost accounts
0.75 82.3
83.7
98.7
885.8 817.6the
66.7
85.2 87.5
90.3 74.0 for
62.4only
48.91%
98.4 For
904.0example,
878.7 78.5in USPS
78.1 80.3
89.7 109.5
98.8 cost
51.3 of
of0.70the81.9join82.4time.
dataset,
the time
our estimator is only 0.81s while the filtering time is 83.7s and the
total running time is (b)
113.4s.
Filtering ratio on the USPS dataset
6
9
Summary Finally, we summarize our findings.
(1) Considering four similarity measures. we observe that Jaccard is inadequate due to its complete neglect of synonym rules.
JaccT is not efficient because it enumerates all transformed strings
T)
SI(F)-L
SN(S)-L
400
SN(F)-L
15
10
5
2
4
6
8 10 (K)
(a) CONF dataset (k=3,l=3)
SI(S)-L
300
200
100
0
SI(F)-L
SI(S)-L
0.95
0.90
0.85
0.80
0.75
0.70
k
l
4
4
2
2
1
1
2
3
3
3
3
3
k/L120JaccT SN-Join
SI(S)-L
SI-Join
100
k=2,l=380 79.6
k=4,l=660 86.1
k=6,l=10 88.6
40 91.7
k=8,l=17
k=10,l=27
20 92.3
4000
3000
2000
1000
0
(a)1000
k,l settings
(K)
200 400 600 800
(b) SPROT dataset (k=11,l=34)
Running tim
UB
800
600
400
LB
200
ITF1 ITF2 ITF3 ITF4
200 400 600 800 1000 (K)
(b) USPS dataset (k=4,l=3)
Filtering Power(%)
Running time(Sec)
SN(S)-L
6000
5000
1000
0
(a) CONF dataset(Prefix)
Figure 17: Performance on LSH signature scheme.
Running time (Sec)
L
JaccT-L
20
0
80
60
40
20
0
98
0.95 0.9 0.85 0.8 0.75 0.7
97
(b) USPS dataset
(b) USPS dataset
83.2
95.7
99.1
99.5
99.8
JaccT SN-Join SI-Join
113
162
213
374
494
124
158
207
335
463
21
30
48
59
76
100
80
60
UB
40
TUB
20
0
LB
TLB
ITF1 ITF2 ITF3 ITF4
(b) USPS dataset(Prefix)
50
0.95 0.9 0.85 0.8 0.75 0.7
L1:k=2,l=2 L4:k=5,l=4
L2:k=3,l=3 L5:k=6,l=4
L3:k=4,l=3 L6:k=7,l=5
40
30
UB
TUB
20
TLB
10
0
LB
L1 L2 L3 L4 L5 L6
(c) USPS dataset (LSH)
Figure 19: Estimation results of 2DHS synopsis for prefix and
USPS
SPROT
LSH schemes
(threshold=0.90).
JaccT
SN-Join
SI-Join
JaccT
SN-Join
SI-Join
k/L
Filtering Time (Sec)
SI(S)
81.3
88.6
89.9
93.0
95.7
SI(S)
99
(a) k,l settings
# of candidates (K)
Running time (Sec)
(S)
2
3
3
3
3
3
# of candidates (10^10)
(b) USPS dataset (K)
Running time (Sec)
K)
200 400 600 800 1000
4
4
2
2
1
1
# of candidates (10^10)
Running
10
0.95
0.90
0.85
0.80
0.75
0.70
200
100
0
100
99
FP(%) ER(%) FP(%) ER(%) FP(%) ER(%) FP(%) ER(%) FP(%) ER(%) FP(%) ER(%)
SI(S)-L
SI(S)
[6]
98
k=3,l=5 71.1 4.83 75.5 4.82 76.3 3.61 79.8 4.22
k=5,l=8 77.5 4.74 76.9 4.77 79.1 3.55 84.2 4.38
k=7,l=13 81.7 4.23 84.3 4.56 87.3 3.52 86.4 4.13
R.
J. Bayardo,
Ma, 4.21
and R.
Scaling
k=9,l=21
84.1 3.89Y. 89.4
99.3Srikant.
2.78 91.7
3.69
k=11,l=34 78.2 3.75 86.1 3.88 94.3 2.43 92.3 3.77
WWW,
131–140,
2007.
k=13,l=53 pages
75.3 3.22
82.5 3.69
94.1 2.19 90.4 3.59
80.2
85.6
89.9
4.19
4.28
3.85
84.6
92.7
95.1
3.71
3.33
3.42
93.7
92.9
3.27
3.73
99.8
96.2
3.06
2.73
up
pairs
search. In
90.0 all
4.04
98.5similarity
2.94
[7]
M. Bilenko
R. J.ratio
Mooney.
duplicate
usingthreshold=0.8)
learnable
Filtering
power (FP)and
and error
(ER) withAdaptive
varying parameters
k and detection
l. (accuracy=0.95,
string similarityUSPS
measures.
In KDD, pages 39–48, 2003.
0
(100K self-join)
SPROT
0.95 0.9 0.85 0.8 0.75 0.7
97
Filtering
Power(%)
Candidate Size(x10
Filtering Power(%)
Candidate Size(x10
)
0.95[8]
0.9 A.0.85
0.8
0.75
0.7 Glassman,
Z. Broder,
S. C.
M. S.) Manasse,
and G. Zweig.
Syntactic
cluster(b) USPS dataset
(b) USPS dataset ing of JaccT
0 1000 (K)
SN-JoinComputer
SI-Join JaccT Networks,
SN-Join SI-Join 29(8-13):1157–1166,
JaccT SN-Join SI-Join JaccT SN-Join
the web.
1997.SI-Join
21)
0.95 86.1 88.8
99.0 695.4 561.3 49.8
91.8 93.5 99.5 41.3
32.7
2.6
Figure 18: Prefix scheme vs. LSH scheme (with varying thresh[9] S. Chaudhuri,
and661.3
R. Kaushik.
A primitive
for similarity
0.90 85.7 86.8V. Ganti,
98.9 714.3
56.9
90.7
91.7 98.6 operator
46.5
41.9
7.1
0.85in84.5
98.9 777.6
701.9 page
57.4 5,
88.5
90.0 95.5 57.5
50.2
22.1
joins
data86.0
cleaning.
In ICDE,
2006.
SPROT
old1000
values).
0.80
82.5
84.7
98.7
874.0
764.2
63.0
86.6
89.3
92.1
67.7
53.4
39.5
100
50
L1:k=2,l=2[10]L4:k=5,l=4
0.75 82.3 83.7
885.8 817.6Extending
66.7
85.2autocompletion
87.5 90.3 74.0 to 62.4
48.9 errors. In
SI-Join
JaccT
SN-Join
SI-Join
S. Chaudhuri
and 98.7
R. Kaushik.
tolerate
UB FR(%) FN(%)
0.70 81.9 82.4
98.4 904.0 878.7 78.5
78.1 80.3 89.7 109.5 98.8
51.3
L2:k=3,l=3 L5:k=6,l=4
R(%) FN(%) FR(%) FN(%)FR(%) FN(%)
SIGMOD Conference, pages 707–718, 2009.
800
80
40
L3:k=4,l=3 L6:k=7,l=5
76.3 3.61 78.8 4.22 80.1 4.19 84.5 3.71
(b) Filtering
ratio
the USPS dataset
[11] W. W. Cohen, P. D. Ravikumar,
and
S. on
E. Fienberg.
A comparison of string
79.1 3.55 83.1 4.38
85.6
4.28 large
90.7 query
3.33
60 Full-expansion is extremely
30
and600
entails
overhead.
efdistance metrics for name-matching tasks. In IIWeb, pages 73–78, 2003.
99.2 3.52 89.4 4.13 89.9 3.85 97.1 3.42
UB
ficiently,
but
its
F-measure
is not
as good asUB
Selective-expansion,
87.8 2.78 84.7 3.69
89.5 4.04
95.5
2.94
[12] M.TUB
Farach. Optimal suffix tree construction with large alphabets. In FOCS,
400
40
20
92.3 2.43 92.3 3.77 93.7 3.27
LB 99.8 3.06
TUB
pages
137–143, 1997.
which
a good
94.1 2.19 90.4 3.59
92.9 makes
3.73 96.2
2.73 balance between effectiveness and efficiency.
TLB
200
20
10
[13] LB
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base
(2)
With
respect
to
three
similarity
join
algorithms,
SI-join
is
TLB
ative (FN) with varying parameters k and l. (accuracy=0.95,
applications. J. Comput. Syst. Sci., 31(2):182–209, 1985.
LB
SI(F)-L
0
0
0
SN(S)-L
SI(F)
JaccT-L
SN(F)-L
JaccT
SN(F)
SN(S)
SI(S)
in all settings over JaccT
and SN-join, by using SI-tree
threshold=0.9) the winner
M. N. Garofalakis, and R. Rastogi. Tracking
set-expression cardi-400
ITF1
ITF2
ITF3
L1 L2 L330[14]
L4 S.
ITF4
L5Ganguly,
L6
20
ITF1 ITF2 ITF3 ITF4
500
to efficiently perform length filtering and signature filtering. We
nalities over continuous update streams. VLDB J., 13(4):354–369, 2004.
25
400
15
Locality sensitive hashing
(LSH) scheme
300
(b) USPS dataset
(c) USPS dataset
(LSH)
(a)aCONF
dataset
H. V. Jagadish, N. Koudas, S. Muthukrishnan,
20[15] L. Gravano, P. G. Ipeirotis,
speedup
data sets over the
USPS achieve
(k=4)
SPROT(k=6) of up to 50∼300x on three
300
CONF
10
200
15
and
D.
Srivastava.
Approximate
string
joins
in
a
database
(almost)
for
free.
In
Signatures
Signatures
urs
JaccT
Ours
200
state-of-the-art
approach.
l
l
10
g/max
Avg/max
Avg/max
Avg/max Avg/max
100
VLDB, pages 491–500,
2001.
5
100
5
(3) Finally,
2DHS synopsis offers very 1.263/3
good results,
32/16 0.95 2
3.384/8
3
14.87/46
1.138/3 and is exImproved upper bounds for
0 3-sat. In SODA, 2004.
0
0
0[16] K. Iwama and S. Tamaki.
(K)
2
4
6
8
10
200 400 600
USPS
SPROT
22/16 0.90 3
4.576/19
4
19.84/58
1.346/4 of1.223/4
2
4
6
8 10
200 400 600 800 1000
tremely efficient
to estimate
the filtering
power
different
filters,
(b) USPS dataset
[17]
G. Kondrak.
N-gram similarity
distance.
pages
115–126,
(a) CONF
dataset
(k=3,l=3)2005.
53/17 0.85 5
8.952/21 k/L
7
34.72/72
1.739/4
(a) CONF
dataset (K)
(b) USPSand
dataset
(K) In SPIRE,
JaccT
SN-Join
SI-Join
JaccT 1.634/5
SN-Join
SI-Join
12/19 0.80 6
11.14/29
10 FP(%)
49.62/103
1.974/5
1.885/5
( < 1 second
, accounting
for only
1% ofFP(%)
theER(%)
total
running
time),
[18] H. Lee, R. T. Ng, and K. Shim. Power-law based estimation of set similarity
ER(%) FP(%) ER(%)
FP(%) ER(%)
FP(%)
ER(%) FP(%)
ER(%)
10/21 0.75 8
15.52/36
16
69.36/192
2.303/6 2.208/6
join size. PVLDB, 2(1):658–669, 2009.
71.1
4.83 75.5 its
4.82
76.3 3.61 in79.8
4.22 2.682/6
80.2 4.19 84.6 3.71
whichk=3,l=5
strongly
motivates
application
practice.
97/23 0.70 11
20.09/51
24
75.04/297
2.741/6
k=5,l=8 77.5 4.74 76.9 4.77 79.1 3.55 84.2 4.38 85.6 4.28 92.7 3.33
[19] H. Lee, R. T. Ng, and K. Shim. Similarity join size estimation using locality
Signatures Per String k=7,l=13 81.7 4.23 84.3 4.56 87.3 3.52 86.4 4.13 89.9 3.85 95.1 3.42
sensitive hashing. PVLDB, 4(6):338–349, 2011.
k=9,l=21 84.1 3.89 89.4 4.21 99.3 2.78 91.7 3.69 90.0 4.04 98.5 2.94
[20] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approxik=11,l=34 78.2 3.75 86.1 3.88 94.3 2.43 92.3 3.77 93.7 3.27 99.8 3.06
SPROT
In
this
paper,
we
studied
the
problem
of
strings
similarity
meak=13,l=53 75.3 3.229 82.5 3.69 94.1 2.19 90.4 3.59 92.9 3.73 96.2 2.73
mate string searches. In ICDE, pages 257–266, 2008.
ze(x106)
Filtering Power(%)
Candidate Size(x10 )
sures
andpower
joins(FP)
with
semantic
rules
synonyms).
Wel. (accuracy=0.95,
proposed threshold=0.8)
Filtering
and error
ratio (ER)
with(i.e.
varying
parameters k and
[21] G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for
n SI-Join JaccT SN-Join SI-Join JaccT SN-Join SI-Join
similarity joins. In VLDB, 2012.
two expansion-based
methods to measure the similaritySPROT
of strings
USPS (100K self-join)
49.8
91.8 93.5 99.5 41.3
32.7
2.6
6)
Filtering
Power(%)
Power(%)
Candidate Size(x109) [22] D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden markov model infora novel
structureCandidate
calledSize(x10
SI-tree
toFiltering
perform
the similar56.9
90.7 91.7and
98.6
46.5 index
41.9
7.1
mation retrieval system. In SIGIR, pages 214–221, 1999.
57.4
88.5 90.0 95.5 57.5 JaccT
22.1 SI-Join JaccT SN-Join SI-Join JaccT SN-Join SI-Join JaccT SN-Join SI-Join
SN-Join
join.67.7We50.2
also
proposed an estimator to select signatures on63.0
86.6 89.3ity92.1
53.4
39.5
[23] S. Patro and W. Wang. Learning top-k transformation rules. In DEXA, pages
66.7
85.2 87.5line
90.3
74.0
62.4
48.9
0.95
86.1
88.8
99.0
695.4
561.3
49.8
91.8
93.5
99.5
41.3
32.7
2.6
to optimize
the51.3
efficiency
of signature filters
in join algorithms.
172–186, 2011.
78.5
78.1 80.3 89.7 109.5
98.8 86.8
0.90 85.7
98.9 714.3 661.3 56.9
90.7 91.7 98.6 46.5
41.9
7.1
0.85 84.5provides
86.0
98.9
777.6 701.9
57.4
88.5 90.0
95.5 57.5 es50.2
22.1 [24] J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity
The estimator
provably
high-confidence
and low-error
0.80 82.5 84.7
98.7 874.0 764.2 63.0
86.6 89.3 92.1 67.7
53.4
39.5
query processing with the asymmetric signature scheme. In SIGMOD Confertimates 0.75
to compare
filtering
power66.7of filters
with 90.3
the logarith82.3 83.7 the98.7
885.8 817.6
85.2 87.5
74.0
62.4
48.9
ence, pages 1033–1044, 2011.
0.70 81.9 82.4
98.4 904.0 878.7 78.5
78.1 80.3 89.7 109.5 98.8
ing ratio on the USPS
dataset
mic space
and time complexity.
The extensive
experiments demon- 51.3 [25] G. Salton and C. Buckley. Term-weighting approaches in automatic text restrated the advantages of our(b)proposed
approach
state-oftrieval. Inf. Process. Manage., 24(5):513–523, 1988.
Filtering ratio
on theover
USPSthe
dataset
the-art methods in three data sets.
[26] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743–754, 2004.
[27] Y. Tsuruoka, J. McNaught, J. Tsujii, and S. Ananiadou. Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics, 23(20):2768–2774, 2007.
This research is partially supported by 973 Program of China
[28] J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string similarity joins
(Project No. 2012CB316205), NSF China (No. 60903056), 863
with
edit-distance
constraints. SI(S
PVLDB, 3(1):1219–1230, 2010.
SI(F)-L
SN(S)-L
JaccT-L
SN(F)-L
JaccT
SN(F)
SN(S)
SI(S)
National
High-tech
Research
Plan ofSI(F)
China (No.
2012AA011001),
[29] J. Wang,
G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive
20
400
30
500
framework for similarity join and search. In SIGMOD Conference, pages 85–
Beijing Natural Science Foundation
(No. 4112030), and New Cen25
400
15
300
96, 2012.
tury
Excellent
Talents
in
University.
Wei
Wang
is
supported
by
20
300
[30] W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extrac10
200
15 ARC Discovery Projects DP130103401 and DP130103405.
tion with edit distance constraints. In SIGMOD Conference, pages 759–770,
200
10
2009. 100
5
100
5
[31] W. E. Winkler. The state of record linkage and current research problems.
0
0
0
0
2
4
6
8 10 (K)
Technical report,
Statistical
Research
Division, U.S. Census Bureau, 1999.
200 400
600 800
1
2 N.4Alon,
6 Y. Matias,
8 10 and M. Szegedy.
200 400
600 800
1000 of approximating
[1]
The space
complexity
USPS
datasetY.
(k=4,l=
(a) CONF dataset (k=3,l=3)
[32]
C.
Xiao,
J. (b)
Qin,
W. Wang,
Ishikawa, K. Tsuda, and K. Sadakane. Efficient
(a) CONF
dataset
(K)
(b)
USPS
dataset
(K)
the frequency moments. In STOC, pages 20–29, 1996.
error-tolerant query autocompletion. PVLDB, 2013.
[2] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for
[33] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for
record matching. In ICDE, pages 40–49, 2008.
near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011.
[3] A. Arasu, S. Chaudhuri, and R. Kaushik. Learning string transformations from
examples. PVLDB, 2(1):514–525, 2009.
[4] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In
VLDB, pages 918–929, 2006.
[5] Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, pages 1–10, 2002.
8.
REFERENCES
Running time (Sec)
Running time (Sec)
Running time (Sec)
Running time (Sec)
Running time (Sec)
ACKNOWLEDGEMENTS
Running time (Sec)
7.
Running time (Sec)
CONCLUSIONS
Running time (Sec)
# of candidates (10^10)
6.
9
# of candidates (10^10)
# of candidates (K)
6