appendix a. convergent proof b. theorem 2 and proof c. proof of

APPENDIX
B. THEOREM 2 AND PROOF
A.
P ROOF P ROOF OF T HEOREM 2 . For the i-th RM step, the time
complexity is O(pqxi ), where xi is the number of linked attribute
pairs in the i-th RM step. For the j-th SM step, the time complexity
is O(mnyj ), where yj is the number of linked record pairs in the
1
j-th SM step. Thus the total time complexity is ∑ki=1
O(pqxi ) +
k2
O(mny
),
where
k
and
k
are
respectively
the
number of
∑j=1
j
1
2
steps of RM and SM. However the time complexity of RM steps
is dominated by the last RM step actually, and the time complexity
of SM steps this property likewise. The former steps of RM or SM
can be considered as a part of the latter steps. Notice that if some
records or attributes have been compared at the former steps, we
need not compare them again. So that we can turn to care about the
time complexity of the final steps of both RM and SM. Also, there
are some overlapping comparisons between the SM and RM step.
Clearly, the time complexity of the overlapping comparisons between the final SM and RM steps is O(xmax ymax ). Thus we can
obtain the total complexity is O(pqxmax +mnymax −xmax ymax ),
where xmax and ymax are the maximal number of linked attributepairs and record-pairs finally. Since xmax ≤ min(m, n) and ymax ≤
min(p, q), we can obtain the upper bound of our interaction algorithm is O(min(m, n)pq+min(p, q)mn−[min(m, n)⋅min(p, q)]).
Since p and q are often much less than m or n, we can simplify
the time complexity into O(min(p, q)mn). Thus, Theorem 2 is
proved.
CONVERGENT PROOF
P ROOF P ROOF OF T HEOREM 1. Denote the bi-graph by G =
G1 ∪ G2 with G1 = {V1 , E1 } and G2 = {V2 , E2 }, where V1 is
the set of record-pair nodes and V2 is a set of attribute-pair nodes;
E1 is the set of edges on G1 and E2 is the set of edges on G2 . Let
A and B denote the adjacent matrixes of (V1 , E1 ) and (V2 , E2 )
respectively. For ease of presentation, we need to redefine the values in A and B. If there is an edge pointing to one node in V2 from
one node in V1 , then the corresponding values in the adjacent matrix A are the contribution values referring to Eq. 6. And if there is
an edge pointing to one node in V1 from one node in V2 , then the
corresponding values in the adjacent matrix B are the contribution
values too. Denote the matching likelihood of record and attribute
�i at the i-th iteration. We now
pairs respectively as a vector r�i and a
define four functions about matrixes:
1) G1 (M ) means applying the function g1 (x) = 1+e1− ⋅x for
each value x in the matrix M;
2) G2 (M ) means applying the function g2 (x) = 1+e1−x for each
value x in the matrix M;
3) H1 (M ) means applying the function h1 (x) = −ln(1 − c ⋅ x)
for each value x in the matrix M, where c is a constant(referring
to the IdC score in Eq. 5);
4) H2 (M ) means applying the function h2 (x) = −ln(1 − x) for
each value x in the matrix M.
Also, we denote X (k) (M) applying the function X on M k times,
where X = {G1 , G2 , H1 , H2 } . Thus we can obtain the following
two equations:
�T
a
i
r�iT
= G1 (A ⋅
T
= G2 (B ⋅
T
H1 (�
riT ))
C. PROOF OF LEMMA 1
P ROOF P ROOF OF L EMMA 1. We leverage the proof by contradiction to prove this lemma. Suppose s1 and s2 do not share any
(`, q)-seq. Since ` ≥ q⋅�(max(�s1 �, �s2 �)−q+1)⋅!�, we obtain that
s1 and s2 share at most �(max(�s1 �, �s2 �) − q + 1) ⋅ !� − 1 q-grams.
Then we have
(17)
H2 (�
aT
i−1 ))
(18)
We then bring Eq. 23 in Eq. 24 and bring Eq. 24 in Eq. 23, then we
have:
�T
a
i
r�iT
= G1 (A ⋅ H1 (G2 (B ⋅
T
T
= G2 (B ⋅ H2 (G1 (A ⋅
T
T
H2 (�
aT
i−1 ))))
T
H1 (�
ri−1
))))
�(max(�s1 �, �s2 �) − q + 1) ⋅ !� − 1
�Gms(s1 , q) � Gms(s2 , q)� + "
�(max(�s1 �, �s2 �) − q + 1) ⋅ !� − 1
=
<!
(max(�s1 �, �s2 �) − q + 1) + "
F (s1 , s2 ) =
(19)
(20)
This contradicts with the condition F (s1 , s2 ) ≥ !, so s1 and s2
should share at least one (`, q)-seq, where ` ≥ q⋅�(max(�s1 �, �s2 �)−
q + 1) ⋅ !�. Thus, Lemma 1 is proved.
To obtain more explicit equations, we need to make some conversions. Note that h1 (g2 (x)) − x and h2 (g1 (x)) − x are monotone
decreasing functions (we can obtain this property by calculating
their derivatives). Similar to the convergence proof of series, we
have
lim (H1 G2 )(k) (M) − M =
k→+∞
lim (H2 G1 )
(k)
k→+∞
(M) − M =
1
(21)
2
(22)
D. PROOF OF LEMMA 2
P ROOF P ROOF OF L EMMA 2. To begin with, we consider how
a set of q−grams is generated from a string s (see the definition 1).
Since a string s can generate at most �s�−q +1 grams, we can obtain
that the minimum length of a string which is made up of n q−grams
is n + q − 1. We may as well suppose �s1 � ≤ �s2 � in the following
proof, i.e. max(�s1 �, �s2 �) = �s2 �.
(1) If two attribute values s1 and s2 are assigned into one block,
according to Lemma 1, they will share at least �(max(�s1 �, �s2 �) −
q +1)⋅!� grams. Then, the minimum length of s1 is [�(�s2 �−q +1)⋅
!� + q − 1]. So, translating s1 into s2 needs at most �s2 � − [�(�s2 � −
q + 1) ⋅ !� + q − 1] operations. According to the definition of edit
similarity, we get that the minimum similarity of s1 between s2 is
�(�s2 �)−q+1)⋅!�+q−1
1 �,�s2 �)−q+1)⋅!�+q−1
i.e. �(max(�smax(�s
.
�s2 �)
1 �,�s2 �)
(2) If s1 and s2 are not in any block, they will share at most
��s2 � − q + 1) ⋅ !� − 1 q−grams. For the string s2 , we remove k
characters from s2 , and we denote the new string by s′ . Assume the
k characters do not contain the first character or the last character
of s2 (we will discuss the scenario when this condition does not
where 1 and 2 are constant vectors. Thus, we obtain the vec�T
�iT as follows:
tors a
i and r
�T
lim a
i =
k→+∞
lim r�iT =
k→+∞
′
1
′
2
�T
+ G1 (AT BT )k−1 ⋅ a
1 ))
+ G2 (BT AT )k−1 ⋅ r�1T ))
(23)
(24)
where ′1 and ′2 are constant vectors. Since the edges are all
bidirectional excluding the nodes at the SM 0 step, we have A =
BT . Thus AT BT and BT AT are both symmetric matrixes. So
AT BT and BT AT are diagonalizable. According to some properties of symmetric matrixes [21], which has been studied well, we
�T
�iT will both converge to constant vectors.
can obtain that a
i and r
The convergence of our iterations is quite rapid, more work about
how to speed up the iterative process can be seen in [21] too.
13
hold; we denote this condition as (*)), thus s2 will be divided into
at most k +1 strings, denoted by {s′1 , s′2 , �, s′k+1 }. Let li the length
of string s′i the number of characters in s′i .
Situation 1: The length of every string s′i is not smaller than
q. Then all k + 1 strings can generate ∑k+1
i=1 (li − q + 1) q−grams
in total. Since k + ∑k+1
i=1 li = �s2 �, we can obtain the inequality as
follows:
�s2 � − k + (k + 1) ⋅ (1 − q) ≤ ��s2 � − q + 1) ⋅ !� − 1
a candidate record-pair, they should be in the same block at least
k′ different indices (the value of k′ depends on the value of ⌧pR and
k′ ≤ k).
�s� − q + 1
�s�−q+1
For a string s, it will produce the number of ∑i=l
�
�
i
(`, q)-seqs, denoted by W (s). That is to say, s can be in the W (s)
blocks under an index. Denote wi the average number of blocks
for each record being in, under the index Ii , the total number of the
blocks under the index Ii as Bi and hi the different strings under
the attribute-pair Ai .
Then we have hi < Bi ≤ hi ⋅ wi . This inequality corresponds to
two extreme situations: 1) the (`, q)-seqs generated by all the hi
strings have the most overlapping (`, q)-seqs (note that Bi can not
be equal to Bi , unless each string can only generate one (`, q)-seq;
2) the (`, q)-seqs generated by any two strings have no overlaps i.e.
Bi ≤ hi ⋅ wi . For ease of presentation, we may as well suppose the
frequencies of values under an attribute follow a uniform distribution, i.e. each block contains the same number of records. Denote
the record comparison times by Ci under the index Ii . If we suppose all the records in T1 and T2 can be covered by the blocks in Ii
(note that this is the worst case that we need the maximum record
comparison times),we can then estimate Ci by the following inequality:
(25)
Then, we can obtain the range of k as following inequality.
k≥
2 − q + �s2 � − ��s2 � − q + 1) ⋅ !�
q
(26)
Situation 2: The length of some (assume z) strings is smaller
than q, denoting the z strings as {s′1 , �, s′z , �, s′k+1 } these strings
can not generate any q−grams. Similar to the analysis of situation
1, we obtain that all k + 1 strings can generate ∑k+1
i=z+1 (li − q + 1)
q−grams in total. Thus we obtain the inequality as follows:
�s2 � − k + (k + 1 − z) ⋅ (1 − q) − � li ≤ ��s2 � − q + 1) ⋅ !� − 1
z
i=1
Since ∑zi=1 li ≤ (q − 1) ⋅ z, we can also get the Inequality 25 in this
situation. That is, the two situation will both lead to Inequality 26.
Notice that k is the number of characters removed from s2 , then
the similarity of s′ between s2 is 1 − �sk2 � .
If the condition (*) does not hold, s2 will be divided into less than
k + 1 strings. We then can consider the scenario as a special case in
Situation 2. That is, we can supplement some null character strings
and then we can solve this problem by the method in Situation 2.
In conclusion, if s1 and s2 are not in any block, the edit similarity
of them will not larger than
1−
mnwi
mnwi2
< Ci ≤
hi
hi
We then discuss the record comparison times under the number of
k indices.
Suppose if two records are in the same block under at least k −
k′ different indices, denoted by {Ik′ +1 , Ik′ +2 , ..., Ik }, we will take
them as a candidate record pair. Then it will be at most the number
of k′ indices, denoted by {I1 , I2 , ..., Ik′ }, are unconstrained. We
take the I1 as the based index, and analyse the influence of the rest
k′ indices. The first point is that all the records have been covered
by the blocks under Ik′ +1 , thus more indices will not bring more
records. The second point is that these indices are independent, so
′
1
at the worst situation we may have Ck′ +1 + 2mn ⋅ ∑k−k
j=1 hj . That
m
n
is, it increases hj or hj record comparison times for each record
in Ik−k′ at most.
Next, we discuss the index set {Ik′ +1 , Ik′ +2 , ..., Ik }. Note that all
the candidate record pairs should be in the same block under the k−
k′ indices. It is quite complex to represent the record comparison
times by the above symbols, so we suppose there are m′ records in
T1 and and n′ records in T2 have been retained under the indices
in {Ik′ +1 , Ik′ +2 , ..., Ik }. Thus, the total record comparison times,
dented by T , is
2 − q + max(�s1 �, �s2 �) − �(max(�s1 �, �s2 �) − q + 1) ⋅ !�
q ⋅ max(�s1 �, �s2 �)
Thus, Lemma 2 is proved.
E.
PROOF OF THEOREM 3
P ROOF P ROOF OF THEOREM 3. Suppose there is a record pair
which is not covered by any block under all the indices and its
matching likelihood is not smaller than ⌧pR . Then we have its
matching likelihood is no larger than ⌧pR according to the analysis in Sec. 5.2.1. This contradicts with the initial assumption, so all
possible matched pairs can be covered.
We also need to prove that the number of blocks is minimum.
Actually, we just need to prove that the number of blocks is minimum under each linked attribute-pair. Though the higher w under
an attribute-pairs will lead to less blocks, wi needs to satisfy Eq. 12.
And in order to cover all the possible matched pairs, the wi settings
are boundary values. So the total number of blocks is minimum
when w1 satisfies Equation 15 and wi (1 ≤ i ≤ m) satisfies Eq. 12.
Thus, Theorem 3 is proved.
F.
(27)
′
k−k
m′ n ′ w i
m′ n′ wi2
1
<T ≤
+ 2mn ⋅ �
hi
hi
j=1 hj
(28)
Based on the above “what-if” scenarios, we find that the total record
comparison times often depend on the values of m′ and n′ . Notice
that the above analysis almost are based on the assumption of the
worst situations, and m′ and n′ are often very small in practice.
Thus the value of T is often quite small in real applications.
TIME COMPLEXITY ANALYSIS OF THE
GREEDY ALGORITHM
[Time Complexity Analysis of The Greedy Algorithm]
Since the number of blocks often depends on the structure of data
sets and the bounding method is applied in our algorithm, we need
to make some assumptions in the following analysis.
Suppose there are k linked attribute pairs under which the indices
are built finally. The number of record pairs in T1 is m and the
number of record pairs in T2 is n. If two records are considered as
14