Document

Rules for Approximate String
Matching
R.C.T. Lee
1
Rule 1
Consider two substrings A1 and A2 as shown
below:
A1
A2
P1
S1
P2
S2
If ed(A1, A2) ≦k and S1=S2, then ed(P1, P2) ≦k.
2
• Rule 1:[AKLLLR2000], [H2005],
[HHLS2006], [JB2000], [LV89], [NB99],
[NB2000], [S80], [TU93], and [WM92].
3
Rule 2
A
B
m
If ed(A, B) ≦k, then the length of A must be
between m-k and m+k.
4
• Rule 2: [FN2004], [NB99], [NB2000] and
[TU93].
5
Rule 3
S1
S1’
P
If S1 contain S1’ completely and the distance
between S1’ and any substring of P is larger
than k, then ed(S1, P)>k.
6
• Rule 3: [ALP2004].
7
Rule 4
T
P
S1
S2
P
S2
For any substring S1 in T, if there exists a substring S2 in
P to the left of S1, ed(S1, S2) ≦k and S2 is the rightmost
such substring, then move P to align S1 and S2.
8
• Rule 4: [ALP2004].
9
Based upon Rule 3 and Rule 2, we have Rule 5
m-k
T
S1
P
If the window size is (m-k) and there exists a substring
S1 in the window such that the distance between S1 and
any substring of P is larger than k, then we can safely
move P as follows:
m-k
T
S1
P
10
If Rule 5 is not satisfied, it means the following:
For every substring S1 in T, there exists a
substring S2 in P such that ed(S1, S2) ≦k.
11
Rule 5-1
m-k
T
S1
P
If Rule 5 is not satisfied, we can only move
1 step as follows:
m-k
T
S1
P
12
• Rule 5: [HN2005].
13
Rule 6
Hamming Distance(A, B) ≧Edit Distance(A, B).
14
• Rule 6: [AKLLLR2000], [FN2004] and
[TU93].
15
Rule 7
For strings A and B, if there are k+1 characters
which do not appear in B, then ed(A, B)>k.
Rule 7-1
Let A and B be two strings. Let there be k+1 characters
a1, a2, …, ak+1 in A and ai is aligned with bi in B. If
every ai does not appear in B[i-k, i+k], then ed(A, B)>k.
16
• Rule 7: [TU93].
17
Rule 8
Let there be two strings A and B. Let B be divided
into j pieces B1, B2, …, Bj. If ed(A, B)>k, there is at
least one substring Ai in A such that
ed(Ai, Bi)  k j  .
18
Rule 8-1
Let A and B be two strings. Let B be divided into j
pieces B1, B2, …, Bj. If for every Bi and every
substring S of A, ed(S, Bi)  k j  , ed(A, B)>k.
19
Rule 8-2
Let A and B be two strings.
Let the lengths of A and B be m+k and m repsectively.
Let B be divided into j pieces B1, B2, …, Bj.
Let AP be a prefix of A.
If for every Bi and every substring S of A,
ed(S, Bi)  k j  , ed(AP, B)>k.
20
• Rule 8: [NB99] and [NB2000].
21
Rule 9
Let A and B be two strings with lengths m+k and m
respectively.
Let A’ be the prefix of A with length m-k.
Let there be j characters a1, a2, …, aj in A’.
Let the number of times that ai appears in A and B be
N(A’, ai) and N(B, ai) respectively.
Let Ci=N(A’, ai)-N(B, ai). Let AP be any prefix of A.
If
C
Ci 0
i
 k , ed(AP, B)>k.
22
Rule 9-1
Let A and B be two strings with lengths m+k and m
respectively.
Let there be j characters a1, a2, …, aj in A.
Let the number of times that ai appears in A and B be
N(A’, ai) and N(B, ai) respectively.
Let Ci=N(B, ai)-N(A, ai). Let AP be any prefix of A.
If
C
Ci 0
i
 k , ed(AP, B)>k.
23
Rule 10
m+2k
P’
T
i-k
i
i+m+k
P
Let P and T be two strings with lengths m and n
respectively.
If P matches with a substring P’ of T at position i,
any substring S of T[i-k, i+m+k] has the probability
of ed(S, P) ≦k.
24
• Rule 10: [NB99].
25
Rule 11
Let P and Q be two strings.
Let P be divided as follows:
P1
…
P2
Pn
Let Qi be the substring in Q and that ed(Pi, Qi)
is the smallest.
P1 P2
Pn
…
Q1 … QN
Q2
N
If
 ed ( P , Q )  k , ed ( P, Q)  k.
i 1
i
i
26
Application of Rule 11
W
…
tn
T
Pn
P1
t2 t1
P2
ed(ti,Pi) is the smallest.
n
If for some n,  ed (ti , Pi )  k , ed (W , P)  k .
i 1
27
• [AKLLLR2000] Text Indexing and Dictionary
Matching with One Error , Amir, A., Keselman, D.,
Landau, G. M., Lewenstein, M., Lewenstein, N. and
Rodeh, M. , Journal of Algorithms , Vol. 37 , 2000 ,
pp. 309-325 .
• [ALP2004] Faster Algorithms for String Matching
with k Mismatches, Amir, A.,
Lewenstein, and Porat, E. Journal of Algorithms, Vol.
50, 2004, pp. 257-275.
• [FN2004] Average-Optimal Multiple Approximate
String Matching, Kimmo Fredriksson , Gonzalo
Navarro, ACM Journal of Experimental Algorithmics,
Vol 9, Article No. 1.4,2004, pp. 1-47.
28
• [GG86] Improved String Matching with k
Mismatches, Galil, Z. and Giancarlo, R.,SIGACT
News, Vol. 17, No. 4, 1986, pp. 52-54.
• [H2005] Bit-parallel approximate string matching
algorithms with transposition Heikki Hyyrö, Journal
of Discrete Algorithms, Vol. 3, 2005, pp. 215-229.
• [HHLS2006] Approximate String Matching Using
Compressed Suffix Arrays, Trinh N. D. Huynh, W. K.
Hon, T. W. Lam and W. K. Sung, Theoretical
Computer Science, Vol. 352, 2006, pp. 240-249.
29
• [HN2005] Bit-parallel Witnesses and their
Applications to Approximate String Matching,
Heikki Hyyro and Gonzalo Navarro,
Algorithmica, Vol 4, No. 3, 2005, pp.203-231.
• [JB2000] Approximate string matching using
factor automata, Jan Holub, Borivoj Melichar,
Theoretical Computer Science 249, 2000, pp.
305-311.
• [LV86] String Matching with k Mismatches by
Using Kangaroo Method, Landau, G.M., and
Vishkin, U., Theoret. Comput Sci 43, 1986, pp.
239-249.
30
• [LV89] Fast Parallel and Serial Approximate
String Matching, G. Landau and U. Vishkin,
Journal of algorithms, 10, 1989, pp.157-169.
• [NB99] Very fast and simple approximate
string matching, G. Navarro and R. BaezaYates, Information Processing Letters, Vol. 72,
1999, pp.65-70.
• [NB2000] A Hybrid Indexing Method for
Approximate String Matching, Gonzalo
Navarro and Ricardo Baeza-Yates , 2000, No.1,
Vol.1, pp.205-239.
31
• [S80] String Matching with Errors, Sellers, P.
H., Journal of Algorithms, Vol. 20, No. 1, 1980,
pp. 359-373.
• [TU93] Approximate Boyer-Moore String
Matching, J. Tarhio and E. Ukkonen, SIAM
Journal on Computing, Vol. 22, No. 2, 1993,
pp.243-260.
• [WM92] Fast Text Searching: Allowing Errors,
Sun Wu and Udi Manber, Communications of
the ACM, Vol. 35, 1992, pp. 83-91.
32