Faster algorithm for computing the edit distance

Faster algorithm for computing the edit distance
between SLP-compressed strings
Paweł Gawrychowski
Max-Planck-Institut für Informatik
October 24, 2012
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
1 / 18
We consider the question of computing the edit distance between two
strings.
Edit distance
Given two strings a and b, what is the minimum number of
insert/delete/replace operations required to transform a into b?
Observation
Computing the edit distance between a and b is equivalent to finding
the length of longest common subsequence of a and b.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
2 / 18
We consider the question of computing the edit distance between two
strings.
Edit distance
Given two strings a and b, what is the minimum number of
insert/delete/replace operations required to transform a into b?
Observation
Computing the edit distance between a and b is equivalent to finding
the length of longest common subsequence of a and b.
a = ATGCCGAC
b = CAGACTAGA
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
2 / 18
We consider the question of computing the edit distance between two
strings.
Edit distance
Given two strings a and b, what is the minimum number of
insert/delete/replace operations required to transform a into b?
Observation
Computing the edit distance between a and b is equivalent to finding
the length of longest common subsequence of a and b.
a = AT GCCGAC
b = CAGACTAGA
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
2 / 18
The question is: how quickly can be compute this number?
Use dynamic programming!
For each i and j compute the longest common subsequence of a[1..i]
and b[1..j].
Which results in a very simple quadratic time algorithm. More
precisely, the running time is O(|a||b|).
Is there anything faster?
Can be improved by either using bit parallelism, or assuming that
some parameter (edit distance, number of matches, . . . ) is small.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
3 / 18
A natural direction is to consider the case when the input strings a and
b are highly compressible. The question is: what method of
compression should we assume?
Straight-line programs, or grammar compression
A context-free grammar in Chomsky normal form with exactly one
production for each nonterminal, hence generating exactly one string.
Fibonacci words
F0 = 1
F1 = 0
F0 = 1
F2 = 01
F1 = 0
F3 = 010
Fn+2 = Fn+1 Fn for all n ≥ 0 F4 = 01001
F5 = 01001010
F6 = 0100101001001
F7 = 010010100100101001010
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
4 / 18
A natural direction is to consider the case when the input strings a and
b are highly compressible. The question is: what method of
compression should we assume?
Straight-line programs, or grammar compression
A context-free grammar in Chomsky normal form with exactly one
production for each nonterminal, hence generating exactly one string.
Fibonacci words
F0 = 1
F1 = 0
F0 = 1
F2 = 01
F1 = 0
F3 = 010
Fn+2 = Fn+1 Fn for all n ≥ 0 F4 = 01001
F5 = 01001010
F6 = 0100101001001
F7 = 010010100100101001010
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
4 / 18
A natural direction is to consider the case when the input strings a and
b are highly compressible. The question is: what method of
compression should we assume?
Straight-line programs, or grammar compression
A context-free grammar in Chomsky normal form with exactly one
production for each nonterminal, hence generating exactly one string.
Fibonacci words
F0 = 1
F1 = 0
F0 = 1
F2 = 01
F1 = 0
F3 = 010
Fn+2 = Fn+1 Fn for all n ≥ 0 F4 = 01001
F5 = 01001010
F6 = 0100101001001
F7 = 010010100100101001010
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
4 / 18
Why this compression method?
1
as powerful as all Lempel-Ziv schemes (i.e., LZ77 can be
converted into SLP with small size increase),
2
easy to work with!
Problem
We are given two strings a and b described by two SLPs.
n is the total number of rules in both SLPs
N is the total length of both words
How quickly can we compute LCS(a, b)?
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
5 / 18
Why this compression method?
1
as powerful as all Lempel-Ziv schemes (i.e., LZ77 can be
converted into SLP with small size increase),
2
easy to work with!
Problem
We are given two strings a and b described by two SLPs.
n is the total number of rules in both SLPs
N is the total length of both words
How quickly can we compute LCS(a, b)?
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
5 / 18
Whenever we are working with compressed data, the goal is to
achieve running time depending just on the size of the compressed
representation. Unfortunately, in our case this is not possible.
Lifshits and Lohrey MFCS 2006
Checking if a is a subsequence of b is N P-hard.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
6 / 18
Lifshits CPM 2007
...but maybe O(nN) running time would be possible?
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
7 / 18
Lifshits CPM 2007
...but maybe O(nN) running time would be possible?
Tiskin Journal of Mathematical Sciences 158(5), 759–769 (2009)
O(nN 1.5 ) is possible!
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
7 / 18
Lifshits CPM 2007
...but maybe O(nN) running time would be possible?
Tiskin Journal of Mathematical Sciences 158(5), 759–769 (2009)
O(nN 1.5 ) is possible!
Hermelin, Landau, Landau and Weimann STACS 2009
O(n1.4 N 1.2 ) is possible!
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
7 / 18
Lifshits CPM 2007
...but maybe O(nN) running time would be possible?
Tiskin SODA 2010
Well, actually O(nN log N) is possible!
Hermelin, Landau, Landau and Weimann STACS 2009
O(n1.4 N 1.2 ) is possible!
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
7 / 18
Lifshits CPM 2007
...but maybe O(nN) running time would be possible?
Tiskin SODA 2010
Well, actually O(nN log N) is possible!
Hermelin, Landau, Landau and Weimann STACS 2009 (full)
...finally, O(nN log Nn ) is possible.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
7 / 18
Lifshits CPM 2007
...but maybe O(nN) running time would be possible?
Tiskin SODA 2010
Well, actually O(nN log N) is possible!
Hermelin, Landau, Landau and Weimann STACS 2009 (full)
...finally, O(nN log Nn ) is possible.
This paper
O(nN
q
log Nn ) time algorithm.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
7 / 18
The aforementioned papers actually solve a more general weighted
version of the problem in the same complexity, for any rational scoring
function.
The same is true for the improvement, but today we will focus just on
the unweighted case.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
8 / 18
High-level idea
Same as in the previous solutions: look at the alignment dag...
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
9 / 18
High-level idea
Same as in the previous solutions: look at the alignment dag...
A
B
C
B
A
A
B
A
C
B
B
A
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
9 / 18
... and notice that if the strings are highly repetitive, many fragments of
the dag look somehow similar, and we can avoid duplicating work.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
10 / 18
... and notice that if the strings are highly repetitive, many fragments of
the dag look somehow similar, and we can avoid duplicating work.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
10 / 18
... and notice that if the strings are highly repetitive, many fragments of
the dag look somehow similar, and we can avoid duplicating work.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
10 / 18
... and notice that if the strings are highly repetitive, many fragments of
the dag look somehow similar, and we can avoid duplicating work.
X1
X2
X3
X4
X5
X6
x
X10
X20
X30
X40
X50
X60
x
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
10 / 18
To make life simple, we want to partition the whole dag into blocks
corresponding to pairs of nonterminals.
Theorem
Given a SLP and a parameter x, we can (quickly) construct a new SLP
of roughly the same size and with the following two properties:
1
each nonterminal describes a string of length at most x,
2
the original string can be represented as a concatenation of
roughly Nx new nonterminals.
x will be chosen later.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
11 / 18
To make life simple, we want to partition the whole dag into blocks
corresponding to pairs of nonterminals.
Theorem
Given a SLP and a parameter x, we can (quickly) construct a new SLP
of roughly the same size and with the following two properties:
1
each nonterminal describes a string of length at most x,
2
the original string can be represented as a concatenation of
roughly Nx new nonterminals.
x will be chosen later.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
11 / 18
To make life simple, we want to partition the whole dag into blocks
corresponding to pairs of nonterminals.
Theorem
Given a SLP and a parameter x, we can (quickly) construct a new SLP
of roughly the same size and with the following two properties:
1
each nonterminal describes a string of length at most x,
2
the original string can be represented as a concatenation of
roughly Nx new nonterminals.
x will be chosen later.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
11 / 18
We want to compute only the values on the boundaries between
blocks. Consider the situation around a single block.
We need to take a closer look at how the values on the right and
bottom boundary (outputs) depend on the values on the left and top
boundary (inputs).
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
12 / 18
We want to compute only the values on the boundaries between
blocks. Consider the situation around a single block.
I0
I1
I2
I3
I4
I5
I6
O12
I−1
O11
I−2
O10
I−3
O9
I−4
O8
I−5
I−6
O0
O7
O1
O2
O3
O4
O5
O6
We need to take a closer look at how the values on the right and
bottom boundary (outputs) depend on the values on the left and top
boundary (inputs).
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
12 / 18
We want to compute only the values on the boundaries between
blocks. Consider the situation around a single block.
I0
I1
I2
I3
I4
I5
I6
O12
I−1
O11
I−2
O10
I−3
O9
I−4
O8
I−5
I−6
O0
O7
O1
O2
O3
O4
O5
O6
We need to take a closer look at how the values on the right and
bottom boundary (outputs) depend on the values on the left and top
boundary (inputs).
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
12 / 18
Let Hα,β (i, j) be the highest scoring path from the i-th input to the j-th
output of the block corresponding to words α and β. Then the matrix
Hα,β has a very simple structure:
Hα,β (i, j) = j − i − P Σ (i, j)
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
13 / 18
Let Hα,β (i, j) be the highest scoring path from the i-th input to the j-th
output of the block corresponding to words α and β. Then the matrix
Hα,β has a very simple structure:
Hα,β (i, j) = j − i − P Σ (i, j)
1
1
1
1
1
1
1
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
13 / 18
Let Hα,β (i, j) be the highest scoring path from the i-th input to the j-th
output of the block corresponding to words α and β. Then the matrix
Hα,β has a very simple structure:
Hα,β (i, j) = j − i − P Σ (i, j)
1
1
1
1
(i, j)
1
1
1
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
13 / 18
Hence each Hα,β can be represented in a succinct form by simply
storing the nonzeroes of the permutation matrix. The question is how
to compute this representation efficiently?
Theorem (Tiskin SODA 2010)
Given the representation of Hα0 ,β and Hα00 ,β , we can compute the
representation of Hα0 α00 ,β in O(x log x) time, where |α0 |, |α00 |, |β| ≤ x.
Note that Hα0 α00 ,β is simply the max-product of Hα0 ,β and Hα00 ,β .
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
14 / 18
Hence each Hα,β can be represented in a succinct form by simply
storing the nonzeroes of the permutation matrix. The question is how
to compute this representation efficiently?
Theorem (Tiskin SODA 2010)
Given the representation of Hα0 ,β and Hα00 ,β , we can compute the
representation of Hα0 α00 ,β in O(x log x) time, where |α0 |, |α00 |, |β| ≤ x.
Note that Hα0 α00 ,β is simply the max-product of Hα0 ,β and Hα00 ,β .
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
14 / 18
Theorem
Given the representation of Hα,β and the vector u storing the input
values, we can compute the vector v storing the output values in O(x)
time, where |α|, |β| ≤ x.
Note that v is simply the max-product of Hα,β and u.
This assumes the word RAM model with w ∈ Ω(log n).
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
15 / 18
Theorem
Given the representation of Hα,β and the vector u storing the input
values, we can compute the vector v storing the output values in O(x)
time, where |α|, |β| ≤ x.
Note that v is simply the max-product of Hα,β and u.
This assumes the word RAM model with w ∈ Ω(log n).
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
15 / 18
Fast matrix-vector multiplication
We reduce the question to a very simple data structure problem, which
is to maintain an array of length x under two operations:
1
decreasing all values in some prefix of the array by one,
2
computing the maximum in the whole array.
This can be reduced further to an even simpler problem, which is to
maintain a partition of [1, x] in to a number of disjoint segments under
the following two operations:
1
locating the segment a given element belongs to,
2
merging two adjacent segments.
This is known as the interval-union-find problem, and a (not very
complicated) amortized constant time solution is known.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
16 / 18
Fast matrix-vector multiplication
We reduce the question to a very simple data structure problem, which
is to maintain an array of length x under two operations:
1
decreasing all values in some prefix of the array by one,
2
computing the maximum in the whole array.
This can be reduced further to an even simpler problem, which is to
maintain a partition of [1, x] in to a number of disjoint segments under
the following two operations:
1
locating the segment a given element belongs to,
2
merging two adjacent segments.
This is known as the interval-union-find problem, and a (not very
complicated) amortized constant time solution is known.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
16 / 18
Fast matrix-vector multiplication
We reduce the question to a very simple data structure problem, which
is to maintain an array of length x under two operations:
1
decreasing all values in some prefix of the array by one,
2
computing the maximum in the whole array.
This can be reduced further to an even simpler problem, which is to
maintain a partition of [1, x] in to a number of disjoint segments under
the following two operations:
1
locating the segment a given element belongs to,
2
merging two adjacent segments.
This is known as the interval-union-find problem, and a (not very
complicated) amortized constant time solution is known.
Paweł Gawrychowski
Faster edit distance between SLPs
October 24, 2012
16 / 18
Now we only have to combine those two theorems and compute the
total running time, which turns out to be proportional to:
n2 x log x +
choosing x = √ f
log f
, where f =
N
n
N2
x
gives us log x ≤ log f and hence we
get the following bound on the running time:
p
p
N2 p
n f log f +
log f = nN log f = nN
f
2
Paweł Gawrychowski
Faster edit distance between SLPs
r
log
N
n
October 24, 2012
17 / 18
Now we only have to combine those two theorems and compute the
total running time, which turns out to be proportional to:
n2 x log x +
choosing x = √ f
log f
, where f =
N
n
N2
x
gives us log x ≤ log f and hence we
get the following bound on the running time:
p
p
N2 p
n f log f +
log f = nN log f = nN
f
2
Paweł Gawrychowski
Faster edit distance between SLPs
r
log
N
n
October 24, 2012
17 / 18
Now we only have to combine those two theorems and compute the
total running time, which turns out to be proportional to:
n2 x log x +
choosing x = √ f
log f
, where f =
N
n
N2
x
gives us log x ≤ log f and hence we
get the following bound on the running time:
p
p
N2 p
n f log f +
log f = nN log f = nN
f
2
Paweł Gawrychowski
Faster edit distance between SLPs
r
log
N
n
October 24, 2012
17 / 18
1
2
Can we avoid bit manipulation?
q
Can we get rid of the annoying log Nn factor?
Questions?
Paweł Gawrychowski
,
Faster edit distance between SLPs
October 24, 2012
18 / 18