The All-Pairs Suffix-Prefix Matching Problem

The All-Pairs Suffix-Prefix Matching Problem - IC

The All-Pairs Suffix-Prefix Matching Problem
∗
(IC) ,
William H. A. Tustumi
Guilherme P. Telles (PQ),
Felipe A. Louza (PG)
Institute of Computing, University of Campinas, Campinas, São Paulo, Brazil
{tustumi,gpt,louza}@ic.unicamp.br
∗Supported by: CNPq - PIBIC (Grant No 118372/2014-9).
Observation 01 When r ≥ 1 all SPMs1are in B 1, B 2, . . . B r
1. Introduction
Given a collection of m strings S 1, S 2, . . . , S m, the All-pairs
Suffix-Prefix Matching Problem (APSP) is the problem of finding, for all pairs S i and S j , the longest suffix of S i that is a
prefix of S j [3].
The APSP has an important application in the context of DNA
sequencing, where the collection of strings represents fragments coming from the sequencing process, and the reconstruction of the original biological sequence is based on overlaps between these fragments [1].
In this work, we present another optimal algorithm which outperforms other alternatives in both time and space.
• Sit is the suffix S t[i, |S t|]
• For a pair of strings (S 1, S 2), the longest SPM of (S 1, S 2) is
the longest suffix of S 1 that (exactly) matches a prefix of S 2
lcp(S¹[i, |S¹|-1], S²[1, |S¹|-1])
S²
In a glance, our algorithm partitions ESA into m blocks, one
for each string S j ∈ S. Then, it finds all suffixes that are prefix
of S j scanning its block, and reuse solutions obtained so far
(in the previous blocks) to compose the complete solution.
We use two lists for each S j ∈ S.
The time spent is about 2.6 times faster (on average)
Total memory and peak memory (in Gbytes) used by each
algorithm, using a threshold t = 10:
• Lglobal [j]: SPMs in B 1, . . . B r−1
Suffix-prefix match (SPM):
i
Observation 03 Let ` = lcp(P r−1, P r ): All suffixes that match
P r−1 of length l also match P r if l ≤ `
• Llocal [j]: SPMs found in B r
2. Background
S¹
Observation 02 If two suffixes of S j match P r . The largest
SPM is the suffix closest to P r in ESA
Worst-case time complexity:
Each of the n suffixes is inserted (and removed) at most once
into the list T [j] ⇒ O(N ).
There are k strings in S such that for each string, reusing
procedure is executed k times ⇒ O(k 2)
|S¹|
$
$
The suffix array SA [4] of a string S is an array of integers in
the range 1 to n that provides the lexicographic order of all
suffixes of S, such that
Therefore, the algorithm runs in O(N + m2), which is optimal
because the input size is N and the output size is m2.
4. Experiments
SSA[1] < SSA[2] < . . . < SSA[n]
The LCP-array is an array of integers that stores the length of
the longest common prefix (lcp) of two consecutive suffixes in
SA, such that LCP[1] = 0 and LCP[i] = lcp(SSA[i], SSA[i−1]) for
1 < i ≤ n, where lcp(u, v) denotes the lcpof strings u and v.
3. Proposed Algorithm
Let P = {P 1, P 2, . . . P k } be a sorted set of prefixes in ESA,
such that P r < P r+1.
Let B r = [br−1, br ) be a block of P r in ESA, from position br−1
(b0 = 0) to position br −1 (position of P r ), such that 0 ≤ br < n.
Our algorithm was implemented in C++ using the SDSL library [2] version 2 to construct the ESA. We also used the
malloc count library to measure the memory usage. We compared the performance of our algorithm with the algorithm
OG[5].
We used real DNA sequences of the EST database from
C. elegans. The number of strings used in the experiments
varies from 10.000 to 300.000 ESTs.
We compared our new algorithm with the solution OG in [5],
updated to run using sdsl-lite v.2
Number of insert operations (used by our algorithm), and the
number of push operations (used by OG):
Memory used is about 15% less (on average)
5. Concluding remarks
Performance tests with real data showed that the new algorithm is 2.6 faster (on average) using about 15% less memory.
Our algorithm can be easily parallelized to improve its performance, since each string can be processed independently,
and then all local solutions can me merged at once.
Another important improvement is to modify the algorithm
to work in semi-external memory fashion, reducing the peak
memory, since the ESA can be buffered from the disk as necessary.
References
[1] Monya Baker. De novo genome assembly: what every
biologist should know. Nature Methods, 9(4):333–337,
March 2012.
[2] Simon Gog, Timo Beller, Alistair Moffat, and Matthias
Petri. From theory to practice: Plug and play with succinct
data structures. In Proc. SEA, pages 326–337, 2014.
[3] Dan Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology.
Cambridge University Press, New York, NY, USA, 1997.
[4] Udi Manber and Eugene W Myers. Suffix arrays: A new
method for on-line string searches. SIAM J. Comput.,
22(5):935–948, 1993.
Number of insert operations (used by our algorithm), and the
number of push operations (used by OG):
Time (in seconds) spent by each algorithm, using a threshold
t = 10:
∗Acknowledgement: The authors would like to thanks Simon Gog.
[5] Enno Ohlebusch and Simon Gog. Efficient algorithms
for the all-pairs suffix-prefix problem and the all-pairs
substring-prefix problem. Information Processing Letters,
110(3):123–128, January 2010.

Download Report

The All-Pairs Suffix-Prefix Matching Problem - IC

Paperzz.com

Your Paperzz