01 Sergeeva Olga Data Structures for pattern matching

Data structures for Pattern Matching
Suffix trees and suffix arrays are a basic data structure in pattern
matching
Reported by: Olga Sergeeva, Saint Petersburg.
On the materials:
E.Ukkonen. On-Line Construction of Suffix Trees, 1993
Udi Manber and Gene Myers. Suffix Arrays: a New Method For On-Line String
Searches, 1992
Juha Karkkainen and Peter Sanders. Simple Linear Work Suffix Array Construction,
2003.
1
Suffix is a ‘concluding’ substring of the
string.
If the suffixes are are organized
‘well’, the resulting construction
can be very informative and can
provide a good base for developing
fast algorithms of working with
strings.
2
Suffix tree:
Suffix tree of a string is a tree with a
root and marked edges, in which any
concatenation of the marks along each
path from the root to a leaf forms a
suffix and every suffix appears once.
Also, the marks on the edges,
having a common root, begin
with different symbols of our
alphabet.
3
Searching in a suffix tree:
Having a suffix tree for a string S, we can easily find out if
a string M is its substring. This will take us O(log |M|)
operations.
4
Algorithms:
• 1973, Weiner. “Linear Pattern Matching Algorithms”.
• 1976, McCreight. “A Space-Economial Suffix Tree
Construction Algorithm”.
• 1993, Ukkonen. “On-Line Construction of Suffix Trees”.
5
Ukkonen’s algorithm: implicit trees.
In general, for the suffix tree to exist,
we need to add a ‘terminal’ symbol to
the string (let us denote it ‘$’).
Otherwise, we’ll have no means to
identify the ends of the suffixes,
which is desirable many algorithms.
But in the course of work we will
build trees for strings without terminal
symbol. Such trees are called implicit
trees.
6
Ukkonen’s algorithm: idea, general
description.
Let’s divide the process of
constructing the tree into
phases.
Starting with a tree for the
first symbol, we’ll build the
tree inductively, in the end of
each phase having an implicit
tree for a prefix of the string.
7
Ukkonen’s algorithm: possible cases of
extension.
When extending one suffix, we can encounter three situations:
when we are just to extend a mark on an edge; when we are to
add an edge (and maybe to split an existing edge into two);
when nothing is needed to be done.
8
Ukkonen’s algorithm: towards improvement.
Essential: how we search the ends of the suffixes in
.
We can find the end of suffix a in
, walking from the
root each time.
In this case, we will build
from
in
, so the
final tree will appear after
operations, compared to
in the naïve algorithm!
We’ll reduce this to
, using some observations and
techniques.
9
Ukkonen’s algorithm. Heuristics: suffix links.
Suffix link – a pointer from an inner vertex with the path mark xa to
a vertex with mark a, if it exists in a tree.
Every inner vertex has a suffix link. Moreover,
If a vertex v with path mark xa is added to the
tree in the extension j of the phase i+1, then s(v)
either already exists in the tree or will be created
in the next extension, in this phase.
So, any just created vertex has a suffix link to the
ending point of the next extension. Consequently,
in the end of each phase the tree has all its suffix
links.
10
Ukkonen’s algorithm. Heuristics: suffix links.
Using suffix links can substantially reduce amount of search.
If we maintain a pointer to the end of the
longest suffix S[1..i], we will have the ending
point of the previously extended suffix in the
beginning of each extension .
We will not move from the root every time
we search for the end of the next suffix.
Instead, we will get from the available ending
point up to the first inner vertex v (if it’s not
an inner vertex itself), walk along its suffix
link s(v) and search only in the subtree of
s(v).
Notice that he moment we are moving along the suffix link, the depth of v in
tree, exceeds the depth of s(v) by less then 2.
11
Ukkonen’s algorithm. Heuristics: suffix links.
Using suffix links can substantially reduce amount of search.
If we maintain a pointer to the end of the
longest suffix S[1..i], we will have the ending
point of the previously extended suffix in the
beginning of each extension .
We will not move from the root every time
we search for the end of the next suffix.
Instead, we will get from the available ending
point up to the first inner vertex v (if it’s not
an inner vertex itself), walk along its suffix
link s(v) and search only in the subtree of
s(v).
Notice that he moment we are moving along the suffix link, the depth of v in
tree, exceeds the depth of s(v) by less then 2.
12
Ukkonen’s algorithm. Heuristics: suffix links.
Using suffix links can substantially reduce amount of search.
If we maintain a pointer to the end of the
longest suffix S[1..i], we will have the ending
point of the previously extended suffix in the
beginning of each extension .
We will not move from the root every time
we search for the end of the next suffix.
Instead, we will get from the available ending
point up to the first inner vertex v (if it’s not
an inner vertex itself), walk along its suffix
link s(v) and search only in the subtree of
s(v).
Notice that he moment we are moving along the suffix link, the depth of v in
tree, exceeds the depth of s(v) by less then 2.
13
Ukkonen’s algorithm: jumping over edges.
Still, there are unnecessary steps. We walk
along edges in the sub tree of s(v), in
series comparing symbols to the searched
mark, as if we were checking if a way,
corresponding to this mark, exists in the
tree.
But it exists, and all we need to do is to
find its ending point. That’s why we can
restrict ourselves to choosing the right
edge in a vertex (comparing the first
symbols), end jump to the next vertex
along this edge, or find the sought point on
the edge, if its mark is long enough.
14
Ukkonen’s algorithm: jumping over edges.
Still, there are unnecessary steps. We walk
along edges in the sub tree of s(v), in
series comparing symbols to the searched
mark, as if we were checking if a way,
corresponding to this mark, exists in the
tree.
But it exists, and all we need to do is to
find its ending point. That’s why we can
restrict ourselves to choosing the right
edge in a vertex (comparing the first
symbols), end jump to the next vertex
along this edge, or find the sought point on
the edge, if its mark is long enough.
15
Ukkonen’s algorithm: jumping over edges.
Still, there are unnecessary steps. We walk
along edges in the sub tree of s(v), in
series comparing symbols to the searched
mark, as if we were checking if a way,
corresponding to this mark, exists in the
tree.
But it exists, and all we need to do is to
find its ending point. That’s why we can
restrict ourselves to choosing the right
edge in a vertex (comparing the first
symbols), end jump to the next vertex
along this edge, or find the sought point on
the edge, if its mark is long enough.
16
Ukkonen’s algorithm: current results.
Theorem. In the improved algorithm, every phase takes
time.
Cons. The current version of Ukkonen’s algorithm terminates in
17
.
Ukkonen’s algorithm: the last observations.
•To keep the labels on the marks is insufficient, because their overall length doesn’t have to
be linear. We can replace the labels with indices, indicating the beginning and the end of the
substring in S.
18
Ukkonen’s algorithm: the last observations.
•To keep the labels on the marks is insufficient, because their overall length doesn’t have to
be linear. We can replace the labels with indices, indicating the beginning and the end of the
substring in S.
•The first time in the phase we find out, that in the current extension nothing is to be done,
we can complete with the phase. So a phase is a consequency of extensions, which use the
first (prolonging a mark) and the second (branching off) rules.
19
Ukkonen’s algorithm: the last observations.
•To keep the labels on the marks is insufficient, because their overall length doesn’t have to
be linear. We can replace the labels with indices, indicating the beginning and the end of the
substring in S.
•The first time in the phase we find out, that in the current extension nothing is to be done,
we can complete with the phase. So a phase is a consequency of extensions, which use the
first (prolonging a mark) and the second (branching off) rules.
•A leaf cannot become an inner vertex.
•During the phase, we add the same symbol to edge marks.
20
Ukkonen’s algorithm: the last observations.
•To keep the labels on the marks is insufficient, because their overall length doesn’t have to
be linear. We can replace the labels with indices, indicating the beginning and the end of the
substring in S.
•The first time in the phase we find out, that in the current extension nothing is to be done,
we can complete with the phase. So a phase is a consequency of extensions, which use the
first (prolonging a mark) and the second (branching off) rules.
•A leaf cannot become an inner vertex.
•During the phase, we add the same symbol to edge marks.
•We will ‘split’ or ‘do nothing’ only with the suffixes, not
processed in the previous phase (those, to which we applied
‘do nothing’) – in the other cases we prolong a leaf.
21
Ukkonen’s algorithm: time estimation.
Theorem. Ukkonen’s algorithm terminates in O(n) time.
22
Ukkonen’s algorithm: difficulties of
implementation.
Problems of working with suffix trees:
• Dependency on the length of the alphabet
• No ‘locality’ – bad for paging
• Number of ‘children’ ranges for different vertices – no general ways of
representation:
• Arrays for vertices near to the root.
• Linked lists for the leaves.
• Balanced trees and hashing for the middle vertices.
As a result, the structure becomes even more complicated.
23
Suffix array:
An array, containing the suffixes in lexicographic order.
The idea belongs to Udi Manber and Gene Myers (1993,
“Suffix Arrays: a New Method For On-Line String
Searches”).
They proposed an algorithm of direct constructing the array in
O(n*log n) time. This algorithm not only built the array, but
on the way gathered some additional information. Manber
and Myers also presented an algorithm of search, using this
information, for a pattern P in O(|P| + log m) time.
24
Suffix array construction: idea.
As with the trees, we build the array inductively, greatly using it’s structure.
Initially, we have an array with unordered suffixes. Beginning with sorting the
suffixes by the first symbol (which is linear in the string’s length), every phase we
twice the number the suffixes are sorted on.
After the phase H, the suffixes are
organized into buckets, holding
suffixes with the same H first symbols.
If A(i) is the suffix in the first bucket, A(i-H) should be first in its 2H-bucket. We
can move it to the beginning of its 2H-bucket, and mark this fact. For every
bucket, we need to know the number of suffixes in this bucket that have already
beer moved and placed in 2H-order. The algorithm basically scans the suffixes as
they appear in the H-order and for each A(i) it moves A(i-H) (if it exists) to the
next available place in its bucket.
25
‘Skew’ algorithm: structure
• Construct the suffix array of the suffixes, starting at positions
i = 1, 2 (mod 3). This is done by reduction to the suffix array
construction of a string of two thirds the length, which is
solved recursively.
• Construct the suffix array of the remaining suffixes using the
result of the first step.
• Merge the suffix arrays into one.
26
The skew algorithm: example
27
The skew algorithm: example
28
The skew algorithm: example
29
The skew algorithm: example
30
The skew algorithm: example
31
Literature
E.Ukkonen. On-Line Construction of Suffix Trees, 1993
Udi Manber and Gene Myers. Suffix Arrays: a New Method
For On-Line String Searches, 1992
Juha Karkkainen and Peter Sanders. Simple Linear Work Suffix
Array Construction, 2003.
Martin Farach. Optimal Suffix Tree Construction with Large
Alphabets, 1997
Dan Gusfield. Algorithms on Strings, Trees, and Sequences.
Computer science and computational biology
32