Solutions

58093 String Processing Algorithms
Renewal/separate Exam, 3 February 2012. Solutions.
Examiner: Juha Kärkkäinen
1. [4+4+4 points] Each of the following pairs of concepts are somehow connected. Describe
the main connecting factors or commonalities as well as the main separating factors or
differences.
(a) Aho–Corasick algorithm and suffix tree.
Solution. Both can solve the multiple exact string matching problem in linear
time for a constant alphabet, but the solutions are very different: AC algorithm
preprocess the patterns, while suffix tree solution preprocesses the text. AC algorithm
is particularly designed for solving multiple exact string matching, but suffix tree can
be used for solving many other problems, too.
Both are based on the trie structure. AC automaton is a trie for the set of pattern,
while the suffix tree is the compact trie for the set of suffixes of the text. Both tries
can be augmented with special links defined similarly, failure links in AC automaton
and suffix links in suffix tree. Suffix tree can be used as an AC automaton for the
set of suffixes.
(b) LSD radix sort and MSD radix sort.
Solution.
Both are string sorting algorithms for integer alphabet. Both use counting sort as a
subroutine. Both process string one position at a time. LSD radix sort proceeds from
the end to the beginning, while MSD radix sort proceeds in the opposite direction.
The time complexity of LSD radix sort is proportional to the total length of the
strings, while the time complexity of MSD radix sort is proportional to the total
length of the shortest distinguishing prefixes. LSD handles large alphabets better
than MSD.
(c) Prefix doubling and induced sorting.
Solution. Both are algorithms for suffix array construction. Prefix doubling has
time complexity O(n log n) while induced sorting is a linear time algorithm over an
integer alphabet. Both are based on computing lexicographic names for substrings.
Prefix doubling doubles the lengths of substring at each step. Induced sorting at
least halves the number of substrings/suffixes at each level of recursion.
A few lines for each part is sufficient.
2. [3+2+3 points]
(a) Explain what are ordered alphabet and integer alphabet.
Solution. An ordered alphabet is a finite ordered set. An integer alphabet is the
set of integers in a range [0..σ) for some σ > 0. An algorithm works on an ordered
alphabet if it requires only symbol comparisons to perform its task. If the algorithm
performs operations that rely on symbols being encoded by integers, it is said to
require an integer alphabet. Examples of such operations are using a symbol to
index a table and computing a hash value. Algorithms for integer alphabet are more
restricted in their applicability.
(b) Give an example of an exact string matching algorithm that works equally well with
both kinds of alphabets.
Solution. Algorithms for ordered alphabet can be used on integer alphabet since
integers can be compared. The (Knuth–)Morris–Pratt algorithm uses only symbol
comparison and thus works with both kinds of alphabets.
(c) Give an example of an exact string matching algorithm that works with one type of
alphabet but not the other. Explain why the algorithm requires a specific type of
alphabet.
Solution. The Horspool algorithm uses a text symbol as an address to a table,
where the corresponding shift length is stored. This operation requires the symbol
to be an integer in a relatively small range. In principle, the lookup could be implemented using a binary search, for example, but this would significantly slow down
the algorithm.
3. [10 points] Use Ukkonen’s cut-off algorithm to find all approximate occurrences of the
pattern P = levee in the text T = elevated water level with edit distance k = 1.
Solution. The following table show the computation.
l
e
v
e
e
0
1
2
3
4
5
e
0
1
1
l
0
0
1
2
e
0
1
0
1
v
0
1
1
0
1
a
0
1
2
1
1
2
t
0
1
2
2
2
2
e
0
1
1
d
0
1
2
2
0
1
2
w
0
1
2
a
0
1
2
t
0
1
2
e
0
1
1
r
0
1
2
2
0
1
2
l
0
0
1
e
0
1
0
1
v
0
1
1
0
1
e
0
1
1
1
0
1
l
0
0
1
2
1
1
The algorithm finds two occurrences ending at the last two positions.
4. [10 points] Let T be a string and let R be a multiset of symbols. A factor S of T is an
occurrence of R if S consists of exactly the symbols of R. For example, if T = abahgcabah
and R = {a, a, b, c}, the only occurrence of R in T is the factor S = {caba}. Describe an
algorithm for finding all occurrences of R in T . The time complexity should be O(|T |+|R|)
on an alphabet of constant size.
Solution. Let n = |T | and m = |R|. The idea is to slide a window of size m over T and
keep account of how many times each symbol of the alphabet occurs within the window.
In each step, one symbol exits the window and one symbol enters the window, so two
counters are updated. When the symbol counts in the window match the symbol counts
for R, an occurrence of R has been found. With a constant size alphabet, comparing the
counts takes constant time.
5. [10 points] Let R = {S1 , S2 , . . . , Sk } be a set of strings, where no string is a factor of
another string. The shortest distinguishing factor of Si is the shortest string that occurs
in S but not in any other string in R. Describe an algorithm for finding the shortest
distinguishing factor for all strings in R. The time complexity should be linear on a
constant size alphabet.
Solution. Construct the suffix tree of the string T = S1 £S2 £ . . . £Sk £$. Label each
leaf v as an i-leaf if v represents a suffix starting within Si £. Each internal node u,
whose subtree contains only i-leaves is labelled an i-node. This can be done by depth first
traversal of the tree. All of this can be done in O(||R||) time.
Now Di is a distinguishing factor of Si if and only if there is an i-node or an i-leaf v
such that Di = T [start(v)...start(v) + |Di |) and depth(parent(v)) < |Di | ≤ depth(v).
Furthermore, if parent(v) is not an i-node and |Di | = depth(parent(v)) + 1, then Di is
the shortest distinguishing factor of Si starting at start(v). The shortest distinguishing
factor of Si can be found by an iteration over all the i-leafs and i-nodes. All shortest
distinguishing factors can be computed by a single traversal of the tree in O(||R||) time.

Download Report

Solutions

Paperzz.com

Your Paperzz