USING THE
TRAVELING SALESMAN PROBLEM
IN
BIOINFORMATIC ALGORITHMS
Master’s Thesis in Computer Science
Finn Rosenbech Jensen
2006 0636
Supervisor:
Christian Nørgaard Storm Pedersen
D EPARTMENT OF C OMPUTER S CIENCE
A ARHUS U NIVERSITY
D ECEMBER 2010
Layout and typography in this thesis are made using LATEX
with the memoir documentclass.
Any good idea or nice layout feature can be directly attributed to the book:
Introduktion til Latex, by Lars Madsen,
http://www.imf.au.dk/system/latex/bog/
The TSP tour images are made using the techniques described on
http://www.cgl.uwaterloo.ca/~csk/projects/tsp/
an output modified version of the ccvt software
http://code.google.com/p/ccvt/
and the TSP heuristic solver LKH
http://www.akira.ruc.dk/~keld/research/LKH/
All self constructed figures are drawn using LaTeXDraw,
http://latexdraw.sourceforge.net/
and all graphs and diagrams are constructed using matplotlib,
http://matplotlib.sourceforge.net/
For brevity is very good, where we are,
or are not understood
Samuel Butler
Abstract
Theoretical computer science has given birth to a very intriguing and fascinating class of problems known as Nondeterministic Polynomial-time
(NP) complete problems. Among their captivating properties are their
computational complexity despite often deceptively simple formulations.
An example of this is the ubiquitous Traveling Salesman Problem (TSP).
This problem has had an uncontested popularity in computational mathematics leading to a plethora of techniques and programs for solving it,
both exactly and heuristically.
Another remarkable property of NP complete problems is their frequent occurrence in real life problems, making them more than just theoretical constructions. A practical example of this is the field of bioinformatics where many of the challenges faced by its researchers have turned
out to be NP complete problems.
Yet another tantalising characteristic of the NP complete problems is
their reduction property, making every problem equally difficult (or easy)
to solve. In other words: to solve them all, we only have to solve one of
them.
This thesis aims at utilising the above properties with the purpose
of examining the effect of trying to solve bioinformatic problems using
the reduction property and a state of the art implementation for solving
the TSP. The practical bioinformatic problem is the Shortest Superstring
Problem (SSP). To asses the quality of the obtained solutions, they are
compared to solutions from four approximation algorithms. To convey
a full understanding of the algorithms and their approximation factors,
the thesis additionally includes a self-contained survey of approximation
algorithms for the SSP.
The thesis further examines the bioinformatic problems concerning
Multiple Sequence Alignment (MSA) and hereby presents the definition of
a TSP based scoring function. A near-optimal MSA construction algorithm
that uses this scoring and additionally a divide-and-conquer algorithm for
refining MSAs are implemented and experimentally tested.
Based on truely convincing results the main conclusion of the thesis
is that it is definitely a promising idea to apply efficient TSP solver implementations to solve NP complete problems within bioinformatic applications. The results obtained for the implemented MSA algorithms are far
more modest, although the MSA construction
algorithm and the scoring
i
function should not be dismissed without further study.
Acknowledgements
First I would like to thank my supervisor Christian Nørgaard Storm Pedersen for being utmost inspiring, positive and patient all along the long
and winding road and for giving me the idea for this thesis in the first
place. I can honestly recommend to chose a supervisor, who has personal
experience with small children.
Second I would like to thank Gaston Gonnet and especially Keld Helsgaun for their kind and helpful responses to my email queries.
A warm thanks goes to the staff at the Department of Computer Science at Aarhus University, in particular to Torsten Nielsen, Michael Glad
and Kai Birger Nielsen for being very prompt whenever I needed help and
for letting me abuse quite some ’camels’ for my test runs.
Furthermore I would like to thank all the people at the office-floor: Anders Hauge Nissen, Martin Højgaard Have, Sean Geggie, Mikkel Vester,
Marie Mehlsen, Mette Helm, Lene Mejlby and Jesper Jakobsen — you all
made a long indoor summer and fall quite enjoyable.
Also many thanks to my office mates Morten Slot Kristensen and Anders Andersen for being very patient with a grumpy ol’ man and Anders
for helping me out whenever my lack of technical talent took its toll.
A heartily thanks to my “old” pair-programming partner Dennis Decker
Jensen for his never failing willingness to help me out every time I have
gotten myself into another fine (coding) mess.
A great many thanks to all my thesis reviewers: Anders Andersen,
Jesper Jakobsen, Rasmus Lauritsen and Sarah Zakarias who bravely accepted the harsh challenge of making my attempts in written English
both readable as well as understandable.
Last but certainly not least a deeply and wholehearted thanks to my
wife Birgit and our three children for enduring me and my absentmindedness for so long.
Finn Rosenbech Jensen
iii
Contents
Abstract
i
Acknowledgements
iii
Contents
v
1 Introduction
1.1 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Thesis Terminology and Notation . . . . . . . . . . . . . . . .
1
3
4
Part I
The Road Goes Ever On and On
9
2 The Traveling Salesman Problem
11
2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Seminal Results in TSP Computation . . . . . . . . . . . . . . 17
2.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Traveling Salesman Problem Solvers
21
3.1 Characterisation of TSP Solvers . . . . . . . . . . . . . . . . . 21
3.2 Selection of TSP Solver . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
v
C ONTENTS
Part II
The Shortest Superstring Problem
27
4 The Shortest Superstring Problem
29
4.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Applications of SSP . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Preliminaries
5.1 Strings . . . . . . . . . . .
5.2 Graphs and Cycle Covers .
5.3 Approximation Strategies
5.4 Notes . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Algorithms
6.1 The Cycle Algorithm . . . . . . . . . . . . . . . . . .
6.2 The RecurseCycle Algorithm . . . . . . . . . . . . . .
6.3 The Overlap Rotation Lemma . . . . . . . . . . . . .
6.4 The Greedy Algorithm . . . . . . . . . . . . . . . . .
6.5 Reductions of Hamiltonian Path Problem Instances
6.6 The Best Factor Algorithm . . . . . . . . . . . . . . .
6.7 The TSP Based Algorithm . . . . . . . . . . . . . . .
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
6.9 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 Experiments
7.1 Test Data . . . . .
7.2 SSP Solver Tests .
7.3 Timing Tests . . .
7.4 Sequencing Tests
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
33
36
39
41
.
.
.
.
.
.
.
.
.
43
43
53
53
57
61
64
73
73
74
.
.
.
.
75
75
77
83
87
Part III
The Circular Sum Score
and Multiple Sequence Alignment
8 Sequence Alignments
8.1 Motivation . . . . . . . . . . .
8.2 Pairwise Sequence Alignment
8.3 Multiple Sequence Alignment
8.4 Notes . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
93
93
94
96
99
9 Circular Sum Score
101
9.1 Sum-of-Pairs Score . . . . . . . . . . . . . . . . . . . . . . . . 101
9.2 MSAs and Evolutionary Trees . . . . . . . . . . . . . . . . . . 102
9.3 Circular Orders, Tours and CS Score . . . . . . . . . . . . . . 103
vi
Contents
9.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
10 Algorithms
10.1 Path Sum Scores . . . . . . . . . .
10.2 The MSA Construction Algorithm
10.3 The MSA Enhance Algorithm . . .
10.4 Notes . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
111
111
112
118
124
11 Experiments
11.1 Reference MSA Applications
11.2 Benchmark Database . . . .
11.3 Experiments Set-up . . . . .
11.4 MSA Construction Tests . . .
11.5 MSA Enhancing Tests . . . .
11.6 Notes . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
125
125
126
128
128
134
136
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Part IV
The End of The Tour
139
12 Conclusion
141
12.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
List of Algorithms
144
List of Figures
146
List of Figures
146
List of Tables
148
List of Tables
148
Part V
Appendices
149
A Acronyms
151
B LKH output
153
C GenBank Nucleotid Sequences
157
D SSP Test Data
161
Bibliography
167
vii
Nothing is more dangerous than an idea,
if it’s the only one you have
C HAPTER
Pragmatic Programmers
1
Introduction
One of the most intriguing notions in theoretical computer science is the
existence of the class of problems known as NP complete problems. The
notion characterises a class of problems for which it is “very unlikely” that
an “efficient” general solution exists.
The argumentation for the “unlikeliness” is partly that forty years of
intense study has not given birth to such a solution,1 partly that the implications, should such a solution exists, are very far reaching. As an example it would be possible to construct an efficient algorithm that could
find the shortest mathematical proof of any theorem, thereby making the
creative part of theorem proving superfluous.
The notion of “efficient” general solution is defined in terms of polynomial2 running-time dependency in the size of the problem instance. Thus
NP complete problems can be seen as problems for which we in most cases
neither have the time nor the calculating power to find the optimal solution.
This essentially means that searching for exact solutions is infeasible and instead approximation algorithms and heuristics must be used.
From a mathematical point of view this might seem unsatisfying, given
that the problems have trivial mathematical solutions. However from a
practical point of view this makes the problems very challenging and in1
On August 7th. 2010, Vinay Deolalikar, a recognised researcher at Hewlett Packard,
published a claimed proof that P 6= NP implying that such solutions do not exist.
This was the 7th. out of 8 published proof attempt in 2010 http://www.win.tue.
nl/~gwoegi/P-versus-NP.htm , and it spurred an intense effort to understand
and pursue its new ideas; but in less than a week some deep and probably irremediable flaws were uncovered http://rjlipton.wordpress.com/2010/08/12/
fatal-flaws-in-deolalikars-proof .
2
as opposed to exponential or factorial
1
1. I NTRODUCTION
teresting giving the opportunity and leeway to try out any conceivable
idea for solving them.
Another fascinating hallmark of NP complete problems is their “reduction” property. This means that each of the problems3 can be reformulated as an instance of every other NP complete problem. The effect
of this is that if an efficient solution for one of these problems would be
found, it could be used to find efficient solutions to all of the NP complete
problems.
One of the most famous examples of an NP complete problem is
the TSP. It is at the same time one the most intensely investigated subjects in computational mathematics. The problem seems to cast a certain
spell of fascination upon the people getting involved with it and it has inspired studies by mathematicians, chemists, biologists, physicists, neuroscientists and computer scientists. The popularity of the TSP has had the
practical effect that a lot of applications for solving it, both exactly and
heuristically, have been devised.
A practical field where a lot of NP complete problems turns up is bioinformatics or computational biology. NP complete problems occur in connection with bioinformatic challenges like Deoxyribonucleic Acid (DNA)
sequencing, sequence alignments, gene identification and evolutionary
tree building. Since it is in most cases infeasible to search for optimal
solutions for these problems, this makes the field of bioinformatics very
vivid in terms of devising approximation algorithms and heuristics.
This thesis can be seen in this light, as its main purpose is to examine
the effect of trying to solve bioinformatics problems using the reduction
property of NP hard problems and a state of the art implementation for
solving the TSP. This idea does run the risk of being too simplistic and
general, since in effect the idea is to solve different problems in the same
way. This is opposed to constructing or using specialised methods and algorithms designed to solve each specific problem, thereby taking advantage of the individual character of the problem. On the other hand, the
idea could prove applicable given that more than fifty years of vast exploration and work has been put into implementing efficient TSP solvers.
The second thesis part is dedicated to the above purpose. It contains
the results of heuristically solving the NP complete, bioinformatics related
problem known as the SSP by reducing it to an instance of the TSP. To
asses the quality of the solutions, they are compared to solutions from
four approximation algorithms one of which achieves the best known approximation factor for the SSP.
3
2
actually every problem belonging to the larger class of problems known as NP
1.1. Road Map
To convey a full understanding of the approximation algorithms and
their approximation factors, this part also contains a chapter that can be
characterised as a new-devised, self-contained survey of approximation
algorithms for the SSP.
As a further application of the idea, the third part of the thesis
explores the use of the TSP in connection with bioinformatics problems
concerning sequence alignment construction and evaluation. The part focuses on MSA and the definition of a scoring function which takes the evolutionary history of proteins into account. A near-optimal MSA construction algorithm based on this scoring function and additionally a divideand-conquer algorithm for refining MSAs are implemented and experimentally tested. These algorithms originate in work done at the Swiss
Federal Institute of Technology around 2000 and the thesis part covers
improvements on both algorithms.
The goals and contributions of this thesis can thus be summarised
as:
1) an examination of the feasibility of applying efficient TSP solver implementations to solve NP complete problems within bioinformatic
applications,
2) a self-contained survey of approximation algorithms for the SSP including descriptions of implementations and testing of solution quality, and
3) a description and an experimental testing of the improvements on
the TSP based algorithms for MSA construction and MSA refinement.
1.1
Road Map
The thesis is divided into four parts:
Traveling Salesman Problem
The first part is a short (minimal?) tour through some history, definitions
and theoretical results for the TSP. It also characterises TSP solvers and
argues for the choice of solver to be used in the algorithm implementations.
This part should provide the reader with an understanding of the TSP
and the current state-of-the-art solver implementations.
3
1. I NTRODUCTION
Shortest Superstring Problem
The second part has emphasis on the SSP and is by far the most theoretical
and difficult part of the thesis. It starts off with a chapter introducing and
defining the problem along with its applications. Next follows a chapter
containing preliminaries for the theoretical treatment of the implemented
algorithms.
These algorithms are then the theme of the succeeding chapter, which
is a full treatment of the five implemented algorithms, their approximation factors and implementations. This mix of theory and practical implementation description is intentional, in the hope that it will make the
reading more varied.
Concludingly a chapter follows, which describes the experiments conducted on the algorithms and the results thereof.
The part should provide the reader with an understanding of both the
theoretical foundation for SSP approximation algorithms as well as technical aspects of their implementation and the achieved practical results.
Multiple Sequence Alignment
The third part focuses on MSA and the evolutionary based scoring function. Both an MSA construction algorithm and an MSA improving algorithm, together with their implementation, are presented. The part finishes with a chapter describing the tests conducted on the algorithms and
results thereof.
This part of the thesis should provide the reader an understanding of
sequence alignment and some of the aspects concerning constructing and
estimating MSAs.
Conclusion
The fifth part contains the thesis conclusion and ideas for future work.
1.2
Thesis Terminology and Notation
This section contains a short description of the mathematical terminology
and notation used in the thesis.
1.2.1
Graphs.
We denote directed graphs as G(N, A) and use the names “nodes” and
“arcs” for their components. Undirected graphs will be denoted G(V, E)
and their components described using the names “vertices” and “edges”.
Arcs and edges are denoted ai and ei or by the nodes or vertices they have
as endpoints (ni , nj ) and (vi , vj ). In an undirected graph the edges (vi , vj )
and (vj , vi ) are identical.
4
1.2. Thesis Terminology and Notation
Remark 1
For simplicity the definitions in the rest of this section will all be phrased
in terms of undirected graphs. Each definition containing the words ’edges’
and ’vertices’ can be substituted by an analogous definition for directed
graphs phrased using ’nodes’ and ’arcs’.
2
A graph G0 = (V 0 , E 0 ) is a subgraph of G(V, E) iff V 0 ⊆ V and E 0 ⊆ E.
If V 0 ⊆ V and E 0 = { e = (vi , vj ) | e ∈ E, vi , vj ∈ V 0 } then (V 0 , E 0 ) is called
the subgraph induced by V 0 .
By a walk in a graph G(V, E) we understand a sequence of vertices:
v1 , v2 , . . . , vk where the walk start in v1 and continuously follow the edges
(vi , vi+1 ) between the vertices vi , vi+1 in the sequence. A walk is called
a path if each of the vertices in the sequence is distinct.4 A walk where
v1 = vk is called a circuit and a path with v1 = vk is called a tour or a
cycle.5 A path or a tour, visiting all vertices in the graph exactly once, is
called Hamiltonian. For a tour visiting k vertices the short-cut notation:
v1 , v2 , . . . , vk will also be used instead of the more correct v1 , v2 , . . . , vk , v1 .
A tour T 0 is called a subtour of T if all vertices in T 0 are contained in T .
For an undirected graph G(V, E) the “degree” of a vertex is the number
of edges incident to the vertex. For a directed graph G(N, A) each node
has an “in-degree” and an “out-degree” signifying the number of arcs going to and from the node respectively. The notation dnin and dnout denote
the in-degree and out-degree of node n.
Example 1
In Figure 1.1 (a) the sequence (n1 , n5 , n4 , n2 ) is a path and in (b) the sequence (n1 , n5 , n4 , n2 , n1 ) is a tour. In subfigure (c) the sequence (n1 , n5 , n4 , n2 , n3 )
is a Hamiltonian path and in subfigure (d) the sequence (n1 , n5 , n4 , n2 , n3 , n1 )
is a Hamiltonian tour. The degrees d1in , d2out , d3out and d4in equal two, all
other degrees equal one.
2
A complete graph G(V, E) is a graph where E includes all possible
edges in the graph. In other words E consists of all pairs of vertices
(vi , vj ), i 6= j in V . Finally we define the cut-set of a set S ⊆ V as δ(S) =
{e | e = (vi , vj ) ∈ E, vi ∈ S, vj ∈
/ S} The cut-set of a vertex set consists of
the edges having one endpoint in the set and the other endpoint outside
the set. In other words, the edges that have to be “cut” in order to remove
S from the graph.
Example 2
Figure 1.2 (a) shows a complete (undirected) graph with five vertices. In
subfigure (b) the cut set δ(v1 ) of the vertex v1 is illustrated.
2
4
5
we use path in the implicit meaning “simple” path
again we use tour (cycle) in the implicit meaning simple tour (cycle)
5
1. I NTRODUCTION
1
1
5
5
2
3
4
2
3
4
(a)
(b)
1
1
5
5
2
3
4
2
3
4
(c)
(d)
Figure 1.1: paths and tours in a graph
1
1
5
2
4
3
(a) complete graph
5
2
4
3
(b) cut set of v1
Figure 1.2: complete graph and cut set
6
1.2. Thesis Terminology and Notation
A bi-partite graph G(V, E) = G(V 0 , V 00 , E) is a graph, where the vertex
set V can be divided into two disjoint vertex sets V 0 , V 00 with the property
that E = {(v 0 , v 00 )| v 0 ∈ V 0 , v 00 ∈ V 00 }. In other words all edges are going
from V 0 to V 00 . For simplicity the notation G(V, V 0 , E) is also used for bipartite graphs.
A weighted graph G(V, E, W ) is a graph that associates a weight (or
cost) with each of its edges. The weights are typically real numbers, that
is W : E → R. We denote
the weight of an edge w(e) and the weight of a
P
graph G by W (G) = e∈ E w(e).
Finally a multigraph is a graph, which can have more than one copy
of its edges.
1.2.2
Trees.
A connected graph without cycles, T (V, E) is called a “tree”. In a tree each
vertex can have zero or more “child vertices” called its children and at
most one “parent vertex” called its parent. A vertex having no children
is called a “leaf ”. A tree can be “rooted” or “unrooted” implying whether
or not there is a single vertex having no parent, denoted the “root” of the
tree. Non-leaf vertices are also called “interior” vertices.
A subtree, Tv (V 0 , E 0 ) of a tree, T (V, E) is the induced subgraph where
V 0 = {v} ∪ {vi |vi is a descendant of v}. In other words a tree having v as
root and including everything “below” v in T
A special kind of tree is the “binary tree” where the number of children
for each vertex is at most two. In a binary tree with n leafs there will be
n − 1 interior vertices.
Example 3
In Figure 1.3 (a) is shown an unrooted tree with two interior vertices
and five leafs. Subfigure (b) shows a rooted tree consisting of a root, two
interior vertices and five leafs and finally subfigure (c) shows a binary
tree consisting of a root, three interior vertices and five leafs.
2
b
b
b
b
b
b
b
b
b
b
b
b
b
(a)
b
b
b
b
b
b
b
b
b
b
(b)
bb
(c)
Figure 1.3: Different tree types
7
1. I NTRODUCTION
1.2.3
Strings.
When naming strings and chars, letters from the “middle” of the alphabet
(s, t, u, . . . ) will be used for strings, and letters from the beginning of the
alphabet (a, b, . . . ) will be used for chars. The empty string will be denoted
.
For any string s let |s| denote the length of s. For a collection of
strings P
S = (s1 , s2 , . . . , sn ) we denote the accumulated length of S as
n
0 0
||S|| =
i=1 |si |. String indexing [ ] will be one-based and string slicing 0 [ : ]0 will be start inclusive and end non-inclusive. The notation s[: i]
is shorthand for s[1 : i] and s[i :] is shorthand for s[i : |s| + 1]. String
concatenation will be implicit i.e. the concatenation of strings s, t will be
denoted st.
Example 4
Let s = abcde, t = cba
then || = 0, |s| = 5, s[1] = a, s[2] = b, s[5] = e
and s[2 : 5] = bcd, s[1 : 3] t[2 : 4] = s[: 3] t[2 :] = abba,
t3 = ttt = cbacbacba, t∞ = ttt...
1.2.4
2
Arrays.
Since strings can be seen as char arrays we will treat array operations
analogously to strings, i.e. array indexing 0 [ ]0 will be one-based and array
slicing 0 [ : ]0 will be start inclusive and end non-inclusive. Furthermore,
for an array Arr, an element elem, and an index idx we use the following
array functions:
• LIST() constructs an array,
• SIZE(Arr) or |Arr| denotes the number of elements in Arr,
• APPEND(Arr, elem) appends elem at the end of Arr,
• REMOVE(Arr, idx) removes the element at index idx from Arr,
• SUM(Arr) calculates the sum of the elements in Arr. (Requires numeral elements in Arr).
8
Part I
The road goes ever on and
on
DNA
climb tour with 8190 cities
The Road goes ever on and on
Down from the door where it began.
Now far ahead the Road has gone,
And I must follow, if I can,
Pursuing it with eager feet,
Until it joins some larger way
Where many paths and errands meet.
And whither then? I cannot say.
C HAPTER
Bilbo Baggins (J.R.R. Tolkien)
2
The Traveling Salesman
Problem.
This chapter introduces the Traveling Salesman Problem and gives a
short overview of its history along with some major computational results. It furthermore contains the Linear Program (LP) formulation of the
TSP. As this chapter is meant as an introduction, treatment and definition
of some technical terms will be postponed to later chapters.
2.1
History
The TSP is one of the most (if not the most) widely studied problems in
computational mathematics. One of the reasons for this might very well
be the ease of formulating and understanding the problem:
Definition 1 (The Traveling Salesman Problem)
Given a collection of cities along with the cost of travel between each
pair of them, determine the cheapest route visiting each city exactly once,
starting and ending in the same city.
2
Given this simple formulation it might be expected that the problem
could have an equally simple solution. This turns out not to be the case.
Despite being easily understood no general efficient solution to the TSP
has been found. However the problem has inspired multiple studies by
mathematicians, computer scientists, chemists, physicists, psychologists
and a host of non professional researchers.
An interesting historical reference is the German handbook from 1832
by B.F. Voigt with the title: “Der Handlungreisende - wie er sein soll und
was er zu thun hat um Aufträge zu erhalten und eines glücklichen Erfolgs
11
2. T HE T RAVELING S ALESMAN P ROBLEM
Figure 2.1: The Commis-Voyageur tour for 47 German cities (from [Applegate et al., 2007]).
in seinen Geshäften gewiss zu sein - Von einem alten Commis-Voyageur”.1
The handbook contains routes through Germany and Switzerland and
goes as far as to claim:
We believe we can ensure as much that it will not be possible to plan the tours through Germany in consideration of the
distances and the traveling back and fourth, which deserves
the traveller’s special attention, with more economy. The main
thing to remember is always to visit as many localities as possible without having to touch them twice.2
One of the tours shown in Figure 2.1 goes through 47 German cities
and is actually of very good quality and might even be optimal given the
travel conditions of that time.
The first explicit mentioning of the TSP as a mathematical optimisation problem is probably due to the Austrian mathematician Karl Menger
who in a mathematical colloquium in 1930 in Vienna used the formulation:
Wir bezeichnen als Botenproblem (weil diese Frage in der
Praxis von jedem Postboten, übrigens auch von vielen Reisenden
1
The traveling salesman - how he has to be and what he has to do to acquire orders
and be assured of a fortunate success in his business - by an old Commis-Voyageur
2
translated from the German original by Linda Cook
12
2.1. History
zu lösen ist) die Aufgabe, für endlich viele Punkte, deren paarweise Abstände bekannt sind, den kürzesten die Punkte verbindenden Weg zu finden.3
Shortly after this the TSP became popular among mathematicians at Princeton University. There does not exist any authoritative source for the origin of the problems name, but according to Merrill Flood and A. W. Tucker
it became introduced by its present day name in 1934 as part of a seminar
given by Hassler Whitney at Princeton University.4
An additional characteristic of the TSP is, that it is relatively easy to
construct good solutions, but far from easy to find a provable optimal solution. This has made the problem a popular “guinea pig” for all kinds
of optimisation techniques during the last half of the 20th. century (see
Table 2.1 on the following page).
An excellent example of this is the seminal paper [Dantzig et al., 1954]
where the inventor of the Simplex Algorithm together with two colleagues
introduce the integer LP formulation of the TSP (see Definition 2) as well
as a cutting plane method5 in order to solve a 49 cities problem instance
to provable optimality.
Definition 2 (Integer LP-formulation for TSP)
Let G = (V, E, W ) be a weighted graph with V = {v1 , v2 , . . . , vn }, then the
TSP in G can be formulated as
X
Minimise
w(e) · xe subject to
e∈E
X
xe = 2,
∀ vi ∈ V
xe ≥ 2,
S ⊂ V, S 6= ∅
(degree constraints)
(2.1)
e ∈ δ({vi })
X
(subtour elimination constraints) (2.2)
e ∈ δ(S)
xe ∈ {0, 1},
∀e∈E
(integer constraints)
(2.3)
2
Remark 2
In Definition 2 the degree constraints (2.1) ensure that all vertices have
degree exactly two i.e. a tour enters and leaves each city exactly once. The
subtour elimination constraints (2.2) ensure that the solution will consist
3
We denote as messenger-problem (because this question has to be solved by any postman and by the way many salesmen) the challenge of finding the shortest route between
a finite number of points of which the pairwise distance is known.
4
although Whitney queried some twenty years later did not recall the problem
[Dantzig, Fulkerson, and Johnson, 1954]
5
a way of coming from the optimal fractional variable solution to the optimal integer
variable solution
13
2. T HE T RAVELING S ALESMAN P ROBLEM
Table 2.1: Some TSP history milestones (mainly from [Okano, 2009])
Year
Milestone
Contributor(s)
1954
49 cities instance solved to optimality using a LP and manually added
cutting planes
Dantzig, Fulkerson and Johnson
1962
33 cities TSP contest
Procter & Gamble
1970
Lagrangian relaxation with error
about 1 % below optimal.
M. Held and R. M. Karp
1972
Proof of NP-completeness for TSP
R. M. Karp
1973
k-opt heuristic 1 to 2 % above optimal
Lin and Kernighan
1976
Factor 32 -approximation algorithm
for MTSP
Nicos Christofides
1983
Simulated Annealing based heuristic
Kirkpatrick, Gelatt and Vecchi
1985
Recurrent neural network-based
heuristic
Hopfield and Tank
1990
TSP
1991
TSPLIB
1992
3038 cities instance solved to optimality using cutting plane generation (Concorde)
David Applegate, Robert Bixby,
Vašek Chvátal and William Cook
1996
PTAS for Euclidean TSP with run1
ning time: nO( )
Sanjeev Arora
1998
Improved k-opt heuristic
within 1 % above optimal.
Keld Helsgaun
2004
24 978 cities instance solved by LKH
and proven optimal by Concorde
Applegate, Bixby, Chvátal, Cook
and Helsgaun
2006
85 900 cities instance solved and
proven optimal using LKH and Concorde
Applegate, Bixby, Chvátal, Cook,
Espinoza, Goycoolea and Helsgaun
14
heuristics based on k-d trees
published
Bentley
Reineld
LKH,
2.1. History
of one large cycle and not two or more sub-cycles. Finally the integer
constraints (2.3) ensure that we use each edge in a “binary” fashion i.e.
the edge is either part of the tour or it is not part of the tour.
2
In 1962, the TSP became publicly known to a great extent in the USA
due to a contest by Procter & Gamble consisting of a problem instance of
33 cities. The $ 10 000 Price for the shortest solution was at that time
enough to purchase a new house in many parts of the country.
Figure 2.2: The 33 city contest from 1962.
In 1970, Held and Karp developed a one-tree6 relaxation which provides a lower bound within 1 % from the optimal. It achieves this by relaxing the degree constraints in Definition 2 using a Minimum Spanning
Tree (MST) and Lagrangian multipliers.
In 1972, Karp [1972] proved the NP-completeness of the Hamiltonian
Cycle Problem (HCP) from which the NP-completeness of the TSP follows
almost directly [Karp, 1972].
In 1973, Lin and Kernighan proposed a variable-depth edge exchanging heuristic for refining an initial tour. The method, now known as the
“Lin-Kernighan” algorithm, performs variable k-opt moves that allow intermediate tours to be longer than the original tour. A k-opt move can be
6
A tree containing exactly one cycle
15
2. T HE T RAVELING S ALESMAN P ROBLEM
seen as the removal of k edges from a TSP tour followed by the patching
of the resulting paths into a tour using k other edges (see Figure 2.3).
The algorithm produces tours about 1 to 2 % above the optimal length
and has won the reputation of being difficult to implement correctly. In
[Helsgaun, 1998, Section 2.3.2, page 7] a survey paper published in 1989
is mentioned, where the authors claim that no other implementation of
the algorithm at that time had shown as good efficiency as was obtained
by Lin and Kernighan in [Lin and Kernighan, 1973].
b
b
v1
b
b
b
b
b
b
v2
v4
v2
v4
v1
v3
v3
(a) 2-opt move
v2
v6
v3
b
b
b
b
b
v1
v5
v2
v6
v3
b
b
b
b
b
b
v1
v4
v5
b
v4
(b) 3-opt move
Figure 2.3: Different k-opt moves
In 1976, Christofides published a tour construction method, that achieves
a 32 -approximation7 by using a MST and “Perfect Matching”. Apart from
the euclidean TSP this is still the tightest approximation ratio known
[Christofides, 1976].
Other examples of using the TSP as a "guinea pig" are found in the
article [S. Kirkpatrick and Vecchi, 1983] which introduces the random,
local search technique known as “Simulated Annealing” and in the article
[Hopfield and Tank, 1985], which is one of the first publications discussing
“Neural Network” algorithms. Both articles use the TSP as a working
example.
In 1990, Bentley developed a new highly efficient variant of the k-d
tree8 data structure, which is used for proximity checking, while he was
working on heuristics for the TSP [Bentley, 1992].
7
8
16
i.e. guaranteeing a solution no worse than 32 times the optimal solution
a binary search tree structure extended in k dimensions
2.2. Seminal Results in TSP Computation
In 1991, Reinelt composed and published TSPLIB9 ,a library containing
many of the test problems studied over the last 50 years [Reinelt, 1991].
In 1992, David Applegate, Robert Bixby, Vašek Chvátal and William
Cook solved a 3038 TSP city instance to optimality using the exact TSP
solver program Concorde, on which they started the development in 1990.
Concorde has ever since been involved in all proven optimality tour records.
In 1996, the first Polynomial Time Approximation Scheme (PTAS) for
the euclidean TSP was devised by Arora. The PTAS finds tours with length
1
(1 + ) times the optimal and has a running-time of nO( ) . Since it had
previously been proven that both the general as well as the Metric Traveling Salesman Problem (MTSP) do not have a PTAS this result was received
with surprise.
In 1998 Keld Helsgaun released a highly efficient and improved extension of the Lin-Kernighan heuristic algorithm, called Lin-KernighanHelsgaun (LKH). Among other characteristics it uses one-tree approximations for determining candidate edge-lists10 and 5-opt moves. LKH has
later been extended and it has participated with Concorde in solving the
largest instances of the TSP to this day. Furthermore LKH has been holding the record for the 1 904 711 city World TSP Tour11 since 2003. It has
subsequently improved the tour three times (most recently in May 2010).
2.2
Seminal Results in TSP Computation
The ability to solve TSP instances to optimality has profited both from
the discovery of new computational techniques as well as the advances in
computer technology. Table 2.2 on page 19 shows some of the milestone
results for solving TSP instances.
As mentioned in Table 2.1 the TSP tour for 24 978 cities in Sweden
depicted in Figure 2.4 on the following page, was found in a cooperation
between LKH and Concorde. LKH found the optimal tour and a Concorde
calculation using a cluster of 96 machines and a total running-time of 84.8
CPU years deemed it optimal. The initial cutting-plane procedure on the
first LP began in March 2003 and the final branch-and-cut run finished in
January 2004. A final check of the solution was completed in May 2004.
The largest TSP instance that has been optimally solved to this date
consists of a tour through 85 900 cities in a Very-large-scale Integration
(VLSI) application. The problem arose in the Bell Laboratories in the late
1980s. This achievement also consisted in a cooperation between Concorde and LKH. Again LKH found a good tour, which was then used as an
9
http://www.iwr.uni-heidelberg.de/groups/comopt/software/
TSPLIB95/
10
a list containing the preferred routes between two cities
11
http://www.tsp.gatech.edu/world/
17
2. T HE T RAVELING S ALESMAN P ROBLEM
Figure 2.4:
The optimal TSP tour in Sweden. (from http://www.tsp.gatech.edu/sweden/
tours/swtour_small.htm )
18
2.3. Notes
Table 2.2: Milestones for solving TSP (from [Applegate et al., 2007])
Year
Contributor(s)
1954
1971
1971
1975
1975
1977
1980
1987
1987
1987
1987
G. Dantzig, R. Fulkerson, S. Johnson
M. Held, R. M. Karp
M. Held, R. M. Karp
P.M. Camerini, L. Fratta, F. Maffioli
P. Miliotis
M. Grötschel
H. Crowder, M. W. Padberg
M. W. Padberg, G. Rinaldi
M. Grötschel, O. Holland
M. W. Padberg, G. Rinaldi
M. W. Padberg, G. Rinaldi
1992
1993
1994
1998
2001
2004
2004
2006
Concorde
Concorde
Concorde
Concorde
Concorde
Concorde
Concorde with Domino-Parity
Concorde with Domino-Parity
Size
49 cities
57 cities
64 cities
67 cities
80 cities
120 cities
318 cities
532 cities
666 cities
1002 cities
2392 cities
3038 cities
4461 cities
7397 cities
13509 cities
15112 cities
24978 cities
33810 cities
85900 cities
TSPLIB
name
dantzig42
random points
random points
random points
gr120
lin318
att532
gr666
pr1002
pr2392
pcb3038
fnl4461
pla7397
usa13509
d15112
sw24978
pla33810
pla85900
upper bound input for Concorde. The determination of the optimal tour,
which turned out to be one found by LKH, required a total running-time of
approximately 136 CPU years. To certify the optimality a proof-checking
procedure, which ran for over 500 hours, was devised – a project which
lead to an article by itself [Applegate, Bixby, Chvátal, Cook, Espinoza,
Goycoolea, and Helsgaun, 2009].
2.3
Notes
This chapter is based on [Applegate et al., 2007] and [Okano, 2009]. All
other references are explicitly mentioned.
19
The Traveling Salesman always visits
Knuth first
C HAPTER
Knuth homage web page
3
Traveling Salesman
Problem Solvers
This chapter describes the different types of TSP solvers. It further introduces the TSP solver, which was used in the later algorithm implementations and the argumentation for choosing it. To avoid getting too involved
in technical details all descriptions are kept short.
3.1
TSP
Characterisation of TSP Solvers
solvers come in two main classes:
Exact Solvers There are two groups of exact solvers. One of these is
solving relaxations of the TSP LP formulation (see Definition 2 on
page 13) and uses methods like Cutting Plane, Interior Point, Branchand-Bound and Branch-and-Cut. Another smaller group is using
Dynamic Programming. For both groups the main characteristic is
a guarantee of finding optimal solutions at the expense of running
time and space requirements.
Non-exact Solvers These solvers offer potentially non-optimal but typically faster solutions. In a way the opposite trade-off of the exact
solvers. Non-exact solvers can be subdivided into:
Approximation Algorithms These algorithms come with a worst
case approximation factor for the found solution. The two traditional methods for solving the TSP are a pure MST based algorithm, which achieves a factor 2 approximation and a combined
MST and Minimum Matching Problem (MMP) based algorithm
due to Christofides, which achieves a factor 32 approximation.
Both methods are restricted to the MTSP as they depend on the
21
3. T RAVELING S ALESMAN P ROBLEM S OLVERS
triangle inequality. The PTAS for Euclidean TSP is mainly a theoretical result due to its prohibitive running time.
Heuristic Algorithms These algorithms only promise a feasible
solution. They range from simple tour-construction methods
like Nearest Neighbour, Clarke-Wright and Multiple Fragment1
to more complicated tour improving algorithms like Tabu Search
and Lin-Kernighan. Finally there is a group of fascinating algorithms which unfortunately tend to combine approximate solutions and large running-times. Here we find methods like Simulated Annealing, Genetic Algorithms, Ant Colony Algorithms
and machine learning algorithms like Neural Networks.
3.2
Selection of TSP Solver
The main problem in choosing a TSP solver for the later algorithm implementations was to decide whether an exact solver or a heuristic solver
would be appropriate, since the “state of the art” programs for each class
are easily found. These implementations are the Concorde exact solver
and the LKH heuristic solver respectively. This evaluation is in accordance
with the findings reported in [Laporte, 2010]
Concorde This is an ANSI C implementation consisting of approximately
130 000 lines of code. Concorde is an LP based solver that uses several modern cutting plane techniques thus approaching the optimal
value from below. As all exact solvers its running time and space
constraints suffers from the “cost of being exact”. The source code is
available for academical use.2
LKH
This is a considerably optimised modification of the Lin-Kernighan
algorithm implemented using ANSI C. To quote [Applegate et al.,
2007, Chapter 17.2]
Since its introduction, the LKH variant of the local-search
algorithm of Lin and Kernighan has set the standard for
high-end TSP heuristics.
starts by using a Held and Karp one-tree ascent method to construct a candidate edge list. A problem instance is then solved using a number of independent runs. Each run consist of a series of
trials, where in each trial a tour is determined by the modified LinKernighan algorithm. The trials of a run are not independent, since
edges belonging to the best tour of the current run are used to prune
LKH
22
1
often called Greedy
2
http://www.tsp.gatech.edu/concorde.html
3.2. Selection of TSP Solver
the search. In the standard setting, LKH refines using 5-opt moves in
ten independent runs each containing a number of trials equalling
the number of cities. A typical output from LKH can be seen in Appendix B. The code is distributed for research use.3
Figure 3.1: A LKH tour of 11 455 danish cities.
In order to gain experience with the programs the following experiment was conducted: On a single dedicated machine4 both programs
were challenged with the dk11455.tsp problem. As the name indicates
the problem is a tour consisting of 11 455 cities in Denmark (see figure
3.1). The optimal solution is a tour of length almost 21 000 km. As most
of the other large solved tours, it was found by LKH and proven optimal
using Concorde. With both programs in standard configurations the results, shown in Table 3.1 on the following page and Figure 3.2 on page 25
were obtained.
The experiment with Concorde was stopped after 46 days. As can be
seen in Figure 3.2 it looks as if Concorde steadily will approach the opti3
http://www.akira.ruc.dk/~keld/research/LKH/
4
Pentium 4, 3066 MHz, 512 MB RAM
23
3. T RAVELING S ALESMAN P ROBLEM S OLVERS
Table 3.1: LKH and Concorde results for dk11455.tsp
OPT:
20 996 131
Best:
20 996 166
Concorde:
day 1
started with best LKH tour as upper bound:
Initial lower bound:
20 996 166
20 991 500
TSP
instance dk11455.tsp
LKH:
day 1
day 6
day 8
day 15
day 22
day 29
day 36
day 43
day 46
started
produced 10 tours
Active nodes
557
987
1183
1478
1803
2027
2192
Temporary files
1.2 GB
2.1 GB
2.5 GB
3.0 GB
3.7 GB
4.1 GB
4.4 GB
Lower bound
20 992 921
20 993 016
20 993 069
20 993 116
20 993 152
20 993 171
20 993 182
mal value. A crude extrapolation5 predicts that the optimal value would
have been found in about 16 months. Although the TSP instance in this experiment was larger than the expected size of the instances in the thesis
experiments the characteristics of the two programs were clearly illustrated.
In the end LKH was chosen as the standard TSP solver for the experiments in this thesis on account of the following:
• As mentioned in [Laporte, 2010], it consistently produces optimal
solutions on small and medium size instances, in less than a second
for n = 100 and for n = 1000 in less than a minute.6 This covers the
expected size of instances in the thesis experiments very well.
• The need for provable optimality is of minor importance, when the
TSP solution is used to implement heuristic algorithms.
3.3
Notes
Concorde is described in details by its authors in [Applegate et al., 2007]
and LKH is described in details in the technical reports [Helsgaun, 1998,
2006] and in the papers [Helsgaun, 2000, 2009].
5
6
24
not scientifically sound but very tempting
on a 300 MHz G3 Power Macintosh
3.3. Notes
8000 +2.099e7
7000
LKH and Concorde results on dk11455.tsp
LKH tour values
Optimal tour length
Concorde lower bound
Tour length
6000
5000
4000
3000
2000
1000
00
10
20
Days
30
40
Figure 3.2: Results on dk11455.tsp. Note the addition of 20 990 000 on the y-axis.
25
Part II
The Shortest Superstring
Problem
Person and bases tour with 16384 cities
The Shortest Common Superstring
Problem, an elegant but flawed
abstraction
C HAPTER
Richard M. Karp
4
The Shortest
Superstring Problem
This chapter introduces and defines the Shortest Superstring Problem. It
further contains the motivation for including the problem in this thesis.
4.1
The Problem
The SSP or Shortest Common Superstring Problem is defined as follows:
Definition 3
P
Given a collection
of strings S = (s1 , s2 , . . . , sn ) and a finite alphabet ,
P∗
with sj ∈
, j ∈ 1 . . . n, determine the shortest string containing each
string in S as a (consecutive) substring. In other words find the shortest
superstring for the collection S.
2
Remark 3
Without loss of generality (WLOG) it can be assumed that S is substring
free, i.e. no string in S is a substring of any other string in S. This assumption, which is made for the rest of this thesis part, does not influence the
shortest superstring.
2
The SSP is, like the TSP, NP hard [Gusfield, 1997, chapter 16.17] and
also MAX-SNP hard [Blum, Jiang, Li, Tromp, and Yannakakis, 1994,
Section 7], which means that no PTAS exists unless P = NP. It is therefore
highly improbable that a general, efficient algorithm for solving the SSP
exists, thus the best we can hope for is approximation algorithms and
heuristics.
29
4. T HE S HORTEST S UPERSTRING P ROBLEM
4.2
Applications of SSP
Articles concerning the SSP typically mention applications in bioinformatics and data compression [Blum et al., 1994; Breslauer, Jiang, and
Jiang, 1996; Ilie and Popescu, 2006; Kaplan, Lewenstein, Shafrir, and
Sviridenko, 2005; Kaplan and Shafrir, 2005; Romero, Brizuela, and Tchernykh, 2004; Tarhio and Ukkonen, 1988].
The first application refers to the DNA-sequencing method known as
Shotgun Sequencing, the second refers to the idea of compressing a set
of strings into their shortest superstring combined with an index pair for
each string marking the beginning and ending index respectively. As we
will argue, the citation of these applications seem more to be based upon
“transitive” citing rather than upon real applications.
4.2.1
Shotgun Sequencing
When having to sequence a genome it is not technically possible to sequence strands of length larger than 100 to 1000 basepairs. For this reason longer strands must be divided into smaller pieces and subsequently
re-assembled. In Shotgun Sequencing one first make many copies (clones)
of the DNA of interest. These copies are then randomly cut into smaller
pieces by physical, enzymatic or chemical means.
The difficulty in assembling the original sequence stems from the fact,
that though the smaller pieces constitute overlapping sequences of the
original DNA, the succession of the overlapping sequences is lost in the
cutting process.
This is where the SSP come into play, as a theoretical model for reassembling the overlapping pieces into a superstring which is assumed
to be a good approximation of the original DNA sequence. This neat simple idea has set off many studies of efficient approximation algorithms
for SSP, but as a model for sequence-assembly it does have some major
problems especially concerning read-errors in the fragments and identical sub-sequences (repeats) in the original DNA sequence.
Example 5
At the top of Figure 4.1 a sequence with two repeats (marked R) is cloned
four times and the clones cut into smaller pieces. After assembly (at
the bottom of the figure) only one occurrence of the repeat region will
be present together with a smaller and originally non-existent region
(marked R’). This is due to all the “inner” fragments of the repeats being assembled as belonging to a single occurrence of the repeat.
2
A rather devastating critic of using SSP directly to solve sequencing
problems is given by Richard M. Karp in [Karp, 2003]:
30
4.2. Applications of SSP
R
R
R
R’
Figure 4.1: sequence assembly problem with repeats
The shortest superstring problem is only superficially related to the sequencing assembly problem. Its difficulty stems
from pathological examples that are unlikely to occur in practise. It does not take noisy reads into account, and admits solutions with an unreasonably large number of mutually overlapping read.
Although Shotgun Sequencing was the most advanced technique for
sequencing genomes from about 1995–20051 and was used by the privately founded group solving the mapping of the human genome [Venter,
Adams, Myers, Li, Mural, Sutton, Smith, Yandell, Evans, Holt, et al.,
2001], newer technologies known as next-generation sequencing have surfaced. These techniques are far superior to Shotgun Sequencing: both in
the amount of data produced as well as with respect to the time needed
for producing the data — they do tend to have lower accuracy though
[Metzker, 2009].
4.2.2
Data compression
The idea of using SSP as a way of compressing a set of strings originates
in the article [Gallant, Maier, and Storer, 1980] which itself refers back
to two technical reports by J. Storer (one of the authors) from 1977.
The practical application mentioned is the need for string compression
by a compiler. It has not been possible to track down references describing
the application of this technique in compilers. Furthermore string compression is hardly of any significant concern for modern day compilers as
the time versus space trade-off clearly seems to favour speed. An even
more serious objection, mentioned in [Gusfield, 1997, page 445, exercise
1
http://en.wikipedia.org/wiki/Shotgun_sequencing
31
4. T HE S HORTEST S UPERSTRING P ROBLEM
25, 26], is that it seems like a rather exotic idea trying to solve a compression problem by using an NP complete problem — especially as there are
simpler, better and more efficient algorithms available e.g. based on Cycle
Covers (CCs).
4.2.3
Why is SSP then interesting anyway ?
Having defied the traditional practical uses of SSP and thereby almost
removed the foundation for working with the problem, we now face the
challenge of arguing for its inclusion in this thesis.
In prioritised order the reasons are,
1) the SSP is in itself an interesting problem,
2) historically the SSP has been one of the first sequencing models
“bridging” between computer science and biology,
3) the SSP has been subject to a considerable amount of research leading to a number of approximation algorithms. To quote from [Gusfield, 1997, Chapter 16.1] “This pure, exact string problem is motivated by the practical problem of shotgun sequence assembly and
deserves attention if only for the elegance of the results that have
been obtained.”
4) the SSP is closely related to the TSP, so it would seem unnatural not
to include it at all, and finally
5) the SSP as a crude model is usable in bioinformatics. Besides the
Shotgun Sequencing method it is related to the class of DNA-sequencing
methods known as “Sequencing by Hybridisation” and it has also
been proposed as a model for genome compression of viruses [Ilie
and Popescu, 2006].
with these reasons in mind, the rest of this thesis Part will hopefully
seem founded to the reader.
32
C HAPTER
5
Preliminaries
This chapter will introduce notation, definitions, some basic facts about
strings and approximation strategies all of which will be needed in the
later treatment of approximation factors and algorithm descriptions. For
further notation usage see Section 1.2 on page 4
5.1
Strings
Given two strings s, t where s = xy and t = yz with x, z 6= , if y is the
longest such string it is called the (maximal) overlap of s and t and is
denoted ov(s, t). The string x is called the (minimal) prefix of s and t and
is denoted pr(s, t). Similarly z is sometimes called the suffix of s and t and
is denoted suf (s, t). The length of the prefix, |pr(s, t)| is called the distance
from s to t and is denoted dist(s, t).
Example 6
Let s = half, t = alfalfa then
ov(s, t) = alf, ov(t, s) = , ov(t, t) = alfa,
|ov(s, t)| = 3, |ov(t, s)| = 0, |ov(t, t)| = 4
pr(s, t) = h, pr(t, s) = alfalfa, pr(t, t) = alf,
dist(s, t) = 1, dist(t, s) = 7, dist(t, t) = 3
note that |s| = dist(s, t) + |ov(s, t)|
2
33
5. P RELIMINARIES
Definition 4 (Merge of strings)
Let S = (s1 , s2 , . . . , sn ) be an ordered collection of strings
hs1 , s2 , . . . , sn−1 , sn i = pr(s1 , s2 ) pr(s2 , s3 ) · · · pr(sn−1 , sn ) sn
is called the merge of the strings (s1 , s2 , . . . , sn ). If S is substring free the
order of the string in the merge is well-defined, as no two strings will
begin or end at the same index.
2
Example 7
For two strings, s, t we have
hs, ti = pr(s, t) t = pr(s, t) ov(s, t) suf (s, t) = s suf (s, t),
|hs, ti| = dist(s, t) + |t| = |s| − |ov(s, t)| + |t| = |s| + |t| − |ov(s, t)|
2
Lemma 1
For an ordered collection of strings S = (s1 , s2 , . . . , sn ), the merge of the
strings is the shortest superstring for the strings in that order.
2
To get warmed up and for the sake of completeness we include the proof:
P ROOF ([G USFIELD , 1997]) The fact that the merge is a superstring is
easiest seen by induction on the string indices beginning from the end of
the merge.
Base case: sn is a substring of the merge as it is explicitly included.
Induction case: Assume si+1 is a substring of the merge.
Now si = pr(si , si+1 )ov(si , si+1 ). As pr(si , si+1 ) is explicitly included
in the merge, it remains to show that ov(si , si+1 ) follows directly
after. Since S is substring free we have ov(si , si+1 ) is a prefix of si+1 ,
but si+1 is per induction hypothesis a substring of the merge starting
at pr(si , si+1 ).
The minimality follows from the definition of pr(·, ·)
Definition 5 (Cyclic String)
For any string s we denote by φ(s) the cyclic string resulting from “wrapping around” the string so that s |s| precedes s[1]. The length of φ(s)
equals |s|.
2
Any string t which is a substring of s∞ is said to map to the cyclic
string φ(s) and correspondingly φ(s) is said to be mapping all such strings.
34
5.1. Strings
Example 8
Let s = abac then all the following strings map to φ(s)
t = ba
u = bacaba
v=
acabacabaca
2
Definition 6 (Factor and Period)
A string x is a factor of another string s if s = xi y with y a (possibly empty)
prefix of x and i ∈ N. The length of the factor |x| is called a period of s. 2
The shortest factor of a string s and its period are denoted f actor(s)
and period(s) respectively. In case s is semi-infinite,1 we call s periodic
if s = xs for x 6= The shortest such x is called the factor of s and its
length the period of s. Two strings s, t (finite or not) are called equivalent if their factors are cyclic shifts of each other, i.e. ∃ x, y : f actor(s) =
xy and f actor(t) = yx.
The following is a classical string algorithm result
Lemma 2 (The Periodicity Lemma)
Let s be a string with periods p, q. If |s| ≥ p + q then the greatest common
divisor of p, q = gcd(p, q) will also be a period for s.
2
P ROOF ([G USFIELD , 1997]) Assume WLOG that q < p. Consider the
number d = p − q. We will show that s[i] = s[i + d] ∀ i ∈ 1 . . . |s| − d.
Case q < i This means s[i − q] exists. We now have:
s[i] = s[i − q] (as q is a period for s)
= s[i − q + p] = s[i + d] (as p is a period for s)
Case i ≤ q This means s[i + p] exists as |s| ≥ p + q. We now have:
s[i] = s[i + p] (as p is a period for s)
= s[i + p − q] = s[i + d] (as q is a period for s)
As |s| ≥ (p − q) + q we have shown, that applying Euclids Algorithm on
p, q we will at each recursive step have numbers that are periods for s. As
Euclids Algorithm ends with gcd(p, q) the claim follows.
1
an infinite string with one fixed end: s = ababab . . .
35
5. P RELIMINARIES
5.2
Graphs and Cycle Covers
To model the overlaps of a collection of strings S = (s1 , s2 , . . . , sn ) the
following two complete graphs are very useful:
Definition 7
Prefix (Overlap) Graph
Let G(N, A, W ) be a weighted, directed graph, with
N = S, A = {aij } ∀ i, j in 1 . . . n
(
dist(si , sj ) (prefix graph)
W : A → N with w(aij ) =
|ov(si , sj )| (overlap graph)
2
We associate each arc in the prefix (overlap) graph with the prefix (overlap) of the two strings corresponding to the start- and end-nodes for the
arc.
Example 9
The prefix (distance) and overlap graphs for the string collection: (ate,
half, lethal, alpha, alfalfa) is illustrated in figure 5.1. Only arcs having
values different from the length of the string (in the prefix graph) and
zero (in the overlap graph) are shown.
2
4
1
alpha
alpha
4
2
4
3
4
3
lethal
6
1
6
4
alfalfa
3
ate
half
5
1
2
lethal
1
1
2
alfalfa
4
(b) overlap graph
Figure 5.1: Graphs for the strings: ’ate’, ’half ’, ’lethal’, ’alpha’ and ’alfalfa’
(from [Blum et al., 1994])
Note that a cycle in the prefix graph for S consisting of the nodes
n1 , n2 , . . . , nk will be of the form
C = φ(t) = φ(pr(s1 , s2 ) pr(s2 , s3 ) · · · pr(sk−1 , sk ) pr(sk , s1 ))
The notation C = s1 , s2 , . . . , sk will be used to describe such a cycle in
terms of the strings from S that C is mapping.
36
ate
3
1
3
(a) prefix graph
1
half
5.2. Graphs and Cycle Covers
With tsi we will denote the string equivalent to t achieved by “breaking” the cycle at the point where string si starts. Defining sn+1 = s1 we
then have:
tsi = pr(si , si+1 ) pr(si+1 , si+2 ) · · · pr(si−2 , si−1 ) pr(si−1 , si )
The next two lemmas prove a rather intuitive property and a more
interesting property concerning cycles in the prefix graph:
Lemma 3
For any cycle C = φ(t) = s1 , s2 , . . . , sn in the prefix graph for S we have
that every si , i ∈ 1 . . . n and hs1 , s2 , . . . , sn i maps to φ(t)
2
P ROOF As the string collection is substring free set, we have that si is a
prefix of pr(si , si+1 )si+1 which again is a prefix of
pr(si , si+1 )pr(si+1 , si+2 )si+2
and so on (defining sn+1 = s1 ). By letting k = |si |/|t| we will get that si
is a prefix of tksi which itself maps to φ(t)k+1 .
The claim for hs1 , s2 , . . . , sn i can be shown by an analogous argument.
In a way the “reverse” property also holds:
Lemma 4
Let S be a collection of strings. Let S ⊆ S with the property that all strings
in S are sub strings of t∞ , then there exists a cycle C = φ(t) in the prefix
graph for S such that all strings in S map to C
2
P ROOF All strings in S will appear in t∞ with an interval of |t| characters.
This induces a circular ordering of the strings in S. A cycle in the prefix
graph for S defined by this ordering will fulfil the criteria.
The optimal SSP solution for S = (s1 , s2 , . . . , sn ) will be of the form
hsπ1 , sπ2 , . . . , sπn i for some permutation Π : [1, n] → [1, n] This means:
OP T (S) = hsπ1 , sπ2 , · · · , sπn i
= pr(sπ1 , sπ2 ) pr(sπ2 , sπ3 ) · · · pr(sπn−1 , sπn ) sπn
= pr(sπ1 , sπ2 ) pr(sπ2 , sπ3 ) · · · pr(sπn−1 , sπn ) pr(sπn , sπ1 ) ov(sπn , sπ1 )
Note that the above concatenation of prefixes is describing a cycle in the
prefix graph for S. For the length of the optimal solution we then have:
|OP T (S)| = |pr(sπ1 , sπ2 )| + |pr(sπ2 , sπ3 )| + · · · + |pr(sπn , sπ1 )| + |ov(sπn , sπ1 )|
(5.1)
Remark 4
Expression 5.1 conveys that the length of the optimal solution is the sum
of a TSP tour in the prefix graph for S plus the overlap of the “end” and
“start” strings of the tour.
2
37
5. P RELIMINARIES
If we let |T SP ∗ (Gpr )| denote the length of a shortest TSP tour in the prefix
graph for S we get the following lower bound for |OP T (S)|:
|T SP ∗ (Gpr )| < |T SP ∗ (Gpr )| + |ov(sπn , sπ1 )| ≤ |OP T (S)|
(5.2)
If we let wpr (aij ), wov (aij ) denote the weights for the arc aij in the prefixand overlap graph respectively, we have:
wpr (aij ) + wov (aij ) = |si | ⇔ wpr (aij ) = |si | − wov (aij ), ∀ i, j ∈ 1 . . . n (5.3)
We now turn to the important concept of a Cycle Cover:
Definition 8 (Cycle Cover of String Collection S)
Let CC(S) = {C1 , C2 , . . . , Ci }
= {φ(t1 ), φ(t2 ), . . . , φ(ti )}
where ∀ sj ∈ S, ∃ φ(tk ) ∈ CC(S) : sj maps to φ(tk )
then CC(S) is called a Cycle Cover of S.
2
Pi
The size of a Cycle Cover CC(S) is denoted ||CC(S)|| and equals k=1 |φ(tk )|.
The minimal length cycle cover of a collection of strings S is denoted
CC ∗ (S). We get the following properties for a minimal CC.
Lemma 5
For every si ∈ S = (s1 , s2 , . . . , sn ), si maps to a unique Ci ∈ CC ∗ (S).
For every distinct pair of cycles Ci = φ(ti ), Cj = φ(tj ) ∈ CC ∗ (S) we have
ti and tj are inequivalent.
2
P ROOF As the collection is substring free every pair of strings si , sj will
have pref (si , sj ) 6= . If si would map onto more than one cycle in CC ∗ (S)
it would contribute to the length of CC ∗ (S) with two non-empty prefixes, which contradicts the minimal length of CC ∗ (S). The second claim
now follows immediately as the strings mapping to two equivalent cycles
would be identical.
we can now show a useful property of the prefix and overlap graphs
Theorem 1 (Prefix-Overlap Graph Duality)
Let S = (s1 , s2 , . . . , sn ) be a string collection, then the following relations
hold between TSP-tours and CCs in the prefix and overlap graphs for S:
• A shortest TSP tour in the prefix graph will be a maximum TSP tour
in the overlap graph (and vice versa).
• A minimal CC in the prefix graph will be a maximum CC in the overlap graph (and vice versa).
2
38
5.3. Approximation Strategies
P ROOF Let T SP ∗ (Gpr ) = nπ1 , nπ2 , . . . , nπn be a shortest TSP tour in the
prefix graph for S. Using Expression 5.3 on the preceding page we then
have:
|T SP ∗ (Gpr )| = |pr(sπ1 , sπ2 ) pr(sπ2 , sπ3 ) · · · pr(sπn−1 , sπn ) pr(sπn , sπ1 )|
= |pr(sπ1 , sπ2 )| + |pr(sπ2 , sπ3 )| + · · · + |pr(sπn , sπ1 )|
= sπ1 − |ov(sπ1 , sπ2 )| + sπ2 − |ov(sπ2 , sπ3 )| + · · · + sπn − |ov(sπn , sπ1 )|
= ||S|| − (|ov(sπ1 , sπ2 )| + |ov(sπ2 , sπ3 )| + · · · + |ov(sπn , sπ1 )|)
= ||S|| − |ov(sπ1 , sπ2 ) ov(sπ2 , sπ3 ) · · · ov(sπn−1 , sπn ) ov(sπn , sπ1 )|
(5.4)
now as |T SP ∗ (Gpr )| is minimal and ||S|| is a constant, the string length on
the right hand side in Expression 5.4 will have to be maximal. But this
is exactly the length of a TSP tour in the overlap graph for S. The other
relation can be proved analogously.
We can now formulate the following lower bound relation:
||CC ∗ (S)|| ≤ |T SP ∗ (Gpr )| < |OP T (S)|
(5.5)
by noting that a TSP tour in the prefix graph for S is a special case of a CC
for S. An equivalent consideration is to consider any CC of S as a solution
to the LP-formulation of the TSP (see Definition 2 on page 13) with the
“subtour elimination constraint” removed.
5.3
Approximation Strategies
From the results in the previous section it can be seen that a possible
approximation strategy for SSP would be to approximate the length of a
TSP tour in the prefix graph for the string collection. Unfortunately the
best known approximation factors for the minimum Asymmetric Traveling Salesman Problem (ATSP) are of the order of log(n) [Kaplan et al.,
2005], which makes this strategy less interesting.
Another approach is to consider the length of the optimal solution ex39
5. P RELIMINARIES
pressed in the overlap graph for the string collection S:
|OP T (S)| = |hsπ1 , sπ2 , · · · , sπn i|
= |pr(sπ1 , sπ2 )| + |pr(sπ2 , sπ3 )|+
· · · + |pr(sπn−1 , sπn )| + |sπn |
= ( |sπ1 | − |ov(sπ1 , sπ2 )| ) + ( |sπ2 | − |ov(sπ2 , sπ3 )| )+
· · · + ( |sπn−1 | − |ov(sπn−1 , sπn )| ) + |sπn |
=
n
X
|sπi | −
i=1
= ||S|| −
n−1
X
|ov(sπi , sπi+1 )|
i=1
n−1
X
|ov(sπi , sπi+1 )|
(5.6)
i=1
The right side sum in the Expression 5.6 is called the total overlap or the
compression for OP T (S) and will be denoted OV ∗ (S).
Remark 5
Expression 5.6 conveys that the length of an optimal SSP solution equals
the length of S minus the length of a maximum Hamiltonian Path in the
overlap graph for S.
2
The maximum Hamiltonian Path Problem (HPP) can actually be approximated by a constant factor, but this does not directly lead to an usable approximation strategy for SSP. The reason being that the optimal
solution may have a large total overlap (of size O(||S||)), which means
that sacrificing even a small fraction of the total overlap can lead to bad
solutions.
Example 10
Let S = (abc, bcd, cde, def, efg, fgh, ghi, hij)
then ||S|| = 8 · 3 = 24, OV ∗ (S) =
7
X
|ov(si , si+1 )| = 7 · 2 = 14
i=1
and |OP T (S)| = |abcdefghij| = 24 − 14 = 10
An approximation factor for the total overlap of even
SSP solution that is 50 % larger than optimal.
2
3
could lead to an
2
The main problem with this approximation strategy is that we are approximating the total overlap and not the length of the SSP solution. For
this strategy to be applicable it is therefore necessary to assure that the
strings in the collection do not have large mutual overlaps. This leads to
the following approximation algorithm template:
40
5.4. Notes
Definition 9 (CC Approximation Algorithm Template)
Given a collection of strings, S, transform S into another set of strings, R,
with the ensuing properties:
1) Strings in R do not overlap too much.
2) All strings in S are sub-strings of a string in R, i.e. a superstring for
R induces a superstring for S.
3) |OP T (R)| is not much larger than |OP T (S)|.
Remark 6
The above template achieves the following:
• from 1) it follows that the reduction to HPP is applicable, and
• from 2) and 3) it follows that an SSP solution for R is a good SSP
solution for S.
These claims will be formalised in Section 6.1.1
2
The standard approach to fill in the template is:
• Find a minimum Cycle Cover CC ∗ (S) = {C1 , C2 , . . . , Cj } in the prefix
graph for S.
• For all Ci = si1 , si2 , . . . , sik ∈ CC ∗ (S) choose a representative for
the cycle, r(Ci ) containing all strings in S that maps to the cycle, as
sub-strings. For example let r(Ci ) = hsi1 , si2 , . . . , sik i.
• Let R = {r(Ci )}, i ∈ 1 . . . j.
The rationale behind this approach is that strings that overlap a lot
will end up in the same cycle in the CC, so substituting these strings by a
single representative should not increase the length of the optimal solution too much, but it will reduce the total overlap for the instance. This
approach is the basis for the implemented algorithms called Cycle (Section 6.1 on page 43) and RecurseCycle (Section 6.2 on page 53) respectively.
5.4
Notes
This chapter is based on the excellent but unfinished note by Marcin
Mucha [Mucha, 2007]. All other references are explicitly mentioned.
41
Copy from one, it’s plagiarism; copy
from two, it’s research.
C HAPTER
6
John Milton
Algorithms
In order to run practical experiments for solving the SSP five algorithms
were implemented. The first two are examples of CC approximation algorithms, the third is a well known greedy algorithm , the fourth an approximation algorithm achieving the best known approximation factor of 2.5
and finally the last algorithm is TSP based.
This chapter contains descriptions of these algorithms and their actual implementations. It is additionally comprised of theoretical sections
containing proofs for the approximation factors. It is the intention that
the chapter should be self-contained with respect to these proofs. The
strategy has been to include proofs for lemmas and theorems originating in directly used references and omit proofs for results that are only
referenced in these.
The survey character of this chapter has had the unfortunate effect of
leading to sections having an abundance of lemmas, theorems and proofs.
In order to compensate for this and ease the reading, most of the proofs
have been elaborated on in terms of smaller steps and/or supplementary
lemmas, remarks and examples.
6.1
The Cycle Algorithm
This algorithm is a variant of the algorithm TGREEDY in [Blum et al.,
1994] called 3All in [Romero et al., 2004]. It consists of the following six
steps, where steps 2) to 3) correspond to the template from Definition 9
on page 41
Given a list of strings S
1) Construct prefix graph for S.
43
6. A LGORITHMS
2) Find a minimum Cycle Cover CC ∗ (S) = {C1 , C2 , . . . , Cj } in the prefix
graph for S.
3) For all Ci = si1 , si2 , . . . , sik in the CC choose a representative r(Ci ) =
hsi1 , si2 , . . . , sik i by breaking the cycle at an index so that |r(Ci )| is
smallest possible. Define the set R = (r(Ci )), i ∈ 1 . . . j.
4) Repeat step 1) for R.
5) Repeat step 2) for R creating a new set of representatives T , with
the restriction that the CC for R should be non-trivial, i.e. no cycle
in the cover includes only a single string from R.
6) Return a random merge of T
Though the algorithm is doing the same thing twice the restriction in
step 5) assures that the sets R and T are different. Pseudo-code for the
algorithm is shown in Algorithm II.1 on the facing page.
6.1.1
Approximation factor
In order to show the approximation factor for the Cycle algorithm, we
will refrain from the direct proofs in [Blum et al., 1994; Gusfield, 1997]
preferring the more general proof-method by Mucha. The reason for this
choice is that it will lead to a refinable theorem and a “reusable” proof
(Theorem 2 on page 47).
The approximation factor for the Cycle algorithm follows from a formalisation of the claims in Remark 6 on page 41 concerning the constructed representatives. Beginning with the last claim and denoting the
original collection of strings S and the constructed set of representatives
R we have:
Lemma 6
|OP T (R)| ≤ |OP T (S)| + |CC ∗ (S)| ≤ 2 · |OP T (S)|
2
P ROOF ([M UCHA , 2007]) We have R = (r(C1 ), r(C2 ), . . . , r(Cj )) where
each r(Ci ) is of the form hsi1 , si2 , . . . , sik i. For each r(Ci ) let
r̂(Ci ) = hsi1 , si2 , . . . , sik , si1 i and let
R̂ = (r̂(C1 ), r̂(C2 ), . . . , r̂(Cj )).
Since each r(Ci ) is a substring of r̂(Ci ) we have |OP T (R)| ≤ |OP T (R̂)|.
Each r̂(Ci ) begins and ends with the same string. Let s(r̂(Ci )) denote this string and let SR̂ = (s(r̂(C1 )), s(r̂(C2 )), . . . , s(r̂(Cj ))). Since
SR̂ ⊆ S we have |OP T (SR̂ )| ≤ |OP T (S)|. Noting that OP T (SR̂ ) can be
44
6.1. The Cycle Algorithm
Algorithm II.1 Cycle
Require: Nonempty substring free list of strings, sl
procedure C YCLE(sl)
avoidSimpleCycles ← false
3:
reps1 ← MAKE R EPRESENTATIVES(sl, avoidSimpleCycles)
if size(reps1) = 1 then
return first(reps1)
6:
end if
avoidSimpleCycles ← true
reps2 ← MAKE R EPRESENTATIVES(reps1, avoidSimpleCycles)
9:
return MERGE S TRINGS(reps2)
end procedure
procedure MAKE R EPRESENTATIVES(sl, avoidSimpleCycles)
CC ← MAKE M INIMUM C YCLE C OVER(sl, avoidSimpleCycles)
reps = empty list
15:
for all C = sc0 , sc1 , . . . , sck ∈ CC do
i ← BEST B REAK I NDEX(C)
rep ← hs(i mod k) , s((i+1) mod k) , . . . , s((i−1) mod k) i
18:
reps ← reps ∪ rep
end for
return reps
21: end procedure
12:
Termination: The algorithm only iterates over finite non-increasing
data structures.
Correctness: Assuming all the the called sub-procedures are correct (see
Section 6.1.2 onwards), all strings in the parameter list sl will be
sub-strings of a representative so the algorithm will be yielding a
feasible solution of the SSP.
45
6. A LGORITHMS
transformed into a solution for R̂ by replacing each s(r̂(Ci )) by its corresponding r̂(Ci ) and that this transformation will increase the length by
|CC ∗ (S)| we get
|OP T (R)| ≤ |OP T (R̂)|
≤ |OP T (SR̂ )| + |CC ∗ (S)|
≤ |OP T (S)| + |CC ∗ (S)|.
In order to show that the strings in R do not overlap too much, we will
need following lemma:
Lemma 7
Let C = φ(t) = s1 , s2 , . . . , sn ∈ |CC ∗ (S)| and let r(C) be the representative
for C, then
period(r(C)) = |t|.
2
P ROOF ([M UCHA , 2007]) As r(C) maps to C = φ(t), |t| will be a period
for r(C). Assume there exists a factor f , for r(C) with |f | < |t|. Then
we have that r(C) maps to φ(f ). Being sub-strings of r(C) so does all
the strings s1 , s2 , . . . , sn . Lemma 4 on page 37 now gives the existence of
a cycle shorter than C “including” all the strings that maps to C, which
contradicts that C belongs to a minimum cycle cover for S.
We can now show
Lemma 8
Let C1 = φ(t1 ), C2 = φ(t2 ) ∈ |CC ∗ (S)| and let r1 (C), r2 (C) be their representatives, we then have
|ov(r(C1 ), r(C2 ))| < |t1 | + |t2 |.
2
P ROOF ([M UCHA , 2007]) We have C1 and C2 ∈ |CC ∗ (S)| are both minimal and from Lemma 7 it follows period(r(C1 )) = |t1 |, period(r(C2 )) = |t2 |.
Let u = ov(r(C1 ), r(C2 )). Now if |u| ≥ |t1 | + |t2 | we would have that |t1 |, |t2 |
both are periods for u. Then the Periodicity Lemma (Lemma 2 on page 35)
in connection with Lemma 4 on page 37 would lead to a contradiction of
the minimality for C1 and C2
Using expression 5.5 on page 39 and Lemma 8 we get:
Corollary 1
Any SSP-solution for the set R will have
OV (R) ≤ 2 · OP T (S)
46
2
6.1. The Cycle Algorithm
We can now turn to the final theorem needed to prove the approximation factor for the cycle algorithm:
Theorem 2
Given an α-approximation for the MAX-HPP, it is possible to get a (4 − 2α)
approximation for the SSP.
2
P ROOF ([M UCHA , 2007]) First construct the set of representatives R and
then use the HPP approximation to get an SSP solution R, for R achieving
a total overlap of α · OV ∗ (R). We then have:
|R| = ||R|| − OV (R)
= ||R|| − α · OV ∗ (R)
= ||R|| − OV ∗ (R) + (1 − α) OV ∗ (R)
= OP T (R) + (1 − α) OV ∗ (R)
≤ 2 · OP T (S) + (1 − α) OV ∗ (R)
(from Lemma 6)
≤ 2 · OP T (S) + (1 − α) 2 · OP T (S)
(from Corollary 1)
= (4 − 2α) OP T (S).
Theorem 3
The Cycle algorithm is a factor 3 approximation algorithm.
2
P ROOF Inspecting the steps of the Cycle algorithm (Section 6.1 on page 43),
the algorithm breaks the cycles optimally in Step 3), continues to find an
minimal non-trivial Cycle Cover CC ∗ (R), for R in Step 4) and Step 5) and
finally return a random merge of the set of representatives T , in step 6).
According to Theorem 1 on page 38, CC ∗ (R) is also a maximal CC in
the overlap graph for R. Translated to the overlap graph for R, what the
algorithm does in step 4) to 6) is equivalent to constructing a solution to
the HPP in the overlap graph for R in the following way:
• Split each cycle in CC ∗ (R) by throwing away the arc with smallest
weight (minimal overlap).
• Arbitrarily mend the ensuing paths together using random arcs between the end nodes.
This gives a 12 -approximation for the HPP, as the cycles each consists
of at least two arcs and throwing away the ones with minimal weight will
not lose more than half of |CC ∗ (R)|.
By Theorem 2 this then gives a factor 3 approximation for the Cycle
algorithm.
47
6. A LGORITHMS
Remark 7
It also follows that if the Cycle algorithm in Step 3) would randomly break
the cycles and return a random concatenation of this set of representatives, no guarantee would be possible for the approximation factor, which
corresponds to α = 0 (a 0-approximation of the max HPP) giving a factor 4
approximation by Theorem 2.
2
6.1.2
Implementation
The implementation of the Cycle algorithm is basically a Python implementation of Algorithm II.1 on page 45. The Cycle algorithm is dependent on the implementation of two subroutines to calculate an all-pairs
suffix-prefix matrix for the strings in S and for finding a minimal CC for
S respectively.
All pairs suffix-prefix calculation
The first attempted implementation of this simple problem was a Python
implementation of the suffix-tree based algorithm described in [Gusfield,
1997, Section 7.10]. Unfortunately this task proved too complicated and
thereby more error-prone than expected. In order to sanity check the
suffix-tree algorithm a simple naive double-loop implementation was made.
The results were in a sense both disillusioning as well as enlightening:
First the suffix-tree implementation kept growing more and more complex while still containing bugs. Second the naive test implementation
had a running-time of order five times faster on the data where the algorithms produced identical results.
As a second try a Knuth-Morris-Pratt (KMP)-algorithm based implementation described in [Tarhio and Ukkonen, 1988] was implemented.
This algorithm consistently produced identical results to the naive test
implementation, thus it was probably free from bugs, but it was a factor
three times slower than the naive test algorithm.
The moral of the story is that even though the theoretical runningtimes for the different implementations are as shown in Table 6.1 , the
measured running-times clearly favoured using the naive implementation.
Part of the explanation for this result could be that the Python method
string.startswith(), used in the interior of a double for-loop, is probably implemented as a highly optimised C or assembler method.
In order to confirm this hypothesis a number of tests were run with
four different implementations:
• a fully naive algorithm, which uses char-by-char comparison instead
of string.startswith(),
• the KMP-based algorithm,
48
6.1. The Cycle Algorithm
Table 6.1: All-pairs suffix-prefix running-times
Algorithm
Running-Time
Suffix-tree based
KMP based
Naive
n
O(n|t| · (1 + |t|
))
O(n|t| · (1 + n))
O(n|t| · (n|t|))
= O(||S|| + n2 )
= O(||S|| + n||S||))
= O(||S||2 )
where n is the number of strings and |t| the (identical)
length of each string, i.e. n|t| = ||S||
• the naive algorithm using string.startswith() implemented in
Python, and finally
• a Java implementation of the naive algorithm using the Java method
string.startswith().
The results of these test runs are shown in Figure 6.1
Overlap timeit results
Fully naive alg.
KMP alg.
Python naive alg.
Java naive alg.
Running-time in seconds
6
5
4
3
2
1
1000
2000
3000
4000
5000
6000
Cumulated length of strings
7000
Figure 6.1: Running-time measures of all-pairs overlap calculations
In the end a Python wrapper calling the Java implementation of the
naive algorithm was used by all SSP-algorithms, as it turned out to be the
49
6. A LGORITHMS
fastest implementation.
Finding minimal CCs
The “traditional” way of finding a minimal CC is by reducing the problem to an instance of the Assignment Problem (AP) or Perfect Matching
Problem. The reduction consists of solving the AP in a complete bi-partite
graph. The node-set is N = {S, S} and arc weights are equal to either
the weights of the prefix or the overlap graph, depending on whether a
minimal or maximal assignment is wanted.
In the general case the AP can be solved by the Hungarian Method1
which has a running-time of O(n3 ), where n is the number of strings.
It turns out that due to the weights being overlap values between the
strings, a maximal assignment can be found by a greedy algorithm having
a running-time of O(log(n) · n2 ).
Fast maximal assignment algorithm
The fact that a greedy approach works is dependent on the following property
Definition 10 (Inverse Monge Condition)
Let B = (N1 , N2 , A, W ) be a complete, weighted bipartite graph. Let
u, u0 ∈ N1 and v, v 0 ∈ N2 . Assume WLOG
w(auv ) ≥ max (w(auv0 ), w(au0 v ), w(au0 v0 ))
If the inequality
w(auv ) + w(au0 v0 ) ≥ w(auv0 ) + w(au0 v )
(6.1)
holds for any such pairs u, u0 and v, v 0 in B then B is said to satisfy the
Inverse Monge Condition.
2
Given a collection of strings S = (s1 , s2 , . . . , sn ), let BS be the complete,
weighted bipartite given by BS = (S, S, A, W ) and w(aij ) = ov(si , sj ), ∀ i, j ∈
1 . . . n. We then have
Lemma 9
BS satisfies the Inverse Monge Condition.
2
P ROOF ([B LUM ET AL ., 1994]) Let su , su0 , sv , sv0 ∈ S. Denote ov(si , sj )
and |ov(si , sj )| by ovij and wij respectively. Assume wuv ≥ max (wuv0 , wu0 v , wu0 v0 ).
If wuv ≥ wuv0 + wu0 v , then Expression 6.1 is obviously fulfilled. Assume
1
50
http://en.wikipedia.org/wiki/Hungarian_algorithm
6.1. The Cycle Algorithm
wuv < wuv0 + wu0 v i.e. wu0 v0 > 0, and let ovuv = a1 a2 . . . awuv . Then ovuv0 =
a(wuv −wuv0 +1) . . . awuv and ovu0 v = a1 . . . awu0 v . However,
ovu0 v0 = a(wuv −wuv0 +1) . . . awu0 v
⇓
wu0 v0 = wu0 v − (wuv − wuv0 + 1) + 1
= wu0 v + wuv0 − wuv .
For a graphical interpretation see Figure 6.2
w uv
w uv’
s
s
v’
v
su
s u’
w u’v
w
u’v’
Figure 6.2: Illustration of Lemma 9 : wuv + wu0 v0 ≥
wu0 v + wuv0
Now consider the greedy algorithm M AKE M AXIMAL A SSIGNMENT for
finding a maximum weight assignment (Algorithm II.2 on the next page)
Assuming the marking of rows and columns and the checking for availability can be done in constant time, the running-time for Algorithm II.2
will be O(n2 + log(n2 ) · n2 + n) = O(log(n) · n2 ).
The returned maximum assignment can easily be translated to a maximum CC for S in the overlap graph by following arcs “from left to right” in
the assignment. According to Theorem 1 on page 38 this is also a minimal
CC in the prefix graph for S.
The SSP-algorithms needing a minimal CC are all using a Python implementation of Algorithm II.2. The Cycle algorithm is dominated by this
procedure and thus having a running-time of O(log(n) · n2 ).
51
6. A LGORITHMS
Algorithm II.2 MakeMaximalAssignment
Require: A n × n overlap matrix M for string collection S
Ensure: A maximum weight assignment for the bi-partite graph BS
procedure M AKE M AXIMAL A SSIGNMENT(M)
overlaps ← empty list of size n
3:
for row = 1 to n do
for col = 1 to n do
overlaps ← overlaps ∪ (M[row][col], (row, col))
6:
end for
end for
SORT (overlaps)
. descending by overlap value
9:
assignment ← empty set of size n
mark all rows and columns available
for all (w,(r,c)) ∈ overlaps do
. iterating from start to end
12:
if r and c both available then
assignment ← assignment ∪ (r,c)
mark row r and column c unavailable
15:
end if
end for
return assignment
18: end procedure
Termination: The algorithm only iterates over finite non-increasing
data structures.
Correctness: ([Blum et al., 1994]) Let A be the assignment returned
by the algorithm and let Amax be a maximum assignment for Bs with
the property that it has the maximum number of arcs in common
with A. Assume A 6= Amax and let aij be the arc of maximal weight
in the symmetric difference of A and Amax (with ties “sorted” the
same way as in the algorithm). There are now two possibilities:
aij ∈ Amax \ A : This means that A must contain either an arc aij 0
or ai0 j of “larger” weight than aij as the algorithm did not choose
aij . Neither of these arcs can be in Amax as it is an assignment
leading to a contradiction of the choice of aij .
aij ∈ A \ Amax : In this case there are two arcs aij 0 , ai0 j ∈ Amax of
“smaller” weight than aij . But according to Lemma 9 we could
now replace the arcs aij 0 and ai0 j in Amax by the arcs aij and ai0 j 0
getting an assignment A0max with no less accumulated weight
but having more arcs in common with A than Amax . This is a
contradiction of the choice of Amax .
52
6.2. The RecurseCycle Algorithm
6.2
The RecurseCycle Algorithm
This is a slight variation of the Cycle algorithm. The variation consists of
repeating steps 4) and 5) of the Cycle algorithm until only one representative string remains:
1) Execute Cycle algorithm step 1) to 3)
2) Repeat Cycle algorithm step 4) and 5) until a representative set of
size one is achieved.
3) Return the single element of the final representative set.
The pseudo-code for the algorithm is shown in Algorithm II.3 on the
next page.
6.2.1
Approximation factor
Theorem 4
The RecurseCycle algorithm is a factor 3 approximation algorithm.
2
P ROOF As the algorithm in effect improves on the Cycle algorithm by
returning a “good” as opposed to a complete random merge of representatives, its approximation factor cannot be worse than the factor 3 for the
Cycle algorithm.
6.2.2
Implementation
The algorithm is basically a small extension of the Cycle algorithm and
is likewise implemented using python. As a practical matter the recursive implementation had to be substituted by an iterative version due to
the missing tail-recursion primitive in the Python interpreter. The original recursive implementation ran into problems with maximum stack
height.2 The running-time for the RecurseCycle algorithm is dominated
by the MAKE M AXIMAL A SSIGNMENT () procedure (see Algorithm II.2) and
thus of size O(log(n) · n2 ).
6.3
The Overlap Rotation Lemma
We will now shortly divert from the algorithm descriptions in order to
introduce a couple of very useful results proved by Breslauer et al.. These
results are at the basis of the proofs of the best known approximation
factors for both the Greedy algorithm (Section 6.4 on page 57) and the
Best Factor algorithm (Section 6.6 on page 64).
2
This problem could alternatively have been solved by either increasing the platform
dependent maximum stack height or by using stack-less python.
53
6. A LGORITHMS
Algorithm II.3 RecurseCycle
Require: Nonempty sub-string free list of strings, sl
procedure RECURSE C YCLE(sl, avoidSimpleCycles)
if size(sl) = 1 then
3:
return first(sl)
end if
reps ← MAKE R EPRESENTATIVES(sl, avoidSimpleCycles)
6:
return RECURSE C YCLE(reps, true )
end procedure
9:
begin main
return RECURSE C YCLE(sl, false )
end main
Termination: In the base case (line 3) the algorithm returns when the
string list has size 1. In the inductive case (lines 5 to 6) the restriction to non-trivial cycle covers in MAKE R EPRESENTATIVES(sl,
avoidSimpleCycles) (line 5) ensures that size(reps) ≤ size(sl) − 1.
As the size of the original string list is final and nonzero, this guarantees termination.
Correctness: For each invocation of RECURSE C YCLE(sl, avoidSimpleCycles) the following invariant holds: A (random) concatenation of the
strings in the string list is a superstring for the strings in the list.
Before the first invocation the invariant holds trivially.
If RECURSE C YCLE(sl, avoidSimpleCycles) returns from the base
case, the string list contains a single string, which is a superstring
for itself. If RECURSE C YCLE(sl, avoidSimpleCycles) returns from
the inductive case, at least two strings have been replaced by their
merge. As a merge is a superstring for the strings involved the invariant will hold.
54
6.3. The Overlap Rotation Lemma
Lemma 10 (Overlap Rotation Lemma)
Let α be a periodic semi-infinite string. There exists an integer k, such that
for any (finite) string s that is inequivalent to α,
ov(s, α[k, ∞]) < period(s) + 12 period(α)
Furthermore such a k can be found in time O(period(α))
2
For the somewhat “technical” proof see [Breslauer et al., 1996, Section
3].
Remark 8
The Overlap Rotation Lemma shows that given any cyclic string Φ(t) and
any string s inequivalent to t, it is possible to restrict the overlap between
s and t by rotating t sufficiently.
The Overlap Rotation Lemma actually consists of two results, but we
only need the first of these. As will be shown later (Lemma 12) the implemented algorithms will not actually have to determine the integer k. 2
The following is a rephrasing of another result in [Breslauer et al.,
1996, Section 5, Lemma 5.1] due to Mucha.
Lemma 11
Let S = (s1 , s2 , . . . , sn ) be a collection of strings with minimal Cycle Cover
CC ∗ (S). It is possible to choose representatives R for cycles in CC ∗ (S) such
that
OP T (R) ≤ 2 · OP T (S)
and for any two distinct cycles Ci , Cj ∈ CC ∗ (S)
ov(r(Ci ), r(Cj )) < |Ci | + 21 |Cj |.
(6.2)
2
P ROOF ([M UCHA , 2007]) Let Ci ∈ CC ∗ (S) = si1 , si2 , . . . , sik and let
α = pr(si1 , si2 ) pr(si2 , si3 ) · · · pr(sik−1 , sik ) pr(sik , si1 ).
Consider the semi-infinite string α∞ and let k be the integer given by the
Overlap Rotation Lemma 10. Assume WLOG that k is in the pr(si1 , si2 )
part of α. (If not we can renumber the indices and split α accordingly.) In
the approach until now we could choose the representative
r(Ci ) = hsi2 , si3 , . . . , sik , si1 i.
55
6. A LGORITHMS
As shown in the proof of Lemma 6 on page 44 we could actually choose
the longer representative
r̂(Ci ) = hsi1 , si2 , si3 , . . . , sik , si1 i,
and still have OP T (R̂) ≤ 2 · OP T (S). Now it follows that if we instead
choose the representative
r(Ci ) = (pr(si1 , si2 )[k : ])hsi2 , si3 , . . . , sik , si1 i
(6.3)
we have that r(Ci ) is a suffix of r(Ci ) which itself is a suffix of r̂(Ci ). This
implies that OP T (R) ≤ 2 · OP T (S).
Equation 6.2 on the previous page follows directly from Lemma 5 on
page 38 and the Overlap Rotation Lemma on the previous page.
These results immediately leads to Corollary 2 and an improvement
of Theorem 2 on page 47
Corollary 2
An optimal SSP-solution for the set R will have
OV (R) ≤
3
2
· OP T (S)
2
Theorem 5
Given an α-approximation for the MAX-HPP problem, it is possible to get a
(3.5 − 1.5α) approximation for the SSP.
2
P ROOF Analogous to the proof for Theorem 2 on page 47
Corollary 3
The Cycle algorithm and RecurseCycle algorithms are factor (3.5 − 1.5 ·
0.5) = 2.75 approximation algorithms.
2
As noted in [Mucha, 2007] there seems to be something non-intuitive
or even contradictory about the approach used above. A better approximation for the total overlap of representatives was the road to the improvement in Theorem 5 and therefore it seemed sensible to minimise
the overlaps of representatives; but actually our overall objective is to
find a shortest superstring and therefore it surely is preferable to make
representatives overlap as much as possible. Fortunately Mucha provides
the following lemma.
Lemma 12
Using any set of representatives R with ||R|| ≤ ||R|| yields a (3.5 − 1.5α)
approximation for the SSP, where α is the approximation factor for the
directed, MAX-HPP algorithm used.
2
56
6.4. The Greedy Algorithm
P ROOF ([M UCHA , 2007]) Let OV (R) and OV (R) be the total overlap for
the optimal solutions OP T (R) and OP T (R) respectively and let |SSPR | be
the length of the approximated solution for R. We then have
|SSPR | ≤ ||R|| − αOV (R)
= ||R|| − α(||R|| − OP T (R)) (Expression 5.6)
= (1 − α) · ||R|| + αOP T (R)
≤ (1 − α) · ||R|| + 2αOP T (S)
(Lemma 11)
≤ (1 − α) · 3.5OP T (S) + 2αOP T (S)
(Corollary 2 and Lemma 11)
= (3.5 − 1.5α) · OP T (S)
Remark 9
Lemma 12 ensures that the original representatives R defined in step 3)
on page 44 can be used to obtain the approximation factor in Theorem 5 if
the cycles are broken at the string yielding the shortest |r(C)| (to ensure
||R|| ≤ ||R||). The rotations used by Breslauer et al. are necessary to
bound the approximation factor but not in the algorithm itself.
2
6.4
The Greedy Algorithm
The greedy algorithm is probably the first idea that comes to mind for
solving the SSP. If S = (s1 , s2 , . . . , sn ) is a collection of strings, keep on
merging the two strings that have the largest overlap and put the resulting merge back into the collection, until only one string is left. The
pseudo-code for this simple and in practise surprisingly good greedy algorithm is shown in Algorithm II.4 on the following page.
6.4.1
Approximation factor
In [Tarhio and Ukkonen, 1988] it is shown that this algorithm achieves
at least a factor 21 of the maximal total overlap possible. Furthermore,
it is conjectured that the algorithm is a 2-approximation algorithm. The
following tight example from [Blum et al., 1994] shows that the approximation factor cannot be better than 2.
Example 11
Let S = {c(ab)k , (ba)k , (ab)k c}
then |Greedy(S)| = |c(ab)k c(ba)k | = 4k + 2
and |OP T (S)| = |c(ab)k+1 c| = 2k + 4
2
57
6. A LGORITHMS
Algorithm II.4 Greedy
Require: Nonempty substring free list of strings, sl
procedure G REEDY(sl)
if size(sl) = 1 then
3:
return first(sl)
end if
Find si , sj ∈ sl : i 6= j and ov(si , sj ) is maximal.
6:
sl ← sl \ {si , sj }
sl ← sl ∪ hsi , sj i
return G REEDY(sl)
9: end procedure
Termination: In the base case (line 3) the algorithm returns when the
string list has size 1. In the inductive case (lines 6 to 8) two strings
are removed and one merged string is added to the string list, reducing its size by one. As the size of the original string list is final
and nonzero, this guarantees termination.
Correctness: For each invocation of G REEDY(sl) the following invariant
holds: A (random) concatenation of the strings in the string list is a
superstring for strings in the list.
Before the first invocation the invariant holds trivially.
If G REEDY(sl) returns from the base case, the string list contains
a single string, which is a superstring for itself. If G REEDY(sl) returns from the inductive case, two strings have been replaced by
their merge. As a merge is a superstring for the strings involved the
invariant will hold.
Even though this conjecture has been subject to a lot of study, it has
not (yet) been possible to prove a better factor than 3.5. This result due to
Kaplan and Shafrir is in essence an application of Corollary 2 on page 56
instead of Corollary 1 on page 46 in the original factor 4 proof by Blum
et al. Before we can show this, we will need quite some terminology and
results from [Blum et al., 1994].
Consider an implementation of the Greedy algorithm that satisfies the
conditions in line 5 of Algorithm II.4 by the following. It runs down a
collection of the arcs from the overlap graph sorted by decreasing weight
(analogous to Algorithm II.2) and for each arc aij it decides whether to
merge its node strings si , sj or not.
58
6.4. The Greedy Algorithm
Let the string returned by the Greedy algorithm be denoted tgreedy and
assume a renumbering of the strings in S so that
tgreedy = hs1 , s2 , . . . , sn i.
Let e, f be arcs in the overlap graph for S. We say that e dominates f
if it comes before f in the sorted collection of the arcs. This means that
w(e) ≥ w(f ) and if e = aij then either f = aij 0 or f = ai0 j . Greedy will
reject merging the end strings of an arc f if either
1) f is dominated by an already chosen arc e
2) f is not dominated but the merge of its end strings would form a
cycle.
If the merge is rejected because of 2) then f is called a bad back arc.3 Let
f = aji be a bad back arc then f corresponds to the merge hsi , si+1 , . . . , sj i,
which also corresponds to a path in the overlap graph. As this merge was
already formed when Greedy considered f it follows that
ov(sk , sk+1 ) = w(ak,k+1 ) ≥ w(f ) = w(aj,i ) = ov(sj , si ), ∀ k ∈ [i, j − 1].
Additionally when f is considered, the arcs ai−1,i and aj,j+1 cannot have
been considered yet. The arc f is said to span the closed interval If = [i, j].
In [Blum et al., 1994, Section 5] the lemma below is proved
Lemma 13
Let e and f be two bad back arcs. Then the closed intervals Ie , If spanned
by the arcs are either disjoint or one contains the other.
2
Minimal (with respect to containment) intervals of bad back arcs are
called culprits. It follows from the definition, that culprit intervals must
be disjoint. Each culprit [i, j] like its bad back arc corresponds to the
merge hsi , si+1 , . . . , sj i.
Let Sm ⊆ S = {sk | k ∈ [i, j], where [i, j] = culprit}
Each culprit [i, j] also defines a cycle Cij = si , si+1 , . . . , sj in the overlap
graph for S. Let CCc (Sm ) be the CC over Sm defined by the culprits. Now
consider the result of applying Algorithm II.2 on page 52 on the overlap
graph induced by Sm . It turns out that this will exactly produce the Cycle
Cover CCc (Sm ), which implies that CCc (Sm ) is a minimal CC and we get
|CCc (Sm )| = |CC ∗ (Sm )| ≤ |OP T (Sm )| ≤ |OP T (S)|.
3
(6.4)
in [Blum et al., 1994] these are called bad back edges
59
6. A LGORITHMS
Finally denote the sum of the weights of all culprit bad back arcs in the
overlap graph
X
|ov(sj , si )|.
Oc =
[i,j] = culprit
Blum et al. utilises the culprits to partition the arcs of the path corresponding to tgreedy into four sets. Through clever analysis they end up
showing the following result [Blum et al., 1994, Section 5].
Lemma 14
|tgreedy | ≤ 2 · OP T (S) + Oc − |CCc (Sm )|.
2
Having by now put quite some canons into position, we can start aiming
at the final proof. The main hurdle is to derive a use-full approximation
for Oc . Let A = {r(Cir )| [i, r] = culprit} be the representatives for Sm
guaranteed to exist by Lemma 11 on page 55. We then have
Lemma 15
For each culprit [i, r], |ov(sr , si )| ≤ |r(Cir )| − |Cir |.
2
P ROOF ([B LUM ET AL ., 1994]) Let j ∈ [i, r] be the index such that the
merge hsj+1 , . . . , sr , si , . . . , sj i is a suffix of r(Cir ) (e.g. j = si2 in Equation 6.3 on page 56). We now get
|r(Cir )| ≥ |hsj+1 , . . . , sr , si , . . . , sj i|
The merge is a suffix
= |pr(sj+1 , sj+2 ) · · · pr(sj−1 , sj )sj |
= |pr(sj+1 , sj+2 ) · · · pr(sj−1 , sj )pr(sj , sj+1 )| + |ov(sj , sj+1 )|
≥ |Cir | + |ov(sr , si )|
Corollary 4
Oc ≤ ||A|| − |CCc (Sm )|
P ROOF ([B LUM ET AL ., 1994])
X
X
Oc =
|ov(sj , si )| ≤
[i,j] = culprit
2
|r(Cir )| − |Cir | ≤ ||A|| − |CCc (Sm )|.
[i,j] = culprit
We are now able to approximate Oc :
Lemma 16
Oc ≤ OP T (S) + 32 |CCc (Sm )|.
60
2
6.5. Reductions of Hamiltonian Path Problem Instances
P ROOF ([K APLAN AND S HAFRIR , 2005])
||A|| = OP T (A) + OV (A)
≤ OP T (A) + 23 |CCc (Sm )|
(Corollary 2)
≤ OP T (Sm ) + |CCc (Sm )| + 23 |CCc (Sm )| (Lemma 6)
= OP T (Sm ) + 25 |CCc (Sm )|.
Now using Corollary 4 on the preceding page we have
Oc ≤ ||A|| − |CCc (Sm )|
≤ OP T (Sm ) + 32 |CCc (Sm )|
≤ OP T (S) + 32 |CCc (Sm )|.
So putting it all together we get:
Theorem 6
Greedy is a factor 3.5 approximation algorithm.
2
P ROOF ([K APLAN AND S HAFRIR , 2005]) Using Lemma 16 on the facing
page and Lemma 14 on the preceding page we get
|tgreedy | ≤ 2 · OP T (S) + OP T (S) + 23 |CCc (Sm )| − |CCc (Sm )|
= 3 · OP T (S) + 12 |CCc (Sm )|
= 3.5 OP T (S)
6.4.2
Implementation
The actual implementation of the Greedy algorithm is a Python implementation of the pseudo-code in Algorithm II.4 on page 58 with the checks
in line 5 implemented as a generator method. Since this method has to
sort the n2 pairwise overlap values, the Greedy algorithm running-time is
dominated by the running-time for this sorting which in this implementation4 is of size O(log(n2 ) · n2 ) = O(log(n) · n2 ).
6.5
Reductions of HPP Instances
Since the last two algorithms to be presented in this chapter, both are
based on the overlap formulation (Expression 5.6 on page 40) for the optimal solution of the SSP, we will now treat the necessary reductions from
the original HPP to equivalent instances of ATSP and TSP respectively.
4
a more efficient implementation could use Radix Sort, since the size of the overlap
values are restricted
61
6. A LGORITHMS
6.5.1
Reduction from a HPP to an ATSP
Given a collection of strings S = (s1 , s2 , . . . , sn ). Let GS (N, A, W ) be the
overlap graph for S. Let H(GS ) be the instance of the maximum HPP in
0
the overlap graph for S. Define an instance of the maximum ATSP, A(GS ),
by
0
0
0
0
GS = G(N , A , W ) with
0
0
N = N ∪ n0 , A = A ∪ {ai0 , a0i } i in 1 . . . n
(
w(aij ) for i, j 6= 0
0
0
0
W : A → N with w (aij ) =
0 otherwise
Lemma 17
The reduction from HPP to ATSP is correct.
2
P ROOF This graph transformation can clearly be done in polynomial time.
If POP T = (nP1 , nP2 , . . . , nPn ) is the hamiltonian path corresponding to the
optimal solution for H(GS ), then the TSP-tour TP = (n0 , nP1 , nP2 , . . . , nPn )
0
will be an optimal solution for A(GS ) with the same length as POP T . Conversely if TOP T = (n0 , nT1 , nT2 , . . . , nTn ) is the TSP-tour for the optimal so0
lution for A(GS ), then the path PT = (nT1 , nT2 , . . . , nTn ) will be an optimal
solution for H(GS ) because removing the visit of node n0 does not reduce
the length of the solution.
Both cases can be proved using that existence of a better TSP tour
(Hamiltonian path) leads to a contradiction of the assumed optimality of
POP T and TOP T respectively.
Remark 10
The adding of the extra n0 node is necessary in order to make the reduction work. Intuitively it could seem as if an optimal solution to the HPP
would be equivalent to determining an optimal solution for the TSP and
then removing the heaviest edge. The graph depicted in Figure 6.3 on
the facing page illustrates that this intuition is quite wrong. An optimal
solution to the TSP (depicted with full drawn lines) has length 1002. Removing the heaviest edge from this tour will result in a solution of the
HPP with length 502. However, the optimal HPP solution has length 3. 2
6.5.2
Reduction from ATSP to TSP
Let A(Gatsp ) be an instance of the maximum ATSP in a graph Gatsp =
(Natsp , Aatsp , Watsp ). Transform Gatsp into an undirected graph Gtsp =
(Vtsp , Etsp , Wtsp ) by the following steps:
62
6.5. Reductions of Hamiltonian Path Problem Instances
1000
b
500
1
b
500
b
1
b
1
Figure 6.3: Shortest Hamiltonian path is not necessarily included in the optimal TSP tour.
1) Split each node in n ∈ Natsp into a pair of nodes nin , nout , where all
incoming arcs for n are incoming to nin and all outgoing arcs for n
are outgoing from nout . Let Vtsp = Natspin ∪ Natspout .
2) In the graph resulting from 1). Add an opposite arc with same
weight for all arcs, making the resulting graph symmetric. Convert
all correlating arc pairs into an edge of the same weight, thereby
converting the graph into an undirected graph.
(
watsp (aij ) if aij ∈ Natsp
Define wtsp (eij ) =
watsp (aji ) if aji ∈ Natsp
3) Define a value K max(Watsp ) where max(Watsp ) is the maximum
weight in Gatsp .
4) For all pairs vi , vj ∈ Vtsp such that vi = nkin ∈ Natspin and vj =
nkout ∈ Natspout add an edge eij , between the pair with wtsp (eij ) = K.
5) Make the graph complete by adding missing edges eij with wtsp (eij ) =
−K.
6) Multiply all edge weights by -1.
A graphical illustration of the transformation is shown in Figure 6.4
on the next page
Steps 1) and 2) ensures that the original arcs with their original weights
are present in the undirected graph. Steps 3) to 5) guarantees that any
maximal TSP tour in the “node splitted” graph will always travel directly
between any corresponding in and out vertex-pair. Finally step 6) ensures
that a maximal TSP tour is turned into an equivalent minimal TSP tour.
Lemma 18
The reduction from ATSP to TSP is correct.
2
63
6. A LGORITHMS
2
1
2
1
(a)
1
1
(b)
2
2
(c)
K
K
1
2
1
-K
K
-1
2
-K
-K
K
-K
K
(d)
-2
K
(e)
(f)
Figure 6.4: Transforming an ATSP instance to a TSP instance
P ROOF The whole graph transformation can clearly be done in polynomial time, i.e. a reduction of an instance of A(Gatsp ) in graph Gatsp can be
turned into an instance T (Gtsp ) in graph Gtsp in polynomial time.
If TAOP T = (n1 , n2 , . . . , nn ) is an optimal maximal ATSP tour in Gatsp
with length Latsp , then the TSP-tour
Ttsp = (vn1in , vn1out , vn2in , vn2out , . . . , vnnin , vnnout )
will be an optimal solution for the TSP in Gtsp with length − (Latsp + n · K).
Conversely if
TTOP T = (vn1in , vn1out , vn2in , vn2out , . . . , vnnin , vnnout )
is an optimal TSP tour in Gtsp with length Ltsp then Tatsp = (n1 , n2 , . . . , nn )
is an optimal maximal ATSP tour in Gatsp with length − (Ltsp + n · K).
As for the first reduction, both cases can be proved using that existence
of a better TSP tour (ATSP tour) leads to a contradiction of the assumed
optimality of TAOP T and TTOP T respectively.
6.6
The Best Factor Algorithm
The basis for this algorithm is the overlap formulation (Expression 5.6 on
page 40) for the optimal solution of the SSP. After making the reductions
64
6.6. The Best Factor Algorithm
described in Section 6.5.1 on page 62 the algorithm finds an approximation for the resulting ATSP instance. To this end the techniques described
in [Kaplan et al., 2005] are used.
6.6.1
Overview
The algorithm starts off by finding a solution to the following relaxation
of an LP-problem:
Definition 11
Let S = (s1 , s2 , . . . , sn ) be a string collection and let G(N, A, W ) be its
overlap graph. The following is a relaxation of the LP-formulation for
Maximal CC with Two-Cycle Constraint
X
Maximise
w(aij ) · xij subject to
aij ∈A
X
xij = 1,
∀j
(in-degree constraints)
xij = 1,
∀i
(out-degree constraints)
i
X
j
xij + xji ≤ 1,
∀ i 6= j
(2-cycle constraints)
xij ≥ 0,
∀ i 6= j
(non-negativity constraints)
2
The above LP-formulation can be seen as a variation of the LP-formulation
of the TSP (see Definition 2 on page 13) with the subtour elimination constraints being modified to only consider possible subsets of size 2.
Using the found solution to the LP formulation in Definition 11 the
algorithm first “integrates” the possibly non-integral solution. This is
done by scaling up (multiplying) with the least common denominator D
of all variables. This integral solution defines a multigraph having a total
weight of at least D times the value of the original
D solution. Furthermore
the multigraph does not contain more than 2 copies of any 2-cycle. By
using a representation of this multigraph as a pair of CCs and by careful
rounding and recursing Kaplan et al. end up with two CCs CC1 , CC2 with
the property
||CC1 || + ||CC2 || ≥ 2 · OP Tatsp .
(6.5)
This pair of CCs is then further decomposed by a 3-coloring into three
collections of paths. It now follows that the heaviest collection will have
weight at least 32 · OP Tatsp and the returned approximation solution is
achieved by randomly patching this path collection into a tour.
Remark 11
As mentioned in [Mucha, 2007], the result achieved by Kaplan et al. is
quite impressive. The theoretical bound for an integral solution to the LP
65
6. A LGORITHMS
seems to be 23 (see Section 6.6.7 on page 72), and Kaplan et al. manage to
achieve this bound using some nice techniques. For this reason the main
part of the algorithm will be covered in details in the following.
2
6.6.2
Preliminaries and notation
The algorithm consists of the following steps
1) Transform the HPP into an instance of the ATSP and construct the
overlap matrix for this instance.
2) Find a solution to the corresponding LP in Definition 11.
3) Scale up the possibly non-integral solution by multiplying with the
least common denominator D of all variables.
4) Recursively decompose the resulting multigraph until a pair of CCs
fulfilling Equation 6.5 is achieved.
5) Decompose this CC pair into three path collections.
6) Patch the heaviest path collection into a tour.
Step 1) follows the description in Section 6.5 on page 61 and Step 2)
is done by use of an LP solver program capable of solving mixed integer
linear programs.
The steps 3) and 4) will be the topic of the rest of this section. A
detailed description of Step 5) is given in [Kaplan et al., 2005, Section 5].
Finally Step 6) is done by arbitrarily merging the disjoint paths together
until a tour is constructed.
Definitions and notation
Let G(N, A, W ) be the overlap graph from the LP. Let {x∗ij } for i, j ∈
1 . . . |N | and i 6= j be an optimal (fractional) solution from Step 2) with
OP Tf the maximal value found for the objective function of the LP. Denote
the optimal value for the corresponding ATSP OP Tatsp . Let D be the least
common denominator from Step 3). Define k · (i, j) to be the multiset
containing k copies of the arc (i, j) and define the weighted multigraph
D · G = (N, A, W ) where A = {(Dx∗ij ) · (i, j) | (i, j) ∈ A} and all copies of the
same arc have identical weight. Let Wmax denote the maximum weight
for an arc in G. Denote the number of copies (the multiplicity) of an arc
a by mG (a) and denote a 2-cycle between the nodes i, j by Cij .
Remark 12
The multiplier D might be exponential in the size of N , as it can be the
product of |A| = (n − 1)2 factors. However, the algorithm avoids using
66
6.6. The Best Factor Algorithm
more than polynomial space since the number of bits necessary to represent D is polynomial. Furthermore, multigraphs are implicitly and never
explicitly represented by maintaining the multiplicities for all arcs.
2
Definition 12 (Half-Bound d-regular Multigraph)
A (directed) multigraph G(N, A) is said to be d-regular if the in and outdegrees for all nodes equal d.
d A d-regular multigraph is said to be halfbound if it contains at most 2 copies of any 2-cycle.
2
6.6.3
Finding the lowest common denominator
Determining the multiplier D in Step 3) is done using a Greatest Common
Divisor (GCD) implementation as a subroutine. The pseudo-code for the
algorithm is shown in Algorithm II.5.
Algorithm II.5 leastCommonDenominator
Require: Nonempty list of fractional numbers,
p
fl = [ pq11 , pq22 , . . . q|fl| ] with GCD(pi , qi ) = 1 and pi ≤ qi ∀ i ∈ 1 . . . |fl|
|fl|
Ensure: The smallest number, lcd ∈ N : lcd · pq ∈ N ∀ pq ∈ fl.
procedure LEAST C OMMON D ENOMINATOR(fl)
3:
lcd ← 1
for all pq in fl do
q
lcd ← lcd · GCD(lcd,
q)
6:
end for
return lcd
end procedure
Termination: The algorithm only iterates over finite non-increasing
data structures.
Correctness: Using induction over the size of fl, we get
Base Case: If |fl| = 1 we have lcd1 =
lcd0 = 1 follows from line 3.
q1
GCD(lcd0 , q1 )
=
q1
1
= q1 where
Induction Case: Assume lcdi−1 is correct and |fl| = i. We have
lcdi = lcdi−1 ·
qi
GCD(lcdi−1 , qi )
No matter whether qi divides lcdi−1 or not, it follows from the
qi
definition of GCD that GCD(lcd
will be the smallest possible
i−1 , qi )
number to multiply with lcdi−1 such that the qi ’s divides the
product.
67
6. A LGORITHMS
The running-time for LEAST C OMMON D ENOMINATOR () is dependent
on the running-time for the used sub-procedure GCD ().5 Using an implementation of Euclids Algorithm we have for the running-time of GCD ()
rtGCD = O(log(max(lcd, q))) (see [Cormen, Leiserson, Rivest, and Stein,
2001, chapter 31.2]) and for LEAST C OMMON D ENOMINATOR (): rtLCD =
O(|fl| · rtGCD ).
6.6.4
Finding Cycle Covers when D is a power of two
To determine the CCs in Step 4) on page 66 we follow the presentation in
[Kaplan et al., 2005, Section 3] and first show the procedure for the case
where the multiplier D is a power of two.
Let G0 = D · G where G = G(N, A, W ) is the graph from the LP in Step
2). G0 is then D-regular, half-bound and has W (G0 ) ≥ D · OP Tatsp . Define
a D-regular bi-partite undirected multigraph B(V, V 0 , E) as follows: for
each node ni ∈ N define two vertices vi ∈ V, vi0 ∈ V 0 and for each arc
aij = (ni , nj ) ∈ A define the edge e = (vi , vj0 ) ∈ E.
Next divide B into two D
B1 , B2 as fol2 -regular bi-partite multigraphs
mB (e) lows: for each edge e ∈ B with mB (e) ≥ 2 add
copies of e to both
2
mB (e) B1 and B2 and remove 2 ·
copies of e from B. Distribute the “left2
over” edges among B1 and B2 as follows: for each connected component
in the remaining subgraph of B determine an Eulerian Cycle,6 and alternately assign the edges in these cycles to B1 and B2 .
Now, let G0 and G00 be the sub-graphs of G0 corresponding to B1 and B2
respectively. Note that both G0 and G00 are D
2 -regular and half-bound. The
latter is given since G0 was half-bound and D-regular, meaning that for
every 2-cycle Cij in G0 at least one of the arcs (ni , nj ) and (nj , ni ) has mul0
00
tiplicity ≤ D
2 . By letting G1 be the heaviest graph of G and G , it is
D
0
00
achieved that W (G1 ) ≥ 2 · OP Tatsp because W (G ) + W (G ) = W (G0 ) ≥
D · OP Tatsp .
Applying the above procedure log(D) − 1 times the obtained graph
Glog(D)−1 will be a 2-regular half-bound multigraph having W (Glog(D)−1 ) ≥
2 · OP Tatsp . From this graph it is fairly simple7 to obtain a pair of CCs satisfying Equation 6.5 on page 65.
The running-time for the procedure will be O(log(D) · n2 ) following
from log(D) iterations each costing O(n2 ).
5
actually it is also dependent on a procedure being able to extract the denominator of
a fractional number from its decimal representation. However, the dominant procedure
will be the GCD.
6
a cycle containing all edges in a graph
7
e.g. by using arc coloring in a bi-partite graph.
68
6.6. The Best Factor Algorithm
6.6.5
Finding Cycle Covers when D is not a power of two
When D is not a power of two the challenge is to use a rounding procedure
on the solution of the LP in order to derive a 2y -regular multigraph. When
this is achieved the procedure in Section 6.6.4 can be applied.
For reasons that will become apparent later assume |N | ≥ 5. Determine y ∈ N such that 2y−1 < 12 |N |2 Wmax ≤ 2y . Define D = 2y − 2 |N |.
For each value x∗ij let xij denote the largest integer multiple of 1/D less
than x∗ij . Define the multigraph D · G = (N, A, W ) where A = {(D xij ) ·
(i, j) | (i, j) ∈ A}. Since D xij ∈ N this graph is well-defined.
Lemma 19
For all nodes n in D · G the following inequalities hold
D − (|N | − 1) ≤ dnin , dnout ≤ D.
2
P ROOF ([K APLAN ET AL ., 2005]) For node nj it follows from the rounding of the x∗ij values that xij ≥ x∗ij −1/D and
P from the in-degree constraints
in Definition 11 on page 65 we get that i x∗ij = 1. From the definition of
D · G it then follows that
X
xij
(6.6)
dnin
=D
j
i
≥D
X
x∗ij − 1/D
i
|N | − 1
D
= D − (|N | − 1)
≥D 1−
using xij ≤ x∗ij the upper bound for the in-degree follows from Equation
(6.6). The bounds on the out-degree can be proved analogously.
In the following we will say that a multigraph Ĝ has the sum-of-arcs
property if
mĜ (aij ) + mĜ (aji ) ≤ D ∀ ni , nj , i 6= j, i, j ∈ 1 . . . |N |.
From the 2-cycle constraints it follows that D · G has the sum-of-arcs
property, that is it has at most D
2 copies of any 2-cycle Cij .
Next the multigraph D · G has to be completed into a 2y -regular graph
by the addition of arcs. Thereby it has to be ensured that the resulting
graph will be half-bound. The process of arc-adding is divided into an arcaddition stage and an arc-augmenting stage, and ensures the half-bound
property by satisfying the condition, that mG (aij ) + mG (aji ) ≤ D for every
pair of nodes ni , nj .
69
6. A LGORITHMS
Arc-addition stage: If ∃ ni , nj , i 6= j : dnout
< D and dnin
< D and
i
j
mG (aij ) + mG (aji ) < D add an arc aij to the graph D · G
denoting the graph by the end of the arc-addition stage G0 the following
lemma holds
Lemma 20
With the exception of maximally two nodes ni , nj every node n in G0 has
dnin = dnout = D. If such pair of nodes do exist, then mG (aij ) + mG (aji ) =
D.
2
P ROOF ([K APLAN ET AL ., 2005]) Assume the existence of three distinct
nodes ni , nj , nk such that forPeach of these
P either the in-degree or outdegree is less than D. Since n dnin = n dnout there will be at least one
node, say ni with dnout
< D and at least one node, say nj with dnin
< D.
i
j
in
out
in
Assume further dnk < D (the case dnk = D ⇒ dnk < D is symmetric). Since the arc-addition stage finished without adding further copies
of aij and aik it follows that mG0 (aij ) + mG0 (aji ) = mG0 (aik ) + mG0 (aki ) = D.
This again implies that dnin
+ dnout
= 2D. However, since the arc-addition
i
i
stage ensures that neither the in nor the out-degree of any node will be
larger than D it follows that dnout
= D, leading to a contradiction.
i
By Lemma 20 at most two nodes ni , nj exist for which the in-degree
and out-degree do not equal D. For these nodes it follows from Lemma
19 that D − (|N | − 1) ≤ dnin
, dnout
, dnin
, dnout
. The first phase of the arci
i
j
j
augmenting stage starts by choosing an arbitrary node nk 6= ni , nj and
add copies of the arcs aki , aik , akj , ajk until dnin
= dnout
= dnin
= dnout
=
i
i
j
j
in
out
D. At this point any node nm , m 6= k has dnm = dnm = D. Denote the
corresponding graph G00
Since there has been added maximally 2(|N | − 1) incoming and maximally 2(|N | − 1) outgoing arcs to and from nk and because D = 2y − 2|N |
it still holds that dnin
, dkout < 2y (Lemma 19). Furthermore, after the arck
addition stage we had mG0 (aki ), mG0 (aik ) ≤ D. The first phase of the arcaugmenting stage added maximally (|N | − 1) copies of both aki and aik so
it now holds that mG00 (aki ), mG00 (aik ) ≤ 2y . Since a similar argument holds
for akj and ajk it is still possible to augment G00 into a 2y -regular graph G0
having the sum-of-arcs property. Since nk is the only node in G00 whose
in-degree or out-degree is larger than D it follows that dnin
= dnout
= L
k
k
y
and D ≤ L ≤ 2 .
The second phase of the arc-augmenting stage adds L − D arbitrary
cycles passing through all nodes except nk , ensuring that all nodes have
dnin = dnout = L. Then 2y − L arbitrary Hamiltonian cycles are added with
the restriction that none of these Hamiltonian cycles uses any of the arcs
aki , aik , akj and ajk . Note that there has to be at least five vertices in order
70
6.6. The Best Factor Algorithm
for these these Hamiltonian cycles to exist. Calling the resulting graph
G0 it follows by construction that:
Lemma 21
G0 is 2y -regular and half-bound.
2
Next the procedure of Section 6.6.4 on page 68 can be applied to G0 .
After y − 1 iterations the procedure will end with a 2-regular half-bound
graph Gy−1 . It remains to prove the following lemma
Lemma 22
The graph Gy−1 will have W (Gy−1 ) ≥ 2 OP Tatsp
2
P ROOF ([K APLAN ET AL ., 2005]) Using the following facts:
i) xij ≥ x∗ij − 1/D,
ii) D = 2y − 2|N |
iii)
P
aij ∈A 1
= (|N | − 1)2 ≤ |N |2 , and
iv) OP Tf ≤ |N | Wmax
we have for the weight of G0 that
W (G0 ) ≥
X
D · w(aij ) · xij
aij ∈A
X w(aij )
≥D·
w(aij )x∗ij −
D
aij ∈A
aij ∈A
X w(aij )
≥ D · OP Tf −
D
aij ∈A
X Wmax
≥ D · OP Tf −
D
a ∈A
X
( from i) )
ij
y
≥ 2 OP Tf − 2|N | OP Tf − |N |2 Wmax
≥ 2y OP Tf − 3|N |2 Wmax
( from ii) and iii) )
( from iv) )
At each of the y − 1 iterations the heaviest graph is retained, thus using 2y ≥ 12 |N |2 Wmax ⇔ 2y−1 ≥ 6 |N |2 Wmax and OP Tf ≥ OP Tatsp . The
71
6. A LGORITHMS
following holds for the weight of Gy−1 :
W (G0 )
2y−1
y
2 OP Tf − 3 |N |2 Wmax
≥
2y−1
3 |N |2 Wmax
≥ 2 OP Tf −
6 |N |2 Wmax
1
≥ 2 OP Tatsp −
2
≥ 2 OP Tatsp since W (Gy−1 ) and OP Tatsp are integers.
W (Gy−1 ) ≥
6.6.6
Approximation factor
Theorem 7
The Best Factor algorithm is a 2.5 approximation algorithm.
2
P ROOF The algorithm achieves a factor 23 for the MAX-ATSP. Through
the reductions in Section 6.5.1 on page 62 this gives a factor 32 for the
MAX -HPP. From Theorem 5 on page 56 we now have that this gives a
3.5 − 1.5 · 32 = 2.5 approximation for the SSP.
6.6.7
Implementation
The implementation consists of a Python implementation, that utilises
two exterior LP solver programs 8 to find the solution in step 2) on page 66.
A quite remarkable observation is that the solutions to the LP problems turned out to be integral in all but one case out of 355 280 . This
simplified the implementation extraordinarily since the steps 3) to 5) on
page 66 could be substituted with the steps:
iii) No rounding is needed because of integral solution.
iv) No pair of CCs is needed, as the returned CC is optimal.
v) No decomposition is needed, just remove the lightest arc from each
cycle in the cover to achieve a heaviest path collection.
Removing the lightest arc in step v) will maximally reduce the accumulated weight with one third (as all cycle consist of minimally three
arcs), thus the approximation factor will be ensured.
8
QSopt: http://www2.isye.gatech.edu/~wcook/qsopt/
http://www.gnu.org/software/glpk/
72
and glpsol:
6.7. The TSP Based Algorithm
Remark 13
It has not been possible to explain this overwhelming majority of integral solutions to the LP. However, using constructed examples of distance
matrices it has been proven, that the feature neither is dependent on the
distance matrices being asymmetric nor on their satisfying of the InverseMonge-Condition.
2
6.7
The TSP Based Algorithm
Like the Best Factor algorithm 6.6 on page 64 the basis for this algorithm
is the overlap formulation (Expression 5.6 on page 40). After making the
reductions described in Section 6.5.1 and Section 6.5.2 on page 62 we can
use LKH to find a solution and transforms this back into a solution for the
original HPP.
6.7.1
Implementation
The implementation consists of a Python implementation that transforms an original asymmetric overlap matrix into a corresponding symmetric distance matrix and then calls the LKH program to find a solution
for the transformed TSP instance. This solution is then converted into a
solution for the original SSP using the methods in Section 6.5 and expression 5.6 on page 40.
The running-time for the implementation is dominated by the calculation of the matrices and the running-time for LKH on the transformed instance. Letting S = (s1 , s2 , . . . , sn ) be the collection of strings, the runningtime for the calculation of the original ATSP overlap matrix can (theoretically) be reduced to O(||S|| + n2 ) [Gusfield, 1997]. Transforming this
matrix into a symmetric matrix for the transformed instance will have a
running-time of O((2n)2 ). The running-time for LKH is empirically determined to approximately O((n)2.2 ) [Helsgaun, 1998].
In practise the calculation of the all-pairs prefix-suffix overlap matrix
was implemented as mentioned in Section 6.1.2 on page 48.
6.8
Summary
To recapitulate, Table 6.2 shows the characteristics of the implemented
algorithms.
73
6. A LGORITHMS
Table 6.2: SSP algorithm characteristics
Algorithm
Approximation factor
Running-time
Cycle
2.75
O(log(n) · n2 )
RecurseCycle
2.75
O(log(n) · n2 )
Greedy
3.5
O(log(n) · n2 )
Best Factor
2.5
O(log(D) · n2 )
–
O(n2.2 )
TSP
In the running-times D is the multiplier form Step 3) on page 66
and n is the number of strings in the stringcollection S = (s1 , s2 , . . . , sn )
The running-times for the Cycle algorithm and the RecurseCycle algorithm are dominated by Algorithm II.2 on page 52. The running-time for
Greedy is dominated by the sorting of n2 pairwise overlaps. The runningtime for the Best Factor algorithm might be dominated by the LP solver
program. Finally is the running-time for the TSP algorithm dominated by
running LKH.
6.9
Notes
The approximation factor theory in Section 6.1 and Section 6.2 is based
on [Blum et al., 1994; Mucha, 2007]. The theory in Section 6.3 is based on
[Breslauer et al., 1996] and the theory in Section 6.4 is based on [Blum
et al., 1994; Kaplan and Shafrir, 2005]. Finally Section 6.6 is based completely on [Kaplan et al., 2005]. All other references are explicitly mentioned.
74
C HAPTER
A little experience often upsets a lot of
theory.
7
Experiments
This chapter contains descriptions of the experiments performed on the
implemented SSP algorithms together with their results.
The main and almost sole purpose of the SSP experiments was for each
of the different algorithms to evaluate its properties concerning SSP solution quality. This especially meant that no particular emphasis concerning running-time efficiency was put into the implementations of the algorithms. In other words: A fast and simple implementation was preferred
to a complex but more running-time efficient implementation.
Since it turned out to be fairly easy to collect test results concerning
properties like running-time and solution quality as sequencing method,
tests for these were also conducted. It should be kept in mind though, that
the latter tests were made more out of curiosity than of planned intent
and as such care should be taken not to draw too definite conclusions
upon them.
An initial section describing the test data generation is followed by
sections covering the tests conducted for SSP solution quality, runningtime and sequencing quality respectively. Each of these experiment sections finishes with the conclusions drawn from the test results.
7.1
Test Data
The test data was constructed according to one of the procedures described in [Romero et al., 2004]. The procedure resembles the Shotgun
Sequencing method (see 4.2.1 on page 30) thus we will need the following
definition:
75
7. E XPERIMENTS
Definition 13 (Coverage)
As a measure of relation between number of clones and number of overlapping fragments, the coverage is defined as
cov =
n·`
|G|
(7.1)
where n is the number of fragments, ` is the average length of the fragments and |G| is the length of the original DNA sequence (genome).
2
As mentioned in [Gusfield, 1997, chapter 16.14] a five- to tenfold coverage generates the best results for Shotgun Sequencing.
The steps for producing test data are:
1) From each of a pool of 64 fasta-format files1 a nucleotide sequence
of length N is obtained.
2) A collection of strings S of equal length ` is constructed by shifting
character by character along the obtained nucleotide sequence.
3) The collection S is modified by randomly removing a number of
strings r until a wanted coverage value is obtained.
4) Ensure that S is substring free (see Remark 3 on page 29).
In the experiments coverage values of 6, 8, 10, 20, 30 and 40 were
used. String length varied from minimally 6 to maximally 100, collection
size from 100, 200, . . . , 1000 and a few of size 2000 and 3000. For exact
details and calculation of the necessary nucleotide sequence length N see
Appendix D on page 161.
In [Romero et al., 2004] two further methods to generate test data
were used. These methods consist of a random generation of the nucleotide sequence in step 1) and a random generation of all strings the
collection S respectively.
The choice of the above method was due to the results obtained by
Romero et al.. These results showed that the largest deviations from a
lower bound for the SSP solutions, was obtained using test data generation from real nucleotide sequences. This might not be all that surprising, as uniformly random generated nucleotide sequences tend to have
small and sparse overlaps, thus are easy to compress optimally. The 64
nucleotide sequences were obtained from GenBank2 and comprise the
sequences which were possible to contrive using the accession numbers
given in [Romero et al., 2004].
76
1
see appendix C on page 157
2
http://www.ncbi.nlm.nih.gov/
7.2. SSP Solver Tests
Table 7.1: SSP experiments overview
coverage
string length
string count
Figure 7.1
figures 7.2, 7.3
figures 7.4, 7.5
Running-time
–
–
figures 7.7 – 7.9
Sequencing
–
Figure 7.10a
Figure 7.10b
SSP
solution
Figures showing the quality dependency of the row values upon the
the column variables
Table 7.1 contains an overview of the following figures and which relationship combination they are illustrating. All tests were run on two
dedicated, identical machines.3
Remark 14
The following abbreviations are used in the figure legend for all figures:
• ’RCycle’ for RecurseCycle algorithm,
• ’Best’ for Best Factor algorithm, and
• ’LB’ for lower bound.
Please note the fine scaling on the y-axes in the figures 7.1–7.5.
7.2
2
SSP Solver Tests
Even though the SSP-solutions originating from the TSP algorithm were
expected to be optimal at the very least for string count up to 1000, it was
decided to calculate a lower bound for the optimal SSP solution for each
test.
The lower bound was calculated by Concorde as a guaranteed (with
respect to numerical precision) feasible solution to the dual of a LP relaxation for the TSP as a lower bound. The results of for all SSP algorithms
are thus depicted as their percentage deviation above the lower bound i.e.
the lower bound value is used as a 100 % mark.
7.2.1
Variable coverage
From Equation 7.1 it follows that the coverage and the length of the original sequence are inversely proportional. For the test data this means,
3
Intel Xeon, Dual core, 3.0 ghz!, 1.0 GB RAM
77
7. E XPERIMENTS
that keeping string (fragment) count and length constant, increasing coverage is equivalent to shorter original sequence length. In other words
the problem of solving the SSP is expected to become easier for increasing coverage as the superstring to determine becomes smaller and the
fragment count and overlaps are constant.
Percentage in terms of lower bound
0.045
SSP results for count 200, length 50
+9.999e1
Cycle
RCycle
Best
Greedy
Tsp
LB
0.040
0.035
0.030
0.025
0.020
0.015
0.010
5
10
15
20
25
Coverage
30
35
40
(a) solution dependency on coverage
SSP results for count 200, length 100
Percentage in terms of lower bound
+9.999e1
Cycle
RCycle
Best
Greedy
Tsp
LB
0.025
0.020
0.015
0.010
5
10
15
20
25
Coverage
30
35
40
(b) solution dependency on coverage
Figure 7.1: SSP solution dependency on variable coverage
78
7.2. SSP Solver Tests
The tests were conducted with coverage values of 6, 8, 10, 20, 30 and
40. Only the results for string count 200 and string lengths 50 and 100
are shown, as they are representative for the results using other string
counts and lengths. As can be seen in Figure 7.1 the expected increase in
SSP-solutions quality with larger coverage values is not obvious. Actually
all algorithm SSP-values (except for the TSP algorithm) seem to fluctuate
somewhat. The fluctuations can partly be explained as a visual effect due
to the very fine y-axis scale.
7.2.2
Variable length
From Equation 7.1 it can be seen that the string (fragment) length and the
length of the original sequence are directly proportional. For the test data
this means, that keeping string count and coverage constant, increasing
length is equivalent to larger original sequence length.
The problem of solving the SSP is expected to become easier though as
the possible overlaps are increasing with the string lengths.
SSP results for coverage 8, count 100
+9.99e1
Cycle
RCycle
Best
Greedy
Tsp
LB
Percentage in terms of lower bound
0.5
0.4
0.3
0.2
0.1
20
40
60
String length
80
100
Figure 7.2: SSP solution dependency on length
For these tests string lengths of 10, 20, 25, 40, 50 and 100 were used.
In Figure 7.2 and Figure 7.3 are some of the results for a coverage of 8 depicted, the results for other coverage and string count values are similar.
79
7. E XPERIMENTS
SSP results for coverage 8, count 500
Percentage in terms of lower bound
+9.97e1
Cycle
RCycle
Best
Greedy
Tsp
LB
1.1
0.9
0.7
0.5
0.3
20
40
60
String length
80
100
Figure 7.3: solution dependency on length, compare
with Figure 7.10a
In contrast to the variable coverage results, the variable length results
are very consistent. All algorithms are producing values of increasing
quality for longer lengths. Only a string length of 10 results in values
up to 1 % above the lower bound. As always this is not case for the TSP
algorithm, which constantly lies more or less on the lower bound line.
7.2.3
Variable count
From Equation 7.1 it follows that the string count and the length of the
original sequence are directly proportional. For the test data this means,
that keeping string length and coverage constant, increasing count is
equivalent to larger original sequence length.
The problem of solving the SSP is expected to stay undisturbed as the
possible overlaps are non-increasing and the extra amount of strings are
compensated for by increasing superstring length.
For these tests string counts of 100, 200, 300, 400, 500, 600 and 700
were used. In Figure 7.5 and Figure 7.5 are the results for a coverage of
8 depicted for string lengths of 25 and 40, the results for other coverage
and string length values being similar.
The results are showing a slight tendency of reduced solution quality
80
7.2. SSP Solver Tests
+9.99e1
Percentage in terms of lower bound
0.22
0.20
0.18
SSP results for coverage 8, length 25
Cycle
RCycle
Best
Greedy
Tsp
LB
0.16
0.14
0.12
0.10
100
200
300
400
500
Number of strings
600
700
Figure 7.4: solution dependency on count, compare
with Figure 7.10b
upon growing string count. The fluctuations in the figures are mainly due
to the changes in y-value scale between Figure 7.4 and Figure 7.5, and
probably also a too conservative lower bound value. If the results from
Figure 7.5 are correlated using the TSP algorithm values as lower bound,
the lines look more smooth (see Figure 7.6).
7.2.4 SSP solution conclusions
Some recurrent patterns that are present in all of the SSP solution figures
are:
• the TSP algorithm values are almost constantly laying on the lower
bound line,
• the Greedy algorithm are producing the second best values, followed
by the two Cycle algorithms and the Best Factor algorithm, which
are producing solutions of similar quality,
• the Cycle algorithm and the RecurseCycle algorithm are, for all
practical manners, producing identical values, and finally
81
7. E XPERIMENTS
Percentage in terms of lower bound
0.07
SSP results for coverage 8, length 40
Cycle
RCycle
Best
Greedy
Tsp
LB
+9.999e1
0.06
0.05
0.04
0.03
0.02
0.01
100
200
300
400
500
Number of strings
600
700
Figure 7.5: solution dependency on count
Table 7.2: minimum scores for SSP algorithms in 355 280 tests
Algorithm
minimum scores
percentage
Cycle
301 637
84.90 %
RecurseCycle
301 664
84.91 %
Best Factor
301 251
84.79 %
Greedy
306 519
86.28 %
TSP
355 280
100.00 %
• all algorithms are producing results typically less than 1 % above
the lower bound, but always less than 2 %.
To further assess the results Table 7.2 shows a listing of the number
of times an algorithm achieved the smallest (best) value in all test runs.
The conclusions that can be drawn upon these observations are:
1) The TSP algorithm is far superior to the other algorithms in view of
solution quality.
82
7.3. Timing Tests
SSP results for coverage 8, length 40
Cycle
RCycle
Best
Greedy
Tsp
LB
Percentage in terms of lower bound
+9.999e1
0.05
0.04
0.03
0.02
0.01
100
200
300
400
500
Number of strings
600
700
Figure 7.6: Figure 7.5 correlated using TSP algorithm values as 100 %
2) The Greedy algorithm produces better results than CC based algorithms even though its approximation factor is higher.
3) The Best Factor algorithm, the Cycle algorithm and the RecurseCycle algorithm are producing alike quality solutions.
4) The improvement of the RecurseCycle algorithm in respect to the
Cycle algorithm is negligible.
5) All approximation algorithms are producing far better solutions than
their worst case approximation factor might indicate.
7.3
Timing Tests
For test data with coverage 6 the running-time for all algorithms plotted
against string count is shown using normal scale in Figure 7.7
To examine the running-time complexities, the Best Factor algorithm
and the TSP algorithm running-times were plotted using double-log scale
(see Figure 7.8a). A linear regression was made for both algorithms4 and
4
using the stats package from scipy
83
7. E XPERIMENTS
SSP algorithms running-time in seconds for coverage 6
Cycle
RCycle
Best
Greedy
Tsp
120
100
Running-time
80
60
40
20
0
100
200
300
400
500 600
String count
700
800
900
1000
Figure 7.7: SSP algorithms running-time using normal scale
the result is plotted in Figure 7.8b. The values obtained by the linear
regressions are shown in Table 7.3.
The Greedy algorithm, the Cycle algorithm and the RecurseCycle algorithm running-times are shown in Figure 7.9a using double-log scale
and running-time values divided by log(string count) values. A linear regression was made for all algorithms and the result is plotted in Figure
7.9b. The values for the linear regressions are shown in Table 7.3.
The following observations can be made:
• the TSP algorithm and the Best Factor algorithm are clearly slowest,
• the Greedy algorithm, the Cycle algorithm and the RecurseCycle
algorithm are all comparable with respect to running-time, and
• the TSP algorithm and the Best Factor algorithm are having a polynomial running-time dependency, the Greedy algorithm, the Cycle
algorithm and the RecurseCycle a log times polynomial runningtime dependency. Comparing these observations with Table 6.2 on
page 74 it can be concluded that all algorithms, apart from the Best
Factor algorithm, are exhibiting the expected running-times. The
84
7.3. Timing Tests
18
log2(Running-time)
16
SSP algorithms running-time in milliseconds for coverage 6
Best
Tsp
14
12
10
8
66.5
7.0
7.5
8.0
8.5
log2(String count)
9.0
9.5
10.0
(a) algorithm running-times
SSP algorithms running-time in milliseconds for coverage 6
Best
Best regression
16
Tsp
Tsp regression
log2(Running-time)
18
14
12
10
8
66.5
7.0
7.5
8.0
8.5
log2(String count)
9.0
9.5
10.0
(b) running-time linear regressions
Figure 7.8: Best Factor and TSP algorithm running-times
85
7. E XPERIMENTS
SSP algorithms running-time in milliseconds for coverage 6
Cycle
RCycle
9
Greedy
log2(Running-time/log2(String count))
10
8
7
6
5
4
36.5
7.0
7.5
8.0
8.5
log2(String count)
9.0
9.5
10.0
(a) algorithm running-times
SSP algorithms running-time in milliseconds for coverage 6
Cycle
Cycle regression
9
RCycle
RCycle regression
8
Greedy
Greedy regression
log2(Running-time/log2(String count))
10
7
6
5
4
36.5
7.0
7.5
8.0
8.5
log2(String count)
9.0
9.5
10.0
(b) running-time linear regressions
Figure 7.9: Cycle, RecurseCycle and Greedy algorithm running-times
86
7.4. Sequencing Tests
Table 7.3: Running time linear regression values
Algorithm
slope
r-value
r-squared
p-value
stderr
Cycle
1.89
1.00
1.00
0.00
0.022
RecurseCycle
1.89
1.00
1.00
0.00
0.021
Best Factor
2.16
1.00
1.00
0.00
0.026
Greedy
1.67
1.00
1.00
0.00
0.027
TSP
2.87
0.99
0.98
0.00
0.060
The r-value is the correlation coefficient, the r-squared is the coefficient of
determination and the two-sided p-value is for a hypothesis test whose null
hypothesis is that the slope is zero i.e. no correlation at all.
running-time measured for the Best Factor algorithm is probably
due to its external LP solver.
7.4
Sequencing Tests
For the test data with coverage 8 the similarity to the original string was
tested using the needle application from the EMBOSS suite [Rice, Longden, Bleasby, et al., 2000].5 This is an implementation of the NeedlemanWunsch Optimal Pairwise Alignment (OPA) algorithm but in this case it
was only used in order to get a percentage value for the similarity between
the two strings.
Remark 15
The alignment tests were conducted before the Best Factor algorithm was
implemented, therefore no values are shown for this algorithm in the figures.
2
The results for variable string length and variable string count were
collected. Results for string count 600 and for string length 25 are shown,
as the other string counts/lengths results were similar. The results are
depicted in Figure 7.10.
Figure 7.10a should be compared to Figure 7.3 on page 80 and Figure
7.10b should be compared with Figure 7.4 on page 81.
Another measure of the similarity between the calculated superstring
and the original string is illustrated in Table 7.4
From the figures it can be seen that there is a correlation between the
SSP solution quality and the string similarity indicating, that the SSP is a
5
http://emboss.sourceforge.net/
87
7. E XPERIMENTS
Percentage identity to original string
Align results for stringcount: 0500
95
90
85
RCycle
Tsp
Greedy
Cycle
80
75
20
40
60
String length
80
100
(a) alignment similarity for variable string length, compare with Figure
7.3
Percentage identity to original string
Align results for stringlength: 025
RCycle
Tsp
Greedy
Cycle
98.5
98.0
97.5
97.0
96.5
100
200
300
400
500
Strings count
600
700
(b) alignment similarity for variable string count, compare with Figure
7.4
Figure 7.10: SSP algorithm alignment similarity results
88
7.4. Sequencing Tests
Table 7.4: superstring lengths compared to original string length
Algorithm
shorter
equal
longer
Cycle
78 %
11 %
11 %
RecurseCycle
78 %
11 %
11 %
Best Factor
79 %
11 %
10 %
Greedy
79 %
12 %
9%
TSP
86 %
14 %
0%
plausible model for the Shotgun Sequencing. From Table 7.4 it is evident,
that all algorithms compress the fragments too much, though. Note that
due to the way of constructing the test data (removal of random chosen
fragments in Step 3) on page 76) it might not be possible to reconstruct a
superstring with same length as the original sequence.
89
Part III
The Circular Sum Score
and
Multiple Sequence
Alignment
DNA
strands tour with 16384 cities
One or two homologous sequences
whisper . . .
A full multiple alignment shouts out
loud
C HAPTER
Arthur Lesk
8
Sequence Alignments
This chapter introduces and defines biological sequence alignments. It
further gives a description of the problems concerned with especially MSAs
along with a characterisation of the commonly used construction algorithms.
8.1
Motivation
Many algorithms in bioinformatics are concerned with comparing sequences.
This is mainly due to the following:
Definition 14 (The First Fact of Biological Sequence Analysis)
In biomolecular sequences (DNA, RNA or amino acid sequences) high sequence similarity usually implies significant functional or structural similarity.
2
This fact can be popularly rephrased as:
If it looks like a duck, it will probably walk and talk like a duck.
Evolution reuses and modifies successful structures and “duplication
with modification” is the central paradigm for protein (amino acid sequence) evolution. A large majority of the human genes are very similar if not directly identical to the genes of for example the mouse.1 What
makes a mouse differ from a human may be determined more by protein
regulation, than difference in the protein sequences.
1
This explains why some molecular biologists affectionately refer to mice as “those
furry little people”
93
8. S EQUENCE A LIGNMENTS
Though the first fact is extremely powerful, it does not signify a symmetric property – which is captured in:
Definition 15 (The Second Fact of Biological Sequence Analysis)
Evolutionary and functionally related molecular sequences can differ significantly throughout much of the sequence and yet preserve the same
three-dimensional structure(s), or the same two-dimensional substructure(s), or the same active sites, or the same or related dispersed residues
(DNA or amino acid).
2
The second fact could then be rephrased as
Even if it walks and talks like a duck, it does not have to
look like a duck.
8.2
Pairwise Sequence Alignment
The similarity of two sequences is often measured using an alignment:
Definition 16 (Global pairwise alignment)
A (global) alignment of two strings s1 and s2 is obtained by first inserting
chosen spaces (or dashes) either into or at the ends of s1 and s2 , and then
placing the resulting strings one above the other so that every character
or space in either string is opposite a unique character or space in the
other string.
2
The term “global” emphasises that the full length of both strings (as
opposed to only a substring) participates in the alignment. A pairwise
alignment (using dashes) of two strings s1 , s2 ∈ Σ∗ results in two strings
a1 , a2 ∈ (Σ0 )∗ where Σ0 = Σ∪{0 −0 } and 0 −0 ∈
/ Σ with the property that if we
remove the inserted dashes we will obtain the original strings. A single
occurrence or a contiguous sequence of the space character (0 −0 above) is
denoted a “gap”. Biologically gaps are interpreted as either insertion or
deletion events (indels) in the sequences.
Remark 16
In the rest of this thesis 0 −0 will be used as the gap character and the
notation A = [[a1 , a2 ]] will be used to denote an alignment. We will use
the notation A[r][c] to describe the c’th. character of the r’th. row in the
alignment. Furthermore, we denote the sequences a1 , a2 the induced sequences for s1 , s2 in the alignment A.
2
To find out whether two sequences share a common ancestor the sequences are usually aligned. The problem is to determine the alignment
that maximises the probability that the sequences are related. The evaluation of an alignment is done using a scoring function.
94
8.2. Pairwise Sequence Alignment
Definition 17 (Scoring Function for an Alignment)
Let A be the set of possible alignments for two strings s1 , s2 ∈ Σ∗ over the
finite alphabet Σ. A function f : A → R is called a scoring function for the
alignment.
2
The value of the scoring function is typically defined as the sum of a
column-wise scoring of the alignment characters using a scoring matrix
M of size |Σ0 | × |Σ0 |
Example 12
Given two DNA-strings AGCCT A and AT T C and an alignment of these:
A=
AGCT CA
A T −T C −
The score S, of the alignment using the score matrix
A C G T
A
5
1
3
1
C
1
5
1
3
G
3
1
5
1
T
1
3
1
5
− −1 −1 −1 −1
−
−1
−1
−1
−1
0
is given by:
A
G
C
T
A
T
−
T
S= 5 + 1 − 1 + 5
C
A
C
−
+ 5 − 1 = 14
2
In Example 12 the scores for gaps are independent of the length of the
current gap. An often used scoring strategy for gaps is the so-called
“affine gap cost” which scores a gap g of length |g|, according to the formula: cost(g) = o + |g| · e, where o is the cost of starting a gap (opening
cost) and e the cost of extending a gap.2 Further refinements can be setting the cost of terminal gaps to zero or letting gapcost be dependent on
the distance to neighbouring gaps.
Determining the OPA for two strings is done using dynamic programming3 and has running-time and space complexity of O(|s1 | · |s2 |). As
mentioned in [Gusfield, 1997, chapter 12.1] the use of a recursive technique due to Hirshberg can reduce space complexity to O(min(|s1 |, |s2 |)
while only increasing the running-time by a factor of two.
2
We deliberately refrain from participating in the controversy of whether the first gap
should have the cost ’o’ or ’o + e’.
3
In bioinformatics the algorithms are named “Needleman-Wunsch” and “SmithWaterman” for global and local alignment respectively [Gusfield, 1997, chapter 11]
95
8. S EQUENCE A LIGNMENTS
8.3
Multiple Sequence Alignment
If comparison of two Ribonucleic Acid (RNA), DNA, or amino acid sequences
(proteins) can lead to biological insights; then this is even more so in the
case of comparing multiple sequences. Being able to determine “families”
of proteins has very important applications in at least three ways:
History of Life on Earth Sequence families are often used as the basis
for building evolutionary or phylogenetic trees. The sequences are
put as leaves in the tree (denoted the “taxa”) and vertices in the tree
then reflects unknown common ancestors.
Identifying conserved regions When comparing multiple viral or bacterial protein strains the most conserved regions are typically also
the critical functional regions for the proteins. Since vira mutate at
a very rapid rate, identifying these regions is of paramount importance when designing drugs. A drug attacking a conserved region
will be far longer effective than a drug attacking a more unstable
region.
Structural and functional prediction A protein folds into so called
“secondary structures”. These structures are divided into three categories: α-helices, β-sheets and random coils or loops (see Figure 8.1).
Indel events seldom occur in the first two categories meaning that a
possible way of identifying coils and loops are as regions with gaps
in the alignment.
Figure 8.1: Secondary structure of Avian Sarcoma
Virus. A retrovirus with many similarities to HIV
(from http://mcl1.ncifcrf.gov/integrase/asv_
intro.html )
96
8.3. Multiple Sequence Alignment
A straight-forward extension of the pairwise alignment definition leads
to:
Definition 18 (Global Multiple Alignment)
A global multiple alignment of k > 2 strings, S = (s1 , s2 , . . . , sk ) is a natural generalisation of an alignment for two strings. Chosen spaces are
inserted into (or at either end of) each of the k strings so that the resulting strings have the same length, defined to be `. Then the strings are
arrayed in k rows of l columns each, so that each character and space of
each string is in a unique column.
2
A more mathematical formulation of the above definition would be
Definition 19 (Multiple Sequence Alignment)
Let S = (s1 , s2 , . . . , sk ) be a collection of sequences with si ∈ Σ∗ ∀ i in 1 . . . k
and Σ a finite alphabet. An MSA consists of a collection of sequences A =
[[a1 , a2 , . . . , ak ]] with ai ∈ (Σ0 )` ∀ i in 1 . . . k where Σ0 = Σ ∪ {0 −0 } and 0 −0 ∈
/
Σ.
2
Remark 17
We call k the size of the MSA and ` the length of the MSA. In connection
with MSAs the literature often denotes the number of strings k and their
(original) length n. We will use this notation, even though n in connection
with the SSP was used to denote the number of strings. Hopefully this will
not lead to confusion.
2
Constructing an MSA immediately leads to three distinct problems:
1) The sequences to be aligned have to be determined.
2) A scoring function for the MSA has to be defined.
3) The optimal MSA with respect to the scoring function has to be determined.
Re 1) Algorithms for constructing an MSA usually assume that the
sequences are “homologous”.4 This does not mean that the algorithms will
refrain from constructing an MSA, should this not be the case, only that
the resulting MSA will at best be useless and in the worst case directly
misleading.
Re 2) Defining a meaningful scoring function for an MSA is very difficult. This has to do with the notion of correctness of an MSA. What
does it mean that an MSA is biologically correct and is it even possible to
recognise a correct MSA if we see one? The challenge is to define a mathematical function capturing all biological knowledge about the structure,
function and evolutionary history of the sequences. This challenge is still
4
The sequences share a common ancestor
97
8. S EQUENCE A LIGNMENTS
to be met, as of today there is no consensus as to how to score an MSA
[Elias, 2006; Gusfield, 1997; Notredame, 2002, 2007].
Re 3) As if the problem of defining a scoring function was not enough;
apart from instances with very short or very few sequences, it is inconceivable to calculate an optimal MSA. The general MSA problem has been
proven NP hard for the most used scoring functions [Elias, 2006; Wang
and Jiang, 1994].
Algorithms for constructing MSAs can be divided into three groups:
Exact Algorithms As a rule of thumb the optimal MSA cannot be practically calculated for more than three sequences of realistic lengths.
There are algorithms that extend this limit by extending a dynamic
programming approach into higher dimensions. They depend on
identifying the portion of the hyperspace that will contribute to the
solution, thereby minimising the space constraints. However even
the best implementations seem to reach a limit around ten sequences
[Notredame, 2002]. This group contains all implementations of the
Needleman-Wunsch algorithm. One of the best known extensions
to higher dimensions is the application msa [Lipman, Altschul, and
Kececioglu, 1989].5
Progressive Alignment Algorithms These algorithms follow the simple idea of constructing the final MSA by adding one sequence (or a
subalignment) at a time to a continuously growing alignment. They
depend on a proximity measure in order to choose the next sequence
(or subalignment) to add. Additionally, the order of succession has
very high influence on the terminal result, in particular the insertion of gaps. This feature is captured in the common formulation:
“once a gap, always a gap”. An illustration of this problem is seen
in Example 13 on the facing page. The best known program using
a progressive alignment algorithm is Clustalw [Thompson, Higgins, and Gibson, 1994]. It is also one of the first published MSA
construction applications. Despite having been surpassed concerning both running-time and quality [Blackshields, Wallace, Larkin,
and Higgins, 2006], Clustalw is still one of the most popular MSA
construction applications.
Iterative Alignment Algorithms The essence of these algorithms is the
idea, that it is possible to arrive at an optimal MSA by refining a
given sub-optimal MSA. The ways to refine the original MSA range
from simply choosing random sequences (or subalignments) to realign to a variety of techniques including Hidden Markov Model
5
although contrary to popular belief msa does not guarantee the mathematical optimal solution [Notredame, 2002]
98
8.4. Notes
(HMM), “Simulated Annealing, Genetic Algorithms, Divide-and-Conquer”
and “Consistency Based” approaches. The latter algorithms tries to
determine the MSA most consistent with libraries containing structural, functional, local and global alignment similarity information.
Known applications of this type are Probcons[Do, Mahabhashyam,
Brudno, and Batzoglou, 2005] and TCoffee[Notredame, Higgins,
and Heringa, 2000]
Many MSA application are a mix of the last two groups.
Example 13
The four sequences: THE LAST FAT CAT, THE FAST CAT, THE VERY
FAST CAT and THE FAT CAT, are progressively aligned in this order to
construct an MSA:
T HE LAST F AT CAT
→
T HE F AST CAT
T HE
LAS T
FAT
C AT
T HE F A S T
CA T
−−−
↓
T HE V ERY F AST CAT →
T HE
LAS T
FA − T
C AT
T HE
F AS T
CA − T
−−−
T HE V E R Y
FAST
C AT
↓
T HE F AT CAT →
T HE
LAS T
FA − T
C AT
T HE
F AS T
CA − T
−−−
T HE
V ERY
FAST
C AT
T HE
−−−− F A−T
C AT
If the algorithm in the first alignment chooses to mismatch L, F and F, C
instead of aligning CAT with CAT, the resulting end gap will stay in the
final MSA. This can happen, if the cost of an end gap is below the cost
of an interior gap, which is the case for several MSA applications. In the
final MSA a far better score would be achieved by realigning the second
sequence.
2
8.4
Notes
This chapter is mainly based on [Notredame, 2002] and [Gusfield, 1997],
in particular most definitions are taken from the latter.
99
Oh he love, he love, he love
He does love his numbers
And they run, they run, they run him
In a great big circle
In a circle of infinity
But he must, he must, he must
Put a number to it
C HAPTER
Kate Bush
9
Circular Sum Score.
This chapter introduces the Circular Sum (CS) score along with the considerations leading to its definition and a description of its main characteristics. The CS score is at the basis of the MSA algorithms described in a
later chapter.
9.1
Sum-of-Pairs Score
As mentioned in Chapter 8 on page 93 one of the problems when dealing
with MSA is how to evaluate them. One of the first introduced scoring
schemes [Bacon and Anderson, 1986; Murata, Richardson, and Sussman,
1985] can be seen as a natural extension of the scoring function for pairwise alignments.
Definition 20 (Sum of Pairs Score)
Let A = [[a1 , a2 , . . . , ak ]] be an MSA of the sequences s1 , s2 , . . . , sk . Let
S(ai , aj ) be the score for the induced alignment of the the strings si , sj
in A. We then define the Sum-of-Pairs (SOP) score of A as
SOP (A) =
k X
k
X
S(ai , aj )
2
i=1 j>i
Example 14
Using the scoring scheme of +1 for char matches and 0 for everything
else,1 the SOP score of the final MSA from Example 13 on page 99 is calcu1
This scoring scheme is also known as “unit metric” or “edit distance”
101
9. C IRCULAR S UM S CORE
lated as:
a1 : T HE
LAS T
FA − T
C AT
a2 : T HE
F AS T
CA − T
−−− ←
−+
a3 : T HE
V ERY
FAST
C A T ←−−−− +
a4 : T HE
−−−− F A−T
a2 : T HE
F AS T
CA − T
−−−
a3 : T HE
V ERY
FAST
−+
C AT ←
a4 : T HE
−−−− F A−T
a3 : T HE
V ERY
a4 : T HE
−−−− F A−T
FAST
8
9
9
C A T ←−−−−−− +
5
5
C A T ←−−−− +
C AT
9
−+
C AT ←
giving the result 26 + 10 + 9 = 45.
2
The SOP score is easy to calculate but as most scoring functions it lacks
a clear biological reasoning. In [Notredame, 2002] it is even stated that,
the SOP score is wrong from an evolutionary point of view. The following
section will elaborate on this claim.
9.2
MSAs and Evolutionary Trees
Substitution matrices used in alignments are often constructed using a
predefined evolutionary model in which the score of a substitution, a gap
or a conservation is determined according to the biological likeliness of
this event happening. This means, that unlikely mutations will have negative (or very small) scores while mutations occurring more often than
would be expected by chance, will have positive scores.2 The most commonly used scoring matrices are the Point Accepted Mutation (PAM) and
Block Substitution Matrix (BLOSUM) series.3 Typically the scoring of gaps
is done using the “affine” gap cost (see Section 8.2 on page 94).
We can connect an MSA to an evolutionary tree in the following way:
Definition 21 (Evolutionary Tree)
Let S = (s1 , s2 , . . . , sk ) be a collection of biomolecular, homologous sequences with k ≥ 2. An Evolutionary or Phylogenetic Tree for S is a
rooted, binary tree, where the leafs are labelled with the k sequences
2
If a minimisation problem is preferred for the MSA the same values can be used after
either multiplying each value by −1 or subtracting each value from the largest value.
3
102
http://en.wikipedia.org/wiki/Substitution_matrix
9.3. Circular Orders, Tours and CS Score
(extant species) and each interior vertex represents a common (usually
unknown) ancestor of its children (extinct species).
2
Consider an evolutionary tree of five sequences S = (s1 , s2 , s3 , s4 , s5 )
(see Figure 9.1). Assume this is the correct evolutionary tree for the sequences. If we look upon the construction of an MSA for the sequences as
a minimisation problem, we can interpret the score of the pairwise alignment [[s1 , s2 ]] as a measure of the probability of evolutionary events (mutations and indels) on the edges (s1 , v1 ) and (v1 , s2 ). Likewise the score of
the alignment [[s3 , s4 ]] evaluates the likelihood of evolutionary events on
the edges (s3 , v2 ), (v2 , v3 ), (v3 , s4 ).
In other words: the shorter the distance between two sequences in the
tree is, the less likely it is that evolutionary events took place i.e. the more
identical and closely related the sequences will be.
r
v2
v1
s1
v3
s3
s4
s5
s2
Figure 9.1: Evolutionary tree with five sequences
(species)
With this interpretation in mind consider an MSA of the sequences
in the same evolutionary tree using the SOP score. If there is added a
“tick” to an edge, each time it is included in the calculation of a pairwise
alignment, we get a result as shown in Figure 9.2. The effect is, that
some edges are considered more often than others. However, there is
no biological justification to suggest, that some evolutionary events are
more important than others. Another way to formulate the evolutionary
problem with the SOP score is that in a sense the SOP score considers each
sequence to be an ancestor of every other sequence.
9.3
Circular Orders, Tours and CS Score
The considerations in the last section thus led to the idea of defining
a scoring function that will consider each edge in a evolutionary tree
equally. Or quoting from [Gonnet et al., 2000]:
103
9. C IRCULAR S UM S CORE
r
6
6
v2
v1
4
s1
4
4
4
s3
s4
6
v3
4
s5
s2
Figure 9.2: SOP score “ticks” added for each edge.
(from [Gonnet et al., 2000])
How can a tree be traversed, going from leaf to leaf, such that
all edges (each representing its own episode of evolutionary divergence) are counted the same number of times?
Definition 22 (Circular Order and Circular Tour)
Let T be an evolutionary tree for the sequences S = (s1 , s2 , . . . , sk ). A
permutation of the sequences which, when interpreted as a traversal of
T , leads to a tour visiting each leaf once and each edge exactly twice, is
called a Circular Order (CO). A tour in a tree following a CO is denoted a
Circular Tour (CT).4
2
Consider our example evolutionary tree for five sequences ( 9.3 on the
next page). If the tree is traversed in the order s1 , s2 , s3 , s4 , s5 (see subfigure 9.3a) the result will be a CT. Another CO would be s1 , s2 , s5 , s4 , s3
leading to the CT depicted in subfigure 9.3b. The last tour can also be
imagined as being “equal to” the first tour in an isomorphic tree resulting from rotating the subtree, Tv2 around v2 . If we consider cycle orders,
which are cyclic shifts of each other as giving the same CT, then the number of distinct CTs is given as the number of rotations around all inner
vertices apart from the root. For a tree with n leafs the number of distinct
CTs is thus 2n−2 .
As an evolutionary tree has at least two leafs, we now have:
Lemma 23
The CT is the shortest possible tour through an evolutionary tree that visits
each leaf once.
2
4
Strictly speaking this is only a tour as far as the distinct leaf nodes are concerned,
as each inner node is visited twice.
104
9.3. Circular Orders, Tours and CS Score
r
r
v2
v1
v2
v1
v3
s3
s1
s4
s5
v3
s3
s1
s2
s4
s5
s2
(a) circular order 1-2-3-4-5
(b) circular order 1-2-5-4-3
Figure 9.3: Circular tours
P ROOF A tour visiting all leafs and returning to the “start” leaf will have
to traverse each edge at least twice, as every edge in both ends will be
incident to subtrees containing at least one leaf.
A CT per definition visits each edge exactly twice. Using induction over
the number of leafs, we can prove the existence of a CT in any tree.
Base Case: In a tree with two leafs the only tour is a CT (see Figure 9.4).
Induction Case: Assume that every evolutionary tree with n − 1 leafs
has a CT. Consider a tree, Tn with n leafs. If we remove a subtree of
Tn consisting of a leaf, sn and its non-root parent, pn together with
the edge between them (such a subtree will exist for n ≥ 3), the
resulting tree Tn−1 will have n − 1 leafs. Per induction hypothesis
there exists a CT, ctn−1 in Tn . Now consider a tour tn , in Tn which is
identical to ctn−1 . The tour will be passing pn twice as pn “divides”
an edge in Tn−1 . If we extend the tour with a visit of sn at one of
these passings, the resulting tour is a CT for T . (see Figure 9.5 on
the next page)
r
s1
s2
Figure 9.4: Base case
We can now finally close in on the new scoring function for MSAs, by
considering the following question: Given the homologous sequences S =
105
9. C IRCULAR S UM S CORE
Tn-1
Tn
p
n
p
n
sn
sn
(a)
(b)
Tn
Tn
p
p
n
n
sn
sn
(c)
(d)
Figure 9.5: Induction case
(s1 , s2 , . . . , sk ), how can we find a CT through an evolutionary tree for S
without actually knowing or having to construct the tree?
If we interpret the score of the OPA [[si , sj ]] as a distance measure for
the strings (leafs) in the evolutionary tree, and use Lemma 23 , the problem of finding
a circular order and a CT reduces to solving a TSP instance
n
with the 2 pairwise alignment scores as distance matrix and the k sequences as cities.
The determination of a CO and
com CT can be done very efficiently
n
5
pared to the calculation of the 2 pairwise alignment scores, that often
is a prerequisite for construction of an MSA.
Definition 23 (Circular Sum Score)
Let A = [[a1 , a2 , . . . , ak ]] be an MSA of the sequences s1 , s2 , . . . , sk . Let
S(ai , aj ) be the score for the induced alignment of the the strings si , sj
in A. We then define the Circular Sum score of A as
CS(A) =
1
2
k
X
S(aci , aci+1 ) where
i=1
ck+1 = c1 and C = (c1 , c2 , . . . , ck ) is a CO.
5
106
By now we also know which tools to use.
2
9.3. Circular Orders, Tours and CS Score
The CS score is similar to the SOP score, except that not all pairwise
alignments are added, only the pairwise alignments in CO are added and
divided by two. The CS score can be regarded as a SOP score in CO.
Example 15
Using the edit distance scoring for the four sequences from Example 14
on page 101, we get the following OPA distance matrix.
s1
s2
s3
s4
:
:
:
:
T HE
T HE
T HE
T HE
LAST F AT CAT
F AST CAT
V ERY F AST CAT
F AT CAT
s1
s2
s3
s4
s1 s2 s3 s4
13 9 9 9
9 10 10 9
9 10 14 9
9 9 9 9
A CO will be C = (s1 , s2 , s3 , s4 ). The CS score of the alignment from Example 14 will be:
LAST
FA − T
CAT
a2 : T HE
F AST
CA − T
−+
−−− ←
a3 : T HE
V ERY
FAST
CAT
←−−−− +
a4 : T HE
−−−− F A−T
CAT
←−−−−−− +
giving the result
9.3.1
←−−−−− +
a1 : T HE
1
2
8
5
· (8 + 5 + 9 + 9) = 15.5.
9
9
2
Some CS score properties
The CS score of an alignment of a set of sequences has a clear upper bound:
Definition 24 (Maximal CS Score)
Let S = (s1 , s2 , . . . , sk ) be a collection of sequences. Let OP A(si , sj ) denote
the score of the OPA of the sequences si , sj The upper bound for the CS
score of S is defined as
CS max (S) =
1
2
k
X
OP A(aci , aci+1 )
where
i=1
ck+1 = c1 and C = (c1 , c2 , . . . , ck ) is a CO.
2
Example 16
Using the matrix and the CO from Example 15 we get for the sequences S
CS max (S) =
=
1
2
1
2
· (OP A(s1 , s2 ) + OP A(s2 , s3 ) + OP A(s3 , s4 ) + OP A(s4 , s1 ))
· (9 + 10 + 9 + 9) = 18.5
2
107
9. C IRCULAR S UM S CORE
Likewise the optimal score of an alignment of a set of sequences can
be defined:
Definition 25 (Optimal CS Score)
Let A be the set of all possible alignments of the sequences S = (s1 , s2 , . . . , sk ).
The optimal score for the CS score of S is defined as
CS opt (S) = max CS(A).
2
A∈A
Example 17
An optimal alignment of the string collection, S from Example 15 and the
resulting CS score is
a1 : T HE
LAS T
F A−T
CAT
a2 : T HE
−−−−
F AS T
CAT ←
−+
a3 : T HE
V ERY
FAST
CAT ←−−−− +
a4 : T HE
−−−− F A−T
giving the result CS opt (S) =
1
2
9
←−−−−− +
10
9
CAT ←−−−−−−− +
9
· (9 + 10 + 9 + 9) = 18.5.
2
The relation between the scores is given by the following lemma:
Lemma 24
Let A be the set of all possible alignments of the sequences S = (s1 , s2 , . . . , sk ).
We then have:
CS max (S) ≥ CS opt (S)
2
P ROOF This follows immediately as each score in the sum for CS max (S)
will be larger than or equal to the equivalent induced pairwise alignment
score in CS opt (S).
Although the scores in the previous two examples are equal, it is not in
general case that CS opt (S) equals CS max (S). If, on the other hand, we do
have an alignment, A of the sequences S = (s1 , s2 , . . . , sk ) with CS(A) =
CS max (S) then we know that the MSA is optimal.
9.4
Notes
This chapter is mainly based on [Gonnet et al., 2000] and [Roth-Korostensky,
2000].
The origin of the SOP score is somewhat unclear; Gusfield gives [Carrillo and Lipman, 1988] as the first introduction but [Altschul and Lipman, 1989] gives the two references mentioned.
108
9.4. Notes
The definition and treatment of evolutionary trees is intentionally
kept simple in order to avoid getting too involved in the plethora of related topics like bifurcation contra multifurcating, parsimony, molecular
clock hypothesis, ultrametric distances and evolutionary tree construction.
109
C HAPTER
10
Algorithms
This chapter presents the two implemented MSA algorithms. The first
is an MSA construction algorithm called tspMsa and the second an MSA
improving algorithm called enhanceMsa. Both algorithms are based directly on the CS score and the actual implementations both consist of improvements of the original algorithms due to Roth-Korostensky.
10.1
Path Sum Scores
For notational convenience we will start this chapter with a couple of
definitions
Definition 26 (Path Sum Score)
Let A = [[a1 , a2 , . . . , ak ]] be an MSA of the sequences s1 , s2 , . . . , sk . Let
S(ai , aj ) be the score for the induced alignment of the the strings si , sj
in A. We then define the Path Sum score of A as
PS (A)
=
1
2
k−1
X
S(aci , aci+1 ) with (c1 , c2 , . . . , ck ) a CO.
2
i=1
Remark 18
The PS(A) is the CS(A) where the score value for the induced strings
ak , a1 is left out of the summation.
2
Analogously we have
111
10. A LGORITHMS
Definition 27 (Maximal Path Sum (MPS) Score)
Let S = (s1 , s2 , . . . , sk ) be a collection of sequences. Let OP A(si , sj ) denote
the score of the OPA of the sequences si , sj . The MPS score is defined as
M P S(S) =
1
2
k−1
X
OP A(aci , aci+1 ) with (c1 , c2 , . . . , ck ) a CO.
2
i=1
Remark 19
The M P S(S) is the CS max (S) where the OPA score value for the induced
strings ak , a1 is left out of the summation. If an MSA, A is having the
property that PS(A) = M P S(S) we say the alignment has MPS score
2
10.2
The MSA Construction Algorithm
The tspMsa algorithm is a progressive alignment algorithm that uses a
CT to determine the alignment order. The pseudo-code for the algorithm
is shown in III.1 on the next page.
Given a collection of sequences S = (s1 , s2 , . . . , sk ) tspMsa() executes
the following four steps.
1) For all possible sequence pairs construct an optimal pairwise alignment and use the score of these optimal alignments to construct a
distance matrix for the TSP solver.
2) Use the distance matrix to determine a CO and a CT.
3) Find the sequence pair si , si+1 having the worst OPA score in the CT
and rearrange the cyclic order using cyclic shifts until the CT starts
with sequence si+1 . Rename this order to s1 , s2 , . . . , sk .
4) Construct an MSA A, progressively beginning with sequence s1 , s2
ensuring that A will have MPS score. This step is the algorithm
called CIRCUS in [Roth-Korostensky, 2000, chapter 6]. It progressively constructs an MSA of a set of strings S = (s1 , s2 , . . . , sk ) using
a CO. (see Algorithm III.2 on page 115 and Example 18 on page 114)
112
10.2. The MSA Construction Algorithm
Algorithm III.1 tspMsa
Require: List of sequences, sl containing S = (s1 , s2 , . . . , sk )
Ensure: An MSA A = [[a1 , a2 , . . . , ak ]]. In case of maximisation and nonnegative induced alignment scores CS(A) ≥ k−1
k · CS max (S)
procedure TSP M SA(sl)
distances ← matrix of size |sl| × |sl|
. Step 1)
3:
for i = 1 to |sl| − 1 do
for j = i + 1 to |sl| do
distances[i][j] ← OP A(si , sj )
6:
distances[j][i] ← OP A(si , sj )
end for
end for
9:
cycleOrder ← FIND C YCLE O RDER(distances)
. Step 2)
worstValueIndex ← 1
. Step 3)
worstValue ← distances[cycleOrder[1]][cycleOrder[2]]
12:
for i = 2 to |cycleOrder| − 1 do
currentValue ← distances[cycleOrder[i]][cycleOrder[i + 1]]
if currentValue worse than worstValue then
15:
worstValueIndex ← i
worstValue ← currentValue
end if
18:
end for
for i = 1 to worstValueIndex do
APPEND (cycleOrder, cycleOrder[1])
21:
REMOVE (cycleOrder, 1)
end for
msa ← LIST(sl[cycleOrder[1]])
. Step 4)
24:
for i = 2 to |cycleOrder| do
msa ← EXTEND A LIGNMENT(msa, sl[cycleOrder[i]])
end for
27:
return msa
end procedure
Termination: The algorithm only iterates over final non-increasing data
structures.
Correctness: Assume maximisation and S(ai , aj ) ≥ 0 ∀ i, j ∈ 1 . . . k and
that Procedure EXTEND A LIGNMENT () returns an MSA having MPS
score. The rotation of the cycle order in Step 3) ensures that the
score for OP A(sk , s1 ) is minimal for all OPA scores. This means that
113
10. A LGORITHMS
the maximal value for OP A(sk , s1 ) is
1
k
· CS max (S) giving
CS(A) = M P S(S)
= CS max (S) − OP A(sk , s1 )
1
≥ CS max (S) − · CS max (S)
k
k−1
≥
· CS max (S)
k
The running-time for the algorithm results from a sequential execution of the algorithm steps. For a collection of sequences S = (s1 , s2 , . . . , sk )
with |si | = n ∀ i ∈ 1 . . . k this gives a running-time complexity of O(k 2 n2 +
k 2.2 + k + kn2 ) = O(k 2 n2 ). Here the complexity of Step 2) (k 2.2 ) assumes
use of the LKH TSP solver and the complexity of Step 4) (kn2 ) that the OPAs
from Step 1) have to be recalculated.
Example 18
Let S = (s1 , s2 , . . . , sk ) be a collection of sequences ordered with respect
to the CO in Step 3) on page 112. Let S(aj , aj+1 ) denote the score of the
induced alignments of the sequences sj , sj+1 . Let OP A(sj , sj+1 ) be the
score of the OPA of sj , sj+1 and let tj , tj+1 denote the strings in this OPA.
Assume that the first i sequences have been aligned in an MSA A0 =
[[a1 , a2 , . . . , ai ]] so that S(aj , aj+1 ) = OP A(sj , sj+1 ) ∀ j ∈ 1 . . . i − 1 (see
Figure 10.1).
A’
a1
ai
+
s i+1
Figure 10.1: Before executing extendMsa(). (from
[Roth-Korostensky, 2000])
Now execute the following steps.
i) Construct the OPA for the sequences si , si+1 giving the sequences
ti , ti+1 (see Figure 10.2 on page 116 (a)).
114
10.2. The MSA Construction Algorithm
Algorithm III.2 extendMsa
Require: An MSA A = [[a1 , a2 , . . . , ai ]] having MPS score, a sequences si+1
to incorporate in the MSA
Ensure: An MSA A = [[a1 , a2 , . . . , ai+1 ]] having MPS score.
procedure EXTEND A LIGNMENT(msa, sequence)
ti , ti+1 ← CALCULATE OPA(si , si+1 )
3:
for all gaps g in ai do
if g not present in ti then
INSERT G AP (ti , g)
6:
INSERT G AP (ti+1 , g)
end if
end for
9:
for all gaps g in ti do
if g not present in ai then
for all aj in msa do
12:
INSERT G AP (aj , g)
end for
end if
15:
end for
APPEND (msa, ti+1 )
return msa
18: end procedure
Termination: The algorithm only iterates over final non-increasing
data structures.
Correctness: That the resulting MSA has MPS score can be seen by
inspection of the gap insertion steps in the algorithm.
In line 3 through 8 only gaps (or parts of gaps) that are not already in ti are inserted into both ti and ti+1 . This means that the
score of S(ti , ti+1 ) will not be affected, since opposing gaps do not
contribute to the score. In other words: ti , ti+1 will still be an OPA
of si , si+1 only with superfluous gaps (see Figure 10.2 (b)).
Analogously in line 9 through 15 only gaps (or part of gaps) that
are not already in ai are inserted into every aj . This does not
affect the score of the MSA which will still have MPS score (see
Figure 10.2 (c)).
Since all gaps (or part of gaps) not present in ai and ti are respectively inserted into the other sequence, ai and ti are now identical. Extending the MSA with ti+1 at this point will thus result in
an MSA with MPS score (see Figure 10.2 (d)).
115
10. A LGORITHMS
ii) Insert all gaps which are present in ai and not already present in ti
into both ti and ti+1 (see Figure 10.2 (b)).
iii) Insert all gaps present in ti which are not present in ai into all sequences in the MSA (see Figure 10.2 (c)).
iv) At this point sequences ai and ti will be identical. Now append ti+1
to the MSA as ai+1 creating the extended alignment A00 (see Figure
10.2 (d)).
2
A’
a1
ai
ti
t i+1
si
s i+1
ti
t i+1
ti
t i+1
(a)
A’
(b)
a1
A’
a1
ai
ai
ti
t i+1
A’
a1
ti
t i+1
A”
a1
ai
ai
a i+1
ti
t i+1
(c)
(d)
Figure 10.2: Example 18 steps i) – iv). (from [Roth-Korostensky, 2000])
116
10.2. The MSA Construction Algorithm
10.2.1
Result improvements
In order to improve the MSA constructed by the tspMsa algorithm Korostensky and Gonnet mentions a couple of possible post-optimisations.
These post-optimisations were not implemented for this thesis, but since
they are based on some interesting considerations, we will give a short
sketch of the ideas.
The MSA returned by the tspMsa algorithm has the property, that the
CS score is optimal for all sequence pairs except s1 , sk . The optimisations
aim at improving the score for the induced non-optimal [[a1 , ak ]] alignment. Changing this alignment will, however, affect the induced alignments of other pairs. Since these alignments are all optimal the challenge
is to improve the score of [[a1 , ak ]] more than the change will decrease the
score of all other induced alignments.
It can be argued that inserting a gap into either of the strings a1 , ak
is very unlikely to improve the total CS score. The argument is, that the
gap-opening cost is typically large compared to the values from the scoring matrix. Thus inserting a gap, g into e.g. a1 will decrease the total
CS score with at least two times the gap-opening cost (see Figure 10.3).
Furthermore, a gap g 0 with the same length will have to be inserted into
all other induced strings ai , i 6= 1 (at another position) to ensure that all
induced strings have equal length. Again this causes an additional score
decrease of at least two times the gap-opening cost.
b
b
b
g’
a k-1
ak
g
g’
b
a1
a2
a3
b
b
Figure 10.3: inserting gaps g and g 0 affects the CS
score at the four locations marked with dark clouds
(from [Roth-Korostensky, 2000])
All together this makes it very improbable that the CS score for the
alignment can be improved by insertion of gaps into a1 or ak . If the alignment score of [[a1 , ak ]] is less than four times the gap-opening cost from
the OPA score, it will be impossible to improve the CS score this way.
117
10. A LGORITHMS
Refraining from gap insertions, a more promising score improving
strategy is to shift gap blocks around in a1 , ak . The possibilities are either to use a heuristic, or to use an exhaustive search algorithm to shift
gap blocks around. In [Roth-Korostensky, 2000] a heuristic that aligns
equal-sized opposing gaps in a1 , ak is implemented.
For this thesis a direct optimisation of the tspMsa algorithm was implemented. The idea being, that the algorithm Step 3) on page 112 only
makes sense in order to achieve the guarantee for the final CS score. It is
not certain, that the index found is the optimal index to break the cycle.
To determine this optimal index, say i, it is necessary to determine the
sequences where the difference between OP A(si−1 , si ) and the alignment
score of [[ai−1 , ai ]], when breaking the cycle at index i is minimal.
10.2.2
Implementation
The implementation of tspMsa was done using a Python implementation
of the algorithm steps on page 112. To determine the pairwise OPAs it uses
external calls to the stretcher application from the EMBOSS suite. This
is a C++ implementation of the Hirschberg algorithm. The determination
of cycle orders and tours was done using external calls to LKH
The actual implementation replaces Step 3) of the original algorithm
by trying out all the k index possibilities for breaking the cycle. This did
increase the running-time, but the effect was partly compensated for by
caching the calculated OPA values from Step 1) and reusing these when
progressively building the MSA.
10.3
The MSA Enhance Algorithm
The enhanceMsa algorithm can be characterised as a kind of iterative
alignment algorithm, since it tries to improve a given MSA. The algorithm,
introduced in [Roth-Korostensky, 2000, chapter 8] is an application of the
Divide-and-Conquer schema.
The following definition will ease describing enhanceMsa:
Definition 28 (MSA Block)
Let A = [[a1 , a2 , . . . , ak ]] be an MSA with length `. A block of A is defined
by two indices s, e in 1 . . . ` + 1 with s ≤ e and is denoted A[[s : e]]. It
consists of the sequence slices ai [s : e] ∀ i ∈ 1 . . . k. The notation A[[s :]] is
shorthand for A[[s : ` + 1]] and A[[: e]] is shorthand for A[[1 : e]].
2
Given a collection of sequences enhanceMsa can be described as follows.
1) Calculate an MSA A, using a fast MSA construction algorithm.
118
10.3. The MSA Enhance Algorithm
2) If the length of A is less than a given minimum length return A.
3) Determine the block of A having the best score B and the (possible
empty) blocks on either side L and R.
4) Realign the best scoring block to get B 0 and keep the better scoring
of B and B 0 , denoting this block Bf inal
5) Return an MSA: Lf inal Bf inal Rf inal where Lf inal and Rf inal are the
results from repeating from Step 1) on L and R respectively.
A
L
B
R
B final
Figure 10.4: The MSA enhance algorithm. (from
[Roth-Korostensky, 2000])
In the case of enhancing MSAs of proteins, the biological reasoning for
the algorithm is as follows: the algorithm looks for the best place to divide
an MSA into a good block and two further blocks by determining the block
having the best CS score. This block will typically be a conserved block
of the proteins. Proteins exhibit conserved as well as highly divergent
regions. The conserved regions may be active sites of the protein,1 or
stable secondary structures like α-helices and β-sheets (see Figure 8.1).
1
locations responsible for the chemical reactions of the proteins
119
10. A LGORITHMS
Since conserved regions have fewer gaps than divergent regions, most
MSA algorithms produce more reliable alignments within these.
Algorithm III.3 enhanceMsa
Require: Sequence (or alignment) list sl of size k, a minimum block
length mbl
Ensure: An MSA A = [[a1 , a2 , . . . , ak ]].
procedure ENHANCE M SA(sl, mbl)
A ← CONSTRUCT M SA(sl)
3:
s, e ← FIND B EST B LOCK(A)
Alef t ← A[[1 : s]], Abest ← A[[s : e]], Aright ← A[[e :]]
if CS(Abest ) < CS(CONSTRUCT M SA(Abest )) then
6:
Abest ← CONSTRUCT M SA(Abest )
end if
if LENGTH(Alef t ) > mbl then
9:
if CS(Alef t ) < CS(CONSTRUCT M SA(Alef t )) then
Alef t ← ENHANCE M SA(Alef t , mbl)
end if
12:
end if
if LENGTH(Aright ) > mbl then
if CS(Aright ) < CS(CONSTRUCT M SA(Aright )) then
15:
Aright ← ENHANCE M SA(Aright , mbl)
end if
end if
18:
return JOIN(Alef t , Abest , Aright )
end procedure
Termination: Assume FIND B EST B LOCK () returns a best block of
length `best ≥ 1 and that CONSTRUCT M SA () does not return alignments with columns consisting of gaps only. The length of the
longest possible alignment returned from CONSTRUCT M SA () in
line 2 will be ||S||. Since `best ≥ 1 the recursive calls in line 10
and line 15 will be on alignments of strictly smaller length. This
ensures termination of the recursions as soon as the alignments
have length mbl or below.
Correctness: Assuming CONSTRUCT M SA () returns a proper alignment, the final MSA will be a proper MSA of S as it consists of a
join of alignments of disjoint blocks of A.
The algorithm is depicted in Figure 10.4 and pseudo-code for the algorithm is shown in Algorithm III.3. The running-time for enhanceMsa is
determined by the number of recursive calls and the time for each invocation of ENHANCE M SA ().
The first depends on the lengths of the MSAs produced by CONSTRUCT M SA ()
120
10.3. The MSA Enhance Algorithm
and the best blocks found by FIND B EST B LOCK (). If the lengths of the latter is constant, say `bestthe number of recursive calls will be in the worst
case be of order O `||S||
.
best
The running-time for each invocation is determined by the runningtimes for CONSTRUCT M SA (), FIND B EST B LOCK () and the time for calculating the CS score. Denoting the first two rt(CONSTRUCT M SA) and
rt(FIND B EST B LOCK) respectively, the running-time for enhanceMsa will
have a worst case order of
||S||
O
· (rt(CONSTRUCT M SA) + rt(FIND B EST B LOCK) + kn) .
`best
10.3.1
Finding the best block within an MSA
It remains to specify how to construct MSAs and how to find the best
block in an MSA, that is, how to implement the CONSTRUCT M SA () and
FIND B EST B LOCK () functions used in Algorithm III.3. The first will be
postponed (see Section Implementation on page 124) and this section will
elaborate on finding the best scoring block within an MSA.
In [Roth-Korostensky, 2000] this is done by deciding upon a fixed MSA
block length and then use a sliding-window method to determine the best
scoring block, i.e. calculate the score of the fixed block size for all possible
positions along the MSA. For this thesis it was decided to use an implementation of the algorithm described in [Bentley, 2000, chapter 8]. The
algorithm, called MAX S UM (), finds the maximum positive sum in a contiguous sub-array of an array containing real numbers (see Example 19).
The reasons for choosing this algorithm were, that
i) it will find the optimal scoring block of an MSA with no constraints
on the length of this block, and
ii) it will do so with a running-time linear in the length of the MSA.
Pseudo-code for FIND B EST B LOCK () and the sub-procedure MAX S UM ()
is shown in Algorithm III.4 on page 123.
Example 19
Let
Arr =
1
2
3 4
5
6 7
8
9
10
31 −41 59 26 −53 58 97 −93 −23 84
be an array of integers. The MAX S UM () algorithm will return the indices
3, 8 marking the slice Arr[3 : 8] with sum 187.
2
121
10. A LGORITHMS
In order to use MAX S UM () for an MSA the CS scores for each column
is calculated and inserted into an array Arr. The average value of Arr is
then subtracted from each value in Arr. This ensures that there will be
both positive and negative values in Arr so that the result of MAX S UM ()
will be different from Arr itself (see Example 20).
Example 20
Using the MSA and edit distance scoring scheme from Example 13 Algorithm III.4 will give the following result.
T
T
T
A=
T
Arr = 4
H
H
H
H
4
E
E
E
E
4
average value =
L
F
V
−
0
A
A
E
−
2
S
S
R
−
2
T
T
Y
−
2
F
C
F
F
3
A
A
A
A
4
−
−
S
−
0
T
T
T
T
4
C
−
C
C
3
A
−
A
A
3
T
−
T
T
3
12 + 6 + 11 + 9
38
=
≈ 2.7 ⇒
14
14
Arr = [1.3, 1.3, 1.3, −2.7, −0.7, −0.7, −0.7, 0.3, 1.3, −2.7, 1.3, 0.3, 0.3, 0.3]
FIND B EST B LOCK (A)
122
= 1, 4 ⇒ A[[1 : 4]] is the maximal block
2
10.3. The MSA Enhance Algorithm
Algorithm III.4 findBestBlock
Require: List of sequences, sl containing non-empty alignment A =
[[a1 , a2 , . . . , ak ]]
Ensure: Indices s, e such that CS(A[[s : e]]) is maximal for all blocks of
A.
procedure F IND B EST B LOCK(sl)
columnScores ← GET C OLUMN S CORES(sl)
(columnScores)
3:
avg ← SUM
|columnScores|
for i = 1 to |columnScores| do
columnScores[i] ← columnScores[i] - avg
6:
end for
return MAX S UM(columnScores)
end procedure
9:
12:
15:
18:
21:
24:
27:
procedure MAX S UM(numbers)
best ← LIST(0, 0, 0)
. sum, start index, end index
current ← LIST(0, 0, 0)
for i = 1 to |numbers| do
if current[1] + numbers[i] < 0 then
current ← LIST(0, 0, 0)
. reset if sum negative
else
current[1] ← current[1] + numbers[i]
if current was reset then
. e.g. is current[3] = 0 ?
current[2] ← i
current[3] ← i
end if
current[3] ← current[3] +1
. increase end index
end if
if best[1] < current[1] then
best ← current[:]
. a copy not a reference
end if
end for
return (best[2], best[3])
end procedure
123
10. A LGORITHMS
Termination: The algorithm only iterates over finite non-increasing data
structures.
Correctness: The algorithm uses the observation that, summed from its
start index, the best slice cannot have negative part-sums. If it had,
it would contain a sub-slice having a better score. This observation
is captured in the use of the current variable, since it contains and
sums from the possible start indices of a best slice in numbers. The
best variable keeps track of the best slice seen so far.
For every round of the for-loop in lines 13 to 27 the following invariant holds: The values of best and current are correct.
Before the loop the invariant obviously holds.
Assume the invariant holds for round i − 1. For round i we have two
possibilities for current:
• its sum plus the value of numbers[i] is negative in which case
current should be reset. This happens in line 15.
• its sum plus the value of numbers[i] is positive or zero in which
case the sum should be increased by the value of numbers[i]
and the end index should be increased by one. This happens in
line 17 and line 22 respectively. Should numbers[i] be the first
positive value that current “meets” after being reset, the start
index and end index have to be set. This special case is treated
in line 19 and 20.
For round i we get for best
• if current should have obtained a larger sum, best should be
updated. This happens in line 25.
All the above gives that the invariant will hold after round i
10.3.2
Implementation
The enhanceMsa was implemented using Python and external calls to
the MSA construction applications Muscle and Mafft in order to implement
the CONSTRUCT M SA () procedure.
10.4
Notes
This chapter is mainly based on [Korostensky and Gonnet, 1999], [RothKorostensky, 2000, chapter 6, 8] and [Bentley, 2000, chapter 8].
124
There is something fascinating about
science. One gets such wholesale returns
of conjecture out of such a trifle
investment of fact.
C HAPTER
11
Mark Twain
Experiments
This chapter contains descriptions of the experiments performed on the
implemented MSA algorithms: tspMsa and enhanceMsa together with
their results.
In order to evaluate the alignments constructed by tspMsa the alignments were compared with MSAs constructed by two other applications.
All alignments were scored using both the CS score and the traditional
SOP score. Further assessment was done by using the supplied scoring
methods from a protein reference database.
In the experiments with enhanceMsa the scores of the MSA before and
after refinement were compared. The alignments were scored using the
CS score and the the supplied scoring methods from the protein reference
database.
11.1
Reference MSA Applications
As mentioned in [Blackshields et al., 2006] the amount of published MSA
construction methods and/or packages is quite overwhelming. According to Blackshields et al. more than 50 applications were described from
1996–2006 and alone in 2005 there were at least 20 new publications.
The criteria for choosing reference MSA applications can be phrased
as trying to select “the fastest of the best”. A small running-time would
be important for the enhanceMsa algorithm and high quality alignment
methods would be preferable as benchmark for the tspMsa. The applications chosen were Muscle and Mafft, as they both fulfil the above criteria
[Blackshields et al., 2006; Notredame, 2007; Wallace, Blackshields, and
Higgins, 2005].
125
11. E XPERIMENTS
Muscle is a progressive alignment program that construct a very crude
guide tree used for constructing the first MSA. The guide tree is then
refined upon and a second MSA is constructed using the improved
guide tree. For increased speed only sequences affected by the subtrees improvements are realigned.
Mafft is a both a progressive and iterative MSA program. It uses a fast
Fourier transform1 to generate a guide tree for its progressive MSA.
As with Muscle the guide tree is then refined upon both by optimising a weighted SOP score and by using different pairwise alignment
similarity information.
11.2
Benchmark Database
In order to further estimate the biological quality of the alignments a
protein benchmark database was used.
As the first MSA construction programs were developed it was difficult
to benchmark them. First lack of processing power made the performance
of the programs on large sequences inferior. Second there was not much
actual information concerning correct sequence families. The first generation of programs were thus mainly tested using a small number of “golden
standards”. As these standards were both limited in number and public
available it had the unfortunate effect, that new programs easily could be
tuned against the standards, achieving artificially high scores.
In order to rectify this effect a number of benchmark databases consisting of large and continually growing data-sets have been published.
For this thesis BAliBASE (Benchmark Alignments dataBASE)2 has been
chosen as reference database for the MSA experiments.
BAliBASE , which was one of the first benchmark databases, consists
of nine protein reference data-sets chosen in order to represent real MSA
problems faced by biologists. The alignments are based upon protein 3D
structure super-positioning and are “manually” refined in order to ensure
higher alignment quality than a pure automated method. This property
can be seen as one of the main quality characteristics of BAliBASE ,
but it can also be considered as a source of subjectivity due to the “expert” refinement [Blackshields et al., 2006]. If nothing else it does make
BAliBASE slower to maintain and expand compared to more automated
benchmark databases.3
In this thesis only the reference sets 1–5 and 9 were used (see Table
11.1), the mundane reason being, that the reference sets 6–8 were lacking
1
2
http://en.wikipedia.org/wiki/Fast_Fourier_transform
http://www-bio3d-igbmc.u-strasbg.fr/balibase/ version 3.0
3
like PREFAB, HOMSTRAD, SABmark and OXBench
126
11.2. Benchmark Database
Table 11.1: The used BAliBASE reference sets
BAliBASE Reference Categories
RV1
Equidistant sequences, divided by length and variability
RV11 – very divergent sequences (< 20 % residue identity)
RV12 – medium to divergent sequences (20-40 % residue identity)
RV20
Families with one or more highly divergent sequences
RV30
Divergent subfamilies
RV40
Sequences with large N/C terminal extensions
RV50
Sequences with large internal insertions
RV9
Short/Linear Motifs, experimentally verified
RV911 – very divergent sequences (< 20 % residue identity)
RV912 – medium to divergent sequences (20-40 % residue identity)
RV913 – similar to medium divergent sequences (40-80 % residue identity)
RV921 – true positive motifs only
RV922 – true positive motifs and errors
RV931 – true positive motifs only
RV932 – true positive motifs and false positive motifs
RV941 – true positive motifs only
RV942 – true positive motifs and true negative motifs
files containing the sequences in fasta format. The used reference sets
contains 610 alignments with all together 17 427 sequences.
11.2.1
BAliBASE scoring
In order to assess the performance of MSA programs, BAliBASE includes
scripts to calculate two scores. Let A = [[a1 , a2 , . . . , ak ]] with |ai | = ` ∀ i ∈
1 . . . k be a test alignment and Ar = [[a1 , a2 , . . . , akr ]] with length `r be the
corresponding reference alignment
Sum-of-pairs This score, which is a variant of the SOP, increases with
the number of sequences correctly aligned with respect to the reference alignment. It is supposed to indicate the extent to which some,
if not all, sequences in an alignment are correctly aligned. The score
Sc for the c’th. column is defined as
Sc =
k
k
X
X
i = 1j = 1
i 6= j
pcij ,
pcij
(
1
=
0
if A[i][c] and A[j][c] are aligned in Ar
else
127
11. E XPERIMENTS
denoting the sum-of-pairs score for A by SP we have
P`
SP = P`ir = 1
Si
j = 1 Srj
with Srj =
kr · (kr − 1)
2
Column score This is a column-wise binary score indicating whether all
sequences are aligned correctly with respect to the reference alignment. Let T C denote the Column score for A we then have
(
P`
1 if the i’th. column in A and Ar are identical
C
i
T C = i = 1 , Ci =
`
0 else
11.3
Experiments Set-up
For both the MSA construction and MSA enhancing experiments the following set-up was used:
• Muscle and Mafft running with default parameters,
• the BLOSUM62 scoring matrix,
• a gap-opening cost of −12 and a gap-extension cost of −2, and
• all gaps scored equally (end/internal gaps)
11.4 MSA Construction Tests
For each of the BAliBASE reference sets the tspMsa alignment was scored
by each of the scoring schemes: CS Score, SOP Score, BAliBASE Sum-ofPairs Score and BAliBASE Column Score. These scores were then compared to the same scores for Muscle and Mafft alignments.
In order to illustrate the results with block diagrams, the number of
times an algorithm achieved the best score was counted for each alignment file in each BAliBASE reference set directory. If more than one algorithm achieved the best score, every one of them was counted as “being
best”. The results are shown in the figures 11.1 to 11.8
11.4.1 CS score results
The results when scoring the alignments by CS score (figures 11.1 and
11.2) does not surprisingly favour the tspMsa algorithm. It achieves the
highest number of best scores in every reference set. The mutual directory
score between Muscle and Mafft is 2 – 8 with 5 draws.
128
11.4. MSA Construction Tests
Alignments scored using Cycle Sum Score
70 69
68
60
60
Number of best scores
tspMsa
Muscle
Mafft
65
49
50
40
29
30
20
20
15
10
2
0
5
2
RV11
1
0 0
RV20
RV30
Balibase directory
RV12
0 0
RV40
0 2
RV50
Figure 11.1: MSA construction on RV11–RV50
30
28
25
Number of best scores
22
Alignments scored using Cycle Sum Score
tspMsa
27
Muscle
24
Mafft
22
22
20
17
15
14
10
10
5
6
7
2
2 2
2
22
1
0
0
0
0
0
0
0
0
0
0
2
2
2
2
1
1
1
1
3
RV91 RV91 RV91 RV92 RV92 RV93 RV93 RV94 RV94
Balibase directory
Figure 11.2: MSA construction on RV911–RV942
129
11. E XPERIMENTS
Alignments scored using Sum-of-Pairs Score
80
tspMsa
Muscle
Mafft
80
Number of best scores
70
60
60
50
47
41
40
30
27
26
21
20
10
0
39
3
3
RV11
19
10
7
0
RV12
0
0
RV20
RV30
Balibase directory
0
RV40
4
RV50
Figure 11.3: MSA construction on RV11–RV50
25
22
Alignments scored using Sum-of-Pairs Score
26
tspMsa
25
Muscle
Mafft
22
Number of best scores
20
17
16
15
16
12
10
7
7
5
2
3
1
8
7
7
6
4
2
2
0
00
0
0
0
00
2
2
2
2
1
1
1
1
3
RV91 RV91 RV91 RV92 RV92 RV93 RV93 RV94 RV94
Balibase directory
Figure 11.4: MSA construction on RV911–RV942
130
11.4. MSA Construction Tests
11.4.2 SOP score results
The results when scoring the alignments by SOP score (figures 11.3 and
11.4) shows a clear superiority by Mafft achieving the highest number of
best scores in every references set. The mutual directory score between
tspMsa and Muscle is 1 – 13 with 1 draw. This last result is also less
surprising as Muscle is optimising using the SOP score.
Alignments scored using BAliBASE Sum-of-Pairs Score
70
tspMsa
Muscle
63
Mafft
70
Number of best scores
60
50
50
40 39
39
31
30
26
20
10
0
17
10
12
8
3
RV11
13
9
RV12
1
RV20
RV30
Balibase directory
3
3 3
RV40
RV50
Figure 11.5: MSA construction on RV11–RV50
11.4.3
BAliBASE SOP score results
Scoring the alignments by the BAliBASE SOP score (figures 11.5 and 11.6)
shows some interesting results. First Mafft is clearly superior being the
best algorithm in 14 out of 15 BAliBASE reference set directories. Second
the mutual directory score between tspMsa and Muscle is 5 – 7 with 3
draws and third the tspMsa algorithm achieves the best result for the
BAliBASE reference set RV11 which consists of very divergent sequences
with less than 20 % residue identity.4
4
also known as the “twilight zone”
131
11. E XPERIMENTS
Alignments scored using BAliBASE Sum-of-Pairs Score
28
tspMsa
Muscle
25
23Mafft23
Number of best scores
25
20
20
17
15
14
13
10
9
6
554
0
2
1
RV91
33
1
2
RV91
1
3
0
0
2
1
6
3
1
33
2
2
2
1
1
1
3
RV91 RV92 RV92 RV93 RV93 RV94 RV94
Balibase directory
Figure 11.6: MSA construction on RV911–RV942
Alignments scored using BAliBASE Column Score
73
tspMsa
Muscle
Mafft
70
Number of best scores
60
56
52
50
50
47
40 39
30
28
28
20
16
10
RV11
11 9
RV12
18
21
24
28
21
10 10
RV20
RV30
Balibase directory
RV40
RV50
Figure 11.7: MSA construction on RV11–RV50
132
11.4. MSA Construction Tests
Number of best scores
25
Alignments scored using BAliBASE Column Score
27
tspMsa
26
26
25
Muscle
23
Mafft
22
20
20
20
19
1616
15 15
13
13
10
5
2
0
RV91
1
33
1
2 2
0
33
3
4
55
2
2
2
2
1
1
1
3
RV91 RV91 RV92 RV92 RV93 RV93 RV94 RV94
Balibase directory
Figure 11.8: MSA construction on RV911–RV942
11.4.4
BAliBASE column score results
Finally scoring by the BAliBASE column score (figures 11.7 and 11.8)
again shows superiority by Mafft “winning” in every reference set directory. Here the mutual directory score between tspMsa and Muscle is 6
– 4 with 5 draws. Interesting enough using the BAliBASE column score
Mafft scores better than tspMsa in reference set RV11.
11.4.5 MSA construction conclusions
The following conclusions can be drawn from the above results:
1) The Mafft algorithm is clearly superior to both the tspMsa and Muscle algorithm.
2) The tspMsa and Muscle algorithms clearly scores better than the
other, when scoring with the algorithm’s optimising score.
3) Using the BAliBASE SOP score Muscle achieves better results than
tspMsa. A Wilcoxon rank-sum test5 gives a p-value equal to 0.085
for the null hypothesis that the results are being comparable, and a
z-statistic value of -1.721 in favour of Muscle.
5
calculated using the stats package from scipy
133
11. E XPERIMENTS
4) Using the BAliBASE column score tspMsa and Muscle achieves
comparable results with a slight advantage for Muscle. A Wilcoxon
rank-sum test gives a p-value equal to 0.885 for the null hypothesis that the results are being comparable, and a z-statistic value of
-0.145 in favour of Muscle.
5) Using the BAliBASE scores the tspMsa algorithm is scoring relatively well in reference sets from the twilight zone (reference set
RV11 and RV911)
11.5 MSA Enhancing Tests
Percentage improvement in Cycle Sum Score
MSA enhance results using Muscle
50
40
best
8
14
20
26
30
20
10
0
2
2
2
2
1
1
1
1
3
RV91 RV91 RV91 RV92 RV92 RV93 RV93 RV94 RV94
Balibase directory
Figure 11.9: CS score improvement on RV911–
RV942
The enhanceMsa algorithm was tested using both Muscle and Mafft
as a fast external alignment construction application.
To compare the effect of different block sizes the algorithm was run
using the FIND B EST B LOCK () function (see Section 10.3.1 on page 121)
and also with constant block sizes of 8, 14, 20 and 26.
For each of the BAliBASE reference sets the enhanceMsa algorithm
was scored by each of the scoring schemes: CS Score, BAliBASE Sum-ofPairs Score and BAliBASE Column Score. The original first alignment
134
11.5. MSA Enhancing Tests
Percentage improvement in BAliBASE Sum-of-Pairs Score
MSA enhance results using Muscle
best
8
14
20
26
15
10
5
0
RV91
1
2
2
2
2
1
1
1
3
RV91 RV91 RV92 RV92 RV93 RV93 RV94 RV94
Balibase directory
Figure 11.10: BAliBASE SOP score improvement on
RV911–RV942
was scored at the start of algorithm and the final alignment was scored
when the algorithm finished.
For each file in a reference set directory the improvement in scores
between the first and the final alignment was calculated as a percentage
of the original score. These differences were then summed (by score type)
and averaged by dividing with the number of files in the reference set
directory.
The result of the test runs, when using Muscle as external MSA construction program, for the reference sets RV911– RV942 and all three
score types are shown in Figure 11.9 – Figure 11.11. The results, when
using Mafft, and for the data sets RV11 – RV50 were similar in the sad
sense of being just as unclear.
Using a Wilcoxon rank-sum test comparing the variable block algorithm with the constant block algorithm for each block size gives the values in Table 11.2.
Neither the figures nor the table conveys any clear message.
135
11. E XPERIMENTS
MSA enhance results using Muscle
Percentage improvement in BAliBASE Column Score
12
best
8
14
20
26
10
8
6
4
2
0
2
2
2
2
2
1
1
1
1
3
RV91 RV91 RV91 RV92 RV92 RV93 RV93 RV94 RV94
Balibase directory
Figure 11.11: BAliBASE column score improvement on RV911–RV942
11.5.1 MSA enhancing conclusions
All though the test results are rather unclear, a couple of conclusions can
be drawn:
• Comparing the variable block length results to the constant block
length results, the variable block results are not better than any of
the constant block results. On the positive side, they are not clearly
worse either.
• The enhanceMsa algorithm uses the CS score to determine whether
it should keep on refining the MSA. Using this scoring does in most
cases also result in improvements in the BAliBASE scores.
11.6
Notes
Section 11.2 is based on [Blackshields et al., 2006] and [Perrodou, Chica,
Poch, Gibson, and Thompson, 2008; Thompson, Koehl, Ripp, and Poch,
2005; Thompson, Plewniak, and Poch, 1999a,b].
Muscle is described in details in [Edgar, 2004] and Mafft in [Katoh,
Kuma, Toh, and Miyata, 2005; Katoh, Misawa, Kuma, and Miyata, 2002].
136
11.6. Notes
Table 11.2: comparing variable block size to constant block sizes
CS
score
SOP
score
column score
constant block size 8
RV11–50
-0.1601, 0.8728
-0.1601, 0.8728
-0.1601, 0.8728
RV911-942
-0.2208, 0.8253
0.6181, 0.5365
0.3091, 0.7573
constant block size 14
RV11–50
0.1601, 0.8728
-0.1601, 0.8728
-0.3203, 0.7488
RV911-942
-0.3974, 0.6911
-0.4857, 0.6272
-0.6623, 0.5078
constant block size 20
RV11–50
-0.1601, 0.8728
-0.4804, 0.6310
-0.6405, 0.5218
RV911-942
-0.3091, 0.7573
-0.0442, 0.9648
-0.2208, 0.8253
constant block size 26
RV11–50
-0.1601, 0.8728
0.0000, 1.0000
-0.3203, 0.7488
RV911-942
-0.1325, 0.8946
0.4415, 0.6588
0.2208, 0.8253
Each table entry consists of z-statistics and two-sided p-value for the
null hypothesis that two sets of measurements are drawn from the same
distribution. A negative z-statistic value indicate worse variable block
values, a positive z-statistic indicate better variable block values.
137
Part IV
The end of the tour
DNA
tour with 16384 cities
The Road goes ever on and on
Out from the door where it began.
Now far ahead the Road has gone,
Let others follow it who can!
Let them a journey new begin,
But I at last with weary feet
Will turn towards the lighted inn,
My evening-rest and sleep to meet.
C HAPTER
Bilbo Baggins (J.R.R. Tolkien)
12
Conclusion
To recapitulate, the goals set for this thesis were:
1) an examination of the feasibility of applying efficient TSP solver implementations to solve NP complete problems within bioinformatic
applications,
2) a self-contained survey of approximation algorithms for the SSP including descriptions of implementations and testing of solution quality, and
3) a description and an experimental testing of the improvements on
the TSP based algorithms for MSA construction and MSA refinement.
Based on the work done and the results obtained in this thesis the
following conclusions can be drawn:
Re 1) It can definitely be concluded that it is a promising idea to apply efficient TSP solver implementations to solve NP complete problems
within bioinformatic applications. The results obtained by the TSP based
SSP algorithm were of a quality that could hardly be more convincing.
Compared to all other algorithms it achieved the best solutions in every
test (see Table 7.2 on page 82). This affirms the findings in [Laporte,
2010] and [Applegate et al., 2007, Chapter 15.5, 15.6] that LKH is capable
of solving TSP instances up to at least 1000 cities to optimality in a very
short time.
In the conducted experiments the TSP solver based algorithm used less
than two minutes for solving a 1000 string SSP problem (see Figure 7.7 on
page 84). this was achieved with an implementation that in no way has
141
12. C ONCLUSION
been optimised with concern to running-time. In other words: the time
cost in order to get results of this quality seems very well spent.
Re 2) This thesis has in Chapter 6 presented a survey of approximation
algorithms for the SSP. Furthermore Chapter 7 has presented test results
of their solution quality. One important insight resulting from the latter
chapter is, that the Greedy algorithm is the best choice if optimal solution quality is of less importance. This very simple algorithm obtains the
best solutions among the fast algorithms (see Table 7.2 and Figure 7.9 on
page 86).
Another relevant point is, that approximation factors only reflect the
performance in the most pathological instances. In none of the 355 280
test cases used for this thesis did any of the algorithms exceed a solution
that was more than 2 % above the lower bound.
Re 3) The results of the implemented MSA construction algorithm could
not compete with the results of Muscle and especially Mafft, when the
BAliBASE reference set and scores were used. This might not come as a
big surprise; these programs have been elaborated on for quite some time,
and surely more than was possible to put into algorithm implementation
in the scope of this thesis. Nevertheless, the algorithm did not produce
evidently bad results and it seems to produce quite good results on sequences from the “twilight zone” (see Figure 11.5 – 11.8 on page 133).
On the other hand, the results obtained by the implemented MSA refinement algorithm can hardly be regarded as positive. The test results
were very unclear and the most interesting conclusion was, that an optimisation based on the CS score, also resulted in improvements in the
BAliBASE scores. This does, however, give reason for optimism concerning the applicability of the CS score in bioinformatics.
12.1
Future Work
The results from this thesis point to the following interesting possibilities
for future work:
Attacking other NP hard problems: The very positive results obtained
by using an efficient TSP solver to solve another NP hard problem
opens up a large amount of future applications. Actually every problem involving any of the other known NP hard problems is a possible
candidate.
Examination of integral solution majority: It came as an unexpected
result, that the solutions to the LP used by the Best Factor approximation algorithm almost solely consisted of integral solutions.
142
12.1. Future Work
There might be a simple reason for this, but there could also be some
interesting insights behind this observation.
Further assessment of the CS score: Optimising the CS score lead to
improvements of the BAliBASE scores. To evaluate its biological
applicability, a study examining the effect of optimising the CS score
using other MSA benchmark databases would be interesting. But
first of all, it would be relevant to include evaluations from biologists
concerning the MSAs constructed by optimising the CS score.
A competitive version of the tspMsa algorithm: The tspMsa algorithm
did not excel in comparison to Mafft and Muscle but its results are
not unpromising either. For example, there is still room for postoptimisations using gap block shifts.
Both the CS score and the tspMsa algorithm origins in work done at
Swiss Federal Institute of Technology around 2000, and when queried
upon reasons for the lack of publications related to these ideas in the past
ten years, Professor Gaston Gonnet kindly commented:1
I think the circular tour idea will always have a fundamental role in evaluating MSAs. Regarding MSA algorithms, there
are several new remarkable algorithms which are probably very
difficult to beat. In particular MAFFT and Prank. This makes
the task of beating them more difficult and more challenging.
Several years ago we improved our MSA algorithm based
on CT but we did not put enough effort to make it competitive
with current algorithms.
I personally think it is an excellent topic to pursue.
My pen is at the bottom of a page,
Which, being finished, here the story ends;
’Tis to be wished it had been sooner done,
But stories somehow lengthen when begun.
Byron
1
personal communication
143
List of Algorithms
II.1 Cycle . . . . . . . . . . . . .
II.2 MakeMaximalAssignment
II.3 RecurseCycle . . . . . . . . .
II.4 Greedy . . . . . . . . . . . .
II.5 leastCommonDenominator
III.1 tspMsa . . . . . . . . . . . .
III.2 extendMsa . . . . . . . . . .
III.3 enhanceMsa . . . . . . . . .
III.4 findBestBlock . . . . . . .
.
.
.
.
.
.
.
.
.
145
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
52
54
58
67
113
115
120
123
List of Figures
1.1 paths and tours in a graph . . . . . . . . . . . . . . . . . . . . . .
1.2 complete graph and cut set . . . . . . . . . . . . . . . . . . . . .
1.3 Different tree types . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 The Commis-Voyageur tour for 47 German cities (from [Applegate et al., 2007]). . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 The 33 city contest from 1962. . . . . . . . . . . . . . . . . . . . .
2.3 Different k-opt moves . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 The optimal TSP tour in Sweden. (from http://www.tsp.gatech.
edu/sweden/tours/swtour_small.htm ) . . . . . . . . . . . . . .
6
6
7
12
15
16
18
3.1 A LKH tour of 11 455 danish cities. . . . . . . . . . . . . . . . . . 23
3.2 Results on dk11455.tsp. Note the addition of 20 990 000 on the
y-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 sequence assembly problem with repeats . . . . . . . . . . . . . 31
5.1 Graphs for the strings: ’ate’, ’half ’, ’lethal’, ’alpha’ and ’alfalfa’
(from [Blum et al., 1994]) . . . . . . . . . . . . . . . . . . . . . . . 36
6.1 Running-time measures of all-pairs overlap calculations .
6.2 Illustration of Lemma 9 : wuv + wu0 v0 ≥ wu0 v + wuv0 . . . . .
6.3 Shortest Hamiltonian path is not necessarily included in
optimal TSP tour. . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Transforming an ATSP instance to a TSP instance . . . . . .
. . .
. . .
the
. . .
. . .
49
51
63
64
7.1 SSP solution dependency on variable coverage . . . . . . . . . . . 78
7.2 SSP solution dependency on length . . . . . . . . . . . . . . . . . 79
146
List of Figures
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10
solution dependency on length, compare with Figure 7.10a
solution dependency on count, compare with Figure 7.10b
solution dependency on count . . . . . . . . . . . . . . . . .
Figure 7.5 correlated using TSP algorithm values as 100 %
SSP algorithms running-time using normal scale . . . . . .
Best Factor and TSP algorithm running-times . . . . . . . .
Cycle, RecurseCycle and Greedy algorithm running-times
SSP algorithm alignment similarity results . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
80
81
82
83
84
85
86
88
8.1 Secondary structure of Avian Sarcoma Virus. A retrovirus with
many similarities to HIV (from http://mcl1.ncifcrf.gov/integrase/
asv_intro.html ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.1
9.2
9.3
9.4
9.5
Evolutionary tree with five sequences (species) . . . . . . . . . . 103
SOP score “ticks” added for each edge. (from [Gonnet et al., 2000])104
Circular tours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Base case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Induction case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.1 Before executing extendMsa(). (from [Roth-Korostensky, 2000])114
10.2 Example 18 steps i) – iv). (from [Roth-Korostensky, 2000]) . . . 116
10.3 inserting gaps g and g 0 affects the CS score at the four locations
marked with dark clouds (from [Roth-Korostensky, 2000]) . . . 117
10.4 The MSA enhance algorithm. (from [Roth-Korostensky, 2000]) . 119
11.1 MSA construction on RV11–RV50 . . . . . . . . . . . . . .
11.2 MSA construction on RV911–RV942 . . . . . . . . . . . . .
11.3 MSA construction on RV11–RV50 . . . . . . . . . . . . . .
11.4 MSA construction on RV911–RV942 . . . . . . . . . . . . .
11.5 MSA construction on RV11–RV50 . . . . . . . . . . . . . .
11.6 MSA construction on RV911–RV942 . . . . . . . . . . . . .
11.7 MSA construction on RV11–RV50 . . . . . . . . . . . . . .
11.8 MSA construction on RV911–RV942 . . . . . . . . . . . . .
11.9 CS score improvement on RV911–RV942 . . . . . . . . . .
11.10BAliBASE SOP score improvement on RV911–RV942 . .
11.11BAliBASE column score improvement on RV911–RV942
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
129
129
130
130
131
132
132
133
134
135
136
147
List of Tables
2.1 Some TSP history milestones (mainly from [Okano, 2009]) . . . 14
2.2 Milestones for solving TSP (from [Applegate et al., 2007]) . . . . 19
3.1 LKH and Concorde results for dk11455.tsp . . . . . . . . . . . . . 24
6.1 All-pairs suffix-prefix running-times . . . . . . . . . . . . . . . . 49
6.2 SSP algorithm characteristics . . . . . . . . . . . . . . . . . . . . 74
7.1
7.2
7.3
7.4
SSP experiments overview . . . . . . . . . . . . . . . . .
minimum scores for SSP algorithms in 355 280 tests . .
Running time linear regression values . . . . . . . . . .
superstring lengths compared to original string length
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
77
82
87
89
11.1 The used BAliBASE reference sets . . . . . . . . . . . . . . . . . 127
11.2 comparing variable block size to constant block sizes . . . . . . 137
C.1 SSP-experiments nucleotid sequences . . . . . . . . . . . . . . . . 157
C.1 SSP-experiments nucleotid sequences . . . . . . . . . . . . . . . . 158
C.1 SSP-experiments nucleotid sequences . . . . . . . . . . . . . . . . 159
D.1
D.2
D.2
D.3
D.3
D.4
D.5
D.6
coverage 6 test data . . . . . . . .
coverage 8 test data . . . . . . . .
coverage 8 testdata (continued) .
coverage 10 test data . . . . . . .
coverage 10 test data (continued)
coverage 20 test data . . . . . . .
coverage 30 test data . . . . . . .
coverage 40 test data . . . . . . .
148
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
163
163
164
164
165
165
165
165
Part V
Appendices
Evolutionary tree tour with 8178 cities
A PPENDIX
A
Acronyms
AP
Assignment Problem
ATSP
Asymmetric Traveling Salesman Problem
BLOSUM
Block Substitution Matrix
CPU
Central Processing Unit
CC
Cycle Cover
CO
Circular Order
CS
Circular Sum
CT
Circular Tour
DNA
Deoxyribonucleic Acid
GB
Giga Byte
GCD
Greatest Common Divisor
HCP
Hamiltonian Cycle Problem
HMM
Hidden Markov Model
HPP
Hamiltonian Path Problem
KMP
Knuth-Morris-Pratt
LKH
Lin-Kernighan-Helsgaun
LP
Linear Program
151
A. A CRONYMS
MB
Mega Byte
MHz
Mega Hertz
MMP
Minimum Matching Problem
MPS
Maximal Path Sum
MSA
Multiple Sequence Alignment
MST
Minimum Spanning Tree
MTSP
Metric Traveling Salesman Problem
NP
Nondeterministic Polynomial-time
OPA
Optimal Pairwise Alignment
PAM
Point Accepted Mutation
PTAS
Polynomial Time Approximation Scheme
RAM
Random-access Memory
RNA
Ribonucleic Acid
SCSP
Shortest Common Superstring Problem
SOP
Sum-of-Pairs
SSP
Shortest Superstring Problem
TSP
Traveling Salesman Problem
VLSI
Very-large-scale Integration
WLOG
Without loss of generality
152
A PPENDIX
B
LKH Output
The following is a sample output from a run of LKH on the TSPLIB instance
pr2392 using the shell command: ingolf% ./LKH pr2392.par
File pr2392.par:
PROBLEM_FILE= pr2392 . tsp
OPTIMUM 378032
MOVE_TYPE = 5
PATCHING_C = 3
PATCHING_A = 2
RUNS = 10
Output:
PARAMETER_FILE = pr2392 . par
Reading PROBLEM_FILE: " pr2392 . tsp " . . . done
ASCENT_CANDIDATES = 50
BACKBONE_TRIALS = 0
BACKTRACKING = NO
# CANDIDATE_FILE =
CANDIDATE_SET_TYPE = ALPHA
EXCESS = 0.00041806
EXTRA_CANDIDATES = 0
EXTRA_CANDIDATE_SET_TYPE = QUADRANT
GAIN23 = YES
GAIN_CRITERION = YES
INITIAL_PERIOD = 1196
INITIAL_STEP_SIZE = 1
INITIAL_TOUR_ALGORITHM = WALK
# INITIAL_TOUR_FILE =
INITIAL_TOUR_FRACTION = 1.000
# INPUT_TOUR_FILE =
153
B. LKH OUTPUT
KICK_TYPE = 0
KICKS = 1
# MAX_BREADTH =
MAX_CANDIDATES = 5
MAX_SWAPS = 2392
MAX_TRIALS = 2392
# MERGE_TOUR_FILE =
MOVE_TYPE = 5
NONSEQUENTIAL_MOVE_TYPE = 9
OPTIMUM = 378032
# OUTPUT_TOUR_FILE =
PATCHING_A = 2
PATCHING_C = 3
# PI_FILE =
# POPULATION_SIZE = 0
PRECISION = 100
PROBLEM_FILE = pr2392 . tsp
RESTRICTED_SEARCH = YES
RUNS = 10
SEED = 1
STOP_AT_OPTIMUM = YES
SUBGRADIENT = YES
# SUBPROBLEM_SIZE =
# SUBPROBLEM_TOUR_FILE =
SUBSEQUENT_MOVE_TYPE = 5
SUBSEQUENT_PATCHING = YES
# TIME_LIMIT =
# TOUR_FILE =
TRACE_LEVEL = 1
Lower bound = 373488.5 , Gap = 1.20% , Ascent time = 30.5 sec .
Cand . min = 2 , Cand . avg = 5 . 0 , Cand . max = 5
Preprocessing time = 33.0 sec .
∗ 1 : Cost = 378224 , Gap = 0.051% , Time = 0 . 8 sec .
∗ 2 : Cost = 378032 , Gap = 0.000% , Time = 1 . 0 sec . =
Run 1 : Cost = 378032 , Gap = 0.000% , Time = 1 . 0 sec . =
∗ 1 : Cost = 378032 , Gap = 0.000% , Time = 0 . 7 sec . =
Run 2 : Cost = 378032 , Gap = 0.000% , Time = 0 . 7 sec . =
∗ 1 : Cost = 378032 , Gap = 0.000% , Time = 0 . 7 sec . =
Run 3 : Cost = 378032 , Gap = 0.000% , Time = 0 . 7 sec . =
∗ 1 : Cost = 378811 , Gap = 0.206% , Time = 1 . 1 sec .
∗ 3 : Cost = 378798 , Gap = 0.203% , Time = 1 . 5 sec .
∗ 6 : Cost = 378653 , Gap = 0.164% , Time = 1 . 9 sec .
∗ 1 0 : Cost = 378033 , Gap = 0.000% , Time = 2 . 4 sec .
∗ 1 1 : Cost = 378032 , Gap = 0.000% , Time = 2 . 4 sec . =
Run 4 : Cost = 378032 , Gap = 0.000% , Time = 2 . 4 sec . =
154
∗ 1 : Cost = 378032 , Gap = 0.000% , Time = 1 . 0 sec . =
Run 5 : Cost = 378032 , Gap = 0.000% , Time = 1 . 0 sec . =
∗ 1 : Cost = 378136 , Gap = 0.028% , Time = 0 . 7 sec .
∗ 2 : Cost = 378032 , Gap = 0.000% , Time = 0 . 9 sec . =
Run 6 : Cost = 378032 , Gap = 0.000% , Time = 0 . 9 sec . =
∗ 1 : Cost = 378370 , Gap = 0.089% , Time = 0 . 7 sec .
∗ 4 : Cost = 378075 , Gap = 0.011% , Time = 2 . 0 sec .
∗ 6 : Cost = 378053 , Gap = 0.006% , Time = 2 . 3 sec .
∗ 2 9 : Cost = 378032 , Gap = 0.000% , Time = 3 . 6 sec . =
Run 7 : Cost = 378032 , Gap = 0.000% , Time = 3 . 6 sec . =
∗ 1 : Cost = 378683 , Gap = 0.172% , Time = 1 . 2 sec .
∗ 2 : Cost = 378032 , Gap = 0.000% , Time = 1 . 7 sec . =
Run 8 : Cost = 378032 , Gap = 0.000% , Time = 1 . 7 sec . =
∗ 1 : Cost =
∗ 3 : Cost =
∗ 4 : Cost =
∗ 8 : Cost =
Run 9 : Cost
378787 , Gap =
378634 , Gap =
378615 , Gap =
378032 , Gap =
= 378032 , Gap
0.200% , Time =
0.159% , Time =
0.154% , Time =
0.000% , Time =
= 0.000% , Time
1 . 2 sec .
1 . 5 sec .
1 . 7 sec .
2 . 7 sec . =
= 2 . 7 sec . =
∗ 1 : Cost = 378032 , Gap = 0.000% , Time = 0 . 8 sec . =
Run 1 0 : Cost = 378032 , Gap = 0.000% , Time = 0 . 8 sec . =
Successes / Runs = 10/10
Cost . min = 378032 , Cost . avg = 378032.00 , Cost . max = 378032
Gap . min = 0.000% , Gap . avg = 0.000% , Gap . max = 0.000%
T r i a l s . min = 1 , T r i a l s . avg = 5 . 8 , T r i a l s . max = 29
Time . min = 0 . 7 sec . , Time . avg = 1 . 6 sec . , Time . max = 3 . 6 sec .
155
A PPENDIX
C
GenBank Nucleotid
Sequences
Depicted in Table C.1 are the accession numbers and fasta file descriptions for the DNA-sequences used as basis for the SSP-experiments.
Table C.1: SSP-experiments nucleotid sequences
Acc. Number
Description
D00723.1
HUMHYDCP Homo sapiens mRNA for hydrogen carrier
protein, a component of an enzyme complex, glycine synthase (EC 2.1.2.10)
HUMPMP22A Homo sapiens mRNA for PMP-22(PASII/SR13/Gas-3), complete cds
D11428.1
NT_004321.15
NT_004350.16
NT_004433.16
NT_004434.16
NT_004487.16
NT_004511.16
NT_004538.15
NT_004547.16
NT_004559.11
NT_004610.16
NT_004668.16
NT_004671.15
NT_004686.16
NT_004754.15
Hs1_4478 Homo sapiens chromosome 1 genomic contig
Hs1_4507 Homo sapiens chromosome 1 genomic contig
Hs1_4590 Homo sapiens chromosome 1 genomic contig
Hs1_4591 Homo sapiens chromosome 1 genomic contig
Hs1_4644 Homo sapiens chromosome 1 genomic contig
Hs1_4668 Homo sapiens chromosome 1 genomic contig
Hs1_4695 Homo sapiens chromosome 1 genomic contig
Hs1_4704 Homo sapiens chromosome 1 genomic contig
Hs1_4716 Homo sapiens chromosome 1 genomic contig
Hs1_4767 Homo sapiens chromosome 1 genomic contig
Hs1_4825 Homo sapiens chromosome 1 genomic contig
Hs1_4828 Homo sapiens chromosome 1 genomic contig
Hs1_4843 Homo sapiens chromosome 1 genomic contig
Hs1_4911 Homo sapiens chromosome 1 genomic contig
the table is continued
157
C. G EN B ANK N UCLEOTID S EQUENCES
Table C.1: SSP-experiments nucleotid sequences
Acc. Number
Description
NT_004836.15
NT_004873.15
NT_019273.16
NT_021877.16
NT_021937.16
NT_021973.16
NT_022071.12
NT_026943.13
NT_028050.13
NT_029860.11
NT_030584.10
X00351.1
X01098.1
X02160.1
X02874.1
X02994.1
Hs1_4993 Homo sapiens chromosome 1 genomic contig
Hs1_5030 Homo sapiens chromosome 1 genomic contig
Hs1_19429 Homo sapiens chromosome 1 genomic contig
Hs1_22033 Homo sapiens chromosome 1 genomic contig
Hs1_22093 Homo sapiens chromosome 1 genomic contig
Hs1_22129 Homo sapiens chromosome 1 genomic contig
Hs1_22227 Homo sapiens chromosome 1 genomic contig
Hs1_27103 Homo sapiens chromosome 1 genomic contig
Hs1_28209 Homo sapiens chromosome 1 genomic contig
Hs1_30115 Homo sapiens chromosome 1 genomic contig
Hs1_30840 Homo sapiens chromosome 1 genomic contig
Human mRNA for beta-actin
Human mRNA for aldolase B (EC 4.1.2.13)
Human mRNA for insulin receptor precursor
Human mRNA for (2’-5’) oligo A synthetase E (1,6 kb RNA)
Human mRNA for adenosine deaminase (adenosine aminohydrolase, EC 3.5.4.4)
Human mRNA for alcohol dehydrogenase beta-1-subunit
(ADH1-2 allele)
Human mRNA for nuclear envelope protein lamin A precursor
Human mRNA for nuclear envelope protein lamin C precursor
Human mRNA for c-fms proto-oncogene
Human mRNA for platelet derived growth factor A-chain
(PDGF-A)
Human mRNA for porphobilinogen deaminase (PBG-D, EC
4.3.1.8)
Human mRNA for liver alchol dehydrogenase (EC 1.1.1.1)
gamma 1 subunit (partial) from ADH3 locus
Human mRNA for plasma gelsolin
Human mRNA for lymphocyte IgE receptor (low affinity
receptor Fc epsilon R)
Human mRNA for non-erythropoietic porphobilinogen
deaminase (hydroxymethylbilane synthase; EC4.3.1.8)
Human mRNA for platelet-derived growth factor PDGF-A
Human mRNA for heme oxygenase
Human mRNA for second protein of inter-alpha-trypsin inhibitor complex
the table is continued
X03350.1
X03444.1
X03445.1
X03663.1
X03795.1
X04217.1
X04350.1
X04412.1
X04772.1
X04808.1
X06374.1
X06985.1
X07173.1
158
Table C.1: SSP-experiments nucleotid sequences
Acc. Number
Description
X07577.1
X07696.1
X07743.1
X13097.1
X13403.1
X13440.1
Human mRNA for C1 inhibitor
Human mRNA for cytokeratin 15
Human mRNA for pleckstrin (P47)
Human mRNA for tissue type plasminogen activator
Human mRNA for octamer-binding protein Oct-1
Human mRNA for 17-beta-hydroxysteroid dehydrogenase
(17-HSD) (EC 1.1.1.62)
Human mRNA for phospholipase C
Human mRNA for adenocarcinoma-associated antigen
(KSA)
Human mRNA for proliferating cell nucleolar antigen P40
C-terminal fragment
Human mRNA for integrin beta(4)subunit
Human mRNA for p68 protein
Human mRNA for placental-like alkaline phosphatase (EC
3.1.3.1)
Human mRNA for cholesterol 7-alpha-hydroxylase
Human mRNA for pM5 protein
Human mRNa for adipogenesis inhibitory factor
HSP15095 H.sapiens mRNA for leukocyte adhesion glycoprotein p150,95
Human mRNA for amyloid A4 precursor of Alzheimer’s disease
Human mRNA for keratin 19
Human mRNA for B-lymphocyte CR2-receptor (for complement factor C3d and Epstein-Barr virus)
X14034.1
X14758.1
X15610.1
X51841.1
X52104.1
X53279.1
X56088.1
X57398.1
X58377.1
Y00093.1
Y00264.1
Y00503.1
Y00649.1
159
A PPENDIX
D
SSP
Test Data
To generate the test data according to the method description in Section 7.1 on page 76 the following calculations were made to determined
the needed length of nucleotid sequence N . Let the wanted values for
coverage, collection size and string length be denoted by cov, Ssize and l
respectively, then
N=
Ssize · l
cov
(D.1)
The number of strings gen(N ) of length l generated by the method
using a nucleotid sequence of length N can be calculated as
gen(N ) = N − l + 1
and the number of strings to be randomly selected for removal r is
given by
r = gen(N ) − Ssize
giving a reduction ratio rratio in percentage of size
rratio =
r · 100
gen(N )
161
D. SSP T EST D ATA
Example 21
Let cov = 8, Ssize = 600, l = 16
600 · 16
= 1200
then N =
8
gen(N ) = 1200 − 16 + 1 = 1185
r = 1185 − 600 = 585
585 · 100
rratio =
≈ 42.194%
1185
2
In praxis an integer value for the reduction ration was preferred leading to the following modifications of the Equation D.1 on the previous
page
Ssize · l
0
N =
−1 +l =N −l+1
cov
Example 22
Let cov = 8, Ssize = 600, l = 16
600 · 16
then N 0 =
− 1 + 16 = 1215
8
0
gen(N ) = 1215 − 16 + 1 = 1200
r = 1200 − 600 = 600
600 · 100
= 50%
rratio =
1200
2
Note that effect of this was that the value for cov is then actually only
approximated giving
Ssize · l
cov 0 =
N0
in Example 22 we get
cov 0 =
600 · 16
≈ 7.901
1215
The values for the generated testdata is listed in following tables
162
Table D.1: coverage 6 test data
Ssize
l
rratio
Ssize
l
rratio
Ssize
l
rratio
100
200
300
400
500
600
700
800
900
1000
10
10
10
10
10
10
10
10
10
10
40
40
40
40
40
40
40
40
40
40
100
200
300
400
500
600
700
800
900
1000
20
20
20
20
20
20
20
20
20
20
70
70
70
70
70
70
70
70
70
70
100
200
300
400
500
600
700
800
900
1000
25
25
25
25
25
25
25
25
25
25
76
76
76
76
76
76
76
76
76
76
100
200
300
400
500
600
700
800
900
1000
40
40
40
40
40
40
40
40
40
40
85
85
85
85
85
85
85
85
85
85
100
200
300
400
500
600
700
800
900
1000
50
50
50
50
50
50
50
50
50
50
88
88
88
88
88
88
88
88
88
88
100
200
300
400
500
600
700
800
900
1000
100
100
100
100
100
100
100
100
100
100
94
94
94
94
94
94
94
94
94
94
3000
8
0
l
rratio
Table D.2: coverage 8 test data
Ssize
l
rratio
Ssize
l
100
200
300
400
500
600
700
800
900
1000
2000
8
8
8
8
8
8
8
8
8
8
8
00
00
00
00
00
00
00
00
00
00
00
100
200
300
400
500
600
700
800
900
1000
2000
3000
10
10
10
10
10
10
10
10
10
10
10
10
rratio
Ssize
20
100
16
50
20
200
16
50
20
300
16
50
20
400
16
50
20
500
16
50
20
600
16
50
20
700
16
50
20
800
16
50
20
900
16
50
20 1000
16
50
20
20
the table is continued
163
D. SSP T EST D ATA
Table D.2: coverage 8 testdata (continued)
Ssize
l
rratio
Ssize
l
rratio
Ssize
l
rratio
100
200
300
400
500
600
700
800
900
1000
20
20
20
20
20
20
20
20
20
20
60
60
60
60
60
60
60
60
60
60
100
200
300
400
500
600
700
800
900
1000
25
25
25
25
25
25
25
25
25
25
68
68
68
68
68
68
68
68
68
68
100
200
300
400
500
600
700
800
900
1000
32
32
32
32
32
32
32
32
32
32
75
75
75
75
75
75
75
75
75
75
100
200
300
400
500
600
700
800
900
1000
40
40
40
40
40
40
40
40
40
40
80
80
80
80
80
80
80
80
80
80
100
200
300
400
500
600
700
800
900
1000
50
50
50
50
50
50
50
50
50
50
84
84
84
84
84
84
84
84
84
84
100
200
300
400
500
600
700
800
900
1000
100
100
100
100
100
100
100
100
100
100
92
92
92
92
92
92
92
92
92
92
Table D.3: coverage 10 test data
164
Ssize
l
rratio
Ssize
l
rratio
Ssize
l
rratio
100
200
300
400
500
600
700
800
900
1000
10
10
10
10
10
10
10
10
10
10
00
00
00
00
00
00
00
00
00
00
100
200
300
400
500
600
700
800
900
1000
20
20
20
20
20
20
20
20
20
20
50
50
50
50
50
50
50
50
50
50
100
200
300
400
500
600
700
800
900
1000
25
25
25
25
25
25
25
25
25
25
60
60
60
60
60
60
60
60
60
600
100
200
300
40
40
40
75
75
75
100
200
300
50
50
50
80
100 100
90
80
200 100
90
80
300 100
90
the table is continued
Table D.3: coverage 10 test data (continued)
Ssize
l
rratio
Ssize
l
rratio
Ssize
l
rratio
400
500
600
700
800
900
1000
40
40
40
40
40
40
40
75
75
75
75
75
75
75
400
500
600
700
800
900
1000
50
50
50
50
50
50
50
80
80
80
80
80
80
80
400
500
600
900
100
100
100
100
90
90
90
90
Table D.4: coverage 20 test data
Ssize
l
rratio
Ssize
l
rratio
100
200
50
50
60
60
100
200
100
100
80
80
Table D.5: coverage 30 test data
Ssize
l
rratio
Ssize
l
rratio
100
200
300
400
500
600
50
50
50
50
50
50
40
40
40
40
40
40
100
200
100
100
70
70
Table D.6: coverage 40 test data
Ssize
l
rratio
Ssize
l
rratio
100
200
50
50
20
20
100
200
100
100
60
60
165
Bibliography
Altschul, S. and D. Lipman (1989). Trees, stars, and multiple biological sequence alignment. SIAM Journal on Applied Mathematics 49(1),
197–209.
Applegate, D., R. Bixby, V. Chvátal, W. Cook, D. Espinoza, M. Goycoolea,
and K. Helsgaun (2009). Certification of an optimal tsp tour through
85,900 cities. Operations Research Letters 37(1), 11–15.
Applegate, D. L., R. E. Bixby, V. Chvátal, and W. J. Cook (2007). The
Traveling Salesman Problem: A Computational Study (Princeton Series in Applied Mathematics)., Chapter 1–5,12–17. Princeton, NJ, USA:
Princeton University Press.
Bacon, D. and W. Anderson (1986). Multiple sequence alignment* 1. Journal of Molecular Biology 191(2), 153–161.
Bentley, J. L. (1992). Fast algorithms for geometric traveling salesman
problems. ORSA Journal on Computing 4(4), 387–411.
Bentley, J. L. (2000). Programming Pearls (second edition ed.)., Chapter 8.
Addison-Wesley, Inc.
Blackshields, G., I. Wallace, M. Larkin, and D. Higgins (2006). Analysis
and comparison of benchmarks for multiple sequence alignment. In
silico biology 6(4), 321–339.
Blum, A., T. Jiang, M. Li, J. Tromp, and M. Yannakakis (1994, May).
Linear approximation of shortest superstrings. In Proceedings of the
23rd Annual ACM Symposium on Theory of Computing, pp. 328–336.
167
B IBLIOGRAPHY
Breslauer, D., T. Jiang, and Z. Jiang (1996). Rotations of periodic strings
and short superstrings. Journal of Algorithms 24, 340–353.
Carrillo, H. and D. Lipman (1988). The multiple sequence alignment
problem in biology. SIAM Journal on Applied Mathematics 48(5), 1073–
1082.
Christofides, N. (1976). Worst-case analysis of a new heuristic for the
travelling salesman problem. Technical Report Report 388, Graduate
School of Industrial Administration, Carnegie-Mellon University, Pittsburg.
Cormen, T., C. Leiserson, R. Rivest, and C. Stein (2001). Introduction to
algorithms., Chapter 22,31.2. The MIT press.
Dantzig, G., R. Fulkerson, and S. Johnson (1954). Solution of a large scale
traveling salesman problem. Technical Report P-510, RAND Corporation, Santa Monica, California, USA.
Do, C., M. Mahabhashyam, M. Brudno, and S. Batzoglou (2005). Probcons: probabilistic consistency-based multiple sequence alignment.
Genome Research 15(2), 330.
Edgar, R. (2004). Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research 32(5), 1792.
Elias, I. (2006). Settling the intractability of multiple alignment. Journal
of Computational Biology 13(7), 1323–1339.
Gallant, J., D. Maier, and J. A. Storer (1980). On finding minimal length
superstrings. Journal of Computer and System Sciences 20(1), 50 – 58.
Gonnet, G., C. Korostensky, and S. Benner (2000). Evaluation measures of
multiple sequence alignments. Journal of Computational Biology 7(12), 261–276.
Gusfield, D. (1997). Algorithms on Streams, Trees, and Sequences., Chapter 5–6, 7.10, 10–17. Cambridge University Press, New York, NY.
Held, M. and R. Karp (1970). The traveling-salesman problem and minimum spanning trees. Operations Research 18(6), 1138–1162.
Helsgaun, K. (1998). An effective implementation of the lin-kernighan
traveling salesman heuristic. datalogiske skrifter (writings on computer science). Technical Report 81, Roskilde University.
Helsgaun, K. (2000). An effective implementation of the Lin-Kernighan
traveling salesman heuristic. European Journal of Operational Research 126(1), 106–130.
168
Bibliography
Helsgaun, K. (2006). An effective implementation of k-opt moves for the
lin-kernighan tsp heuristic. datalogiske skrifter (writings on computer
science). Technical Report 109, Roskilde University.
Helsgaun, K. (2009). General k-opt submoves for the Lin-Kernighan TSP
heuristic. Mathematical Programming Computation 1(2), 119–163.
Hopfield, J. J. and D. W. Tank (1985). “Neural” computation of decisions in optimization problems. Biological Cybernetics 52, 141–152.
10.1007/BF00339943.
Ilie, L. and C. Popescu (2006). The shortest common superstring problem
and viral genome compression. Fundamenta Informaticae 73(1,2), 153–
164.
Kaplan, H., M. Lewenstein, N. Shafrir, and M. Sviridenko (2005). Approximation algorithms for asymmetric tsp by decomposing directed regular
multigraphs. section 1–3, 5. J. ACM 52(4), 602–626.
Kaplan, H. and N. Shafrir (2005). The greedy algorithm for shortest superstrings. Inf. Process. Lett. 93(1), 13–17.
Karp, R. M. (1972). Reducibility among combinatorial problems. In
R. Miller and J. Thatcher (Eds.), Complexity of Computer Computations,
New York, USA., pp. 85–103. Plenum Press.
Karp, R. M. (2003). The role of algorithmic research in computational
genomics. Computational Systems Bioinformatics Conference, International IEEE Computer Society 0, 10.
Katoh, K., K. Kuma, H. Toh, and T. Miyata (2005). MAFFT version 5:
improvement in accuracy of multiple sequence alignment. Nucleic acids
research 33(2), 511.
Katoh, K., K. Misawa, K. Kuma, and T. Miyata (2002). Mafft: a novel
method for rapid multiple sequence alignment based on fast fourier
transform. Nucleic acids research 30(14), 3059.
Korostensky, C. and G. Gonnet (1999). Near optimal multiple sequence
alignments using a traveling salesman problem approach. In Proceedings of the String Processing and Information Retrieval Symposium &
International Workshop on Groupware, pp. 105–114. IEEE Computer
Society.
Laporte, G. (2010). A concise guide to the traveling salesman problem.
Journal of the Operational Research Society 61(1), 35–40.
169
B IBLIOGRAPHY
Lin, S. and B. Kernighan (1973). An effective heuristic algorithm for the
traveling-salesman problem. Operations research 21(2), 498–516.
Lipman, D., S. Altschul, and J. Kececioglu (1989). A tool for multiple
sequence alignment. Proceedings of the National Academy of Sciences
of the United States of America 86(12), 4412.
Metzker, M. (2009). Sequencing technologies—the next generation. Nature Reviews Genetics 11(1), 31–46.
Mucha, M. (2007). A tutorial on shortest superstring approximation.
http://www.mimuw.edu.pl/~mucha/teaching/aa2008/ss.pdf .
Murata, M., J. Richardson, and J. Sussman (1985). Simultaneous comparison of three protein sequences. Proceedings of the National Academy
of Sciences of the United States of America 82(10), 3073.
Notredame, C. (2002). Recent progress in multiple sequence alignment: a
survey. pgs 3(1), 131–144.
Notredame, C. (2007). Recent evolutions of multiple sequence alignment
algorithms. PLoS Comput. Biol 3(8), e123.
Notredame, C., D. Higgins, and J. Heringa (2000). T-coffee: a novel
method for fast and accurate multiple sequence alignment1. Journal
of molecular biology 302(1), 205–217.
Okano, H. (2009). Study of Practical Solutions for Combinatorial Optimization Problems. Ph. D. thesis, School of Information Sciences, Tohoku University.
Perrodou, E., C. Chica, O. Poch, T. Gibson, and J. Thompson (2008). A
new protein linear motif benchmark for multiple sequence alignment
software. BMC bioinformatics 9(1), 213–228.
Reinelt, G. (1991, Fall). Tsplib - a traveling salesman problem library.
ORSA, Journal On Computing 3(4), 376–384.
Rice, P., I. Longden, A. Bleasby, et al. (2000). Emboss: the european
molecular biology open software suite. Trends in genetics 16(6), 276–
277.
Romero, H. J., C. A. Brizuela, and A. Tchernykh (2004). An experimental
comparison of approximation algorithms for the shortest common superstring problem. In Fifth Mexican International Conference on Computer Science, (ENC), pp. 27 – 34.
170
Bibliography
Roth-Korostensky, C. (2000). Algorithms for Building Multiple Sequence
Alignments and Evolutionary Trees. Doctor of technical sciences, Swiss
Federal Institute of Technology, Zürich.
S. Kirkpatrick, C. D. G. J. and M. P. Vecchi (1983, May). Optimization by
simulated annealing. Science 220(4598), 671–680.
Tarhio, J. and E. Ukkonen (1988). A greedy approximation algorithm for
constructing shortest common superstrings. Theor. Comput. Sci. 57(1),
131–145.
Thompson, J., D. Higgins, and T. Gibson (1994). Clustal w: improving the sensitivity of progressive multiple sequence alignment through
sequence weighting, position-specific gap penalties and weight matrix
choice. Nucleic acids research 22(22), 4673.
Thompson, J., P. Koehl, R. Ripp, and O. Poch (2005). Balibase 3.0: latest
developments of the multiple sequence alignment benchmark. Proteins:
Structure, Function, and Bioinformatics 61(1), 127–136.
Thompson, J. D., F. Plewniak, and O. Poch (1999a). Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15, 87–88.
Thompson, J. D., F. Plewniak, and O. Poch (1999b). A comprehensive
comparison of multiple sequence alignment programs. Nucleic Acids
Research 27(13), 2682–2690.
Venter, J., M. Adams, E. Myers, P. Li, R. Mural, G. Sutton, H. Smith,
M. Yandell, C. Evans, R. Holt, et al. (2001). The sequence of the human
genome. science 291(5507), 1304.
Wallace, I., G. Blackshields, and D. Higgins (2005). Multiple sequence
alignments. Current opinion in structural biology 15(3), 261–266.
Wang, L. and T. Jiang (1994). On the complexity of multiple sequence
alignment. Journal of computational biology 1(4), 337–348.
171
© Copyright 2026 Paperzz