Multiple Sequence Alignment with Evolutionary Computation

Multiple Sequence Alignment with Evolutionary Computation
Conrad Shyu
[email protected]
Luke Sheneman
[email protected]
James A. Foster
[email protected]
Initiatives for Bioinformatics and Evolutionary Studies (IBEST)
Department of Bioinformatics and Computational Biology
University of Idaho, Moscow, Idaho 83844-1010, USA
+1 208.885.7062
Abstract. In this paper we provide a brief review of current work in the area of multiple sequence
alignment (MSA) for DNA and protein sequences using evolutionary computation (EC). We detail the
strengths and weaknesses of EC techniques for MSA. In addition, we present two novel approaches for
inferring MSA using genetic algorithms. Our first novel approach utilizes a GA to evolve an optimal guide
tree in a progressive alignment algorithm and serves as an alternative to the more traditional heuristic
techniques such as neighbor-joining. The second novel approach facilitates the optimization of a consensus
sequence with a GA using a vertically scalable encoding scheme in which the number of iterations needed
to find the optimal solution is approximately the same regardless the number of sequences being aligned.
We compare both of our novel approaches to the popular progressive alignment program Clustal W.
Experiments have confirmed that EC constitutes an attractive and promising alternative to traditional
heuristic algorithms for MSA.
Keywords: multiple sequence alignment, genetic algorithm, progressive alignments, DNA sequences
1. Introduction
Living things diverge from common ancestors through changes in deoxyribonucleic acid (DNA) and
millions of years of evolution (5). DNA plays a fundamental role in the processes of life. DNA contains the
template for the synthesis of proteins, which are crucial molecules for living systems. Moreover, DNA is
essential to life because it functions as a medium to transmit information from one generation to another
(10). The most important regions in DNA are generally conserved to ensure survival. Sequence alignment
is commonly used to detect and quantify similarities in DNA or protein sequences. Alignments of
biological sequences generated by computational algorithms are routinely used as a basis for inference
about sequences whose structures or functions are not well known. The most common approach is to find
the best-scoring alignment between a pair of sequences, where the alignment score is a measure of the edit
distance between the sequences in the context of a particular evolutionary model. An evolutionary model
can be represented as a scoring system which penalizes substitutions and gaps (5, 7). The best-scoring
(optimal) alignment can be found through the use of dynamic programming (DP) algorithms such as the
Smith-Waterman (28, 37) and Needleman-Wunsch algorithms (20). However, the complexity of DP
algorithms grows exponentially as the length and number of sequences increase. Specifically, multiple
sequence alignments (MSA) with DP have been shown to be NP-hard (36). Several heuristic approaches,
such as Clustal W (32, 33, 34) are frequently used to quickly approximate optimal alignments. In this
paper, we briefly review the current work in sequence alignment with evolutionary computation (EC). In
addition, we present two novel approaches that utilize EC to optimize multiple alignments.
Our first new approach employs a steady-state GA (13, 39) to evolve guide trees, which is a fundamental
component of progressive alignment algorithms (8). The population in the GA consists of viable guide trees
that are represented in an efficient, coalescing binary tree structure. This enables fast and meaningful
crossover and mutation. Variability operators such as crossover and mutation are constructed such that the
viability of an individual tree is never compromised. Fitness is objectively computed by performing the
progressive alignment in the pairwise ordering specified by the guide trees in the population. The fitness of
an individual tree is computed as the natural log of the alignment score of the final alignment produced by
performing the progressive alignment in the order specified by that tree. In this way, the fitness of a guide
tree is optimized only in respect to the most important result: the quality of the final multiple sequence
alignment. The second MSA approach facilitates the optimization of a consensus sequence (6) with a GA
with an encoding scheme that was designed such that the search complexity is independent of the number
of sequences being aligned. The search complexity of this approach primarily depends on the length of the
consensus sequence and the degree of similarity between sequences. The scheme encodes each possible
matching nucleotide at a given column with binary masks. This compact representation greatly reduces the
space requirement as well as the search complexity. The objective or evaluation function gives the sum-ofpairs (SP) (4) score to determine the fitness of each chromosome in the population. SP score has been
widely used to detect and quantify similarities between sequences; however it does not provide any
probabilistic or biological justifications (7). To further improve the performance of GA, we have developed
a sequence profiling formulation that reduces the complexity for calculating the SP scores.
2. Sequence Alignment
There are diverse motivations behind the alignment of biological sequences. Genetic sequences are
inherited from common ancestors through millions of years of evolution. Therefore, it is of interest to trace
evolutionary history of mutation and other evolutionary changes through sequencing (1, 5). Alignment of
biological sequences, in this context, is generally understood as a comparison based on the criteria of
evolution. For example, the number of mutations, insertions, and deletions of residues necessary to
transform one DNA sequence into another is a measure of phylogeny or evolutionary relatedness. On the
other hand, a comparison may pinpoint regions of common origin, which may in turn coincide with regions
of similar structure or function (10). A pairwise sequence alignment is a technique of arranging two
sequences, so that the residues in certain positions are deemed to have a common evolutionary origin. In
other words, if the same residue occurs in both sequences at the same position then it may have been
conserved during the course of evolution. If, however, two residues differ, then it is generally assumed that
they may have been derived from a common ancestor. Homologous sequences, those related by common
descent, might have different lengths, which is generally explained through insertions or deletions (27).
Statistical approaches, such as hidden Markov models, have been commonly used to detect homologous
sequences and subsequently infer the alignments (7, 22). A hidden Markov model consists of a set of states
connected by probabilistic transitions. Each transition indicates the probability of moving from one state to
another. The transition structure consists of repeated element of match, insert, and silent delete states. The
number of repeated elements is the length of the model. Each element of a match, insert and delete state
models a position in the consensus sequence of the sequence family and describes sequence homology.
Another commonly used approach is dynamic programming. Dynamic programming is a mathematically
rigorous technique because it is guaranteed to find the optimal alignment (26). MSA is simply an extension
of pairwise sequence alignment. MSA is the process of aligning three or more sequences simultaneously to
bring as many similar residues into register as possible (4, 25). The resulting alignments are commonly
interpreted in two contexts; (a) to find regions that define a conserved pattern or domain; and (b) to derive
the possible phylogeny or evolutionary relationships among the sequences (12). The presence of similar
domains across multiple sequences implies a similar biochemical function or higher-level structure that
may be used as the basis for further experimental investigation.
2.1 Dynamic Programming
DP is a commonly used recurrence method for solving sequential or multi-stage decision problems (11, 22).
The essence of DP is the principle of optimality. DP has long been used to solve varieties of discrete
optimization problems such as scheduling, string-editing, packaging, and inventory management (11). It
views a problem as a set of interdependent sub-problems and DP solves these sub-problems and uses the
results to solve ever-larger sub-problems. The solution to a sub-problem is expressed as a function of
solutions to one or more sub-problems at the preceding levels. DP expresses the problem in a recurrence
formulation. To make optimal decisions for the next and all future states, DP only needs to know the
current state and the state of its immediate predecessors. This is also known as the Markovian property (7).
For a process to be Markovian, future states must depend only on the present state and the past should not
have any effect on the future. The term programming in the name actually refers to the mathematical rules
that can be easily followed to solve a problem; it has nothing to do with writing a computer program. DP is
known to be an efficient programming technique for solving certain combinatorial problems. It is
particularly important in bioinformatics (27) as it is the basis of sequence alignments for comparing DNA
and protein sequences.
The recurrence equation (Eq. 1) is applied repeatedly to fill the matrix of F(i, j) values. This particular
formulation gives the global alignment of two sequences. F(i, j) is the maximum of three previous values,
namely F(i-1, j-1), F(i-1, j), and F(i, j-1). The value s(xi, yj) is the score for aligning the characters xi and yj
while d is the penalty for gap insertion. For pairwise sequence alignments, DP begins with the construction
of an alignment matrix F(i, j) with the indexes (i, j) for the two sequences Sx and Sy. The matrix is first
initialized with F(0, 0)=0. The value of F(i, j) is the score of the best alignment from the first character x1
to the character xi of sequence Sx and the first character y1 to the character yj of Sy. There are three possible
ways that xi and yj can be aligned; (a) xi can align with yj, which gives a match or mismatch; (b) xi is aligned
with a gap; or (c) yj is aligned to a gap. Since the matrix is built recursively, in order to calculate F(i, j), the
previous states F(i-1, j-1), F(i-1, j), and F(i, j-1) must be known beforehand. The following equation shows
the recurrence formulation of DP for sequence alignment.
 F (i − 1, j − 1) + s ( xi , y j ),

F (i, j ) = max  F (i − 1, j ) − d ,
 F (i, j − 1) − d .

Eq. 1
Simultaneous alignment of three or more sequences with DP, however, poses a difficult algorithmic
challenge (30). Determining the optimal alignment of more than a handful of sequences has a prohibitive
time complexity (36). Because of this, various heuristic approaches have been developed, many of which
are capable of producing good alignments in a relatively short period of time. The most commonly used
heuristic technique is known as progressive multiple sequence alignment (8, 32, 33, 34).
2.2 Progressive Alignment
Traditional progressive multiple sequence alignment algorithms involve at least a three-step process in
which input sequences are first compared to one another using dynamic programming (DP) (8) to
determine the edit distances between all possible pairs of sequences. The use of DP for computing pairwise
distances guarantees an optimal result for the pairwise comparisons, but has time complexity of O(n2) for
comparing just two sequences (36). For n input sequences, the numbers of pairwise distance measurements
which must be taken are:
n
Number of Pairwise Distances =  
2
Eq. 2
Notably, to counter the obvious scalability issues of performing so many optimal pairwise alignments,
systems such as Clustal W offer the option of using faster, less-accurate forms of pairwise distance
measurements, but this ultimately results in the construction of less accurate guide trees, which can have a
deleterious impact on the overall quality of the entire multiple sequence alignment. After all pairwise
distances have been computed, the distances are used to construct a guide tree using techniques such as
Neighbor-Joining (NJ) (24, 31).
Figure 1: The traditional progressive alignment algorithm. (a) All possible pairs of sequences are optimally aligned using dynamic
programming to determine their edit distance. Then, (b) edit distance information is used by a neighbor-joining algorithm to estimate
and construct a guide tree. (c) Finally, the sequences are progressively aligned using the guide tree in order to produce an alignment
The process of constructing a guide tree (8) based on pairwise distances is simple and reasonably scalable,
but it is a subject to certain limitations. NJ is a simplistic iterative clustering algorithm which is based on
the approach of using pairwise edit distance information to decompose an initial star-shaped tree into a
fully descriptive tree which represents, based on pairwise sequence distances, the phylogenetic
relationships between all of the taxa on the tree (24, 31). In such a tree, the most similar sequences are
clustered together first, followed by the most similar sub-alignments, and so on. Eventually, an entire tree is
built which represents the similarity relationships between all of the sequences. The tree built by neighbor
joining (NJ) is subsequently used as the guide tree that ultimately describes an order of operations of
aligning sequences and sub-alignments. The quality of the final alignment is typically quantified by a sumof-pairs (SP) score.
2.3 Clustal W
Clustal W is a popular progressive alignment system. Since progressive alignment is a heuristic algorithm,
Clustal W is not guaranteed to find optimal alignments (8, 32, 33, 34). Clustal W exploits the fact that
homologous sequences are evolutionarily related. It builds up multiple alignments progressively with a
series of pairwise alignments, moving from the leaves upward in a guide tree that estimates the phylogeny
of the sequences (8). Although Clustal W doesn’t always find optimal alignments, in most cases those
alignments give a good starting point for further automatic or manual refinement. This type of alignment is
generally useful for the study of identifying regions that are highly conserved. The alignment can be further
improved through sequence weighting, position-specific gap penalties and choice of weight matrix (2).
The local maxima problem stems from the nature of the progressive alignment strategy. As the algorithm
follows the guide tree and merges sequences together, the solution is never guaranteed to be globally
optimal, as defined by some overall measure of alignment quality. Any misaligned regions made early in
the alignment process cannot be corrected later as new information from other sequences is introduced.
This problem is frequently a result of an incorrect branching order in the guide tree. One way to correct this
is to use an iterative or stochastic sampling procedure such as bootstrapping (33). The choice of alignment
parameters is also problematic in Clustal W. If parameters are not chosen appropriately, alignments will not
converge to a globally optimal solution. For closely related sequences, any reasonable scoring matrices
should work fine because matches usually receive the most weights. Therefore, when matches dominate an
alignment, almost any weight matrices will find a good solution. However, when aligning more divergent
sequences, scores for gaps and mismatches become narrow and critical because they occur more frequently.
Moreover, for highly conserved sequences, the range of gap penalties that will find the correct or best
possible solution can be very broad. As more and more divergent sequences are added, however, the exact
values for gap penalties become critical for success (31). Our observations have confirmed that this is a
common problem in most MSA algorithms. As the number of sequences in an alignment increase, the
expected number of matches in each column also increases. For example, the probability of finding a
matching nucleotide in the column of ten sequences is much higher than that of three sequences. In general,
it is difficult to justify why one scoring matrix is better than the others (7).
2.4 Sum-of-Pairs (SP) Scores and Substitution Matrices
Carrillo and Lipman (4) first introduced the sum-of-pairs (SP) score function, which defines the scores of a
multiple alignment of N sequences as the sum of the scores of the N(N-1)/2 pairwise alignments. Although
SP score function has been widely used to evaluate MSA, it doesn’t really provide any biological or
probabilistic justification (7). Each sequence is scored as if it is descended from the N-1 other sequences,
instead of a single ancestor. As a result, evolutionary events are often overestimated. The problem worsens
as the number of sequences increase. A weighted SP score function (2) has been proposed to partially
compensate this effect. Moreover, despite the simplicity of the SP score function, its sheer running time and
space consumption makes it impractical even for modestly-sized sets of short sequences. It has been shown
that the problem of computing MSA with optimal SP score is NP-hard (36). Several fast approximations
and divide-and-conquer approaches (30) have been proposed to overcome the computational complexity.
In (2) and (6), the SP function, w(M), sums all the pairwise substitution scores in the columns for the
sequence pairs p and q. Each column is evaluated with a scoring matrix. The substitution scoring function,
s(mpj, mqj), defines all possible alignments for nucleotides pj and qj. The function s(mpj, mqj) gives the score
of the alignment at column j for sequence p and q. The weight, αp,q, is intended to balance the
overestimation problem in the SP score function (2, 6, 7). The following equation shows the mathematical
formulation of the weighted SP score function.
w( M ) =
N


a
×
 p , q ∑ s (m pj , mqj ) 
∑
1≤ p < q ≤ k 
j =1

Eq. 3
A major component in assessing the quality of a sequence alignment is the substitution matrix, which
assigns a cost for substituting any possible pair of residues. The substitution costs are evaluated using a
predefined evolutionary model in which a score is assigned to every possible substitution or conservation
according to its biological similarity (1, 4, 7). Each sequence receives a weight proportional to the amount
of independent information it contains. The overall cost of an entire multiple sequence alignment is the
sum of the costs of all of the pairwise substitutions. Amino acid substitution matrices, for example, can be
calculated empirically by examining which substitutions occur in correct alignments and a model for the
random protein sequences. These matrices can also be derived by scoring the relations of amino acid to
each other according to some of their features, such as size, charge, hydrophobicity and genetic code. The
theory of amino acid substitution matrices is described in (1) and applied to DNA sequence comparison in
(29). A single matrix may nevertheless be reasonably efficient over a relatively broad range of evolutionary
change. In general, different substitution matrices are tailored to detecting similarities among sequences
that are diverged by differing degrees. Moreover, substitution matrices are frequently used to simulate
evolutionary events and generate sequences for experimental studies (27).
2.5 Sequence Generation
Mutations in sequences are fixed into the population and consequently result in the substitutions from one
nucleotide to another at various sites (12, 16). The simplest model for such nucleic substitutions assumes
that all changes are equally likely. In order to predict the probability that a particular nucleotide at a
particular site will change to another over some time interval, we only need to know the instantaneous rate
of change (denoted by α) or the rate at which nucleotide substitutions occur. The Jukes-Cantor model (16)
has only one parameter and assumes that the substitution rates are the same for all nucleotides. Because the
one-parameter model assumes that all substitutions are equally likely, therefore it can be written in more
general statements as the followings:
P(ii ) (t ) =
1 3 −4α t
1 1
+ e
and P(ij ) (t ) = − e −4α t
4 4
4 4
Eq. 4
With real biological data, there are typically several conserved regions within sequences (23), which signal
important biological functions. To better simulate the evolutionary process in nature and perform
experiments on controlled data sets, we devised a technique that closely follows the biological assumptions
for sequence generation.
The sequence generation procedure begins with a randomly generated sequence that serves as the template
or ancestral sequence. A trigonometric function landscapes the probability distribution that any given
nucleotide at a particular site will undergo a mutation or not. The trigonometric function is periodic so that
it works well for simulating conserved regions and allows site-specific rate heterogeneity. The
implementation employs a Markov model, and assumes that evolution is independent and identically
distributed at each site (10, 12). The simulation process randomly generates a number between 0 and 1. If
such a number is less than the predefined probability density at given site, then no mutation will occur.
Otherwise, the process invokes the evolutionary model and determines the most probable substitution. For
the purpose of our studies, only the Jukes-Cantor model (16) was implemented for simulation.
Figure 2: This figure shows an example set of sequences that are generated from our simulation program. Visual inspection can easily
pinpoint that several regions in the sequences are highly conserved, which closely resemble the real biological data
3. Literature Review
Evolutionary computation (EC) constitutes a very interesting alternative to the heuristic approaches for
multiple sequence alignment (MSA). It has been shown that iterative algorithms often offer highly accurate
alignments at the expense of runtime (9). Stochastic methods such as simulated annealing and EC have
been successfully applied to the problem of MSA (18). However they have a tendency to stagnate at local
optima as the number of sequences increase. Notredame and Higgins (21) applied GA to MSA with a tool
known as Sequence Alignment by Genetic Algorithm (SAGA). This is the best-known seminal work in this
particular area. SAGA evolves a population of alignments using a complex set of 22 different crossover and
mutation operators in an attempt to gradually improve the fitness of the alignments in the population.
Providing meaningful scores for sequence alignments can be somewhat problematic, and by default SAGA
relies on a weighted SP approach in which each pair of sequences in an alignment is compared and scored
and then the scores from all of the pairwise alignments are summed to produce a representative score for
the entire alignment. Although SAGA was shown to produce high quality results which were comparable
(or sometimes better) than other popular heuristic techniques, SAGA has a large time complexity, likely
due to the time complexity involved in the repeated use of the weighted SP fitness function.
Another approach for MSA with GA was later introduced by Zhang and Wong (40). The authors reported
that their implementation was highly efficient. These results however must be considered with a great care
since their strategy assumes the presence of completely conserved regions, which are the sole evidence to
guide the assembly of alignments. Their chromosome encoding scheme codified the locations and numbers
of gaps in the alignment. In other words, their GA simply evolves the number and position of gaps within
conserved segments of an alignment. The assumption that such conserved segments always exist is never
realistic or biologically sound. This method therefore can only compare long, highly similar sequences.
Researchers proposed a technique that defines a chromosome as multiple number-strings of fixed lengths
(14). The number-strings represent the positions and number of gaps in the alignments. The authors
compared their approach with Clustal W and reported outstanding performance in terms of runtime and
alignment quality on a small set of sequences. However, their claims must be considered with scrutiny
since the complexity of their GA depends on the number of sequences being aligned as well as the length of
those sequences. Since gap positions are individually encoded, the search space increases exponentially as
the length of the chromosome increases. Isokawa et al. (15) and Wayama et al. (38) proposed a simple GA
that encodes the alignment as a bit matrix that consists of 0s and 1s. In the bit matrix, the positions of 1
correspond to the gaps and 0 corresponds to a nucleotide or residue. The concept of such a representation is
very similar to that of (14).
Karadimitriou and Kraft (17) developed a program called MSA (not to be confused with multiple sequence
alignment). They first considered the alignments without internal gaps. The chromosome only encodes the
number of gaps at the beginning of the alignments. Their fitness function evaluates the number of matching
symbols in an alignment. The fitness function simply counts the total number of matches and assigns one
point to each match. This approach is not very meaningful because the alignments produced from their
implementation do not carry any inherent biological significance. They further considered the alignments
with internal gaps. The chromosome encodes the positions of gaps in the sequences, which is very similar
to (38) and (19). The second fitness function rewards a match with one point and penalizes every group of
consecutive gaps with four points. The alignments produced from this algorithm cannot be easily quantified
and compared because the algorithm employs a non-standard measurement.
Next, we examine two novel applications of GA to the problem multiple sequence alignment.
4. Evolving Guide Trees for Progressive Alignment Algorithm
In progressive alignment algorithm the guide tree dictates the order of construction of a final alignment.
Final alignment quality is highly dependent on the correctness of this guide tree. We hypothesize that
evolving guide trees using a genetic algorithm can lead to higher-quality trees which will ultimately result
in higher-quality alignments. In addition, since we avoid exhaustive and repetitive calculations of pairwise
distances, we hypothesize that our approach is more scalable than other progressive alignment approaches
when aligning large numbers of long sequences.
4.1 Algorithm Implementation
This algorithm implements an iterative steady-state GA (13, 39). The population in the GA consists of
viable guide trees which are represented in an efficient, coalescing binary tree data structure which enables
meaningful crossover and mutation. Variability operators such as crossover and mutation are constructed
such that the viability of an individual tree is never compromised. Rank-based selection is implemented via
the use of a random number generator which samples from a carefully parameterized beta probability
distribution. This non-uniform random selection, when overlaid across a sorted table of fitness scores for all
individuals in the population, allows for strongly biased rank-based selection wherein highly-fit parents are
far more likely to be selected for crossover, and whose offspring replace low-fit individuals on the opposite
end of the distribution. Elitism is implemented, as the fittest individual in a population is never destroyed
by less-fit offspring. Fitness for any individual is objectively computed by performing the progressive
alignment in the pairwise order specified by the individual guide tree. The fitness of an individual tree is
computed as the log of the alignment score of the final alignment produced by performing the progressive
alignment in the order specified by that tree. In this way, the fitness of a guide tree is optimized only in
respect to the most important measurement: the absolute quality of the final multiple sequence alignment.
Because of this, extraneous optimality criteria and sources of possible errors (such as misleading neighborjoining trees) are ignored as the GA focuses only on maximizing progressive alignment scores by evolving
successively better guide trees.
4.2 Guide Tree Encoding
We present a novel chromosome encoding for the individual guide trees in our GA. The encoding is
extremely efficient in the contexts of both space and time and allows for the application of fast and
meaningful crossover and mutation operators. One of the most important aspects of our chromosome
encoding is that it avoids the problem of dealing with duplicate leaves during branch swapping. Each
individual in the population represents a possible guide tree, and is stored as an integer vector describing
how nodes on one level of a coalescing tree connect to the next level of the same coalescing tree. At the
lowest level of the tree, level 0, there are n terminal nodes, where n is the number of sequences being
aligned. Each terminal node corresponds to a particular sequence from the n sequences being aligned. The
ordering of the terminal nodes is static. At the next level, there are n-1 nodes to which each of the terminal
nodes may connect. This forces at least one coalescence at level 1. In general, each level of the coalescing
tree has n-x nodes, where n is the number of sequences being aligned, and x is the level of the tree.
Figure 3: A coalescing binary tree with 8 sequences. Note that full coalescence occurs at an upper bound of n steps, but can often
occur sooner
Since a node is little more than the description of the edge from a given node to another node at a
subsequent level, these nodes (and therefore the tree itself) can be represented as an integer vector. If every
node in a column of such a tree is numbered from 0 through n-x, where n is the number of leaf nodes and x
is the column index, then the tree shown above in Figure 3 can be efficiently represented as:
2,0,5,3,4,7,7,3,3,(-1),1,3,5,0,4,1,3,(-1),1,4,4,(-1),0,(-1),3,1,0,2,(-1),2,1,(-1),0,0,0
In this encoding, each value represents a description of the edge from a node at level x in the coalescing
tree to another node at level x+1. The value -1 is used to represent edgeless nodes. Since we are essentially
evolving a bifurcating phylogenetic tree, we add the constraint that any one node can have no more than
two connections from the left. To efficiently enforce this constraint, at each node we also track the number
of connections from the previous level. For a binary coalescing tree, these values are either 0, 1, or 2. By
examining the number of left and right connections at each node, it is straightforward to quickly confirm
the validity of a given tree.
4.3 Evaluating Guide Tree Fitness
The initial population of trees in our GA consists of some number of randomly generated trees. These trees
are built in a bottom-up fashion in a completely random walk up to the root of the tree. For each node at
given level of a tree, a node in a subsequent level is chosen entirely at random, constrained only by the
limitation that nodes at level x are allowed a maximum of two connections from level x-1. Completely
viable, random trees can be built very quickly using this approach.
Tree fitness is computed for each individual in the initial random population as well as for each offspring
that results from crossover/mutation operations. Fitness is computed by first building an intermediate
evaluation tree which is a temporary data structure used to hold the sequences and partial alignments as the
progressive alignment is computed by a recursive depth-first traversal of the evaluation tree. At each node
in the evaluation tree, the fitness function either recursively descends or performs an alignment.
Alignments can occur between pairs of sequences, between a single sequence and a partial alignment, or
between two partial alignments. In this way, the complete progressive alignment is built up until the root
node of the evaluation tree contains the complete alignment and the score for that alignment. The natural
log of this alignment score is then computed and represents the objective fitness for the guide tree.
Fitness evaluation is the most computationally time-consuming component of this genetic algorithm,
especially towards the top of the evaluation tree, where large partial alignments are themselves being
aligned. Specifically, the time complexity of computing alignments is given in Table 1. To conserve
memory, evaluation trees are destroyed after the progressive alignment is complete. The fitness of the
individual guide trees are maintained in a fitness table which is sorted in descending order of relative
fitness. This sorted fitness table is used for the selection process, as a precursor to crossover and mutation.
Figure 4: The Process of Fitness Evaluation. The coalescing binary tree is first converted to an evaluation tree, and then a progressive
alignment is performed via a depth-first recursive traversal of the evaluation tree in which sequences and partial alignments are
progressively aligned into a complete alignment of all of the input sequences
Table 1: Time complexity for three kinds of dynamic programming alignments
Type of Alignment
Sequence +
Sequence
Sequence +
Alignment
Alignment +
Alignment
Time Complexity
O(mn)
where m and n are the lengths of the sequences being aligned
O(kn + min{s, k}mn)
where m is the length of the sequence, n is the length of the alignment, k
is the number of sequences represented comprising the alignment, and s
is the size of the alphabet
2
O(km + ln + min{s , kl}mn)
where m and n are the lengths of the two alignments, k and l are the
numbers of sequences in the alignments, and s is the size of the alphabet
4.4 Selection, Crossover, and Mutation
A rank-based selection process is implemented wherein a parameterized beta distribution is overlaid across
a sorted fitness table. Two unique parents and one replacement tree (for crossover offspring) are chosen at
random from this non-uniform probability distribution which has a strong bias for selecting parents with a
high fitness as well as a strong bias towards selecting lower-fit individuals to be replaced by the offspring
of crossover. The beta distribution is parameterized with α = 3.0, and β = 0.5.
Once the rank-based selection process chooses two unique parents and one child (also unique), these
individuals are then processed by the crossover operator. The GA performs a type of one-point crossover in
which a crossover point is chosen at random for one of the parent trees. In effect, a crossover point is
simply one of the n levels on the coalescing tree data structure. A second crossover point is chosen on the
second parent using a linear search for a compatible matching level. Two binary coalescing trees are not
always compatible for crossover. Preliminary analysis indicates that an incompatible selection occurs in <
5% of all cases, and appears to decrease slowly as a function of tree size. Compatible parent trees are trees
in which there exist some internal level at which the internal nodes on the trees can be entirely connected
via edges in order to produce a viable child. Since the selection of crossover points is stochastically driven,
we arbitrarily attempt 20 times to identify possible crossover points between two randomly selected parent
trees before deciding that the trees are incompatible for crossover. In the event that two trees are found to
be incompatible, we re-select new parents and attempt until compatible trees are found.
The crossover process is shown in Figure 5 below. The lower portion of the first parent at Crossover Point
1 is added to the upper portion of the second parent at Crossover Point 2. This preserves all of the lowerlevel node relationships below Crossover Point 1 from the first parent, while mixing with the preserved
upper-level sub-tree ordering specified in the second parent at and above Crossover Point 2. At the point at
which the two different trees now intersect in the child tree, edges between nodes are constructed in such a
way that the child tree always remains viable. Sometimes edges will need to be constructed at the interface
which did not exist at all in the previous tree. In this case, we repair the graft by randomly selecting a viable
node from the next level in the coalescing tree such that no terminal nodes are ultimately orphaned, and the
tree remains fully-connected and viable. Mutation of a child tree after crossover is a simple process of
selecting some connected node on the tree and changing its upper edge to connect to a randomly selected,
but viable node. In some cases, removing an edge from a node which is connected higher up in the tree
requires a recursive repair mechanism which traverses up the tree from the newly orphaned node, removing
all edges from the visited nodes until a node is found that has two connected nodes. Once all such edges are
removed, the tree is again viable.
Figure 5: The crossover of two compatible coalescing binary trees. Note that a graft repair was needed in this example in order to
prevent orphaning the leaf node SEQ_3
4.5 Algorithm Parameters
The GA has a steady-state population of 30 individual guide trees. The algorithm is iterative and not
generational, and iterates for a configurable number of times before halting. For the tests and experiments
conducted with this GA, the researcher chose to terminate the algorithm arbitrarily at 10,000 iterations. The
number of iterations required to reach convergence on the globally optimal guide tree is a function of both
the number of sequences and the length of the sequences. In real world application of this GA approach, the
termination condition for the GA could be dynamically calculated at run-time as a function of average
sequence length and the number of sequences being aligned.
Since this is an iterative, steady-state GA, a single selection and crossover happens at each iteration (39).
However, some minority of individual trees may not be compatible for crossover, and so there is an
effective crossover probability, which is something slightly smaller than 100% probability of crossover. In
each iteration, each child tree from crossover has a 10% chance of incurring a single point mutation.
Examining the effects of manipulating the mutation rate is an area of future work.
Table 2: Concise summary of experimental GA parameters
GA Parameter
Population Type
Population Size
Population Initialization
Number of Iterations
Selection Type
Crossover Type
Crossover Rate
Mutation Type
Random Number Generator
Value
Steady-State
30
Bottom-up randomly generated, viable guide tree
10,000
Biased, Rank-Based using Beta probability distribution
Branch Swapping on Coalescing Binary Tree
0.9
Random intra-tree, same-level branch migration
R250 from GNU Scientific Library
(available at http://www.gnu.org/software/gsl)
In addition to the GA parameters, alignment parameters were also provided. As mentioned previously,
dynamic programming algorithms were used to align sequences and alignments based on the evolved guide
trees. The alignments and alignment scores are produced in the context of an evolutionary model which
takes the form of a scoring system that specifies the penalties for opening new gaps in an alignment or
extending an open series of gaps. The central idea behind making this distinction is based on the
fundamental idea that it should be considered more expensive to open a new region of gaps than to simply
extend an existing gap region. In addition, different scores are assigned to residue matches and mismatches
in an alignment. The scoring system used in this GA implements affine gap penalties, and is fully
parameterized as shown the following table below. When comparing against Clustal W, the same scoring
system was used in order to more directly compare results between our GA and Clustal W.
Table 3: The GA alignment scoring system with affine gap penalties
Score Type
Gap opening penalty
Gap extension penalty
Nucleotide match score
Nucleotide mismatch score
Value
-5.0
-1.0
10.0
2.0
4.6 Experimental Results
All experimental runs were conducted on a workstation with a 1GHz Intel Pentium-III CPU, 640MB of
RAM, running Redhat Linux version 7.2. Over 30 tests were run, with each test working against a different
input sequence file which contained different numbers of sequences and/or different sequence lengths. The
input sequences for all of the experimental runs were constructed using the Jukes-Cantor sequence
generation mechanism described in section 2.5. All generated sequences are DNA nucleotide sequences,
although protein alignment works equally as well.
For each experiment, alignments were performed both with our GA as well as with Clustal W (v1.82)
(31,32,33). Performance, in terms of both efficiency and apparent alignment quality, are summarized for
several of our experimental runs. In order to more accurately compare the results of Clustal W to our GA,
we attempted to identically parameterize each system. In order to do this, we identically configured affine
gap penalties and substitution costs, and disabled the delayed alignment of divergent sequences in Clustal
W. All of the experimental runs produced similar overall results for all of the input sequences, regardless of
the number of sequences being aligned, and the results presented in this section were chosen as
representative results for the entire experimental evaluation of our system. It is important to note that our
approach is still fundamentally a progressive alignment algorithm. Therefore, it suffers from the same
problems as we mentioned before. To overcome this difficulty, in the next section we have proposed a
different EC technique that optimizes the alignment on all sequences simultaneously.
Fitness
20 Sequences, 100 bp in Length
11.275
11.27
11.265
11.26
11.255
11.25
11.245
11.24
11.235
11.23
1
101
201
301
401
501
601
701
801
901
Iterations x 10
Figure 6: The fitness trend of the fittest individual across 10,000 generations when evolving a guide tree to align 20 sequences of
length 100 bps
(a)
(b)
Figure 7: Visual comparison of the alignment produced by the GA (a) and the alignment produced by Clustal W (b). Although
extremely similar, the GA alignment is slightly better. For example, the alignment produced by Clustal W has one additional column,
indicating that more gaps were used to construct the alignment. The slight improvement in alignment quality is also apparent on a
closer visual inspection
5. Evolving Consensus Sequence with a Genetic Algorithm
Here, we present a second novel approach where a consensus sequence is evolved with a genetic algorithm.
This optimized consensus sequence can then be translated into an alignment.
5.1 Consensus Sequence
The consensus sequence is the most interesting and important feature of this GA approach. It is essentially
a compact formulation to represent all possible alignments for virtually any given numbers of sequences (6,
18). The consensus sequence borrows the idea from biology that sometimes it is necessary for certain
positions in a sequence to be made ambiguous when some residues simply cannot be resolved during
laboratory experiments. A sequence with ambiguity codes is actually a mix of sequences, each having one
of the nucleotides defined by the ambiguity at that position. For example, if an R is encountered in the
sequence, then the sequences in the assortment will have either an adenine or a guanine at that position.
The ambiguity enables conserved sequences to be condensed into one single representation. Figure 8 lists
the most commonly used ambiguity codes defined by the Nomenclature Committee of the International
Union of Biochemistry (IUB). The consensus sequence in essence is a condensed sequence with ambiguity
codes that shows what nucleotides are allowed in each column.
Figure 8: The most commonly used DNA ambiguity codes are defined by the International Union of Biochemistry (IUB). The
presence of ambiguity generally indicates that some residues cannot be resolved during the laboratory experiments. Ambiguity codes
also enable sequences to be represented in a more condensed form
5.2 Design of Encoding Scheme
To further enhance the design, our chromosomes are broken into four pieces according to the nucleotide
they represent. In other words, our GA uses four parallel chromosomes to represent four different
nucleotides. Each chromosome encodes the relative occurrences and locations of a nucleotide and is only
evolved with the chromosome that encodes the same nucleotide. The fitness of the entire chromosomes is
determined by how well they fit together to derive the final alignment. The geometry of the parallel
chromosomes is very similar to the four dimensional hyperplane. Each dimension is evolved and optimized
separately and independently. Figure 9 shows a hypothetical sequence and demonstrates how the procedure
works.
Figure 9: Sequences are split up into four subsequences according to the nucleotides. Each subsequence only encodes the information
where such a nucleotide can be found. Since each nucleotide is individually encoded, any possible ambiguities can be fully
represented. Further, chromosomes are represented as binary strings, where 1 signals the existence of the relevant a nucleotide at the
given location while 0 signals the absence
The length of chromosomes is difficult to determine precisely. It depends on the evolutionary model and
the similarity of the sequences. If sequences are highly conserved, chromosomes can be relatively short.
Intuitively any two randomly generated sequences will have at least 25% similarity. For this study, we
assumed that sequences have at least 50% similarity to be biologically significant. Therefore, we arbitrarily
defined the length of the chromosome to be 1.5 times longer than the longest sequences. For most of our
studies, this assumption worked fine. Furthermore, for implementation convenience, each chromosome is
further divided into smaller blocks called loci. Biologically, a locus is a block of alleles where genes can be
found. The length of a locus is determined by the length of an integer on the hardware platform. The
number of loci depends on the length of the chromosome.
5.3 Crossover and Mutation Operations
For this research, we implemented a simple one-point crossover. A point is randomly selected in each locus
for each chromosome and alleles are exchanged between two parent chromosomes to form an offspring. An
offspring is produced at each generation and then competes with the population. Since each chromosome
has separate string for the four nucleotides, the crossover points are chosen separately. Furthermore, we
have implemented a bias function for selecting the parent chromosomes from the entire population. The
bias function is like an “unfair” randomly number generator. It is essentially a quadratic equation that
randomly generates a series of numbers with bias toward the lower indexes. Since the initial population has
been sorted in the descending order according to the fitness values, consequently the individuals with
higher fitness values are more likely to be selected. Mutation is an important operator that prevents the
population from stagnating at local optima (38). In our implementation, the mutation is only applied to the
newly created offspring chromosomes. The GA first calculates the expected number of mutations for each
locus in the chromosomes with a random factor. It then iteratively picks random locations on each locus for
each of the four parallel chromosomes and changes the alleles. The mutation operator randomly flips the
alleles independently on each locus with the binary XOR operator. It inverts the alleles from 0 to 1 or 1 to
0.
Figure 10: The crossover operator retrieves the alleles from two parent chromosomes to create an offspring. A point is randomly
selected on each locus for each chromosome. The offspring receives approximately a half of alleles from each parent. The length of a
locus is 32-bit, which corresponds to the length of an integer on our machines
5.4 Objective Function
The objective function measures the quality of MSA. Therefore, ideally the better the score the more
biologically relevant the multiple alignments are. The substitution costs are evaluated using a predefined
substitution matrix. The matrix assigns every possible substitution or conservation according to its
biological likeliness. We used the nucleic scoring matrix defined by IUB that each match receives 10.0
points and each mismatch 0. The gap penalty is 10.0 for opening and 0.2 extending a gap. The alignment
with the highest score is considered a potentially optimal solution. In addition, the objective function
subtracts the fitness value with both the mismatch and gap scores multiplied by the numbers of nucleotides
that are missing in the alignment for the chromosomes that do not include all nucleotides. Our experiments
have confirmed that this strategy worked quite well. The calculation of SP scores for N sequences takes
O(M×N2) time (7,36) where M is the average length of the sequences. To further improve the GA
performance, we have devised a sequence profiling technique that simplifies the calculations of SP scores.
The objective function first computes the profile of the sequences for each column. A profile is simply the
occurrences or frequencies of each nucleotide. The profiling process accumulates the occurrences of each
nucleotide and reduces the calculations of SP scores into three smaller tasks. Matches are only possible
when two identical nucleotides are aligned together. Therefore, the matching score is simply the sum of all
possible combinations of the same nucleotides multiplied by the match score in the substitution matrix. The
number of mismatches, on the other hand, is the sum of all the combinations of two different nucleotides.
There are only six such combinations. The number of gap alignments is derived from the sum of the
frequencies of each nucleotide multiplied by the number of gaps. The result is then multiplied by the gap
penalty to obtain the overall gap score.
Figure 11: The figure shows the sequence profile for the isolated column. The profiling process simply accumulates the frequencies
or occurrences of each nucleotide on a given column. The process simplifies the calculations of the SP scores into three smaller tasks
and reduces the complexity. Matches are only possible when two identical nucleotides are aligned together. Therefore, the score for
matches is the sum of all possible combinations of identical nucleotides multiplied by the matching scores S(i, i) from the substitution
matrix. The number of mismatches is the sum of all combinations between two different nucleotides. There are only six such
combinations. The gap penalties are the sum of all arrangements between each nucleotide and gaps
5.5 Alignment Construction
The construction of the final alignment is very similar to that of the DP algorithm. The GA derives the
alignment from the last nucleotide to the first and the chromosomes are accordingly decoded backward. If a
nucleotide is permitted at a given column, then it is consumed and added in the final alignment. The
process moves on to the preceding ones. Otherwise a gap is inserted into the alignment. If no nucleotides
are ever used in the column, then the allele is skipped. Alleles that are not used to derive the final alignment
are considered the non-coding regions. One of the very interesting and important features of the encoding
scheme is that the alleles that are used to derive the alignment do not have to be consecutive. In addition,
two different chromosomes can potentially give the same alignments. In other words, the alignment construction process picks the appropriate alleles as it moves along. The chromosomes do not have to encode
exact bit patterns for the alignments. This makes every allele in the chromosome a potential solution for the
alignment. Experiments have confirmed that the GA discovered the optimal alignment, as defined by the
substitution matrix, quickly and effectively. The alignments produced by the GA are at least as good as the
ones obtained from Clustal W. If the GA is allowed to continue evolving, better alignments are very likely
to be found. As the number of sequences increases, the effectiveness of the GA begins to surface. Our
experiments have shown that regard-less the number of sequences being aligned; the GA performed
extremely well and produced alignments with competitive scores.
Figure 12: The figure shows how the final alignment is constructed from the chromosomes. The alignment is derived in reverse order.
The spaces in between are the alleles that did not match up any of the nucleotides in the sequences. If a particular nucleotide is not
present in the sequence, a gap is inserted. Chromosomes do not have to encode the exact bit patterns for the alignments. The alignment
process simply picks the “appropriate” alleles
5.6 Experiment Design
The objective of our experiments was to demonstrate that the GA could scale better as well as produce
competitive alignments. For the purpose of this study, we assumed Clustal W as the standard method for
multiple sequence alignments. Therefore sequences were first aligned with Clustal W and the scores were
used as the stopping condition for the GA. We applied the standard IUB nucleic scoring matrix and used
the gap penalties identical to that of Clustal W. For this study, we have generated various lengths of
sequences with approximately 50% of similarity. The mutation rate was 0.0625 and the average expected
number of mutations on each locus was about one. The GA began with a randomly generated population of
64 individuals. The population was first evaluated and sorted in the descending order according to the
fitness values. At each generation, the bias function randomly picked two individuals from the population
that served as the parent chromosomes. The crossover operator exchanged alleles from two parent
chromosomes and created an offspring. Mutation was applied to the offspring repeatedly until the fitness
was higher than both parent chromosomes. The offspring then competed with the entire population and
removed the individual with the lowest fitness. We gradually increased the number of sequences in each
trial. Due to the stochastic nature of GA, all trials were performed at least three times in order to obtain
more reliable results.
5.7 Results and Discussions
Experiments show interesting results for our GA approach. In most cases, the GA outperformed Clustal W
and produced alignments with higher SP scores. The number of generations needed to find the optimal
solutions remained approximately the same even though the quantity of sequences increased. This is tribute
to the fact that the GA was able to utilize the guided search effectively and found the optimal alignments.
Furthermore, experiments have confirmed that GA scales well with respect to the number of sequences. We
believe that the objective function doesn’t have sufficient granularity to guide GA in finding the optimal
solution. The increased number of sequences doesn’t impact the performance but the length does. For
sequences that are average of 60 base pairs long, the GA converges stably and quick to the optimal
alignment (as derived from Clustal W). The performance however began to fluctuate as the length of
sequences increases, the performance fluctuated. For 60 base-pair-long sequences, for example, the search
space is 1660 or approximately 1.767×1072. For 100 base-pair-long sequences, the search space increased
exponentially. Specifically it is 1.46×1048 times larger than that of the 60 base-pair long sequences. We
have noticed that GA became slow to converge on long sequences. In addition, we have confirmed that the
SP scoring function was never a good measurement for MSA. If the gap is not heavily penalized, the same
score can be easily achieved with more matches but excessive amounts of gaps. The relative difference in
score between the correct and incorrect alignments decreases as the number of sequences increases. Clearly
this is very counter-intuitive and not realistic. The difference should increase when more sequences are
introduced into the alignment (7).
Table 5: The following tables summarize the numbers of generations needed to find the optimal alignments, at least as good as Clustal
W, for various lengths of sequences
60 base pairs
Trial 1
Trial 2
Trial 3
70 base pairs
Trial 1
Trial 2
Trial 3
100 base pairs
Trial 1
Trial 2
Trial 3
10
54,627
48,408
59,044
Number of Sequences
25
50
75
48,470
41,697
57,705
53,502
55,387
57,642
46,543
54,028
53,447
100
55,495
51,763
59,955
162,982
164,209
131,512
159,427
156,715
134,724
167,843
184,960
176,781
165,949
186,131
108,757
180,613
175,887
154,084
1,254,233
1,197,003
1,124,561
1,298,503
1,432,034
1,090,024
1,178,904
1,470,175
1,373,209
1,119,092
1,405,332
1,181,392
1,135,987
1,157,061
1,219,451
(a)
(b)
Figure 13: The first figure (a) shows the alignment generated from the GA and the second (B) from Clustal W. (A) is about 3 base
pairs longer than (b) but it detects more conserved regions. Visual inspection on the beginning of (B) can immediately pinpoint that
the alignment is not optimal. Clustal W typically requires human intervention in order to produce an alignment with a higher score
6. Conclusion
We have presented novel and highly effective approaches to applying EC to the problem of MSA. In
addition, we devised a novel use of binary coalescent trees and consensus sequences. The binary coalescent
tree functions as a data structure which lends itself nicely to the problem of performing crossover on two
trees without duplicating terminal nodes. The consensus sequence approach enables the simultaneous
optimization of alignment on many sequences. In our experiments, we have demonstrated that our
approaches have produced more refined alignments than Clustal W. The first approach can be shown to
produce better alignments than Clustal W while also being more scalable than Clustal W for large datasets
(i.e. 500 sequences of length 2500 or larger). Additional testing needs to be done in order to evaluate this
hypothesis and determine the precise inflection point at which stochastic optimization techniques such as
genetic algorithms surpass Clustal W in both alignment quality and scalability. The second approach
overcomes the problems of progressive alignment algorithm and permits optimization of the alignments on
all sequences simultaneously. The experiments have confirmed that the GAs perform and scale better than
that of the heuristic techniques.
7. Future Work
For future work, we would like to extend this approach to align protein sequences and implement statistical
scoring techniques. Once this GA supports the alignment of amino acid sequences, we intend to perform
benchmarked comparisons against other techniques using the BAliBASE (34, 35) alignment database. We
are currently investigating an approach that incorporates several statistical or simulation techniques with
phylogeny in order to better quantify the significance of alignments. In addition, we intend to study the
parameterization of our GA in great detail to determine how best to converge to a near-optimal guide tree
in the minimum number of iterations. The effects of modifying tunable parameters such a population size
and mutation rate on solution convergence rates and solution quality will be fully explored. In addition, the
problem of sequence alignment and GA can be easily parallelized. Therefore, we intend to parallelize our
fitness functions in each approach such that they can be efficiently handled in parallel on a Beowulf cluster.
Acknowledgements
This research used equipment funded in part by NIH NCRR 1P20 RR16448, and NIH NCRR 1P20
RR16454. Shyu was partially funded by a grant from Proctor and Gamble and Sheneman was partially
funded by NIH NCRR 1P20 RR16448. Foster was partially funded for this research by NIH NCRR 1P20
RR16448. The authors would like to thank Jason Evans for helpful reviews and comments.
References
1.
S. F. Altschul, “Amino acid substitution matrices from an information theoretic perspective”. J. of
Mol. Biol., vol. 219, 1999, pp. 555-565.
2. S. F. Altschul, R. J. Carroll, and D. Lipman “Weights for data related by a tree”. J. of Mol. Biol., vol.
207, 1989, pp. 647-653.
3. S. F. Altschul, and D. Lipman, “Trees, stars, and multiple sequence alignment”. SIAM J. of Appl.
Math., vol. 49, 1989, pp. 197-209.
4. H. Carrillo, and D. Lipman, “The multiple sequence alignment problem in biology”. SIAM J. of Appl.
Math., vol. 48, 1988, pp. 1073-1082.
5. S. B. Carroll, J. K. Grenier, and S. D. Weatherbee, From DNA to diversity: molecular genetics and the
evolutionary of animal designs, Malden, MA: Blackwell Science, 2001.
6. W. H. Day, and F. R. McMorris, F.R., “The computation of consensus patterns in DNA sequence”.
Mathematical and Computational Model. Vol. 17, 1993, pp. 49-52.
7. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic models of
proteins and nucleic acids. Cambridge, UK: Cambridge University, 1998.
8. D. Feng, and R. F. Doolittle, “Progressive sequence alignment as a prerequisite to correct phylogenetic
trees”, J. of Mol. Evol., vol. 25, 1987, pp. 351-360.
9. D. B. Fogel, and D. W. Corne, D.W. (Eds.), Evolutionary Computation in Bioinformatics, San
Francisco, CA: Morgan Kaufmann Publishers, 2003.
10. D. Graur, and W. H. Li, Fundamental of Molecular Evolution (2nd ed), Sunderland, MA: Sinauer
Associates, 2002.
11. D. Gusfield, Algorithms on strings, trees and sequences: computer science and computational biology,
New York, NY: Cambridge University Press, 1997.
12. B. G. Hall, Phylogenetics trees made easy: a how-to manual for molecular biologists, Sunderland, MA:
Sinauer Associates, 1997.
13. J. H. Holland, Adaptation in natural and artificial systems, Ann Arbor: University of Michigan Press,
1975.
14. J. T. Horng, C. M. Lin, B. J. Liu, and C. Y. Lao, “Using genetic algorithm to solve multiple sequence
alignment”, in E. Wingender, et al (eds.) Proc. Germ. Conf. on Bioinfo., 2001, pp. 883-890.
15. M. Isokawa, M. Wayama, and T. Shimizu, “Multiple sequence alignment using a genetic algorithm”,
Genome Informatics, vol. 7, 1997, pp. 176-177.
16. T. H. Jukes, and C. Cantor, Evolution of protein molecules. Mammalian Protein Metabolism (ed.), M.
N. Munro, 1969, p. 21-132. New York: Academic Press.
17. K. Karadimitriou, and D. H. Kraft, “Genetic algorithms and the multiple sequence alignment program
in biology”, in T. R. Tiersch, et al (eds.) Proc. 2nd Ann. Baton Rough Area Mole. Biol. and Biotec.
Conf., 1996.
18. J. M. Keith, P. Adams, D. Bryant, D. P. Kroese, K. R. Mitchelson, D. A. E. Cochran, and G. H. Lala,
“A simulated annealing algorithm for finding consensus sequences”, Bioinformatics, vol. 18, no.11,
2002, pp. 1494-1499.
19. B. Morgenstern, “DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence
alignment”, Bioinformatics, vol. 15, no. 3, 1999, pp. 211-8.
20. S. B. Needleman, and C. D. Wunsch, “A general method applicable to the search for similarities in the
amino acid sequence of two proteins”, J. of Mol. Biol., vol. 48, 1970, pp. 443-453.
21. C. Notredame, and D. G. Higgins, D.G., “SAGA: sequence alignment by genetic algorithm”, Nucleic
Acids Research, vol. 24, no. 8, 1996, pp. 1515-1524.
22. C. N. S. Pedersen. Algorithms in Computational Biology. Ph.D. dissertation. BRICS, Department of
Computer Science, University of Aarhus, Aarhus, Denmark. March 2000.
23. A. Rambaut, and N. C. Grassly, “Seq-Gen: An application for the Monte Carlo simulation of DNA
sequence evolution along phylogenetic trees”, Compu. Appl. Biosci. Vol.13, 1997, pp. 235-238.
24. N. Saitou, and M. Nei, “The neighbor-joining method: a new method for reconstructing phylogenetic
trees”, Molecular and Biological Evolution, vol. 4, no. 4, 1987, pp. 406-425.
25. J. Sauder, J. Arther, and R. Dunbrack, “Large-scale of comparison of protein sequence alignment
algorithms with structure alignments”. Proteins: structures, function, and genetics, vol. 40, 2000, pp. 632.
26. R. E. Sean, “A memory-efficient dynamic programming algorithm for optimal alignment of sequence
to an RNA secondary structure”, BMC Bioinformatics, vol. 3, 2002, pp. 13.
27. J. Setubal and J. Meidanis. Introduction to computational molecular biology. Boston, MA: PWS
Publishing, 1997.
28. T. F. Smith and M. S. Waterman, “Identification of common molecular sequences.” J. of Mol. Biol.,
vol. 147, 1981, p. 195-197.
29. D. J. States, W. Gish, and S. F. Altschul, “Improved sensitivity of nucleic acid database searches using
application-specific scoring matrices”, Methods: A Companion to Methods in Enzymology, vol. 3, no.
1, 1997, pp. 66-70.
30. J. Stoye, S. W. Perry, and A. W. M. Dress, “Improving the divide-and-conquer approach to sum-ofpairs multiple sequence alignment”, Applied Mathematical Literature, vol. 10, no. 2, 1997, pp. 67-73.
31. J. Studier, and K. Keppler, “A note on the neighbor-joining algorithm of Saitou and Nei”, Molecular
and Biological Evolution, vol. 5, 1988, pp. 729-731.
32. J. D. Thompson, T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins, “The Clustal X
windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools”.
Nucleic Acids Research, vol. 24, 1997, pp. 4876-4882.
33. J. D. Thompson, D. G. Higgins, and T. J. Gibson, T.J, “Clustal W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting, position specific gap penalties
and weight matrix choice”. Nucleic Acids Research, vol. 22, 1994, pp. 4673-4680.
34. J. D. Thompson, F. Plewniak, and O. Poch, “A comprehensive comparison of multiple sequence
alignment programs”. Nucleic Acids Research, vol. 27, no. 13, 1999, pp. 2682-2690.
35. J. D. Thomson, F. Plewniak, and O. Poch., “BAliBASE: a benchmark alignment database for the
evaluation of multiple alignment programs”, Bioinformatics, vol. 15, no. 1, 1999, pp. 87-88.
36. L. Wang and T. Jiang, “On the complexity of multiple sequence alignment”. J. of Compu. Biol., vol. 1,
1994, pp. 337-348.
37. S. M. Waterman, and M. Eggert, “A new algorithm for best subsequence alignments with application
to tRNA-rRNA comparisons”. J. of Mol. Biol., vol. 197, 1987, pp. 723-725.
38. W. Wayama, K. Takahashi, and T. Shimizu, “An approach to amino acid sequence alignment using a
genetic algorithm”. Genome Informatics, vol. 6, 1995, pp. 122-123.
39. D. Whitley, “A genetic algorithm tutorial”. Statistics and Computing, vol. 4, 1994, pp. 65-85.
40. C. Zhang, and A. K. Wong, “A genetic algorithm for multiple molecular sequence alignment”.
Computational Applications for Biosicence. Vol. 13, no. 6, 1994, pp. 565-581.