Multiple Sequence Alignment with Evolutionary Computation Conrad Shyu [email protected] Luke Sheneman [email protected] James A. Foster [email protected] Initiatives for Bioinformatics and Evolutionary Studies (IBEST) Department of Bioinformatics and Computational Biology University of Idaho, Moscow, Idaho 83844-1010, USA +1 208.885.7062 Abstract. In this paper we provide a brief review of current work in the area of multiple sequence alignment (MSA) for DNA and protein sequences using evolutionary computation (EC). We detail the strengths and weaknesses of EC techniques for MSA. In addition, we present two novel approaches for inferring MSA using genetic algorithms. Our first novel approach utilizes a GA to evolve an optimal guide tree in a progressive alignment algorithm and serves as an alternative to the more traditional heuristic techniques such as neighbor-joining. The second novel approach facilitates the optimization of a consensus sequence with a GA using a vertically scalable encoding scheme in which the number of iterations needed to find the optimal solution is approximately the same regardless the number of sequences being aligned. We compare both of our novel approaches to the popular progressive alignment program Clustal W. Experiments have confirmed that EC constitutes an attractive and promising alternative to traditional heuristic algorithms for MSA. Keywords: multiple sequence alignment, genetic algorithm, progressive alignments, DNA sequences 1. Introduction Living things diverge from common ancestors through changes in deoxyribonucleic acid (DNA) and millions of years of evolution (5). DNA plays a fundamental role in the processes of life. DNA contains the template for the synthesis of proteins, which are crucial molecules for living systems. Moreover, DNA is essential to life because it functions as a medium to transmit information from one generation to another (10). The most important regions in DNA are generally conserved to ensure survival. Sequence alignment is commonly used to detect and quantify similarities in DNA or protein sequences. Alignments of biological sequences generated by computational algorithms are routinely used as a basis for inference about sequences whose structures or functions are not well known. The most common approach is to find the best-scoring alignment between a pair of sequences, where the alignment score is a measure of the edit distance between the sequences in the context of a particular evolutionary model. An evolutionary model can be represented as a scoring system which penalizes substitutions and gaps (5, 7). The best-scoring (optimal) alignment can be found through the use of dynamic programming (DP) algorithms such as the Smith-Waterman (28, 37) and Needleman-Wunsch algorithms (20). However, the complexity of DP algorithms grows exponentially as the length and number of sequences increase. Specifically, multiple sequence alignments (MSA) with DP have been shown to be NP-hard (36). Several heuristic approaches, such as Clustal W (32, 33, 34) are frequently used to quickly approximate optimal alignments. In this paper, we briefly review the current work in sequence alignment with evolutionary computation (EC). In addition, we present two novel approaches that utilize EC to optimize multiple alignments. Our first new approach employs a steady-state GA (13, 39) to evolve guide trees, which is a fundamental component of progressive alignment algorithms (8). The population in the GA consists of viable guide trees that are represented in an efficient, coalescing binary tree structure. This enables fast and meaningful crossover and mutation. Variability operators such as crossover and mutation are constructed such that the viability of an individual tree is never compromised. Fitness is objectively computed by performing the progressive alignment in the pairwise ordering specified by the guide trees in the population. The fitness of an individual tree is computed as the natural log of the alignment score of the final alignment produced by performing the progressive alignment in the order specified by that tree. In this way, the fitness of a guide tree is optimized only in respect to the most important result: the quality of the final multiple sequence alignment. The second MSA approach facilitates the optimization of a consensus sequence (6) with a GA with an encoding scheme that was designed such that the search complexity is independent of the number of sequences being aligned. The search complexity of this approach primarily depends on the length of the consensus sequence and the degree of similarity between sequences. The scheme encodes each possible matching nucleotide at a given column with binary masks. This compact representation greatly reduces the space requirement as well as the search complexity. The objective or evaluation function gives the sum-ofpairs (SP) (4) score to determine the fitness of each chromosome in the population. SP score has been widely used to detect and quantify similarities between sequences; however it does not provide any probabilistic or biological justifications (7). To further improve the performance of GA, we have developed a sequence profiling formulation that reduces the complexity for calculating the SP scores. 2. Sequence Alignment There are diverse motivations behind the alignment of biological sequences. Genetic sequences are inherited from common ancestors through millions of years of evolution. Therefore, it is of interest to trace evolutionary history of mutation and other evolutionary changes through sequencing (1, 5). Alignment of biological sequences, in this context, is generally understood as a comparison based on the criteria of evolution. For example, the number of mutations, insertions, and deletions of residues necessary to transform one DNA sequence into another is a measure of phylogeny or evolutionary relatedness. On the other hand, a comparison may pinpoint regions of common origin, which may in turn coincide with regions of similar structure or function (10). A pairwise sequence alignment is a technique of arranging two sequences, so that the residues in certain positions are deemed to have a common evolutionary origin. In other words, if the same residue occurs in both sequences at the same position then it may have been conserved during the course of evolution. If, however, two residues differ, then it is generally assumed that they may have been derived from a common ancestor. Homologous sequences, those related by common descent, might have different lengths, which is generally explained through insertions or deletions (27). Statistical approaches, such as hidden Markov models, have been commonly used to detect homologous sequences and subsequently infer the alignments (7, 22). A hidden Markov model consists of a set of states connected by probabilistic transitions. Each transition indicates the probability of moving from one state to another. The transition structure consists of repeated element of match, insert, and silent delete states. The number of repeated elements is the length of the model. Each element of a match, insert and delete state models a position in the consensus sequence of the sequence family and describes sequence homology. Another commonly used approach is dynamic programming. Dynamic programming is a mathematically rigorous technique because it is guaranteed to find the optimal alignment (26). MSA is simply an extension of pairwise sequence alignment. MSA is the process of aligning three or more sequences simultaneously to bring as many similar residues into register as possible (4, 25). The resulting alignments are commonly interpreted in two contexts; (a) to find regions that define a conserved pattern or domain; and (b) to derive the possible phylogeny or evolutionary relationships among the sequences (12). The presence of similar domains across multiple sequences implies a similar biochemical function or higher-level structure that may be used as the basis for further experimental investigation. 2.1 Dynamic Programming DP is a commonly used recurrence method for solving sequential or multi-stage decision problems (11, 22). The essence of DP is the principle of optimality. DP has long been used to solve varieties of discrete optimization problems such as scheduling, string-editing, packaging, and inventory management (11). It views a problem as a set of interdependent sub-problems and DP solves these sub-problems and uses the results to solve ever-larger sub-problems. The solution to a sub-problem is expressed as a function of solutions to one or more sub-problems at the preceding levels. DP expresses the problem in a recurrence formulation. To make optimal decisions for the next and all future states, DP only needs to know the current state and the state of its immediate predecessors. This is also known as the Markovian property (7). For a process to be Markovian, future states must depend only on the present state and the past should not have any effect on the future. The term programming in the name actually refers to the mathematical rules that can be easily followed to solve a problem; it has nothing to do with writing a computer program. DP is known to be an efficient programming technique for solving certain combinatorial problems. It is particularly important in bioinformatics (27) as it is the basis of sequence alignments for comparing DNA and protein sequences. The recurrence equation (Eq. 1) is applied repeatedly to fill the matrix of F(i, j) values. This particular formulation gives the global alignment of two sequences. F(i, j) is the maximum of three previous values, namely F(i-1, j-1), F(i-1, j), and F(i, j-1). The value s(xi, yj) is the score for aligning the characters xi and yj while d is the penalty for gap insertion. For pairwise sequence alignments, DP begins with the construction of an alignment matrix F(i, j) with the indexes (i, j) for the two sequences Sx and Sy. The matrix is first initialized with F(0, 0)=0. The value of F(i, j) is the score of the best alignment from the first character x1 to the character xi of sequence Sx and the first character y1 to the character yj of Sy. There are three possible ways that xi and yj can be aligned; (a) xi can align with yj, which gives a match or mismatch; (b) xi is aligned with a gap; or (c) yj is aligned to a gap. Since the matrix is built recursively, in order to calculate F(i, j), the previous states F(i-1, j-1), F(i-1, j), and F(i, j-1) must be known beforehand. The following equation shows the recurrence formulation of DP for sequence alignment. F (i − 1, j − 1) + s ( xi , y j ), F (i, j ) = max F (i − 1, j ) − d , F (i, j − 1) − d . Eq. 1 Simultaneous alignment of three or more sequences with DP, however, poses a difficult algorithmic challenge (30). Determining the optimal alignment of more than a handful of sequences has a prohibitive time complexity (36). Because of this, various heuristic approaches have been developed, many of which are capable of producing good alignments in a relatively short period of time. The most commonly used heuristic technique is known as progressive multiple sequence alignment (8, 32, 33, 34). 2.2 Progressive Alignment Traditional progressive multiple sequence alignment algorithms involve at least a three-step process in which input sequences are first compared to one another using dynamic programming (DP) (8) to determine the edit distances between all possible pairs of sequences. The use of DP for computing pairwise distances guarantees an optimal result for the pairwise comparisons, but has time complexity of O(n2) for comparing just two sequences (36). For n input sequences, the numbers of pairwise distance measurements which must be taken are: n Number of Pairwise Distances = 2 Eq. 2 Notably, to counter the obvious scalability issues of performing so many optimal pairwise alignments, systems such as Clustal W offer the option of using faster, less-accurate forms of pairwise distance measurements, but this ultimately results in the construction of less accurate guide trees, which can have a deleterious impact on the overall quality of the entire multiple sequence alignment. After all pairwise distances have been computed, the distances are used to construct a guide tree using techniques such as Neighbor-Joining (NJ) (24, 31). Figure 1: The traditional progressive alignment algorithm. (a) All possible pairs of sequences are optimally aligned using dynamic programming to determine their edit distance. Then, (b) edit distance information is used by a neighbor-joining algorithm to estimate and construct a guide tree. (c) Finally, the sequences are progressively aligned using the guide tree in order to produce an alignment The process of constructing a guide tree (8) based on pairwise distances is simple and reasonably scalable, but it is a subject to certain limitations. NJ is a simplistic iterative clustering algorithm which is based on the approach of using pairwise edit distance information to decompose an initial star-shaped tree into a fully descriptive tree which represents, based on pairwise sequence distances, the phylogenetic relationships between all of the taxa on the tree (24, 31). In such a tree, the most similar sequences are clustered together first, followed by the most similar sub-alignments, and so on. Eventually, an entire tree is built which represents the similarity relationships between all of the sequences. The tree built by neighbor joining (NJ) is subsequently used as the guide tree that ultimately describes an order of operations of aligning sequences and sub-alignments. The quality of the final alignment is typically quantified by a sumof-pairs (SP) score. 2.3 Clustal W Clustal W is a popular progressive alignment system. Since progressive alignment is a heuristic algorithm, Clustal W is not guaranteed to find optimal alignments (8, 32, 33, 34). Clustal W exploits the fact that homologous sequences are evolutionarily related. It builds up multiple alignments progressively with a series of pairwise alignments, moving from the leaves upward in a guide tree that estimates the phylogeny of the sequences (8). Although Clustal W doesn’t always find optimal alignments, in most cases those alignments give a good starting point for further automatic or manual refinement. This type of alignment is generally useful for the study of identifying regions that are highly conserved. The alignment can be further improved through sequence weighting, position-specific gap penalties and choice of weight matrix (2). The local maxima problem stems from the nature of the progressive alignment strategy. As the algorithm follows the guide tree and merges sequences together, the solution is never guaranteed to be globally optimal, as defined by some overall measure of alignment quality. Any misaligned regions made early in the alignment process cannot be corrected later as new information from other sequences is introduced. This problem is frequently a result of an incorrect branching order in the guide tree. One way to correct this is to use an iterative or stochastic sampling procedure such as bootstrapping (33). The choice of alignment parameters is also problematic in Clustal W. If parameters are not chosen appropriately, alignments will not converge to a globally optimal solution. For closely related sequences, any reasonable scoring matrices should work fine because matches usually receive the most weights. Therefore, when matches dominate an alignment, almost any weight matrices will find a good solution. However, when aligning more divergent sequences, scores for gaps and mismatches become narrow and critical because they occur more frequently. Moreover, for highly conserved sequences, the range of gap penalties that will find the correct or best possible solution can be very broad. As more and more divergent sequences are added, however, the exact values for gap penalties become critical for success (31). Our observations have confirmed that this is a common problem in most MSA algorithms. As the number of sequences in an alignment increase, the expected number of matches in each column also increases. For example, the probability of finding a matching nucleotide in the column of ten sequences is much higher than that of three sequences. In general, it is difficult to justify why one scoring matrix is better than the others (7). 2.4 Sum-of-Pairs (SP) Scores and Substitution Matrices Carrillo and Lipman (4) first introduced the sum-of-pairs (SP) score function, which defines the scores of a multiple alignment of N sequences as the sum of the scores of the N(N-1)/2 pairwise alignments. Although SP score function has been widely used to evaluate MSA, it doesn’t really provide any biological or probabilistic justification (7). Each sequence is scored as if it is descended from the N-1 other sequences, instead of a single ancestor. As a result, evolutionary events are often overestimated. The problem worsens as the number of sequences increase. A weighted SP score function (2) has been proposed to partially compensate this effect. Moreover, despite the simplicity of the SP score function, its sheer running time and space consumption makes it impractical even for modestly-sized sets of short sequences. It has been shown that the problem of computing MSA with optimal SP score is NP-hard (36). Several fast approximations and divide-and-conquer approaches (30) have been proposed to overcome the computational complexity. In (2) and (6), the SP function, w(M), sums all the pairwise substitution scores in the columns for the sequence pairs p and q. Each column is evaluated with a scoring matrix. The substitution scoring function, s(mpj, mqj), defines all possible alignments for nucleotides pj and qj. The function s(mpj, mqj) gives the score of the alignment at column j for sequence p and q. The weight, αp,q, is intended to balance the overestimation problem in the SP score function (2, 6, 7). The following equation shows the mathematical formulation of the weighted SP score function. w( M ) = N a × p , q ∑ s (m pj , mqj ) ∑ 1≤ p < q ≤ k j =1 Eq. 3 A major component in assessing the quality of a sequence alignment is the substitution matrix, which assigns a cost for substituting any possible pair of residues. The substitution costs are evaluated using a predefined evolutionary model in which a score is assigned to every possible substitution or conservation according to its biological similarity (1, 4, 7). Each sequence receives a weight proportional to the amount of independent information it contains. The overall cost of an entire multiple sequence alignment is the sum of the costs of all of the pairwise substitutions. Amino acid substitution matrices, for example, can be calculated empirically by examining which substitutions occur in correct alignments and a model for the random protein sequences. These matrices can also be derived by scoring the relations of amino acid to each other according to some of their features, such as size, charge, hydrophobicity and genetic code. The theory of amino acid substitution matrices is described in (1) and applied to DNA sequence comparison in (29). A single matrix may nevertheless be reasonably efficient over a relatively broad range of evolutionary change. In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees. Moreover, substitution matrices are frequently used to simulate evolutionary events and generate sequences for experimental studies (27). 2.5 Sequence Generation Mutations in sequences are fixed into the population and consequently result in the substitutions from one nucleotide to another at various sites (12, 16). The simplest model for such nucleic substitutions assumes that all changes are equally likely. In order to predict the probability that a particular nucleotide at a particular site will change to another over some time interval, we only need to know the instantaneous rate of change (denoted by α) or the rate at which nucleotide substitutions occur. The Jukes-Cantor model (16) has only one parameter and assumes that the substitution rates are the same for all nucleotides. Because the one-parameter model assumes that all substitutions are equally likely, therefore it can be written in more general statements as the followings: P(ii ) (t ) = 1 3 −4α t 1 1 + e and P(ij ) (t ) = − e −4α t 4 4 4 4 Eq. 4 With real biological data, there are typically several conserved regions within sequences (23), which signal important biological functions. To better simulate the evolutionary process in nature and perform experiments on controlled data sets, we devised a technique that closely follows the biological assumptions for sequence generation. The sequence generation procedure begins with a randomly generated sequence that serves as the template or ancestral sequence. A trigonometric function landscapes the probability distribution that any given nucleotide at a particular site will undergo a mutation or not. The trigonometric function is periodic so that it works well for simulating conserved regions and allows site-specific rate heterogeneity. The implementation employs a Markov model, and assumes that evolution is independent and identically distributed at each site (10, 12). The simulation process randomly generates a number between 0 and 1. If such a number is less than the predefined probability density at given site, then no mutation will occur. Otherwise, the process invokes the evolutionary model and determines the most probable substitution. For the purpose of our studies, only the Jukes-Cantor model (16) was implemented for simulation. Figure 2: This figure shows an example set of sequences that are generated from our simulation program. Visual inspection can easily pinpoint that several regions in the sequences are highly conserved, which closely resemble the real biological data 3. Literature Review Evolutionary computation (EC) constitutes a very interesting alternative to the heuristic approaches for multiple sequence alignment (MSA). It has been shown that iterative algorithms often offer highly accurate alignments at the expense of runtime (9). Stochastic methods such as simulated annealing and EC have been successfully applied to the problem of MSA (18). However they have a tendency to stagnate at local optima as the number of sequences increase. Notredame and Higgins (21) applied GA to MSA with a tool known as Sequence Alignment by Genetic Algorithm (SAGA). This is the best-known seminal work in this particular area. SAGA evolves a population of alignments using a complex set of 22 different crossover and mutation operators in an attempt to gradually improve the fitness of the alignments in the population. Providing meaningful scores for sequence alignments can be somewhat problematic, and by default SAGA relies on a weighted SP approach in which each pair of sequences in an alignment is compared and scored and then the scores from all of the pairwise alignments are summed to produce a representative score for the entire alignment. Although SAGA was shown to produce high quality results which were comparable (or sometimes better) than other popular heuristic techniques, SAGA has a large time complexity, likely due to the time complexity involved in the repeated use of the weighted SP fitness function. Another approach for MSA with GA was later introduced by Zhang and Wong (40). The authors reported that their implementation was highly efficient. These results however must be considered with a great care since their strategy assumes the presence of completely conserved regions, which are the sole evidence to guide the assembly of alignments. Their chromosome encoding scheme codified the locations and numbers of gaps in the alignment. In other words, their GA simply evolves the number and position of gaps within conserved segments of an alignment. The assumption that such conserved segments always exist is never realistic or biologically sound. This method therefore can only compare long, highly similar sequences. Researchers proposed a technique that defines a chromosome as multiple number-strings of fixed lengths (14). The number-strings represent the positions and number of gaps in the alignments. The authors compared their approach with Clustal W and reported outstanding performance in terms of runtime and alignment quality on a small set of sequences. However, their claims must be considered with scrutiny since the complexity of their GA depends on the number of sequences being aligned as well as the length of those sequences. Since gap positions are individually encoded, the search space increases exponentially as the length of the chromosome increases. Isokawa et al. (15) and Wayama et al. (38) proposed a simple GA that encodes the alignment as a bit matrix that consists of 0s and 1s. In the bit matrix, the positions of 1 correspond to the gaps and 0 corresponds to a nucleotide or residue. The concept of such a representation is very similar to that of (14). Karadimitriou and Kraft (17) developed a program called MSA (not to be confused with multiple sequence alignment). They first considered the alignments without internal gaps. The chromosome only encodes the number of gaps at the beginning of the alignments. Their fitness function evaluates the number of matching symbols in an alignment. The fitness function simply counts the total number of matches and assigns one point to each match. This approach is not very meaningful because the alignments produced from their implementation do not carry any inherent biological significance. They further considered the alignments with internal gaps. The chromosome encodes the positions of gaps in the sequences, which is very similar to (38) and (19). The second fitness function rewards a match with one point and penalizes every group of consecutive gaps with four points. The alignments produced from this algorithm cannot be easily quantified and compared because the algorithm employs a non-standard measurement. Next, we examine two novel applications of GA to the problem multiple sequence alignment. 4. Evolving Guide Trees for Progressive Alignment Algorithm In progressive alignment algorithm the guide tree dictates the order of construction of a final alignment. Final alignment quality is highly dependent on the correctness of this guide tree. We hypothesize that evolving guide trees using a genetic algorithm can lead to higher-quality trees which will ultimately result in higher-quality alignments. In addition, since we avoid exhaustive and repetitive calculations of pairwise distances, we hypothesize that our approach is more scalable than other progressive alignment approaches when aligning large numbers of long sequences. 4.1 Algorithm Implementation This algorithm implements an iterative steady-state GA (13, 39). The population in the GA consists of viable guide trees which are represented in an efficient, coalescing binary tree data structure which enables meaningful crossover and mutation. Variability operators such as crossover and mutation are constructed such that the viability of an individual tree is never compromised. Rank-based selection is implemented via the use of a random number generator which samples from a carefully parameterized beta probability distribution. This non-uniform random selection, when overlaid across a sorted table of fitness scores for all individuals in the population, allows for strongly biased rank-based selection wherein highly-fit parents are far more likely to be selected for crossover, and whose offspring replace low-fit individuals on the opposite end of the distribution. Elitism is implemented, as the fittest individual in a population is never destroyed by less-fit offspring. Fitness for any individual is objectively computed by performing the progressive alignment in the pairwise order specified by the individual guide tree. The fitness of an individual tree is computed as the log of the alignment score of the final alignment produced by performing the progressive alignment in the order specified by that tree. In this way, the fitness of a guide tree is optimized only in respect to the most important measurement: the absolute quality of the final multiple sequence alignment. Because of this, extraneous optimality criteria and sources of possible errors (such as misleading neighborjoining trees) are ignored as the GA focuses only on maximizing progressive alignment scores by evolving successively better guide trees. 4.2 Guide Tree Encoding We present a novel chromosome encoding for the individual guide trees in our GA. The encoding is extremely efficient in the contexts of both space and time and allows for the application of fast and meaningful crossover and mutation operators. One of the most important aspects of our chromosome encoding is that it avoids the problem of dealing with duplicate leaves during branch swapping. Each individual in the population represents a possible guide tree, and is stored as an integer vector describing how nodes on one level of a coalescing tree connect to the next level of the same coalescing tree. At the lowest level of the tree, level 0, there are n terminal nodes, where n is the number of sequences being aligned. Each terminal node corresponds to a particular sequence from the n sequences being aligned. The ordering of the terminal nodes is static. At the next level, there are n-1 nodes to which each of the terminal nodes may connect. This forces at least one coalescence at level 1. In general, each level of the coalescing tree has n-x nodes, where n is the number of sequences being aligned, and x is the level of the tree. Figure 3: A coalescing binary tree with 8 sequences. Note that full coalescence occurs at an upper bound of n steps, but can often occur sooner Since a node is little more than the description of the edge from a given node to another node at a subsequent level, these nodes (and therefore the tree itself) can be represented as an integer vector. If every node in a column of such a tree is numbered from 0 through n-x, where n is the number of leaf nodes and x is the column index, then the tree shown above in Figure 3 can be efficiently represented as: 2,0,5,3,4,7,7,3,3,(-1),1,3,5,0,4,1,3,(-1),1,4,4,(-1),0,(-1),3,1,0,2,(-1),2,1,(-1),0,0,0 In this encoding, each value represents a description of the edge from a node at level x in the coalescing tree to another node at level x+1. The value -1 is used to represent edgeless nodes. Since we are essentially evolving a bifurcating phylogenetic tree, we add the constraint that any one node can have no more than two connections from the left. To efficiently enforce this constraint, at each node we also track the number of connections from the previous level. For a binary coalescing tree, these values are either 0, 1, or 2. By examining the number of left and right connections at each node, it is straightforward to quickly confirm the validity of a given tree. 4.3 Evaluating Guide Tree Fitness The initial population of trees in our GA consists of some number of randomly generated trees. These trees are built in a bottom-up fashion in a completely random walk up to the root of the tree. For each node at given level of a tree, a node in a subsequent level is chosen entirely at random, constrained only by the limitation that nodes at level x are allowed a maximum of two connections from level x-1. Completely viable, random trees can be built very quickly using this approach. Tree fitness is computed for each individual in the initial random population as well as for each offspring that results from crossover/mutation operations. Fitness is computed by first building an intermediate evaluation tree which is a temporary data structure used to hold the sequences and partial alignments as the progressive alignment is computed by a recursive depth-first traversal of the evaluation tree. At each node in the evaluation tree, the fitness function either recursively descends or performs an alignment. Alignments can occur between pairs of sequences, between a single sequence and a partial alignment, or between two partial alignments. In this way, the complete progressive alignment is built up until the root node of the evaluation tree contains the complete alignment and the score for that alignment. The natural log of this alignment score is then computed and represents the objective fitness for the guide tree. Fitness evaluation is the most computationally time-consuming component of this genetic algorithm, especially towards the top of the evaluation tree, where large partial alignments are themselves being aligned. Specifically, the time complexity of computing alignments is given in Table 1. To conserve memory, evaluation trees are destroyed after the progressive alignment is complete. The fitness of the individual guide trees are maintained in a fitness table which is sorted in descending order of relative fitness. This sorted fitness table is used for the selection process, as a precursor to crossover and mutation. Figure 4: The Process of Fitness Evaluation. The coalescing binary tree is first converted to an evaluation tree, and then a progressive alignment is performed via a depth-first recursive traversal of the evaluation tree in which sequences and partial alignments are progressively aligned into a complete alignment of all of the input sequences Table 1: Time complexity for three kinds of dynamic programming alignments Type of Alignment Sequence + Sequence Sequence + Alignment Alignment + Alignment Time Complexity O(mn) where m and n are the lengths of the sequences being aligned O(kn + min{s, k}mn) where m is the length of the sequence, n is the length of the alignment, k is the number of sequences represented comprising the alignment, and s is the size of the alphabet 2 O(km + ln + min{s , kl}mn) where m and n are the lengths of the two alignments, k and l are the numbers of sequences in the alignments, and s is the size of the alphabet 4.4 Selection, Crossover, and Mutation A rank-based selection process is implemented wherein a parameterized beta distribution is overlaid across a sorted fitness table. Two unique parents and one replacement tree (for crossover offspring) are chosen at random from this non-uniform probability distribution which has a strong bias for selecting parents with a high fitness as well as a strong bias towards selecting lower-fit individuals to be replaced by the offspring of crossover. The beta distribution is parameterized with α = 3.0, and β = 0.5. Once the rank-based selection process chooses two unique parents and one child (also unique), these individuals are then processed by the crossover operator. The GA performs a type of one-point crossover in which a crossover point is chosen at random for one of the parent trees. In effect, a crossover point is simply one of the n levels on the coalescing tree data structure. A second crossover point is chosen on the second parent using a linear search for a compatible matching level. Two binary coalescing trees are not always compatible for crossover. Preliminary analysis indicates that an incompatible selection occurs in < 5% of all cases, and appears to decrease slowly as a function of tree size. Compatible parent trees are trees in which there exist some internal level at which the internal nodes on the trees can be entirely connected via edges in order to produce a viable child. Since the selection of crossover points is stochastically driven, we arbitrarily attempt 20 times to identify possible crossover points between two randomly selected parent trees before deciding that the trees are incompatible for crossover. In the event that two trees are found to be incompatible, we re-select new parents and attempt until compatible trees are found. The crossover process is shown in Figure 5 below. The lower portion of the first parent at Crossover Point 1 is added to the upper portion of the second parent at Crossover Point 2. This preserves all of the lowerlevel node relationships below Crossover Point 1 from the first parent, while mixing with the preserved upper-level sub-tree ordering specified in the second parent at and above Crossover Point 2. At the point at which the two different trees now intersect in the child tree, edges between nodes are constructed in such a way that the child tree always remains viable. Sometimes edges will need to be constructed at the interface which did not exist at all in the previous tree. In this case, we repair the graft by randomly selecting a viable node from the next level in the coalescing tree such that no terminal nodes are ultimately orphaned, and the tree remains fully-connected and viable. Mutation of a child tree after crossover is a simple process of selecting some connected node on the tree and changing its upper edge to connect to a randomly selected, but viable node. In some cases, removing an edge from a node which is connected higher up in the tree requires a recursive repair mechanism which traverses up the tree from the newly orphaned node, removing all edges from the visited nodes until a node is found that has two connected nodes. Once all such edges are removed, the tree is again viable. Figure 5: The crossover of two compatible coalescing binary trees. Note that a graft repair was needed in this example in order to prevent orphaning the leaf node SEQ_3 4.5 Algorithm Parameters The GA has a steady-state population of 30 individual guide trees. The algorithm is iterative and not generational, and iterates for a configurable number of times before halting. For the tests and experiments conducted with this GA, the researcher chose to terminate the algorithm arbitrarily at 10,000 iterations. The number of iterations required to reach convergence on the globally optimal guide tree is a function of both the number of sequences and the length of the sequences. In real world application of this GA approach, the termination condition for the GA could be dynamically calculated at run-time as a function of average sequence length and the number of sequences being aligned. Since this is an iterative, steady-state GA, a single selection and crossover happens at each iteration (39). However, some minority of individual trees may not be compatible for crossover, and so there is an effective crossover probability, which is something slightly smaller than 100% probability of crossover. In each iteration, each child tree from crossover has a 10% chance of incurring a single point mutation. Examining the effects of manipulating the mutation rate is an area of future work. Table 2: Concise summary of experimental GA parameters GA Parameter Population Type Population Size Population Initialization Number of Iterations Selection Type Crossover Type Crossover Rate Mutation Type Random Number Generator Value Steady-State 30 Bottom-up randomly generated, viable guide tree 10,000 Biased, Rank-Based using Beta probability distribution Branch Swapping on Coalescing Binary Tree 0.9 Random intra-tree, same-level branch migration R250 from GNU Scientific Library (available at http://www.gnu.org/software/gsl) In addition to the GA parameters, alignment parameters were also provided. As mentioned previously, dynamic programming algorithms were used to align sequences and alignments based on the evolved guide trees. The alignments and alignment scores are produced in the context of an evolutionary model which takes the form of a scoring system that specifies the penalties for opening new gaps in an alignment or extending an open series of gaps. The central idea behind making this distinction is based on the fundamental idea that it should be considered more expensive to open a new region of gaps than to simply extend an existing gap region. In addition, different scores are assigned to residue matches and mismatches in an alignment. The scoring system used in this GA implements affine gap penalties, and is fully parameterized as shown the following table below. When comparing against Clustal W, the same scoring system was used in order to more directly compare results between our GA and Clustal W. Table 3: The GA alignment scoring system with affine gap penalties Score Type Gap opening penalty Gap extension penalty Nucleotide match score Nucleotide mismatch score Value -5.0 -1.0 10.0 2.0 4.6 Experimental Results All experimental runs were conducted on a workstation with a 1GHz Intel Pentium-III CPU, 640MB of RAM, running Redhat Linux version 7.2. Over 30 tests were run, with each test working against a different input sequence file which contained different numbers of sequences and/or different sequence lengths. The input sequences for all of the experimental runs were constructed using the Jukes-Cantor sequence generation mechanism described in section 2.5. All generated sequences are DNA nucleotide sequences, although protein alignment works equally as well. For each experiment, alignments were performed both with our GA as well as with Clustal W (v1.82) (31,32,33). Performance, in terms of both efficiency and apparent alignment quality, are summarized for several of our experimental runs. In order to more accurately compare the results of Clustal W to our GA, we attempted to identically parameterize each system. In order to do this, we identically configured affine gap penalties and substitution costs, and disabled the delayed alignment of divergent sequences in Clustal W. All of the experimental runs produced similar overall results for all of the input sequences, regardless of the number of sequences being aligned, and the results presented in this section were chosen as representative results for the entire experimental evaluation of our system. It is important to note that our approach is still fundamentally a progressive alignment algorithm. Therefore, it suffers from the same problems as we mentioned before. To overcome this difficulty, in the next section we have proposed a different EC technique that optimizes the alignment on all sequences simultaneously. Fitness 20 Sequences, 100 bp in Length 11.275 11.27 11.265 11.26 11.255 11.25 11.245 11.24 11.235 11.23 1 101 201 301 401 501 601 701 801 901 Iterations x 10 Figure 6: The fitness trend of the fittest individual across 10,000 generations when evolving a guide tree to align 20 sequences of length 100 bps (a) (b) Figure 7: Visual comparison of the alignment produced by the GA (a) and the alignment produced by Clustal W (b). Although extremely similar, the GA alignment is slightly better. For example, the alignment produced by Clustal W has one additional column, indicating that more gaps were used to construct the alignment. The slight improvement in alignment quality is also apparent on a closer visual inspection 5. Evolving Consensus Sequence with a Genetic Algorithm Here, we present a second novel approach where a consensus sequence is evolved with a genetic algorithm. This optimized consensus sequence can then be translated into an alignment. 5.1 Consensus Sequence The consensus sequence is the most interesting and important feature of this GA approach. It is essentially a compact formulation to represent all possible alignments for virtually any given numbers of sequences (6, 18). The consensus sequence borrows the idea from biology that sometimes it is necessary for certain positions in a sequence to be made ambiguous when some residues simply cannot be resolved during laboratory experiments. A sequence with ambiguity codes is actually a mix of sequences, each having one of the nucleotides defined by the ambiguity at that position. For example, if an R is encountered in the sequence, then the sequences in the assortment will have either an adenine or a guanine at that position. The ambiguity enables conserved sequences to be condensed into one single representation. Figure 8 lists the most commonly used ambiguity codes defined by the Nomenclature Committee of the International Union of Biochemistry (IUB). The consensus sequence in essence is a condensed sequence with ambiguity codes that shows what nucleotides are allowed in each column. Figure 8: The most commonly used DNA ambiguity codes are defined by the International Union of Biochemistry (IUB). The presence of ambiguity generally indicates that some residues cannot be resolved during the laboratory experiments. Ambiguity codes also enable sequences to be represented in a more condensed form 5.2 Design of Encoding Scheme To further enhance the design, our chromosomes are broken into four pieces according to the nucleotide they represent. In other words, our GA uses four parallel chromosomes to represent four different nucleotides. Each chromosome encodes the relative occurrences and locations of a nucleotide and is only evolved with the chromosome that encodes the same nucleotide. The fitness of the entire chromosomes is determined by how well they fit together to derive the final alignment. The geometry of the parallel chromosomes is very similar to the four dimensional hyperplane. Each dimension is evolved and optimized separately and independently. Figure 9 shows a hypothetical sequence and demonstrates how the procedure works. Figure 9: Sequences are split up into four subsequences according to the nucleotides. Each subsequence only encodes the information where such a nucleotide can be found. Since each nucleotide is individually encoded, any possible ambiguities can be fully represented. Further, chromosomes are represented as binary strings, where 1 signals the existence of the relevant a nucleotide at the given location while 0 signals the absence The length of chromosomes is difficult to determine precisely. It depends on the evolutionary model and the similarity of the sequences. If sequences are highly conserved, chromosomes can be relatively short. Intuitively any two randomly generated sequences will have at least 25% similarity. For this study, we assumed that sequences have at least 50% similarity to be biologically significant. Therefore, we arbitrarily defined the length of the chromosome to be 1.5 times longer than the longest sequences. For most of our studies, this assumption worked fine. Furthermore, for implementation convenience, each chromosome is further divided into smaller blocks called loci. Biologically, a locus is a block of alleles where genes can be found. The length of a locus is determined by the length of an integer on the hardware platform. The number of loci depends on the length of the chromosome. 5.3 Crossover and Mutation Operations For this research, we implemented a simple one-point crossover. A point is randomly selected in each locus for each chromosome and alleles are exchanged between two parent chromosomes to form an offspring. An offspring is produced at each generation and then competes with the population. Since each chromosome has separate string for the four nucleotides, the crossover points are chosen separately. Furthermore, we have implemented a bias function for selecting the parent chromosomes from the entire population. The bias function is like an “unfair” randomly number generator. It is essentially a quadratic equation that randomly generates a series of numbers with bias toward the lower indexes. Since the initial population has been sorted in the descending order according to the fitness values, consequently the individuals with higher fitness values are more likely to be selected. Mutation is an important operator that prevents the population from stagnating at local optima (38). In our implementation, the mutation is only applied to the newly created offspring chromosomes. The GA first calculates the expected number of mutations for each locus in the chromosomes with a random factor. It then iteratively picks random locations on each locus for each of the four parallel chromosomes and changes the alleles. The mutation operator randomly flips the alleles independently on each locus with the binary XOR operator. It inverts the alleles from 0 to 1 or 1 to 0. Figure 10: The crossover operator retrieves the alleles from two parent chromosomes to create an offspring. A point is randomly selected on each locus for each chromosome. The offspring receives approximately a half of alleles from each parent. The length of a locus is 32-bit, which corresponds to the length of an integer on our machines 5.4 Objective Function The objective function measures the quality of MSA. Therefore, ideally the better the score the more biologically relevant the multiple alignments are. The substitution costs are evaluated using a predefined substitution matrix. The matrix assigns every possible substitution or conservation according to its biological likeliness. We used the nucleic scoring matrix defined by IUB that each match receives 10.0 points and each mismatch 0. The gap penalty is 10.0 for opening and 0.2 extending a gap. The alignment with the highest score is considered a potentially optimal solution. In addition, the objective function subtracts the fitness value with both the mismatch and gap scores multiplied by the numbers of nucleotides that are missing in the alignment for the chromosomes that do not include all nucleotides. Our experiments have confirmed that this strategy worked quite well. The calculation of SP scores for N sequences takes O(M×N2) time (7,36) where M is the average length of the sequences. To further improve the GA performance, we have devised a sequence profiling technique that simplifies the calculations of SP scores. The objective function first computes the profile of the sequences for each column. A profile is simply the occurrences or frequencies of each nucleotide. The profiling process accumulates the occurrences of each nucleotide and reduces the calculations of SP scores into three smaller tasks. Matches are only possible when two identical nucleotides are aligned together. Therefore, the matching score is simply the sum of all possible combinations of the same nucleotides multiplied by the match score in the substitution matrix. The number of mismatches, on the other hand, is the sum of all the combinations of two different nucleotides. There are only six such combinations. The number of gap alignments is derived from the sum of the frequencies of each nucleotide multiplied by the number of gaps. The result is then multiplied by the gap penalty to obtain the overall gap score. Figure 11: The figure shows the sequence profile for the isolated column. The profiling process simply accumulates the frequencies or occurrences of each nucleotide on a given column. The process simplifies the calculations of the SP scores into three smaller tasks and reduces the complexity. Matches are only possible when two identical nucleotides are aligned together. Therefore, the score for matches is the sum of all possible combinations of identical nucleotides multiplied by the matching scores S(i, i) from the substitution matrix. The number of mismatches is the sum of all combinations between two different nucleotides. There are only six such combinations. The gap penalties are the sum of all arrangements between each nucleotide and gaps 5.5 Alignment Construction The construction of the final alignment is very similar to that of the DP algorithm. The GA derives the alignment from the last nucleotide to the first and the chromosomes are accordingly decoded backward. If a nucleotide is permitted at a given column, then it is consumed and added in the final alignment. The process moves on to the preceding ones. Otherwise a gap is inserted into the alignment. If no nucleotides are ever used in the column, then the allele is skipped. Alleles that are not used to derive the final alignment are considered the non-coding regions. One of the very interesting and important features of the encoding scheme is that the alleles that are used to derive the alignment do not have to be consecutive. In addition, two different chromosomes can potentially give the same alignments. In other words, the alignment construction process picks the appropriate alleles as it moves along. The chromosomes do not have to encode exact bit patterns for the alignments. This makes every allele in the chromosome a potential solution for the alignment. Experiments have confirmed that the GA discovered the optimal alignment, as defined by the substitution matrix, quickly and effectively. The alignments produced by the GA are at least as good as the ones obtained from Clustal W. If the GA is allowed to continue evolving, better alignments are very likely to be found. As the number of sequences increases, the effectiveness of the GA begins to surface. Our experiments have shown that regard-less the number of sequences being aligned; the GA performed extremely well and produced alignments with competitive scores. Figure 12: The figure shows how the final alignment is constructed from the chromosomes. The alignment is derived in reverse order. The spaces in between are the alleles that did not match up any of the nucleotides in the sequences. If a particular nucleotide is not present in the sequence, a gap is inserted. Chromosomes do not have to encode the exact bit patterns for the alignments. The alignment process simply picks the “appropriate” alleles 5.6 Experiment Design The objective of our experiments was to demonstrate that the GA could scale better as well as produce competitive alignments. For the purpose of this study, we assumed Clustal W as the standard method for multiple sequence alignments. Therefore sequences were first aligned with Clustal W and the scores were used as the stopping condition for the GA. We applied the standard IUB nucleic scoring matrix and used the gap penalties identical to that of Clustal W. For this study, we have generated various lengths of sequences with approximately 50% of similarity. The mutation rate was 0.0625 and the average expected number of mutations on each locus was about one. The GA began with a randomly generated population of 64 individuals. The population was first evaluated and sorted in the descending order according to the fitness values. At each generation, the bias function randomly picked two individuals from the population that served as the parent chromosomes. The crossover operator exchanged alleles from two parent chromosomes and created an offspring. Mutation was applied to the offspring repeatedly until the fitness was higher than both parent chromosomes. The offspring then competed with the entire population and removed the individual with the lowest fitness. We gradually increased the number of sequences in each trial. Due to the stochastic nature of GA, all trials were performed at least three times in order to obtain more reliable results. 5.7 Results and Discussions Experiments show interesting results for our GA approach. In most cases, the GA outperformed Clustal W and produced alignments with higher SP scores. The number of generations needed to find the optimal solutions remained approximately the same even though the quantity of sequences increased. This is tribute to the fact that the GA was able to utilize the guided search effectively and found the optimal alignments. Furthermore, experiments have confirmed that GA scales well with respect to the number of sequences. We believe that the objective function doesn’t have sufficient granularity to guide GA in finding the optimal solution. The increased number of sequences doesn’t impact the performance but the length does. For sequences that are average of 60 base pairs long, the GA converges stably and quick to the optimal alignment (as derived from Clustal W). The performance however began to fluctuate as the length of sequences increases, the performance fluctuated. For 60 base-pair-long sequences, for example, the search space is 1660 or approximately 1.767×1072. For 100 base-pair-long sequences, the search space increased exponentially. Specifically it is 1.46×1048 times larger than that of the 60 base-pair long sequences. We have noticed that GA became slow to converge on long sequences. In addition, we have confirmed that the SP scoring function was never a good measurement for MSA. If the gap is not heavily penalized, the same score can be easily achieved with more matches but excessive amounts of gaps. The relative difference in score between the correct and incorrect alignments decreases as the number of sequences increases. Clearly this is very counter-intuitive and not realistic. The difference should increase when more sequences are introduced into the alignment (7). Table 5: The following tables summarize the numbers of generations needed to find the optimal alignments, at least as good as Clustal W, for various lengths of sequences 60 base pairs Trial 1 Trial 2 Trial 3 70 base pairs Trial 1 Trial 2 Trial 3 100 base pairs Trial 1 Trial 2 Trial 3 10 54,627 48,408 59,044 Number of Sequences 25 50 75 48,470 41,697 57,705 53,502 55,387 57,642 46,543 54,028 53,447 100 55,495 51,763 59,955 162,982 164,209 131,512 159,427 156,715 134,724 167,843 184,960 176,781 165,949 186,131 108,757 180,613 175,887 154,084 1,254,233 1,197,003 1,124,561 1,298,503 1,432,034 1,090,024 1,178,904 1,470,175 1,373,209 1,119,092 1,405,332 1,181,392 1,135,987 1,157,061 1,219,451 (a) (b) Figure 13: The first figure (a) shows the alignment generated from the GA and the second (B) from Clustal W. (A) is about 3 base pairs longer than (b) but it detects more conserved regions. Visual inspection on the beginning of (B) can immediately pinpoint that the alignment is not optimal. Clustal W typically requires human intervention in order to produce an alignment with a higher score 6. Conclusion We have presented novel and highly effective approaches to applying EC to the problem of MSA. In addition, we devised a novel use of binary coalescent trees and consensus sequences. The binary coalescent tree functions as a data structure which lends itself nicely to the problem of performing crossover on two trees without duplicating terminal nodes. The consensus sequence approach enables the simultaneous optimization of alignment on many sequences. In our experiments, we have demonstrated that our approaches have produced more refined alignments than Clustal W. The first approach can be shown to produce better alignments than Clustal W while also being more scalable than Clustal W for large datasets (i.e. 500 sequences of length 2500 or larger). Additional testing needs to be done in order to evaluate this hypothesis and determine the precise inflection point at which stochastic optimization techniques such as genetic algorithms surpass Clustal W in both alignment quality and scalability. The second approach overcomes the problems of progressive alignment algorithm and permits optimization of the alignments on all sequences simultaneously. The experiments have confirmed that the GAs perform and scale better than that of the heuristic techniques. 7. Future Work For future work, we would like to extend this approach to align protein sequences and implement statistical scoring techniques. Once this GA supports the alignment of amino acid sequences, we intend to perform benchmarked comparisons against other techniques using the BAliBASE (34, 35) alignment database. We are currently investigating an approach that incorporates several statistical or simulation techniques with phylogeny in order to better quantify the significance of alignments. In addition, we intend to study the parameterization of our GA in great detail to determine how best to converge to a near-optimal guide tree in the minimum number of iterations. The effects of modifying tunable parameters such a population size and mutation rate on solution convergence rates and solution quality will be fully explored. In addition, the problem of sequence alignment and GA can be easily parallelized. Therefore, we intend to parallelize our fitness functions in each approach such that they can be efficiently handled in parallel on a Beowulf cluster. Acknowledgements This research used equipment funded in part by NIH NCRR 1P20 RR16448, and NIH NCRR 1P20 RR16454. Shyu was partially funded by a grant from Proctor and Gamble and Sheneman was partially funded by NIH NCRR 1P20 RR16448. Foster was partially funded for this research by NIH NCRR 1P20 RR16448. The authors would like to thank Jason Evans for helpful reviews and comments. References 1. S. F. Altschul, “Amino acid substitution matrices from an information theoretic perspective”. J. of Mol. Biol., vol. 219, 1999, pp. 555-565. 2. S. F. Altschul, R. J. Carroll, and D. Lipman “Weights for data related by a tree”. J. of Mol. Biol., vol. 207, 1989, pp. 647-653. 3. S. F. Altschul, and D. Lipman, “Trees, stars, and multiple sequence alignment”. SIAM J. of Appl. Math., vol. 49, 1989, pp. 197-209. 4. H. Carrillo, and D. Lipman, “The multiple sequence alignment problem in biology”. SIAM J. of Appl. Math., vol. 48, 1988, pp. 1073-1082. 5. S. B. Carroll, J. K. Grenier, and S. D. Weatherbee, From DNA to diversity: molecular genetics and the evolutionary of animal designs, Malden, MA: Blackwell Science, 2001. 6. W. H. Day, and F. R. McMorris, F.R., “The computation of consensus patterns in DNA sequence”. Mathematical and Computational Model. Vol. 17, 1993, pp. 49-52. 7. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge, UK: Cambridge University, 1998. 8. D. Feng, and R. F. Doolittle, “Progressive sequence alignment as a prerequisite to correct phylogenetic trees”, J. of Mol. Evol., vol. 25, 1987, pp. 351-360. 9. D. B. Fogel, and D. W. Corne, D.W. (Eds.), Evolutionary Computation in Bioinformatics, San Francisco, CA: Morgan Kaufmann Publishers, 2003. 10. D. Graur, and W. H. Li, Fundamental of Molecular Evolution (2nd ed), Sunderland, MA: Sinauer Associates, 2002. 11. D. Gusfield, Algorithms on strings, trees and sequences: computer science and computational biology, New York, NY: Cambridge University Press, 1997. 12. B. G. Hall, Phylogenetics trees made easy: a how-to manual for molecular biologists, Sunderland, MA: Sinauer Associates, 1997. 13. J. H. Holland, Adaptation in natural and artificial systems, Ann Arbor: University of Michigan Press, 1975. 14. J. T. Horng, C. M. Lin, B. J. Liu, and C. Y. Lao, “Using genetic algorithm to solve multiple sequence alignment”, in E. Wingender, et al (eds.) Proc. Germ. Conf. on Bioinfo., 2001, pp. 883-890. 15. M. Isokawa, M. Wayama, and T. Shimizu, “Multiple sequence alignment using a genetic algorithm”, Genome Informatics, vol. 7, 1997, pp. 176-177. 16. T. H. Jukes, and C. Cantor, Evolution of protein molecules. Mammalian Protein Metabolism (ed.), M. N. Munro, 1969, p. 21-132. New York: Academic Press. 17. K. Karadimitriou, and D. H. Kraft, “Genetic algorithms and the multiple sequence alignment program in biology”, in T. R. Tiersch, et al (eds.) Proc. 2nd Ann. Baton Rough Area Mole. Biol. and Biotec. Conf., 1996. 18. J. M. Keith, P. Adams, D. Bryant, D. P. Kroese, K. R. Mitchelson, D. A. E. Cochran, and G. H. Lala, “A simulated annealing algorithm for finding consensus sequences”, Bioinformatics, vol. 18, no.11, 2002, pp. 1494-1499. 19. B. Morgenstern, “DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment”, Bioinformatics, vol. 15, no. 3, 1999, pp. 211-8. 20. S. B. Needleman, and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins”, J. of Mol. Biol., vol. 48, 1970, pp. 443-453. 21. C. Notredame, and D. G. Higgins, D.G., “SAGA: sequence alignment by genetic algorithm”, Nucleic Acids Research, vol. 24, no. 8, 1996, pp. 1515-1524. 22. C. N. S. Pedersen. Algorithms in Computational Biology. Ph.D. dissertation. BRICS, Department of Computer Science, University of Aarhus, Aarhus, Denmark. March 2000. 23. A. Rambaut, and N. C. Grassly, “Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees”, Compu. Appl. Biosci. Vol.13, 1997, pp. 235-238. 24. N. Saitou, and M. Nei, “The neighbor-joining method: a new method for reconstructing phylogenetic trees”, Molecular and Biological Evolution, vol. 4, no. 4, 1987, pp. 406-425. 25. J. Sauder, J. Arther, and R. Dunbrack, “Large-scale of comparison of protein sequence alignment algorithms with structure alignments”. Proteins: structures, function, and genetics, vol. 40, 2000, pp. 632. 26. R. E. Sean, “A memory-efficient dynamic programming algorithm for optimal alignment of sequence to an RNA secondary structure”, BMC Bioinformatics, vol. 3, 2002, pp. 13. 27. J. Setubal and J. Meidanis. Introduction to computational molecular biology. Boston, MA: PWS Publishing, 1997. 28. T. F. Smith and M. S. Waterman, “Identification of common molecular sequences.” J. of Mol. Biol., vol. 147, 1981, p. 195-197. 29. D. J. States, W. Gish, and S. F. Altschul, “Improved sensitivity of nucleic acid database searches using application-specific scoring matrices”, Methods: A Companion to Methods in Enzymology, vol. 3, no. 1, 1997, pp. 66-70. 30. J. Stoye, S. W. Perry, and A. W. M. Dress, “Improving the divide-and-conquer approach to sum-ofpairs multiple sequence alignment”, Applied Mathematical Literature, vol. 10, no. 2, 1997, pp. 67-73. 31. J. Studier, and K. Keppler, “A note on the neighbor-joining algorithm of Saitou and Nei”, Molecular and Biological Evolution, vol. 5, 1988, pp. 729-731. 32. J. D. Thompson, T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins, “The Clustal X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools”. Nucleic Acids Research, vol. 24, 1997, pp. 4876-4882. 33. J. D. Thompson, D. G. Higgins, and T. J. Gibson, T.J, “Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice”. Nucleic Acids Research, vol. 22, 1994, pp. 4673-4680. 34. J. D. Thompson, F. Plewniak, and O. Poch, “A comprehensive comparison of multiple sequence alignment programs”. Nucleic Acids Research, vol. 27, no. 13, 1999, pp. 2682-2690. 35. J. D. Thomson, F. Plewniak, and O. Poch., “BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs”, Bioinformatics, vol. 15, no. 1, 1999, pp. 87-88. 36. L. Wang and T. Jiang, “On the complexity of multiple sequence alignment”. J. of Compu. Biol., vol. 1, 1994, pp. 337-348. 37. S. M. Waterman, and M. Eggert, “A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons”. J. of Mol. Biol., vol. 197, 1987, pp. 723-725. 38. W. Wayama, K. Takahashi, and T. Shimizu, “An approach to amino acid sequence alignment using a genetic algorithm”. Genome Informatics, vol. 6, 1995, pp. 122-123. 39. D. Whitley, “A genetic algorithm tutorial”. Statistics and Computing, vol. 4, 1994, pp. 65-85. 40. C. Zhang, and A. K. Wong, “A genetic algorithm for multiple molecular sequence alignment”. Computational Applications for Biosicence. Vol. 13, no. 6, 1994, pp. 565-581.
© Copyright 2026 Paperzz