The construction of phylogenetic trees using ACO Thies Gehrmann 1054147 ∗ Abstract In this paper, the field of generating phylogenetic trees is reviewed, and in particular a method using an Ant Colony Optimization (ACO) algorithm is be described. An implementation of an ACO method is presented, and tested. The methods described by Lopes and Perretto [9] will be used, using the mtDNA dataset described by Cao et. al. [3], to construct a phylogenetic tree for the species in the dataset. It is found that the ACO method gives a fast approximation to an optimal phylogenetic tree which exhibits correctness similar to other accepted methods such as Neighbor Joining. Contents 1 Introduction 2 2 Biological background 2 3 Phylogenetic trees 4 4 Inference methods 5 5 Our method 6 6 Constructing the tree 8 7 Implementation 10 8 Experiments 11 9 Results 12 10 Conclusion 12 ∗ [email protected], [email protected] 1 1 Introduction In contrast, homoplasy: Reaching a compromise between accuracy and tractability is something which often has to be done in computer science. The problem of inferring phylogenetic trees for a set of taxa is another time when this is the case. This paper will present many aspects in the following order: Definition 3. Homoplasy: Character similarities which arose in different taxa through parallelism or convergence. Establishing homologous relationships between characters is not easy, since similar characters can develop through different mechanisms of evolution[13]. 1. Biological background A brief segment designed to explain what is meant by biological classification and what it means. 2. Inference methods Some traditional methods are explained, and some non-traditional methods are mentioned. They are explained to show the influences that these methods had on the ACO method, and to highlight differences. 3. Our method Our method, that of Lopes and Perretto [9] will be explained in depth. 4. Experiments and results The method will be tested on two datasets, and results of the method will be described. 2 2.1 Evolutionary distances Given two taxa and sequences of homologous genes, we want to find some kind of distance between them. An example of a representative sequence could be the amino acid sequence of proteins in two different organisms which perform the same function. In our case, we will use mitochondrial DNA as our feature. This is explained in more detail later. There are many ways to find distances between taxa, but two will be discussed here. Multiple sequence alignments is the most common method, and is explained first. Secondly, a method described by Li et al. in [8], and used in [9], using Kolmogorov complexity to calculate distances is described. This method will also be used in this paper. Biological background Establishing evolutionary lineage among species (the exact genealogy), also called phylogeny is a problem which has been, and still is, not easily soluble. Beyond the computational limitations, we also have to consider that the entire line of ancestors that connects every species is not usually known (lack of transitional fosils)[13]. The first step in establishing these links is in finding the distances between species. Instead of species, a more general term is introduced: Multiple Sequence Alignment Multiple Sequence Alignment (MSA) is explained by way of example: Example 1. Given two sequences, let us take as an example the amino acid sequences for the p53 protein, a tumor suppressor common in many multicellular organisms. A common method to find the distances between more than two of such sequences is to first perform a MSA on them. Figure 1 shows a segment of MSA output. Multiple sequence alignments produce alignment tables, which highlight patterns of amino acid (or nucleotide) conservation among taxa. Typically, it is a good idea to use amino acid sequences over nucleotide sequences, which they are much shorter, and can be converted back to their DNA counterparts even after the alignment. There are many problems associated with MSA, but that is a different subject of study altogether. The MSA used in this paper is the European Bioinformatics Institute2 ClustalW3 application for MSA. Definition 1. Taxon (plural taxa): A named unit which encompasses a distinct group of organisms placed in a taxonomic category, be it species, genus, family, etc. Historically, the data used to determine these distances were using morphological data1 . Since the advent of sequencing methods, we can use molecular data, usually DNA or RNA sequences from homologous genes or proteins. Definition 2. Homologous characters: Characters whose similarity in different taxa is because of their descent from a common ancestor. 1 Data which describes the anatomical structure of a species. E.g. beak shape, feather color, foot shape, etc. http://www.ebi.ac.uk 3 ClustalW2: http://www.ebi.ac.uk/Tools/clustalw2/ 4 Phylip phylogenetic software package: http://evolution.genetics.washington.edu/phylip.html 2 EBI: 2 Tools from the Phylip4 package (such as protdist or dnadist) can be used to calculate the evolutionary distance matrix, given a MSA. These distances are called the ”pairwise projections” of the alignment. Typically, these tools measure how much change (i.e. how many amino acid or nucleotide mutations) is needed to get from one sequence to another (i.e. a form of Hamming distance). This gives us a distance matrix like the following: d Delphinapterus l. Bos p. Canis l. f. DL 0 .122304 .122051 BP .122304 0 .190062 Example 2. An example of the distances seen is shown in the following table: d Homo Sapiens Pan Paniscus Halichoerus Gryphus CLF .122051 .190062 0 2.2 HG .97682 .97560 0 Phylogenetic classification In the construction of phylogenies, we are typically interested in representing them as cladograms7 . Such an example tree is given in figure 2(a), which shows an example phylogenetic tree generated from the p53 protein expressed in various organisms. Often, the goal of phylogenetic classification is defined as the parsimony principle, or the principle of minimal evolution. This principle works along the same lines as ”Occam’s razor”, in that the best solution (most correct) is most likely the one which minimizes the distances between each taxon, while keeping true to the perceived data. Performing a MSA takes a long time, waiting for a few hours to get the results is normal. Kolmogorov Complexity Li et. al. [8] described the use of Kolmogorov Complexity to determine a distance valid (i.e. meeting the symmetry and triangle inequality conditions) between two sequences. Definition 4. Kolmogorov Complexity: A measure of the complexity needed to describe something. Often defined as the length of the smallest way to describe a sequence of characters, a minimal descrip2.3 tor: |d(x)|. Our mitochondrial dataset The dataset we will use is based on the dataset created by Cao et. al. in [3]. The mitochondrial DNA is used from 20 individual species, from several mammalian branches (primates, ferungulata, rodents). This is the mtDNA dataset. Li described a function dist(x, y) in his paper using Kolmogorov complexity, shown in equation 1. K(x) − K(x|y) K(xy) PP .65913 0 .98176 The distance matrix is not quite symmetric, this is because K(x)−K(x|y) ≈ K(y)−K(y|x). As long as consistently one side of the matrix is used, this does not become an issue. △ △ dist(x, y) = 1 − HS 0 .66086 .97643 (1) Definition 5. Mitochondria: Mitochindria are small organisms which live in every animal cell. In a symbiotic relationship, they produce energy for the cell from various food sources. K(x) is the Kolmogorov complexity of the string x, equivalent to K(x|ǫ), K(x|y) is the conditional Kolmogorov complexity of x given y, and K(xy) is the complexity of the concatenation of x and y. We define K(x) = |d(x)|, the length of the minimal descriptor of x, which is taken to be the size of the compressed sequence using GenCompress5 . The numerator of equation 1, K(x) − K(x|y) measures the extent to which y ”knows about” x i.e. their similarity. The denominator simply normalizes the numerator, and we subtract it from 1, to find a measure of their difference, their distance. A program was written to take as input multiple sequences in FASTA6 format and output a distance matrix. All animals have mitochondria, and they can give us a good indication of their ancestral relationships. Detailed here is the dataset, in the format {Latin name (common name; Nucleotide ID8 )}. ◦ ◦ ◦ ◦ ◦ ◦ 5 GenCompress: Bos taurus (Cow; V00654) Balaenoptera physalus (Fin whale; X61145) Balaenoptera musculus (Blue whale; X72204) Phoca vitulina (Harbor seal; X63726) Halichoerus grypus (Gray seal; X72004), Felis catus (Cat; U20753) http://www1.spms.ntu.edu.sg/~chenxin/GenCompress/ a text format for storing sequences with header information 7 Also known as evolutionary trees or phylogenetic trees 8 NCBI Nucleotide database: http://www.ncbi.nlm.nih.gov/nuccore 6 FASTA: 3 Figure 1: MSA segment for eight p53 protein sequences (a) (b) Figure 2: 2(a) Rooted phylogenetic tree for 8 species based on expressed p53 proteins. 2(b) Unrooted phylogenetic tree for 8 species based on expressed p53 proteins. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 3 ter, whereas rooted trees describe a clear direction of evolution. Equus caballus (Horse; X79547) Rhinoceros unicornis (Indian rhinoceros; X97336) Mus musculus (Mouse; V00711) Rattus norvegicus (Rat; X14848) Homo sapiens (Human; D38112) Pan troglodytes (Chimpanzee; D38113) Pan paniscus (Bonobo; D38116) Gorilla gorilla (Gorilla; D38114) Pongo pygmaeus p. (Bornean orangutan; D38115) Pongo pygmaeus a. (Sumatran orangutan; X97707) Hylobates lar (Common gibbon; X99256) Didelphis virginiana (Virginia opossum; Z29573) Macropus robustus (Wallaroo; Y10524) Ornithorhynchus anatinus (Platypus; X83427) Definition 6. Rooted Phylogenetic Tree: A rooted tree is a directed weighted graph G(V, E, d) in which there is a root node r. Leaves are nodes with an indegree of 1, and interior nodes, have an indegree of 1, and an outdegree of 2, with the exception of the root node, which has an outdegree of 2. Definition 7. Unrooted Phylogenetic Tree: An unrooted tree is a undirected weighted graph G(V, E, d), in which there is no (traditional) root node, the ”root” of the tree is considered to be the node which has an outdegree of 3. Leaves are nodes with a degree9 of 1, and interior nodes have a degree of 3. Phylogenetic trees In all cases, leaves represent the known taxa, and the interior nodes usually represent the unknown, inferred ancestor taxa. An unrooted tree can be converted to a rooted tree, by considering an ancestor node as a root node, which is intermediate to an internal node. There follows an exact definition of phylogenetic trees. There exist two representations of phylogenetic trees, rooted and unrooted. Unrooted trees express the relationships between taxa bet9 degree = indegree + outdegree 4 4 Inference methods paired into a subtree, which is added to the set of taxa, replacing i and j. The process then repeats until only 3 taxa remain. There follows a small example: There are many traditional methods which try to produce phylogenetic trees. Some are more tractable than others, with an accuracy payoff. For example, figure 2(b) was created using the Neighbor Joining algorithm (see section 4.2), whereas figure 2(a) was created by our algorithm (see section 5). These algorithms are clustering algorithms, they group the taxa in a way similar to an agglomerative clustering algorithm. However, they differ from it because they do not consider a ”linkage” criterion. Differences among the methods lie mainly in the assumptions they make about the data, and the ways in which distances are calculated. There are two types of algorithms, those that consider distance based metrics, and those that consider characters. Distance based algorithms apply what was described in section 2.1, whereas character based algorithms use any kind of data, even just raw sequence data. As a result of the intractability of many of the methods described here, other methods have been explored. Among them, swarm intelligence methods. Described here are four traditional methods, and (very) briefly described are the Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO) methods. Example 3. Finding the distance between i, j, k and their ancestor u. d i j k i 0 24 28 j 24 0 32 k 28 32 0 d(µ, i) = 0.5(d(i, j) + (d(i, k) − d(j, k))) = 0.5(24 − (28 − 32)) = 10 d(µ, j) = d(i, j) − d(µ, i) = 14 d(µ, k) = d(i, k) − d(µ, i) = d(j, k) − d(µ, j) = 18 △ 4.2 UPGMA and Neighbor Joining UPGMA, Unweighted Pair Group Method with Arithmetic Mean is an algorithm which clusters nodes from the bottom up. It assumes a constant 4.1 Fitch-Margoliash method evolutionary clock, i.e. all branches are of equal Described first in [5], constructs either rooted or length (i.e. ultrametric). Starting at the bottom unrooted trees. The process relies on that the dis- (each node representing a taxon), it groups the two tances between two taxa i, j and their ancestral which have the smallest distance between them, by root µ can be approximated by knowing the dis- adding an ancestor node to represent them. UPtances between i, j and a third taxon k. The three GMA does this until the entire level is joined, retaxa can then be grouped into a tree. calculates the distances for these ancestor nodes (by The following figure shows the relationship be- considering arithmetic averages of the distances between the distances. tween the two taxa), and repeats the same process on the set of ancestor nodes. This results in rooted trees. UPGMA is not used today, but it paved the way for the Neighbor Joining algorithm (NJ). The exact functionality of UPGMA is described in [2]. Neighbor joining is similar to UPGMA, in that it recalculates distances at every step, but it does not assume a constant evolutionary clock. The Neighbor Joining algorithm produces unrooted trees. The principle is the same, it is a bottom up It becomes clear that we can ascertain the dis- algorithm. Neighbor Joining iteratively selects a tances between µ and i, j and k with the following taxon pair pairs them to form a new subtree and calculation: d(µ, i) = 0.5(d(i, j) + (d(i, k) − d(j, k))) groups them back into the taxon set to reduce the size of the taxon set by one. A similar calculation can be used to find d(µ, j). Starting from the bottom, NJ groups from the In the case of a problem with more than three taxa, the third taxon k is represented by the av- set of nodes S the two nodes i, j which have the erage of the remaining taxa. i and j can then be smallest distance between them, and represents 5 5.1 them as an intermediate node u. Rather than doing this for all nodes on the level, it removes i and j from S, and adds u. It recalculates the distances for the new node u, and performs this step again until all nodes have been grouped. The exact functionality of NJ is described in [6]. 4.3 ACO applied to the TSP The ACO algorithm is a meta-heuristic which models the behaviour of ants while searching for food. It has been observed that ants, collectively, can find the shortest distance between a food source and their nest by laying pheromones along the ground. The paths upon which high pheromone levels develop will be preferred by ants, and it is by this mechanism that shorter paths accumulate a larger pheromone level than others. The TSP problem is a very famous problem in which a salesman has to visit a number of cities, and wishes to find out how he can do this in the least distance possible. The problem was approached using ACO by Dorigo [4]. Algorithm 1 shows the general outline of the algorithm. There are some important points to note. Maximum parsimony The Maximum parsimony algorithm tests all the trees, scoring them along the way and picks the tree with the best score (i.e. smallest, the tree describing the smallest amount of steps between taxa). There are far too many trees to test for even a small amount of taxa, and pruning methods must therefore be applied. It has been shown that knowing the shortest TSP path for the set can reduce the number of trees which need to be searched. [7]. Unfortunately, this method will often return Transition probabilities multiple trees with the same score. This is often If a value q ∈ [0, 1], is less than q0 (a parameter), a result of using characters which are not homolothen the node with the highest transition probabilgous and exhibit a high degree of homoplasy. ity is chosen. Equation 2 shows the distribution used for each ant to pick the next node. If q > q0 , 4.4 PSO then the next node is chosen from the distribution Only one paper discussed using the Particle Swarm pa . Optimization algorithm to construct phylogenetic τ (c, n) · η(c, n)β trees [10]. Unfortunately, this paper was so poorly (2) pa (c, n) = X τ (c, u) · η(c, u)β written that it is nearly impossible to understand u∈V / a it. It is not even clear how the trees are represented or arrived at. After reading the paper, it becomes Equation 2 finds the probability of proceeding to clear why the method was not described in detail the next node n from the current node c. V is the a in the review paper [12]. set of nodes which ant k has visited so far. τ (i, j) is the pheromone value on the edge between node i 4.5 ACO and j. η(i, j) is a heuristic chosen to be d(i, j)−1 , The use of Ant Colony Optimization however, is where d(i, j) is the distance between node i and j. extensively studied, [9], [14], [1]. In this paper, the algorithm described by Perretto and Lopes [9] will be presented and implemented. 5 Example 4. At a given node c, pa (c, A) represents the probability of moving to node A, whereas pa (c, B) represents the probability of moving to node B. Our method Our method is composed of multiple steps. 1. Obtain a distance matrix (described in section 2.1) 2. Use an ACO algorithm to obtain a best path p and a pheromone matrix τ . 3. Use τ to construct the tree. The ACO algorithm was implemented based on Other nodes may already have been visited, and Perretto and Lopes work [9]. They used a modified version of the traveling salesman problem (TSP). are therefore in the set of nodes the ant has already First, the method used for the TSP problem will be visited, and will not be assigned a probability. explained, and then the changes will be highlighted. △ 6 Pheromone updating Scoring function The score of a path is the sum of it’s edges, seen Pheromones are updated differently. Lopes and in equation 3, for a given ant a. Perretto do not perform a local update of the pheromone levels, and instead only update globally X score(a) = d(i, j) (3) at the end of each cycle. Equation 8 is used to up(i,j)∈path(a) date the pheromones. Pheromone updating Another note of interest is the pheromone updating functions. There are two, the local, in equation 4 and the global, seen in equation 5: τ (i, j) = (1 − ρ)τ (i, j) + ρ · τ0 τ (i, j) = ρ · τ (i, j) + (1 − ρ) · ∆τ (i, j) ∆τ (i, j) = (4) Distance recalculation The most important difference is that the distances between each taxa and the current taxon have to be recalculated at every step. This is because the distance between an ancestor node µ of a taxon t1 and another taxon tn , is different from the distances between t1 and tn , i.e. d(t1 , tn ) 6= d(µ, tn ). Distances are recalculated according to the following rule: Our modified TSP There are a few differences between the solution in section 5.1, and the one we applied. The changes can be seen in Algorithm 2. The goal of the modified TSP is to produce a pheromone matrix where the weights are representative of the distances between the species. d(i, n) + [d(i, n) − d(j, n)] · δ if d(j, n) > d(i, n) d(µ, n) = dµn (i, j) = d(j, n) + [d(j, n) − d(i, n)] · δ if d(j, n) < d(i, n) (10) These calculations are similar to those used in the Fitch-Margoliash algorithm (see section 4.1). We are interested in the distance between the ancestral node µ and the next node n we pick, d(µ, n). This can be calculated using equation 10, where i and j represent the descendants of µ. δ is a parameter between 0 and 1, which defines the closeness of the ancestor node to it’s children. The larger δ is, the closer the ancestor will be. The justification for distance recalculation is given as an example: Transition probabilities Transition probabilities are calculated in a similar fashion to equation 2. Equation 6 shows the probability function for the modified TSP problem. τ (c, n)α · d(c, n)−β pa (c, n) = X τ (c, u)α · d(c, u)−β (6) u∈V / a The only differences are that the distance is explicitly used in the function, whereas previously it was only a suggested way, and the introduction of another scaling factor α. Together, α and β make a trade-off between exploration and exploitation of good edges in the path. Scoring function Example 5. In figure 2(a), The species Bos primigenius and Delphinapterus leuca are grouped, to form an ancestor node. The distances between this new species and the others (Macaca fascicularis, etc.) is obviously different than from the child nodes. This is the justification for the distance recalculation. △ The score of a path is given as the sum of the transition probabilities. The higher the score of the path, the better. Equation 7 shows how to calculate the score. X scoreA (a) = pa (i, j) (7) (i,j)∈pathA (a) 10 NN k 0 otherwise (9) Where scorebest is the score of the best path. τ (i, j) = (1 − ρ)τ (i, j) + ( 5.2 X scoreA (k) · score−1 best if (i, j) ∈ pathA (k) ρ ) (5) bA ρ is a pheromone evaporation parameter. τ0 is the initial value of the pheromone matrix, defined as (n · Lnn )−1 , where n is the number of nodes, and Lnn is the Nearest Neighbor Heuristic10 . ba is the score of the best path found at the end of the cycle. (8) Heuristic: at each node pick the closest neighbor to be your next. 7 Algorithm 1 ACO algorithm to solve TSP Input: ◦ Complete weighted graph G(V, E, d), V = {vertices}, E = {Edges}, d : e → R, e ∈ E ◦ ρ: Pheromone evaporation factor Output: p : N → V : Best path τ (i, j) = τ0 = (n · Lnn )−1 , ∀i, j # Pheromone matrix p=∅ # Path is empty A = {k ants} # Set of ants pathA : a → (N → V 2 ) # Set of paths created by each ant scoreA : a → R # Set of path scores for each ant repeat for all a ∈ A do pathA (a) = construct path(G, τ ) scoreA (a) = score(pathA (a)) τ (i, j) = (1 − ρ)τ (i, j) + ρ · τ0 # Local pheromone update (i, j) ∈ E, (i, j) ∈ (pathA (a) end for bA = min(scoreA ) # Best tour length τ (i, j) = (1 − ρ)τ (i, j) + ( bρA ), ∀i, j # Global pheromone update until All cycles completed return p = a, s.t. pathA (a) = min(pathA ) There follows an example of the distance recalculation: △ Example 6. Imagine three taxa with the distance matrix: Figure 3 outlines the process of creating the path while changing the distances. d A B C A 0 1 2 B 1 0 3 C 2 3 0 6 Constructing the tree After the ACO algorithm has finished, a pheromone If we go from A to B, we will get a new, inferred matrix whose values are representative of the distaxa µ1 , we will do some calculations: tances between the nodes (including the changes made by updating the distances) is returned. Using this matrix, we can construct a tree with algorithm δ = 0.5 3. d(µ1 , C) = dµ1 C (A, B) Algorithm 3 differs slightly from the one pre= 2 + (2 − 3) · δ sented in [9], but only to enhance the understanding = 1.5 of the algorithm. d(µ1 , A) = d(A, C) − dµ1 C (A, B) The algorithm recursively groups subtrees with a new, inferred ancestor node until all taxa have = 0.5 been grouped. d(µ1 , B) = d(B, C) − dµ1 C (A, B) An example tree construction follows: = 1.5 Example 7. Constructing a tree using this pheromone matrix: The new distance matrix will look like this: d A B C µ1 A 0 1 2 0.5 B 1 0 3 1.5 C 2 3 0 1.5 µ1 0.5 1.5 1.5 0 τ A B C 8 A 0 1 3 B 1 0 2 C 3 2 0 Algorithm 2 ACO for the modified TSP Input: ◦ Complete weighted graph G(V, E, d), V = {vertices}, E = {Edges}, d : e → R, e ∈ E ◦ ρ: Pheromone evaporation factor Output: ◦ p : N → V : Best path ◦ τ : Pheromone matrix τ (i, j) = τ0 = (n · Lnn )−1 , ∀i, j p=∅ A = {k ants} pathA : a → (N → V ) scoreA : a → R repeat for all a ∈ A do for i = 1 to |V | do pathA (a)(i) = pick next node() recalculate distances() end for scoreA (a) = score(pathA (a)) end for scorebest = min(scoreA ) ∆τ (i, j) = # Pheromone matrix # Path is empty # Set of ants # Set of paths created by each ant # Set of path scores for each ant # Best tour length # Pheromone update X Sk · score−1 if (i, j) ∈ pathA (k) best k 0 otherwise τ (i, j) = ρ · τ (i, j) + (1 − ρ) · ∆τ (i, j), ∀i, j until All cycles completed return τ and p = a, s.t. pathA (a) = min(pathA ) Algorithm 3 Algorithm to construct the phylogenetic tree Input: ◦ τ : pheromone matrix ◦ T : set of taxa Output: t: rooted binary tree τsorted (n) = (i, j) grouped = 0 iterator = 0 while grouped < (|T | − 1) do (i, j) = τsorted (iterator) iterator = iterator + 1 ia = oldest ancestor(i) ja = oldest ancestor(j) if i == j then continue end if group(ia , ja ) grouped = grouped + 1 end while return oldest ancestor(S) # sort fields in pheromone matrix in decreasing order. # The number of taxa that have been grouped. # If s has no ancestor, oldest ancestor(s) = s # i.e. if i and j have the same ancestor (previously grouped). # Group the two taxa to form a new, inferred ancestor # Return the oldest ancestor in the set of taxa 9 (b) (a) (c) Figure 3: Recalculating distances. 3(a): The initial graph G, with A picked as the starting node. Node B is chosen as the next node, and an ancestor node µ1 is created. 3(b): Distances from µ1 to C and D are calculated. C is chosen as the next node, and a new ancestor µ2 is created. 3(c): Distances from µ2 to D are calculated. D is chosen as the last node. The final path is ABCD. △ τsorted (0) = (A, C) τsorted (1) = (B, C) 7 τsorted (2) = (A, B) Implementation Together with Matteo Brunati11 and Alberto Testolin12 , Thies Gehrmann implemented an algorithm mostly similar to the Lopes and Perretto method. The implementation follows the algorithms presented previously closely. There were several components of the implementation, not all in the same language. Here, a brief overview of them is given, and also a mention of authorship, in order of contribution. This results in that we first group A and C, and finally A(which is replaced by the ancestor of A) and C. 1. Distance calculation (written in Shell script) 2. Program initialization (written in C) 3. ACO algorithm (written in C) 11 Matteo 12 Alberto Brunati, MSc student [email protected] Testolin, MSc student [email protected] 10 ◦ ◦ ◦ ◦ ◦ ◦ 4. Tree construction (written in C) β (2) See equation 6. δ (0.5) See equation 10. ρ (0.9) See equation 8. Number of ants (500). Number of cycles (50). Favor for best node (0.9). See equation 2. The main part of the program, composed of components 2, 3 and 4 is constructed with 3 main modules: init → solve → tree init is responsible for loading the distance matrix, parameters, and initializing memory. solve This component was written by Thies produces a usable pheromone matrix. tree com- Gehrmann. putes the tree from a pheromone matrix, and can ACO Algorithm provide output in a variety of different formats. The algorithm was implemented as shown in alDistance calculation gorithm 2. Perretto and Lopes [9] used a distance matrix This module is passed a distance matrix, a calculated by Li et. al. [8]. This file was no longer pheromone matrix, and several other parameters, available on the internet, so the matrix had to be it returns the final pheromone matrix to the tree re-calculated. module. Initially, distances were calculated using This component was written by Thies clustalw and dnadist/protdist aided by online Gehrmann and Alberto Testolin. tools13 . But these distances were incorrect, the Tree construction algorithm arrived at strange results which were not expected. Therefore an attempt was made to The tree is produced from the pheromone marecalculate the data made by Li et. al. trix, as in algorithm 3. This component can output Li et.al. used a program called GenCompress, in two different formats, Graphviz DOT format14 , which compresses a DNA sequence into a smaller and Newick tree format15 . space. The size of that space was taken to be the This component was written by Thies Kolmogorov complexity of the sequence. A program Gehrmann. was written to automate the calculation of distances between species using raw FASTA sequence data taken from the NCBI Nucleotide database, or con8 Experiments verted amino acid sequences taken from the NCBI protein database. This program was written in shell Two datasets were constructed: script. Since the calculations can take a while, the dis• A small dataset consisting of the p53 protein tance matrix is precomputed before the ACO-based coding genes from 8 different species. These mechanism is employed. species were chosen for no particular reason, This component was written by Thies other than that they came up first in the Gehrmann. search on NCBI protein database search16 . Initialization • mtDNA dataset, described by Cao in [3], and explained previously in section 2.3. This module is a relatively small component, dealing mainly with initializing parameters, reading command line arguments, and loading the distance matrix. It assumes a correct distance matrix is presented, if not, all values are filled with zerodistances. To set the default pheromone matrix values, the Nearest Neighbor heuristic distance is calculated. There are several parameters which can be set (including default values): The field of phylogenetic tree construction is quite inexact. The most any method can really try to do is to determine which species are closer to one another, or to state which groups ”belong to each other”, more than other groups. The groupings are therefore what interests us. When examining the trees, note that the ones produced by Neighbor Joining are unrooted, whereas those produced by our algorithm are rooted. ◦ α (1) See equation 6. 13 Mobyle@pasteur: http://mobyle.pasteur.fr http://www.graphviz.org/ 15 Newick tree format: http://evolution.genetics.washington.edu/phylip/newicktree.html 16 NCBI Protein database: http://www.ncbi.nlm.nih.gov/protein/ 14 Graphviz: 11 8.1 p53 dataset Figure 2(a) was produced by our ACO algorithm and figure 2(b) was produced by the traditional Neighbor Joining algorithm. We can see an identical grouping, indicating that our method has performed as well as the Neighbor Joinging method. For the Neighbor Joining algorithm, a different distance matrix (produced from the same initial data) produced by protdist 17 was used. be more computationally expensive, especially for larger amount of taxa. 10 Conclusion In this paper, a method using ACO to produce phylogenetic trees, as described by Perretto and Lopes was presented. It was described in detail how the method works. Furthermore, the method was implemented and tested on two entirely different datasets. 8.2 mtDNA dataset It was found that the method performed acceptExamine figure 4(a) (produced by our algorithm), ably compared to the Neighbor Joining algorithm. and figure 4(b) (produced by Neighbor Joining). Since it can offer considerable time performance Again, similarities in grouping can be seen. Obimprovements, the method may be a viable option serve, for example, the grouping of the branches for problems with very large numbers of taxa. containing Didelphis virginiana and Mus Musculus, which are identical in each tree. There are some discrepancies between the trees, References such as between the Homo sapiens branches, but these are probably due to slight differences in the [1] Shin Ando and Hitoshi Iba. Ant algorithm for calculation of the distances. Since the original data construction of evolutionary tree. In In Profrom Cao et. al. can not be found, this is the best ceedings of GECCO 02. IEEE, 2002. we could do. 9 Results As we saw in section 8, the method produces trees with similar groupings as traditional methods. They also reported some time-growth characteristics which suggest that their ACO solution gives better performance. Lopes and Perretto have worked for years on their implementation, since their first paper in 2005 [11]. Using our implementation to compare against the traditional methods would not give accurate indications of the time growth characteristics of the algorithm, Lopes and Perretto summed up their presentation with figures shown in figure 9. Figure 5(a) shows that they measured fairly low execution times for even up to almost 500 species, which is remarkable. Figure 5(b) highlights the perceived increased performance of the algorithm over traditional methods, such as Fitch-Margoliash (see section 4.1), and Neighbor Joining (see section 4.2). The performance increase arises because the algorithm is not dependent on the number of taxa, but rather on the number of ants and iterations. In our method, clustering of the tree is determined by the distance matrix, whereas in traditional methods, the clustering is done by distances between the taxa. Because of this, traditional methods will 17 Protdist: [2] Hans J. Böckenhauer and Dirk Bongartz. Algorithmic Aspects of Bioinformatics (Natural Computing Series). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2007. [3] Y. Cao, A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Pääbo, and M. Hasegawa. Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. Journal of molecular evolution, 47(3):307–322, September 1998. [4] M. Dorigo. Ant colonies for the travelling salesman problem. Biosystems, 43(2):73–81, July 1997. [5] W. M. Fitch and E. Margoliash. Construction of Phylogenetic Trees. Science, 155(3760):279– 284, January 1967. [6] Olivier Gascuel and Mike Steel. NeighborJoining Revealed. Molecular Biology and Evolution, 23(11):1997–2000, November 2006. [7] Chantal Korostensky and Gaston H. Gonnet. Using traveling salesman problem algorithms for evolutionary tree construction. Bioinformatics, 16(7):619–627, July 2000. http://cmgm.stanford.edu/phylip/protdist.html 12 (a) (b) Figure 4: 4(a) ACO-produced tree for mtDNA dataset. 4(b) Neighbor Joining tree for mtDNA dataset. (a) (b) Figure 5: Growth characteristics reported by Lopes and Perretto in [9]. Figure 5(a): Growth of the time needed to complete the run for up to almost 500 species. Figure 5(b): Comparing the execution time between the proposed method and traditional methods. [8] Ming Li, Jonathan H. Badger, Xin Chen, Sam Kwong, Paul Kearney, and Haoyong Zhang. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 17(2):149–154, February 2001. Ant Colony system for large-scale phylogenetic tree reconstruction. Journal of Intelligent and Fuzzy Systems, 18(6):575–583, January 2007. [10] Hui-Ying Lv, Wen-Gang Zhou, and ChunGuang Zhou. A discrete particle swarm optimization algorithm for phylogenetic tree reconstruction. pages 2650–2654, 2004. [9] Heitor S. Lopes and Mauricio Perretto. An 13 [11] Mauricio Perretto and Heitor Silvério Lopes. Reconstruction of phylogenetic trees using the ant colony optimization paradigm. Genetics and molecular research : GMR, 4(3):581–589, 2005. [13] Monroe W. Strickberger. Evolution. Jones and Bartlett, 2000. [14] Karla Vittori, Alexandre C. B. Delbem, and Sergio L. Pereira. Ant-Based Phylogenetic Reconstruction (ABPR): A new distance algo[12] Jeffrey Rizzo and Eric C. Rouchka. Review of rithm for phylogenetic estimation based on ant Phylogenetic Tree Construction. Technical recolony optimization. Genetics and Molecular port, University of Louisville, November 2007. Biology, 31(4), December 2008. 14
© Copyright 2026 Paperzz