The construction of phylogenetic trees using ACO

The construction of phylogenetic trees using ACO
Thies Gehrmann 1054147
∗
Abstract
In this paper, the field of generating phylogenetic trees is reviewed, and in particular a method
using an Ant Colony Optimization (ACO) algorithm is be described. An implementation of an ACO
method is presented, and tested. The methods described by Lopes and Perretto [9] will be used, using
the mtDNA dataset described by Cao et. al. [3], to construct a phylogenetic tree for the species in
the dataset. It is found that the ACO method gives a fast approximation to an optimal phylogenetic
tree which exhibits correctness similar to other accepted methods such as Neighbor Joining.
Contents
1 Introduction
2
2 Biological background
2
3 Phylogenetic trees
4
4 Inference methods
5
5 Our method
6
6 Constructing the tree
8
7 Implementation
10
8 Experiments
11
9 Results
12
10 Conclusion
12
∗ [email protected],
[email protected]
1
1
Introduction
In contrast, homoplasy:
Reaching a compromise between accuracy and
tractability is something which often has to be done
in computer science. The problem of inferring phylogenetic trees for a set of taxa is another time when
this is the case. This paper will present many aspects in the following order:
Definition 3. Homoplasy: Character similarities
which arose in different taxa through parallelism or
convergence.
Establishing homologous relationships between
characters is not easy, since similar characters can develop through different mechanisms of
evolution[13].
1. Biological background
A brief segment designed to explain what is
meant by biological classification and what it
means.
2. Inference methods
Some traditional methods are explained, and
some non-traditional methods are mentioned.
They are explained to show the influences
that these methods had on the ACO method,
and to highlight differences.
3. Our method
Our method, that of Lopes and Perretto [9]
will be explained in depth.
4. Experiments and results
The method will be tested on two datasets,
and results of the method will be described.
2
2.1
Evolutionary distances
Given two taxa and sequences of homologous genes,
we want to find some kind of distance between
them. An example of a representative sequence
could be the amino acid sequence of proteins in two
different organisms which perform the same function. In our case, we will use mitochondrial DNA
as our feature. This is explained in more detail
later.
There are many ways to find distances between
taxa, but two will be discussed here. Multiple sequence alignments is the most common method,
and is explained first. Secondly, a method described
by Li et al. in [8], and used in [9], using Kolmogorov
complexity to calculate distances is described. This
method will also be used in this paper.
Biological background
Establishing evolutionary lineage among species
(the exact genealogy), also called phylogeny is a
problem which has been, and still is, not easily soluble. Beyond the computational limitations, we also
have to consider that the entire line of ancestors
that connects every species is not usually known
(lack of transitional fosils)[13]. The first step in
establishing these links is in finding the distances
between species. Instead of species, a more general
term is introduced:
Multiple Sequence Alignment
Multiple Sequence Alignment (MSA) is explained by way of example:
Example 1. Given two sequences, let us take as
an example the amino acid sequences for the p53
protein, a tumor suppressor common in many multicellular organisms.
A common method to find the distances between more than two of such sequences is to first
perform a MSA on them. Figure 1 shows a segment of MSA output. Multiple sequence alignments
produce alignment tables, which highlight patterns
of amino acid (or nucleotide) conservation among
taxa. Typically, it is a good idea to use amino acid
sequences over nucleotide sequences, which they are
much shorter, and can be converted back to their
DNA counterparts even after the alignment. There
are many problems associated with MSA, but that
is a different subject of study altogether. The MSA
used in this paper is the European Bioinformatics
Institute2 ClustalW3 application for MSA.
Definition 1. Taxon (plural taxa): A named unit
which encompasses a distinct group of organisms
placed in a taxonomic category, be it species, genus,
family, etc.
Historically, the data used to determine these
distances were using morphological data1 . Since the
advent of sequencing methods, we can use molecular
data, usually DNA or RNA sequences from homologous genes or proteins.
Definition 2. Homologous characters: Characters
whose similarity in different taxa is because of their
descent from a common ancestor.
1 Data
which describes the anatomical structure of a species. E.g. beak shape, feather color, foot shape, etc.
http://www.ebi.ac.uk
3 ClustalW2: http://www.ebi.ac.uk/Tools/clustalw2/
4 Phylip phylogenetic software package: http://evolution.genetics.washington.edu/phylip.html
2 EBI:
2
Tools from the Phylip4 package (such as protdist
or dnadist) can be used to calculate the evolutionary distance matrix, given a MSA. These distances
are called the ”pairwise projections” of the alignment. Typically, these tools measure how much
change (i.e. how many amino acid or nucleotide
mutations) is needed to get from one sequence to
another (i.e. a form of Hamming distance).
This gives us a distance matrix like the following:
d
Delphinapterus l.
Bos p.
Canis l. f.
DL
0
.122304
.122051
BP
.122304
0
.190062
Example 2. An example of the distances seen is
shown in the following table:
d
Homo Sapiens
Pan Paniscus
Halichoerus Gryphus
CLF
.122051
.190062
0
2.2
HG
.97682
.97560
0
Phylogenetic classification
In the construction of phylogenies, we are typically
interested in representing them as cladograms7 .
Such an example tree is given in figure 2(a), which
shows an example phylogenetic tree generated from
the p53 protein expressed in various organisms.
Often, the goal of phylogenetic classification is
defined as the parsimony principle, or the principle
of minimal evolution. This principle works along
the same lines as ”Occam’s razor”, in that the best
solution (most correct) is most likely the one which
minimizes the distances between each taxon, while
keeping true to the perceived data.
Performing a MSA takes a long time, waiting
for a few hours to get the results is normal.
Kolmogorov Complexity
Li et. al. [8] described the use of Kolmogorov
Complexity to determine a distance valid (i.e. meeting the symmetry and triangle inequality conditions) between two sequences.
Definition 4. Kolmogorov Complexity: A measure
of the complexity needed to describe something. Often defined as the length of the smallest way to describe a sequence of characters, a minimal descrip2.3
tor: |d(x)|.
Our mitochondrial dataset
The dataset we will use is based on the dataset
created by Cao et. al. in [3]. The mitochondrial
DNA is used from 20 individual species, from several mammalian branches (primates, ferungulata,
rodents). This is the mtDNA dataset.
Li described a function dist(x, y) in his paper
using Kolmogorov complexity, shown in equation 1.
K(x) − K(x|y)
K(xy)
PP
.65913
0
.98176
The distance matrix is not quite symmetric, this
is because K(x)−K(x|y) ≈ K(y)−K(y|x). As long
as consistently one side of the matrix is used, this
does not become an issue.
△
△
dist(x, y) = 1 −
HS
0
.66086
.97643
(1)
Definition 5. Mitochondria: Mitochindria are
small organisms which live in every animal cell. In
a symbiotic relationship, they produce energy for
the cell from various food sources.
K(x) is the Kolmogorov complexity of the string
x, equivalent to K(x|ǫ), K(x|y) is the conditional
Kolmogorov complexity of x given y, and K(xy) is
the complexity of the concatenation of x and y. We
define K(x) = |d(x)|, the length of the minimal descriptor of x, which is taken to be the size of the
compressed sequence using GenCompress5 .
The numerator of equation 1, K(x) − K(x|y)
measures the extent to which y ”knows about” x i.e. their similarity. The denominator simply normalizes the numerator, and we subtract it from 1,
to find a measure of their difference, their distance.
A program was written to take as input multiple
sequences in FASTA6 format and output a distance
matrix.
All animals have mitochondria, and they can
give us a good indication of their ancestral relationships. Detailed here is the dataset, in the format
{Latin name (common name; Nucleotide ID8 )}.
◦
◦
◦
◦
◦
◦
5 GenCompress:
Bos taurus (Cow; V00654)
Balaenoptera physalus (Fin whale; X61145)
Balaenoptera musculus (Blue whale; X72204)
Phoca vitulina (Harbor seal; X63726)
Halichoerus grypus (Gray seal; X72004),
Felis catus (Cat; U20753)
http://www1.spms.ntu.edu.sg/~chenxin/GenCompress/
a text format for storing sequences with header information
7 Also known as evolutionary trees or phylogenetic trees
8 NCBI Nucleotide database: http://www.ncbi.nlm.nih.gov/nuccore
6 FASTA:
3
Figure 1: MSA segment for eight p53 protein sequences
(a)
(b)
Figure 2: 2(a) Rooted phylogenetic tree for 8 species based on expressed p53 proteins. 2(b) Unrooted
phylogenetic tree for 8 species based on expressed p53 proteins.
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
3
ter, whereas rooted trees describe a clear direction
of evolution.
Equus caballus (Horse; X79547)
Rhinoceros unicornis (Indian rhinoceros; X97336)
Mus musculus (Mouse; V00711)
Rattus norvegicus (Rat; X14848)
Homo sapiens (Human; D38112)
Pan troglodytes (Chimpanzee; D38113)
Pan paniscus (Bonobo; D38116)
Gorilla gorilla (Gorilla; D38114)
Pongo pygmaeus p. (Bornean orangutan; D38115)
Pongo pygmaeus a. (Sumatran orangutan; X97707)
Hylobates lar (Common gibbon; X99256)
Didelphis virginiana (Virginia opossum; Z29573)
Macropus robustus (Wallaroo; Y10524)
Ornithorhynchus anatinus (Platypus; X83427)
Definition 6. Rooted Phylogenetic Tree: A rooted
tree is a directed weighted graph G(V, E, d) in which
there is a root node r. Leaves are nodes with an indegree of 1, and interior nodes, have an indegree of
1, and an outdegree of 2, with the exception of the
root node, which has an outdegree of 2.
Definition 7. Unrooted Phylogenetic Tree: An
unrooted tree is a undirected weighted graph
G(V, E, d), in which there is no (traditional) root
node, the ”root” of the tree is considered to be the
node which has an outdegree of 3. Leaves are nodes
with a degree9 of 1, and interior nodes have a degree
of 3.
Phylogenetic trees
In all cases, leaves represent the known taxa,
and the interior nodes usually represent the unknown, inferred ancestor taxa. An unrooted tree
can be converted to a rooted tree, by considering
an ancestor node as a root node, which is intermediate to an internal node.
There follows an exact definition of phylogenetic
trees. There exist two representations of phylogenetic trees, rooted and unrooted. Unrooted
trees express the relationships between taxa bet9 degree
= indegree + outdegree
4
4
Inference methods
paired into a subtree, which is added to the set of
taxa, replacing i and j. The process then repeats
until only 3 taxa remain.
There follows a small example:
There are many traditional methods which try
to produce phylogenetic trees. Some are more
tractable than others, with an accuracy payoff. For
example, figure 2(b) was created using the Neighbor
Joining algorithm (see section 4.2), whereas figure
2(a) was created by our algorithm (see section 5).
These algorithms are clustering algorithms, they
group the taxa in a way similar to an agglomerative
clustering algorithm. However, they differ from it
because they do not consider a ”linkage” criterion.
Differences among the methods lie mainly in the assumptions they make about the data, and the ways
in which distances are calculated.
There are two types of algorithms, those that
consider distance based metrics, and those that consider characters. Distance based algorithms apply
what was described in section 2.1, whereas character based algorithms use any kind of data, even just
raw sequence data.
As a result of the intractability of many of the
methods described here, other methods have been
explored. Among them, swarm intelligence methods. Described here are four traditional methods,
and (very) briefly described are the Particle Swarm
Optimization (PSO) and Ant Colony Optimization
(ACO) methods.
Example 3. Finding the distance between i, j, k
and their ancestor u.
d
i
j
k
i
0
24
28
j
24
0
32
k
28
32
0
d(µ, i) = 0.5(d(i, j) + (d(i, k) − d(j, k)))
= 0.5(24 − (28 − 32))
= 10
d(µ, j) = d(i, j) − d(µ, i)
= 14
d(µ, k) = d(i, k) − d(µ, i)
= d(j, k) − d(µ, j)
= 18
△
4.2
UPGMA and Neighbor Joining
UPGMA, Unweighted Pair Group Method with
Arithmetic Mean is an algorithm which clusters
nodes from the bottom up. It assumes a constant
4.1 Fitch-Margoliash method
evolutionary clock, i.e. all branches are of equal
Described first in [5], constructs either rooted or length (i.e. ultrametric). Starting at the bottom
unrooted trees. The process relies on that the dis- (each node representing a taxon), it groups the two
tances between two taxa i, j and their ancestral which have the smallest distance between them, by
root µ can be approximated by knowing the dis- adding an ancestor node to represent them. UPtances between i, j and a third taxon k. The three GMA does this until the entire level is joined, retaxa can then be grouped into a tree.
calculates the distances for these ancestor nodes (by
The following figure shows the relationship be- considering arithmetic averages of the distances between the distances.
tween the two taxa), and repeats the same process
on the set of ancestor nodes. This results in rooted
trees. UPGMA is not used today, but it paved the
way for the Neighbor Joining algorithm (NJ). The
exact functionality of UPGMA is described in [2].
Neighbor joining is similar to UPGMA, in
that it recalculates distances at every step, but
it does not assume a constant evolutionary clock.
The Neighbor Joining algorithm produces unrooted
trees. The principle is the same, it is a bottom up
It becomes clear that we can ascertain the dis- algorithm. Neighbor Joining iteratively selects a
tances between µ and i, j and k with the following taxon pair pairs them to form a new subtree and
calculation: d(µ, i) = 0.5(d(i, j) + (d(i, k) − d(j, k))) groups them back into the taxon set to reduce the
size of the taxon set by one.
A similar calculation can be used to find d(µ, j).
Starting from the bottom, NJ groups from the
In the case of a problem with more than three
taxa, the third taxon k is represented by the av- set of nodes S the two nodes i, j which have the
erage of the remaining taxa. i and j can then be smallest distance between them, and represents
5
5.1
them as an intermediate node u. Rather than doing this for all nodes on the level, it removes i and
j from S, and adds u. It recalculates the distances
for the new node u, and performs this step again
until all nodes have been grouped. The exact functionality of NJ is described in [6].
4.3
ACO applied to the TSP
The ACO algorithm is a meta-heuristic which models the behaviour of ants while searching for food.
It has been observed that ants, collectively, can find
the shortest distance between a food source and
their nest by laying pheromones along the ground.
The paths upon which high pheromone levels develop will be preferred by ants, and it is by this
mechanism that shorter paths accumulate a larger
pheromone level than others.
The TSP problem is a very famous problem in
which a salesman has to visit a number of cities, and
wishes to find out how he can do this in the least distance possible. The problem was approached using
ACO by Dorigo [4]. Algorithm 1 shows the general
outline of the algorithm. There are some important
points to note.
Maximum parsimony
The Maximum parsimony algorithm tests all the
trees, scoring them along the way and picks the
tree with the best score (i.e. smallest, the tree describing the smallest amount of steps between taxa).
There are far too many trees to test for even a small
amount of taxa, and pruning methods must therefore be applied. It has been shown that knowing
the shortest TSP path for the set can reduce the
number of trees which need to be searched. [7].
Unfortunately, this method will often return
Transition probabilities
multiple trees with the same score. This is often
If a value q ∈ [0, 1], is less than q0 (a parameter),
a result of using characters which are not homolothen the node with the highest transition probabilgous and exhibit a high degree of homoplasy.
ity is chosen. Equation 2 shows the distribution
used for each ant to pick the next node. If q > q0 ,
4.4 PSO
then the next node is chosen from the distribution
Only one paper discussed using the Particle Swarm pa .
Optimization algorithm to construct phylogenetic
τ (c, n) · η(c, n)β
trees [10]. Unfortunately, this paper was so poorly
(2)
pa (c, n) = X
τ (c, u) · η(c, u)β
written that it is nearly impossible to understand
u∈V
/ a
it. It is not even clear how the trees are represented
or arrived at. After reading the paper, it becomes
Equation 2 finds the probability of proceeding to
clear why the method was not described in detail the next node n from the current node c. V is the
a
in the review paper [12].
set of nodes which ant k has visited so far. τ (i, j)
is the pheromone value on the edge between node i
4.5 ACO
and j. η(i, j) is a heuristic chosen to be d(i, j)−1 ,
The use of Ant Colony Optimization however, is where d(i, j) is the distance between node i and j.
extensively studied, [9], [14], [1]. In this paper, the
algorithm described by Perretto and Lopes [9] will
be presented and implemented.
5
Example 4. At a given node c, pa (c, A) represents the probability of moving to node A, whereas
pa (c, B) represents the probability of moving to
node B.
Our method
Our method is composed of multiple steps.
1. Obtain a distance matrix (described in section
2.1)
2. Use an ACO algorithm to obtain a best path
p and a pheromone matrix τ .
3. Use τ to construct the tree.
The ACO algorithm was implemented based on
Other nodes may already have been visited, and
Perretto and Lopes work [9]. They used a modified
version of the traveling salesman problem (TSP). are therefore in the set of nodes the ant has already
First, the method used for the TSP problem will be visited, and will not be assigned a probability.
explained, and then the changes will be highlighted.
△
6
Pheromone updating
Scoring function
The score of a path is the sum of it’s edges, seen
Pheromones are updated differently. Lopes and
in equation 3, for a given ant a.
Perretto do not perform a local update of the
pheromone levels, and instead only update globally
X
score(a) =
d(i, j)
(3) at the end of each cycle. Equation 8 is used to up(i,j)∈path(a)
date the pheromones.
Pheromone updating
Another note of interest is the pheromone updating functions. There are two, the local, in equation 4 and the global, seen in equation 5:
τ (i, j) = (1 − ρ)τ (i, j) + ρ · τ0
τ (i, j) = ρ · τ (i, j) + (1 − ρ) · ∆τ (i, j)
∆τ (i, j) =
(4)
Distance recalculation
The most important difference is that the distances between each taxa and the current taxon
have to be recalculated at every step. This is because the distance between an ancestor node µ of a
taxon t1 and another taxon tn , is different from the
distances between t1 and tn , i.e. d(t1 , tn ) 6= d(µ, tn ).
Distances are recalculated according to the following rule:
Our modified TSP
There are a few differences between the solution in
section 5.1, and the one we applied. The changes
can be seen in Algorithm 2. The goal of the modified TSP is to produce a pheromone matrix where
the weights are representative of the distances between the species.

d(i, n) + [d(i, n) − d(j, n)] · δ



 if d(j, n) > d(i, n)
d(µ, n) = dµn (i, j) =

d(j, n) + [d(j, n) − d(i, n)] · δ



if d(j, n) < d(i, n)
(10)
These calculations are similar to those used in
the Fitch-Margoliash algorithm (see section 4.1).
We are interested in the distance between the
ancestral node µ and the next node n we pick,
d(µ, n). This can be calculated using equation 10,
where i and j represent the descendants of µ. δ
is a parameter between 0 and 1, which defines the
closeness of the ancestor node to it’s children. The
larger δ is, the closer the ancestor will be.
The justification for distance recalculation is
given as an example:
Transition probabilities
Transition probabilities are calculated in a similar fashion to equation 2. Equation 6 shows the
probability function for the modified TSP problem.
τ (c, n)α · d(c, n)−β
pa (c, n) = X
τ (c, u)α · d(c, u)−β
(6)
u∈V
/ a
The only differences are that the distance is explicitly used in the function, whereas previously it
was only a suggested way, and the introduction of
another scaling factor α. Together, α and β make
a trade-off between exploration and exploitation of
good edges in the path.
Scoring function
Example 5. In figure 2(a), The species Bos primigenius and Delphinapterus leuca are grouped, to
form an ancestor node. The distances between
this new species and the others (Macaca fascicularis, etc.) is obviously different than from the child
nodes. This is the justification for the distance recalculation.
△
The score of a path is given as the sum of the
transition probabilities. The higher the score of the
path, the better. Equation 7 shows how to calculate
the score.
X
scoreA (a) =
pa (i, j)
(7)
(i,j)∈pathA (a)
10 NN
k
0 otherwise
(9)
Where scorebest is the score of the best path.
τ (i, j) = (1 − ρ)τ (i, j) + (
5.2
X

scoreA (k) · score−1
best if (i, j) ∈ pathA (k)

ρ
)
(5)
bA
ρ is a pheromone evaporation parameter. τ0 is
the initial value of the pheromone matrix, defined
as (n · Lnn )−1 , where n is the number of nodes, and
Lnn is the Nearest Neighbor Heuristic10 . ba is the
score of the best path found at the end of the cycle.
(8)
Heuristic: at each node pick the closest neighbor to be your next.
7
Algorithm 1 ACO algorithm to solve TSP
Input:
◦ Complete weighted graph G(V, E, d), V = {vertices}, E = {Edges}, d : e → R, e ∈ E
◦ ρ: Pheromone evaporation factor
Output: p : N → V : Best path
τ (i, j) = τ0 = (n · Lnn )−1 , ∀i, j
# Pheromone matrix
p=∅
# Path is empty
A = {k ants}
# Set of ants
pathA : a → (N → V 2 )
# Set of paths created by each ant
scoreA : a → R
# Set of path scores for each ant
repeat
for all a ∈ A do
pathA (a) = construct path(G, τ )
scoreA (a) = score(pathA (a))
τ (i, j) = (1 − ρ)τ (i, j) + ρ · τ0
# Local pheromone update (i, j) ∈ E, (i, j) ∈ (pathA (a)
end for
bA = min(scoreA )
# Best tour length
τ (i, j) = (1 − ρ)τ (i, j) + ( bρA ), ∀i, j
# Global pheromone update
until All cycles completed
return p = a, s.t. pathA (a) = min(pathA )
There follows an example of the distance recalculation:
△
Example 6. Imagine three taxa with the distance
matrix:
Figure 3 outlines the process of creating the path
while changing the distances.
d
A
B
C
A
0
1
2
B
1
0
3
C
2
3
0
6
Constructing the tree
After the ACO algorithm has finished, a pheromone
If we go from A to B, we will get a new, inferred
matrix whose values are representative of the distaxa µ1 , we will do some calculations:
tances between the nodes (including the changes
made by updating the distances) is returned. Using
this matrix, we can construct a tree with algorithm
δ = 0.5
3.
d(µ1 , C) = dµ1 C (A, B)
Algorithm 3 differs slightly from the one pre= 2 + (2 − 3) · δ
sented in [9], but only to enhance the understanding
= 1.5
of the algorithm.
d(µ1 , A) = d(A, C) − dµ1 C (A, B)
The algorithm recursively groups subtrees with
a new, inferred ancestor node until all taxa have
= 0.5
been grouped.
d(µ1 , B) = d(B, C) − dµ1 C (A, B)
An example tree construction follows:
= 1.5
Example 7. Constructing a tree using this
pheromone matrix:
The new distance matrix will look like this:
d
A
B
C
µ1
A
0
1
2
0.5
B
1
0
3
1.5
C
2
3
0
1.5
µ1
0.5
1.5
1.5
0
τ
A
B
C
8
A
0
1
3
B
1
0
2
C
3
2
0
Algorithm 2 ACO for the modified TSP
Input:
◦ Complete weighted graph G(V, E, d), V = {vertices}, E = {Edges}, d : e → R, e ∈ E
◦ ρ: Pheromone evaporation factor
Output:
◦ p : N → V : Best path
◦ τ : Pheromone matrix
τ (i, j) = τ0 = (n · Lnn )−1 , ∀i, j
p=∅
A = {k ants}
pathA : a → (N → V )
scoreA : a → R
repeat
for all a ∈ A do
for i = 1 to |V | do
pathA (a)(i) = pick next node()
recalculate distances()
end for
scoreA (a) = score(pathA (a))
end for
scorebest = min(scoreA )
∆τ (i, j) =
# Pheromone matrix
# Path is empty
# Set of ants
# Set of paths created by each ant
# Set of path scores for each ant
# Best tour length
# Pheromone update
X

Sk · score−1
if
(i,
j)
∈
pathA (k)
best
k
0 otherwise
τ (i, j) = ρ · τ (i, j) + (1 − ρ) · ∆τ (i, j), ∀i, j
until All cycles completed
return τ and p = a, s.t. pathA (a) = min(pathA )

Algorithm 3 Algorithm to construct the phylogenetic tree
Input:
◦ τ : pheromone matrix
◦ T : set of taxa
Output: t: rooted binary tree
τsorted (n) = (i, j)
grouped = 0
iterator = 0
while grouped < (|T | − 1) do
(i, j) = τsorted (iterator)
iterator = iterator + 1
ia = oldest ancestor(i)
ja = oldest ancestor(j)
if i == j then
continue
end if
group(ia , ja )
grouped = grouped + 1
end while
return oldest ancestor(S)
# sort fields in pheromone matrix in decreasing order.
# The number of taxa that have been grouped.
# If s has no ancestor, oldest ancestor(s) = s
# i.e. if i and j have the same ancestor (previously grouped).
# Group the two taxa to form a new, inferred ancestor
# Return the oldest ancestor in the set of taxa
9
(b)
(a)
(c)
Figure 3: Recalculating distances. 3(a): The initial graph G, with A picked as the starting node. Node
B is chosen as the next node, and an ancestor node µ1 is created. 3(b): Distances from µ1 to C and D
are calculated. C is chosen as the next node, and a new ancestor µ2 is created. 3(c): Distances from µ2
to D are calculated. D is chosen as the last node. The final path is ABCD.
△
τsorted (0) = (A, C)
τsorted (1) = (B, C)
7
τsorted (2) = (A, B)
Implementation
Together with Matteo Brunati11 and Alberto
Testolin12 , Thies Gehrmann implemented an algorithm mostly similar to the Lopes and Perretto
method. The implementation follows the algorithms presented previously closely.
There were several components of the implementation, not all in the same language. Here, a brief
overview of them is given, and also a mention of
authorship, in order of contribution.
This results in that we first group A and C, and
finally A(which is replaced by the ancestor of A)
and C.
1. Distance calculation (written in Shell script)
2. Program initialization (written in C)
3. ACO algorithm (written in C)
11 Matteo
12 Alberto
Brunati, MSc student [email protected]
Testolin, MSc student [email protected]
10
◦
◦
◦
◦
◦
◦
4. Tree construction (written in C)
β (2) See equation 6.
δ (0.5) See equation 10.
ρ (0.9) See equation 8.
Number of ants (500).
Number of cycles (50).
Favor for best node (0.9). See equation 2.
The main part of the program, composed of
components 2, 3 and 4 is constructed with 3 main
modules:
init → solve → tree
init is responsible for loading the distance matrix, parameters, and initializing memory. solve
This component was written by Thies
produces a usable pheromone matrix. tree com- Gehrmann.
putes the tree from a pheromone matrix, and can
ACO Algorithm
provide output in a variety of different formats.
The algorithm was implemented as shown in alDistance calculation
gorithm 2.
Perretto and Lopes [9] used a distance matrix
This module is passed a distance matrix, a
calculated by Li et. al. [8]. This file was no longer
pheromone matrix, and several other parameters,
available on the internet, so the matrix had to be
it returns the final pheromone matrix to the tree
re-calculated.
module.
Initially, distances were calculated using
This component was written by Thies
clustalw and dnadist/protdist aided by online
Gehrmann
and Alberto Testolin.
tools13 . But these distances were incorrect, the
Tree construction
algorithm arrived at strange results which were
not expected. Therefore an attempt was made to
The tree is produced from the pheromone marecalculate the data made by Li et. al.
trix, as in algorithm 3. This component can output
Li et.al. used a program called GenCompress, in two different formats, Graphviz DOT format14 ,
which compresses a DNA sequence into a smaller and Newick tree format15 .
space. The size of that space was taken to be the
This component was written by Thies
Kolmogorov complexity of the sequence. A program Gehrmann.
was written to automate the calculation of distances
between species using raw FASTA sequence data
taken from the NCBI Nucleotide database, or con8 Experiments
verted amino acid sequences taken from the NCBI
protein database. This program was written in shell
Two datasets were constructed:
script.
Since the calculations can take a while, the dis• A small dataset consisting of the p53 protein
tance matrix is precomputed before the ACO-based
coding genes from 8 different species. These
mechanism is employed.
species were chosen for no particular reason,
This component was written by Thies
other than that they came up first in the
Gehrmann.
search on NCBI protein database search16 .
Initialization
• mtDNA dataset, described by Cao in [3], and
explained previously in section 2.3.
This module is a relatively small component,
dealing mainly with initializing parameters, reading command line arguments, and loading the distance matrix. It assumes a correct distance matrix
is presented, if not, all values are filled with zerodistances.
To set the default pheromone matrix values, the
Nearest Neighbor heuristic distance is calculated.
There are several parameters which can be set
(including default values):
The field of phylogenetic tree construction is
quite inexact. The most any method can really try
to do is to determine which species are closer to one
another, or to state which groups ”belong to each
other”, more than other groups. The groupings
are therefore what interests us. When examining
the trees, note that the ones produced by Neighbor
Joining are unrooted, whereas those produced by
our algorithm are rooted.
◦ α (1) See equation 6.
13 Mobyle@pasteur:
http://mobyle.pasteur.fr
http://www.graphviz.org/
15 Newick tree format: http://evolution.genetics.washington.edu/phylip/newicktree.html
16 NCBI Protein database: http://www.ncbi.nlm.nih.gov/protein/
14 Graphviz:
11
8.1
p53 dataset
Figure 2(a) was produced by our ACO algorithm
and figure 2(b) was produced by the traditional
Neighbor Joining algorithm. We can see an identical grouping, indicating that our method has performed as well as the Neighbor Joinging method.
For the Neighbor Joining algorithm, a different distance matrix (produced from the same initial data)
produced by protdist 17 was used.
be more computationally expensive, especially for
larger amount of taxa.
10
Conclusion
In this paper, a method using ACO to produce
phylogenetic trees, as described by Perretto and
Lopes was presented. It was described in detail how
the method works. Furthermore, the method was
implemented and tested on two entirely different
datasets.
8.2 mtDNA dataset
It was found that the method performed acceptExamine figure 4(a) (produced by our algorithm),
ably compared to the Neighbor Joining algorithm.
and figure 4(b) (produced by Neighbor Joining).
Since it can offer considerable time performance
Again, similarities in grouping can be seen. Obimprovements, the method may be a viable option
serve, for example, the grouping of the branches
for problems with very large numbers of taxa.
containing Didelphis virginiana and Mus Musculus,
which are identical in each tree.
There are some discrepancies between the trees,
References
such as between the Homo sapiens branches, but
these are probably due to slight differences in the
[1] Shin Ando and Hitoshi Iba. Ant algorithm for
calculation of the distances. Since the original data
construction of evolutionary tree. In In Profrom Cao et. al. can not be found, this is the best
ceedings of GECCO 02. IEEE, 2002.
we could do.
9
Results
As we saw in section 8, the method produces
trees with similar groupings as traditional methods.
They also reported some time-growth characteristics which suggest that their ACO solution gives
better performance.
Lopes and Perretto have worked for years on
their implementation, since their first paper in 2005
[11]. Using our implementation to compare against
the traditional methods would not give accurate indications of the time growth characteristics of the
algorithm, Lopes and Perretto summed up their
presentation with figures shown in figure 9.
Figure 5(a) shows that they measured fairly low
execution times for even up to almost 500 species,
which is remarkable. Figure 5(b) highlights the perceived increased performance of the algorithm over
traditional methods, such as Fitch-Margoliash (see
section 4.1), and Neighbor Joining (see section 4.2).
The performance increase arises because the algorithm is not dependent on the number of taxa,
but rather on the number of ants and iterations. In
our method, clustering of the tree is determined by
the distance matrix, whereas in traditional methods, the clustering is done by distances between
the taxa. Because of this, traditional methods will
17 Protdist:
[2] Hans J. Böckenhauer and Dirk Bongartz. Algorithmic Aspects of Bioinformatics (Natural
Computing Series). Springer-Verlag New York,
Inc., Secaucus, NJ, USA, 2007.
[3] Y. Cao, A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada,
S. Pääbo, and M. Hasegawa. Conflict among
individual mitochondrial proteins in resolving
the phylogeny of eutherian orders. Journal of
molecular evolution, 47(3):307–322, September
1998.
[4] M. Dorigo. Ant colonies for the travelling salesman problem. Biosystems, 43(2):73–81, July
1997.
[5] W. M. Fitch and E. Margoliash. Construction
of Phylogenetic Trees. Science, 155(3760):279–
284, January 1967.
[6] Olivier Gascuel and Mike Steel. NeighborJoining Revealed. Molecular Biology and Evolution, 23(11):1997–2000, November 2006.
[7] Chantal Korostensky and Gaston H. Gonnet.
Using traveling salesman problem algorithms
for evolutionary tree construction. Bioinformatics, 16(7):619–627, July 2000.
http://cmgm.stanford.edu/phylip/protdist.html
12
(a)
(b)
Figure 4: 4(a) ACO-produced tree for mtDNA dataset. 4(b) Neighbor Joining tree for mtDNA dataset.
(a)
(b)
Figure 5: Growth characteristics reported by Lopes and Perretto in [9]. Figure 5(a): Growth of the time
needed to complete the run for up to almost 500 species. Figure 5(b): Comparing the execution time
between the proposed method and traditional methods.
[8] Ming Li, Jonathan H. Badger, Xin Chen, Sam
Kwong, Paul Kearney, and Haoyong Zhang.
An information-based sequence distance and
its application to whole mitochondrial genome
phylogeny.
Bioinformatics, 17(2):149–154,
February 2001.
Ant Colony system for large-scale phylogenetic
tree reconstruction. Journal of Intelligent and
Fuzzy Systems, 18(6):575–583, January 2007.
[10] Hui-Ying Lv, Wen-Gang Zhou, and ChunGuang Zhou. A discrete particle swarm optimization algorithm for phylogenetic tree reconstruction. pages 2650–2654, 2004.
[9] Heitor S. Lopes and Mauricio Perretto. An
13
[11] Mauricio Perretto and Heitor Silvério Lopes.
Reconstruction of phylogenetic trees using the
ant colony optimization paradigm. Genetics
and molecular research : GMR, 4(3):581–589,
2005.
[13] Monroe W. Strickberger. Evolution. Jones and
Bartlett, 2000.
[14] Karla Vittori, Alexandre C. B. Delbem, and
Sergio L. Pereira. Ant-Based Phylogenetic Reconstruction (ABPR): A new distance algo[12] Jeffrey Rizzo and Eric C. Rouchka. Review of
rithm for phylogenetic estimation based on ant
Phylogenetic Tree Construction. Technical recolony optimization. Genetics and Molecular
port, University of Louisville, November 2007.
Biology, 31(4), December 2008.
14