Ant Algorithm for Construction of Evolutionary Tree

Ant Algorithm for Construction of Evolutionary
Tree
Shin Ando
School of Engineering,
University of Tokyo
7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
Abstract - This paper proposes an Ant algorithm
implemented with stack count for the purpose of designing suffix representations of tree structures. The
algorithm is applied to the problem of reconstructing
an evolutionary tree from nucleotide sequences. The
algorithm shows compatible results in simulated experiment and alignment of protein sequences from 15
species against the existing algorithms.
I. Introduction
In this paper, we present a new ant algorithm to
construct similarity tree structures from given set of
objects and their mutual distance matrix. The method
is based on the extension of ant colony optimization for
traveling salesman problem [2].
The construction of evolutionary tree has become one of
the major problems in computational biology since DNA
and protein sequences have become easier to obtain. The
objective in evolutionary tree construction is optimizing
the tree score, which is calculated from the sequences
at the leaves and the structure of the tree. Better score
means higher probability of the tree to be the indication
of actual evolutionary history for the nucleotide or DNA
sequence.
Since the number of tree topologies grows exponentially
with the number of nodes, finding the optimal topology
is generally an intractable problem. The exact construction of evolutionary trees are known to be NP-complete.
Our algorithm searches for a tree structure, which
minimizes the score for a given set of DNA sequences.
It uses the mutual distance matrix of the leaves as the
input.
While the ant colony optimization for traveling salesman
problem visits the respective cities once to construct
a round trip, ants in tree constructing algorithm visit
leaves and vertices of the tree to construct a suffix
representation of the bifurcating tree. When the ants
connect two leaves by visiting a vertex, it is indicated
by the +, a suffix to connect the leaf.
Our method competes with conventional method of the
0-7803-7282-4/02/$10.00 ©2002 IEEE
Hitoshi Iba
Grad. School of Frontier Science,
University of Tokyo
7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
exhaustive search or the sequential insertion method,
taken by the most popular methods. In the simulated
experiment, the algorithms showed compatible results
against existing software. Our method can also be
used for constructing cluster trees for any object where
the mutual distance matrix of the trees are given and
the objective is to minimize the size of the tree edges.
Bioinformatics tasks such as clustering DNA microarray
expression patterns and tissue classifications also require
the construction of bifurcating trees.
II. CONSTRUCTING EVOLUTIONARY
TREE
Reconstructing the evolutionary history of the proteins
and nucleic acids is one of the very important analyses
of genome sequence data. It is a fundamental aspect for
understanding the structure, functionality, and the evolution of the proteins and DNA. There has been many
previous works on the subject[6][5].
An evolutionary tree is an unrooted binary tree such as
Figure1, which is composed of leaves (T x1 − 4) of degree
1, or vertex (u, v, w) of degree 3, and the edges which
connect them. An evolutionary tree with n leaves has
n − 2 vertex and 2n − 3 edges. The number of trees
n
Q
with n leaves is
2i − 5 [13]. Exploration of tree space
l=3
by exhaustive search is computationally very expensive.
Figure1 shows the example of DNA sequences and evolutionary tree structure. In evolutionary tree, each leaf
indicates an evolutionary object, a sequence of a protein
or DNA. The tree structure represents the familial relationships or the evolutionary history of the objects. The
lengths of the edges correspond to how similar or far the
sequences are.
The objective is to find a tree structure that minimizes
the score function. The total length of the edges can
signify the score for the evolutionary tree. In the
maximum likelihood method, the shorter edge lengths
indicate higher likeliness for the transition of ancestral
of the edges that maximize the likelihood. This method
is used to calculate the distance matrix between the
sequences in later simulation.
IV. SCORING THE TREE
The total length of the edges indicate the evaluation
of the tree, since the shorter edge lengths indicate higher
likeliness for the transition of ancestral sequence to
present sequence in the maximum likelihood method.
Total edge length of a tree can be obtained from circular
tour length, C(S), along the tree edges. The detailed
description of C(S) as the tree scoring is given in [8].
Tx0:
Tx1:
Tx2:
Tx3:
Tx4:
TAAGTTTACATGCGATTTCT
CTGGTCCACACTTCTAAATA
GAGCAAGGACTCGACTTCGT
AATATAGTTGCCTGACGTCC
TACAAATTGATTGGGCCGCG
Fig. 1. An evolutionary tree
sequence to present sequence.
The construction of optimal evolutionary trees is a very
challenging problem, since most versions of the problem
are NP-complete [7][9]. Even though the problem has
been studied extensively, evolutionary tree construction
still remains an open problem.
We consider only the unrooted trees in this paper,
though the method is completely compatible with
rooted tree as well.
III. Definition of the Matrix
There have been several proposed methodologies for
determining the tree structure. The details of these
methods can be found in [4].
The parsimony method usually counts the number of
amino acid or nucleotide substitutions in a weighted
or unweighted manner. They take a multiple sequence
alignment (MSA) as an input and take the number of
replacement, which could explain the corresponding
evolutionary tree as the score of that tree.
The maximum likelihood approach scores the tree by the
probability that the observed data would have occurred
from the tree. The conventional strategy is to search
over all trees and for each topology T, find the length
0-7803-7282-4/02/$10.00 ©2002 IEEE
Fig. 2. The circular tour along the tree edges
Equation(1) gives the C(S) for Figure2, which is the
sum of the trail indicated by broken arrows.
C(S) = d0w + d1u + d2u + d3v + d4w + duv + dvw
(1)
Since the lengths of the edges represent the distances
between the leaves, C(S) can be calculated from distance
matrix [δij ] (equation 2).
C (S) = (δ01 + δ12 + δ23 + δ34 + δ40 ) × 2
(2)
A correct circular tour establishes the upper bound for
the best score that can possibly be achieved by a given
set of protein or DNA sequences. Both the bound and
the circular tour can be derived without an explicit
knowledge of the correct evolutionary tree. The circular
tour length is equal for all isomorphic trees.
In our method, C(S) is used for scoring the tree, thus
a distance matrix that shows the distance between
every pair of sequence is required to calculate the
score. Further, our method can be used for any scoring
function that correlates to the amount of changes along
the branches of an evolutionary tree.
V. The Extension of Ant Algorithm for Tree
Design
In this section, we will describe the implementation of
our extended Ant algorithm.
A. Basic Ant Algorithms
The ant colony optimization metaheuristic was inspired
by studied behaviors of the ants. Ants communicate
among themselves through pheromone, a substance they
deposit on the ground as they move about. It has been
observed that the ants are attracted to pheromones deposited by the other ants, creating a mechanism similar to reinforcement learning system by concentration of
stream of ants to a particular path. [1][3][2] offer detailed
information on the works of the algorithm and the choice
of the various parameters.
In the standard algorithm of ant algorithm for traveling salesman problem, an ant at city i chooses the next
city j on probability pkij (t) for pheromone τij (t) and inversed distance ηij (3). The pheromone tables are updated based on equation4 and equation5 for evaporation
constant ρ.
pkij
α
(t) = [τij (t)] [ηij ]
X
α
h∈Jik
[τij (t)] [ηij ]
τij (t + 1) = (1 − ρ) · τij (t) + ∆τij (t)
∆τij (t) =
n
X
Q (k)
T x0 T x1 T x2 + T x3 + +T x4+
(6)
To make the suffix representation to actually form a tree,
certain requirements have to be met. This constraint can
be easily explained with the idea of the Stack Count [10].
We assign leaves, the terminals, with the Stack Value of
+1 and assign non-terminal vertices with -1. The Stack
Count of a tree is the sum of the Stack Value of all its
vertices. If the suffix representation of the tree structure
is valid, Stack Count should be 1. Further, any subtree
of that tree would have +1 as its Stack Count. Thus,
while adding up the vertices of a valid representation in
sequential order, Stack Count should never subceed 1.
(Figure3)
Fig. 3. Stack Count of the Tree from Figure1
While the ants are making the round trip, they hold
the history of leaves it has visited, so as not to visit a leaf
more than once. At any point during the trip, the Stack
Count for the visited leaves and non-terminal vertices
cannot subceed 1. This Stack Count constraint prohibits
the ant from visiting the non-terminal vertices, until the
Stack Count has accumulated up to 2 or more.
(3)
C. Choice of the Vertex
(4)
(5)
k=1
Each of the N ants will make the round trip times
before the iteration is complete.
B. Suffix Representation
The rules in previous section are inherited in our new
algorithm as well. In addition, to represent the tree in
suffix representation, an ant must visit N leaves once and
the vertices N-1 times in between the cities while making
the round trip. The vertices are not specified (as u, v, w),
when the ants are making the rounds, and all indicated
as + in the representation, since the tree structure is
indecisive until the rounds are complete. An evolutionary
0-7803-7282-4/02/$10.00 ©2002 IEEE
tree shown in Figure1 is represented as equation?? in
suffix representation.
The Stack Count constrains the ant to not to go to vertices at inappropriate timing. The following rule determines the probability that the ant will choose to go to
the vertex, thus connecting the branches to form larger
ones. For sequential order of leaves S = {T x0, , , T x3} ,
which appears in the order of the index, we define ∆(S)
by eq. 7.
∆ (S) = δ01 + δ23 − δ02 − δ13
(7)
In the evolutionary tree, the edges indicate the distances
between the leaves. Thus, as in Figure4, if Tx1 and Tx2
are connected by a vertex, then ∆(S) = 0. When inner
two leave are not connected by a vertex, for example
∆ (S 0 ) = (S 0 = {T x1, , , T x4}), ∆ (S) > 0 as in Figure5.
When the ant, currently at leaf i, chose j as the next
visiting leaf according to the equation, and has the latest
history of visiting g, h (Figure6), calculate ∆(S) for {S =
g, h, i, j} .
Fig. 4. ∆ (S) = δ01 + δ23 − δ02 − δ13 = 0
Fig. 7.
∆ (S) = δ01 + δ14 − δ03 − δ14
{S = T x0, subtree(T x1, T x2), T x3, T x4}
=
0
If the ∆(S) is small, the ant is highly probable to go
to the vertex before going to next leaf j. The probability
P is determined using the largest distance D of in the
distance matrix and k, an empirical parameter.
P = k∆(S)/2D(2D > ∆(S))
Fig. 5. ∆ (S) = δ12 + δ34 − δ13 − δ24 = 2 × δuv > 0
(8)
When the history of the ant includes a vertex,
the calculation of ∆(S) is slightly modified. As in
Figure7, when a vertex connects a leaf (T x3) and a
subtreeSt(T x1, T x2), ∆(S) for S = T x0, St, T x3, T x4 is
equal to 0. ∆(S) is calculated by representing the subtree
by one of the consisting leaf. This enables the calculation
of ∆(S) for all S.
Note that a leaf is a form of a subtree. The revised
algorithm for deciding whether to go to a vertex, would
be as follows: An ant at a vertex in the current subtree
I, which chose j as the next leaf to visit, and whose two
previously visited subtree, G and H, will go to a vertex
before going to j by probability of P = k∆(S)/2D(2D >
∆(S)) for {S = G, H, I, j} . The visited vertex can be
identified by subsumption of the Stack Count (Figure 8).
VI. Simulated Experiment
Fig. 6. History of visited leaves and choosing the vertex
0-7803-7282-4/02/$10.00 ©2002 IEEE
The simulation is done as follows: Based on a specified evolutionary tree structure, random sequences of nucleotide are generated. We used Seq-Gen, a sequence generator software. It simulates the evolution of nucleotide
sequences along a phylogeny, using common models of
the substitution process. Thus we know the correct tree
for the set of sequences in each simulation.
Fig. 8. Subsumption of the Stack Count
ρ = 0.1
K=0.2
α = 1.2
N=16
M = 1000
TABLE I
Ant Algorithm Parameters
Fig. 10. 16-leaf target tree
Then, distance matrix between all the sequences is computed by dnadist, of the Phylip program. It computes
a distance measure for DNA sequences, using maximum
likelihood estimates based on the Dayhoff PAM matrix.
The input of sequence and distance matrix were also fed
to FITCH and Neighbor algorithm of the PHYLIP programs[11][12] for comparison.
The simulation was executed under different evolutionary tree parameter settings, which include maximum distance in the matrix , sequence length, and number of
leaves. The structures of the trees for 8-leaf and 16-leaf
simulation are shown in Figure9 and 10.
The parameters for ant algorithms are shown in TableI,
but they have not been proven to be the best parameters.
Results of the simulation are summarized in Table II
and Table III. In the first experiment for a Figure9 tree,
the ant algorithm gave the best results in general. For
16 leaves, FITCH program gave the best result when the
D
-3
3-6
6-
FITCH
85%
60%
35%
Neighbor
78%
33%
12%
Ant
82%
78%
64%
TABLE II
8-leaf simulation result
maximum distance was small, and otherwise the ant algorithm gave the best. More simulation should be done on
larger trees to see which particular method is suited for
specific parameter settings. We draw conclusion that this
ant algorithms is useful and produces good results compatible to popularly used tree construction algorithms.
VII. Multiple Alignment Examples
We have also tested the method on an example protein family of the Cytochrome P450 CYP050A. The
sequences of 15 species (Table IV) were taken from
Cytochrome P450 Database (http://cpd.ibmh.msk.su/).
The sequence lengths differ from 414 to 561, as indicated
in Table IV. The multiple sequence alignment is obtained
by ClustalW software[14]. The distance matrix is calculated by protdist of Phylip programs. Again, the comD
-3
3-6
6-
Fig. 9. 8-leaf target tree
0-7803-7282-4/02/$10.00 ©2002 IEEE
FITCH
85%
60%
35%
Neighbor
78%
33%
12%
Ant
82%
78%
64%
TABLE III
16-leaf simulation result
arabidopsis thaliana
penicillium italicum
candida albicans
rat
candida glabrata
baker yeast
tomato
schizosaccharomyces pombe
human
wheat
issatchenkia orientalis
smut fungus
mycobacterium tuberculosis
grape powdery mildew fungus
rice
489
515
528
503
533
530
485
495
509
453
414
561
451
524
427
TABLE IV
15 species and the length of their CYP sequences
Fig. 11. Cytochrome P450 tree constructed with new Ant
Algorithms
parison is made to FITCH and Neighbor programs, and
also tree drawing algorithms of ClustalW. All the algorithms agreed on the tree topology, shown in Fig. 11.
This result represents a very feasible structure, and the
Ant algorithms gave the best score of 4023.8.
VIII. Discussion
We have introduced a new tree construction method
that constructs the tree with the minimum score for
a given set of sequences. By using the ants, we have
introduced a metaheuristic search to a NP problem.
By suffix representation and vertex choosing scheme,
we have extended the possibilities of the ant colony
0-7803-7282-4/02/$10.00 ©2002 IEEE
optimization application.
The method was compatible with existing software in
the presented experiments. We will further test the
algorithm for the scalability and accuracy by applying it
to real MSAs. We should also discuss the noise factor in
our further work.
References
[1] Colorni, A., M. Dorigo, et al. (). Distributed optimization by
ant-colonies. European Conference on Artificial Life (ECAL91).
[2] Dorigo, M. and G. D. Caro (1999). ”The Ant Colony Optimization Meta-Heuristic.” New Ideas in Optimization.
[3] Dorigo, M. and L. M. Gambardella (1997). ”Ant colonies for
the travelingsalesman problem.” Biosystems 43: 73-81.
[4] Durbin, R., S. Eddy, et al. (1998). Biological Sequence Analysis:
Probabilistic Models of Protein and Nucleic Acids, Cambridge
University Press.
[5] Felsenstein, J. (1996). ”Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods.” Methods Enzymol 266: 418-27.
[6] Fitch, W. M. and E. Margoliash (1967). ”Construction of phylogenetic trees.” Science 155(760): 279-84.
[7] Foulds, L. R. and R. L. Graham (1982). ”The steiner problem in
phylogen is np-complete.” Proc. Natl. Academ Science 3: 43-49.
[8] Gonnet, G., C. Korostensky, et al. (1999). ”Evaluation measures of multiple sequence alignments.” Journal of Computational Biochemistry.
[9] Jiang, T. and L. Wang (1994). ”On the complexity of multiple
sequence alignment.” J. Comp. Biol. 1: 337.
[10] Keith, M. J. and M. C. Martin (1997). ”Genetic Programming in C++: Implementation Issues.” Advances in Genetic
Programming.
[11] Lim, A. and L. Zhang (1999). ”WebPHYLIP: a web interface
to PHYLIP.” Bioinformatics 15(12): 1068-9. A web interface to
PHYLIP (version 3.57 C) is implemented using CGI/Perl programming. It enables users to do phylogenetic analysis through
the Internet.
[12] Retief, J. D. (2000). ”Phylogenetic analysis using PHYLIP.”
Methods Mol Biol 132: 243-58.
[13] Rohlf, F. J. (1983). ”Numbering binary trees with labeled terminal vertices.” Bull Math Biol 45(1): 33-40.
[14] Thompson, J. D., D. G. Higgins, et al. (1994). ”CLUSTAL W:
improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties
and weight matrix choice.” Nucleic Acids Res 22(22): 4673-80.