Analysis and Predictions of DNA Sequence Transformations on Grids

Analysis and Predictions of DNA Sequence
Transformations on Grids
A Thesis
Submitted for the Degree of
Master of Science (Engineering)
in the Faculty of Engineering
By
Yadnyesh R. Joshi
Supercomputer Education and Research Centre
INDIAN INSTITUTE OF SCIENCE
BANGALORE – 560 012, INDIA
August 2007
Acknowledgments
First of all I would like to extend my sincere thanks to my research supervisor Dr. Sathish
Vadhiyar for his constant guidance and support during the entire period of my post-graduation
at IISc. He was always approachable, supportive and ready to help in any sort of problem. I am
very thankful to him for being extremely patient and understanding about the silly mistakes that
I had made. Under his guidance I learned to approach problems in an organized manner and
set realistic goals for my research. I thank him for his extreme patience and excellent technical
guidance in writing and presenting research. Finally, he was and continues to be my role model
for his hard work and passion for research. I am also thankful to Dr. Nagasuma Chandra, Dr.
Debnath Pal from S.E.R.C. and Dr. Narendra Dixit from Chemical Engineering department for
their very useful and interesting insights into the biological domain of our research. I am also
thankful to all the faculty of S.E.R.C. for always inspiring us with their motivational talks.
I would like to mention the names of my colleagues Sandip, Sanjay, Rakhi, Sundari, Antoine
and Roshan for making their technical and emotional support. Special thanks to vatyaa kya
group members for the adventures and the routines inside and outside the institute. I would also
like to thank the Marathi Mandal for making the institute a homely place.
Back home, I would like to thank my parents for being my pillars of strength. I would also
like to thank Yamini tai and Dhanashree, my sisters for supporting and guiding me to make
important decisions. I would like to thank my friends, Vijay, Vishwanath, Pushkaraj, Prashant,
Akshay, Sunder, Anusha and Neha for always being there for me. Last but never the least, I am
very thankful to my grandfather Laxman Rao for being the strongest motivator all the time.
i
Abstract
Phylogenetics is the study of evolution of organisms. Evolution occurs due to mutations of DNA
sequences. The reasons behind these seemingly random mutations are largely unknown. There
are many algorithms that build phylogenetic trees from DNA sequences. However, there are
certain uncertainties associated with these phylogenetic trees. Fine level analysis of these phylogenetic trees is both important and interesting for evolutionary biologists. In this thesis, we
try to model evolutions of DNA sequences using Cellular Automata and resolve the uncertainties associated with the phylogenetic trees. In particular, we determine the effect of neighboring
DNA base-pairs on the mutation of a base-pair. Cellular Automata can be viewed as an array
of cells which modifies itself in discrete time-steps according to a governing rule. The state of
the cell at the next time-step depends on its current state and state of its neighbors. We have
used cellular automata rules for analysis and predictions of DNA sequence transformations on
computational grids.
In the first part of the thesis, DNA sequence evolution is modeled as a cellular automata
with each cell having one of the four possible states, corresponding to four bases. Phylogenetic
trees are explored in order to find out the cellular automata rules that may have guided the evolutions. Master-client paradigm is used to exploit the parallelism in the sequence transformation
analysis. Load balancing and fault-tolerance techniques are developed to enable the execution
of the explorations on grid resources. The analysis of the sequence transformations is used to
resolve uncertainties associated with the phylogenetic trees namely, intermediate sequences in
the phylogenetic tree and the exact number of time-steps required for the evolution of a branch.
The model is further used to find out various statistics such as most popular rules at a particular time-step in the evolution history of a branch in a phylogenetic tree. We have observed
ii
iii
some interesting statistics regarding the unknown base pairs in the intermediate sequences of
the phylogenetic tree and the most popular rules used for sequence transformations.
Next part of the thesis deals with predictions of future sequences using the previous sequences. First, we try to find out the preserved sequences so that cellular automata rules can
be applied selectively. Then, random strategies are developed as base benchmarks. A roulette
wheel strategy is used for predicting future DNA sequences. Though the prediction strategies
are able to better the random benchmarks in most of the cases, average performance improvement over the random strategies is not significant. The possible reasons are discussed.
Contents
List of Figures
vii
List of Tables
ix
1 Introduction
1
1.1
Cellular Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
DNA and Cellular Automata . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Phylogenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.5
Motivation and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . .
8
2 Related Work
11
2.1
DNA Sequence Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2
Grid Computing Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2.1
Applications in Mathematics and Earth Sciences . . . . . . . . . . . .
13
2.2.2
Applications in Astronomy, Physics and Chemistry . . . . . . . . . . .
14
2.2.3
Applications in Biology and Bioinformatics . . . . . . . . . . . . . . .
15
3 Sequence Transformation on Grids
3.1
16
Sequence Transformation on a Branch . . . . . . . . . . . . . . . . . . . . . .
16
3.1.1
Naive Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.1.2
Selective Application of Cellular Automata Rules . . . . . . . . . . . .
19
3.1.3
Dynamic Formation of Cellular Automata Rules . . . . . . . . . . . .
20
iv
C ONTENTS
v
3.2
Sequence Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.3
Pseudo Molecular Clock Assumption . . . . . . . . . . . . . . . . . . . . . .
22
3.4
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.4.1
Master-Worker Paradigm . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.4.2
Phases of Execution . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Grid Computing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.5.1
Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.5.2
Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.6
Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.7
Statistics Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.7.1
Timesteps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.7.2
Unknown base-pairs . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.7.3
Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.7.4
Differential rule analysis . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.7.5
Popularity of transitions . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.5
4 Experiments and Results
36
4.1
Grid Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.2
Timesteps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.3
Popular Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.4
Base Pairs Corresponding to Unknown Positions . . . . . . . . . . . . . . . .
42
4.5
Potential of Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
5 Predictions in phylogenetic trees
5.1
5.2
48
Determining the Preserved Segments . . . . . . . . . . . . . . . . . . . . . . .
48
5.1.1
Calculation of PSSM . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.1.2
Strategies for Determining Preserved Sequences . . . . . . . . . . . .
50
5.1.3
Evaluation of Strategies . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.1.4
Determination of Threshold Values for Flexible Strategies . . . . . . .
53
Analysis of Random Strategies . . . . . . . . . . . . . . . . . . . . . . . . . .
55
C ONTENTS
vi
5.3
Methods Used for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.3.1
Roulette Wheel Method . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.3.2
Roulette Wheel Method with Random Component . . . . . . . . . . .
58
5.3.3
History Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
5.3.4
Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . .
59
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
5.4
6 Conclusions and Future work
62
6.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
6.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
References
65
List of Figures
1.1
Evolution of Cellular Automata through time steps . . . . . . . . . . . . . . .
2
1.2
Rule that governs the evolution of cellular automata shown in Figure 1.1 . . . .
2
1.3
Double helix structure of DNA (Courtesy : U.S. National Library of Medicine)
4
1.4
Example Phylogenetic Tree with Gag Sequences . . . . . . . . . . . . . . . .
6
3.1
Application of Random Cellular Automata Rules . . . . . . . . . . . . . . . .
18
3.2
Selective Application of Cellular Automata Rules . . . . . . . . . . . . . . . .
19
3.3
Example : Dynamic Formation of Cellular Automata Rules . . . . . . . . . . .
20
3.4
Dynamic Formation and Selective Application of Cellular Automata Rules . . .
21
3.5
Illustration of the Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . .
24
3.6
The Master-Worker Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.7
Phase I in Master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.8
Phase II in Master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
5.1
Analysis of threshold values : Flexible-1 . . . . . . . . . . . . . . . . . . . . .
53
5.2
Analysis of threshold values : Flexible-2 . . . . . . . . . . . . . . . . . . . . .
54
5.3
Analysis of random strategies . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
vii
List of Algorithms
1
Algorithm for Sequence Transformer . . . . . . . . . . . . . . . . . . . . . . .
34
2
Greedy Algorithm for Chain Formation . . . . . . . . . . . . . . . . . . . . .
35
3
Calculation of Position Specific Scoring Matrix . . . . . . . . . . . . . . . . .
49
viii
List of Tables
1.1
Left-Hand Sides of 64 Transitions of Cellular Automata with Neighborhood
Size of 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
3.1
strands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.2
working strand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.3
branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.4
ruletable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.5
chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.1
The Distributed Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.2
Summary of time step information for Gag sequences . . . . . . . . . . . . . .
38
4.3
Summary of time step information for GagPol sequences . . . . . . . . . . . .
38
4.4
Summary of time step information for env sequences . . . . . . . . . . . . . .
39
4.5
Differential Rule Analysis for Gag Sequences . . . . . . . . . . . . . . . . . .
40
4.6
Differential Rule Analysis for GagPol Sequences . . . . . . . . . . . . . . . .
40
4.7
Differential Rule Analysis for env Sequences . . . . . . . . . . . . . . . . . .
41
4.8
Popular Rules for a Branch for Gag Sequences . . . . . . . . . . . . . . . . . .
42
4.9
Popular Rules for a Branch for GagPol Sequences . . . . . . . . . . . . . . . .
43
4.10 Popular Rules for a Branch for env Sequences . . . . . . . . . . . . . . . . . .
44
4.11 Resolution of Unknown Positions for Gag Sequences . . . . . . . . . . . . . .
45
4.12 Resolution of Unknown Positions for GagPol Sequences . . . . . . . . . . . .
45
4.13 Resolution of Unknown Positions for env Sequences . . . . . . . . . . . . . .
46
4.14 Usefulness of Large Number of Runs
46
ix
. . . . . . . . . . . . . . . . . . . . . .
Chapter 1
Introduction
In this section, we give brief background on cellular automata, the relationship between cellular
automata and DNA evolutions and the concept of phylogenetic trees.
1.1 Cellular Automata
Cellular automaton is a regular array of identical finite state automata where the next states of
the array elements are determined solely by their current states and the states of their neighbors.
One dimensional cellular automata consist of a line of cells, each having a particular state. State
of each of these cells changes over discrete time-steps. At every time-step, there is a definite
rule that determines the next state of a given cell based on the current state of the cell and its
neighboring cells[36].
As an example, the evolution of one dimensional cellular automata is shown in Figure 1.1.
Each cell can have one of the two possible states - 0 or 1. In this example, the neighborhood
size is 1, i.e. state of each cell at the next time step is dependent on the state of that cell, the
state of its single left neighbor and the state of its single right neighbor. The rule that governs
this evolution is depicted in Figure 1.2.
In Figure 1.1, evolution of the cellular automata is shown for two time-steps. 0 th time-step
corresponds to the original configuration of the cellular automata. Let us consider the second
cell in this original configuration. The current state of this cell is 1. The state of both of its
1
C HAPTER 1. I NTRODUCTION
Time steps
0
0 1
1
1 1
2
0 0
2
Cells
0 1 0 0 0
0 1 1 0 0
0 0 0 1 1
Figure 1.1: Evolution of Cellular Automata through time steps
0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
0
1
1
0
1
0
0
1
Figure 1.2: Rule that governs the evolution of cellular automata shown in Figure 1.1
neighbors (left and right) is 0 forming a triplet 010. As can be seen in Figure 1.2, the transition
corresponding to this state (010) tells us that the next state of this particular cell should be 1
which is reflected in time-step 1. In the same way, all the remaining cells in the array get
transformed to their respective next states to form the complete array at time-step 1. The same
procedure is repeated to get the state of the array at time-step 2 and so on.
As can be seen, the cellular automata rule consists of eight transitions corresponding to eight
possible left hand side states. Each of these eight left hand side states can assume two values
- 0 or 1. Thus, the number of rules possible for this particular cellular automata is 2 8 = 256.
These 256 CAs are generally referred to using Wolfram notation, a standard naming convention
invented by Wolfram[36]. The name of a CA is the decimal number which, in binary, gives the
rule table, with the eight possible states listed in reverse counting order, i.e. 000, 001,. . . ,111.
Thus, the rule in the Figure 1.2 is the “rule 105 CA” (binary representation of 105 is 01101001).
One dimensional cellular automata with two states have a rule consisting of 2 2·n+1 transitions where n is the neighborhood size and the total number of possible rules are 2 2
2·n+1
. In
general, one dimensional cellular automata with P states has a rule consisting of P P ·n+1 transitions and the total number of possible rules are P P
2n+1
.
Cellular automata have been used to model physical, economical, and sociological
systems[12]. They have replaced partial differential equations in the area of system modeling fairly successfully. Evolution has been modeled as partial differential equations in the work
by Smith et. al.[26]. This directly suggests that cellular automata can be potentially powerful
tools for modeling molecular evolutions.
Hence, cellular automata can prove to be very powerful tools to analyze DNA mutations.
C HAPTER 1. I NTRODUCTION
3
Rules found by modeling DNA mutations using cellular automata can give us useful insights
about the effects of neighboring base pairs on the evolution of DNA segments.
1.2 DNA and Cellular Automata
DNA is a nucleic acid that contains the genetic instructions for the development and function of
living things. DNA is often compared with blueprint of building as it contains the information
for construction of the other components of the cells such as proteins and RNA molecules.
The building blocks of the DNA polymer are nucleotides, which in turn consist of a phosphate group, a sugar ring group and either a purine or a pyrimidine base group. Two possible
purines are guanine (G) and adenine (A) and the two possible pyrimidines are thymine (T) and
cytosine (C). DNA is a double stranded molecule (see Figure (1.3)), where the two strands are
connected to each other through hydrogen bonding between a purine on one strand and a pyrimidine on the other or vice versa. Furthermore, adenine (A) is always paired with thymine (T)
and guanine (G) is always paired with cytosine (C). Thus, the sequence of one of the strands of
DNA is known, the sequence on the other strand can be easily determined.
A sequence of these bases forms a strand or a sequence. Three base-pairs in a DNA sequence
form one codon. Each codon corresponds to either one of the 20 amino acids or to a control
codon (start codon or end codon). There are 64 (43 ) possible combinations for a codon. But
there are only 20 amino acids. There are more than one codon that map to same amino acid.
This means there exists redundancy in the mapping. This mapping from amino acids to codons
is called as “genetic code” and is more or less same for all organisms. A chain of amino acids
forms a protein. Proteins are the basic functional blocks of the organisms. We can view a DNA
strand as a line of cells with each cell having one of the four values (A,G,C or T). This array
of cells essentially contains the information stored in DNA. This strand is copied exactly to
produce another identical strand in the process of DNA replication. Some times, mutation(s)
occurs during replication giving rise to a DNA strand which is different from the original strand.
These mutations are the basic reasons of evolution.
There are strong indications that mutations of DNA base-pairs are affected by neighboring
C HAPTER 1. I NTRODUCTION
4
Figure 1.3: Double helix structure of DNA (Courtesy : U.S. National Library of Medicine)
base-pairs[13, 4, 2, 17, 29]. The exact effects of the neighboring base-pairs on the mutation of
an individual base-pair is still unknown. We make an attempt to find out this relationship by
modeling DNA as cellular automaton where the DNA mutations are governed by the cellular
automata rules. The DNA molecule can be viewed as a one-dimensional cellular automaton,
with four states per cell, corresponding to each of the four base-pairs. Thus, the base of this
cellular automata is 4. The number of transitions in a given rule is 4 2n+1 where n is the number
of left/right neighbors. Total number of rules that may govern the DNA mutations is thus 4 4
2n+1
.
Even for n = 1, the number of possible cellular automata rules is 464 , which is an astronomical
number. Thus, the task of finding rules which could have been followed during evolutions
requires exploration of huge search space.
In this work, we consider only those rules with neighborhood size of 1, i.e. the transition
of a base-pair in a DNA sequence during evolution depends on the base-pair and its left and
right neighboring base-pairs. The 64 left-hand sides of the transitions corresponding to the
neighborhood size of 1 is depicted in Table 1.1. The right-hand side of each transition can be
C HAPTER 1. I NTRODUCTION
5
Table 1.1: Left-Hand Sides of 64 Transitions of Cellular Automata with Neighborhood Size of
1
S.No. LHS S.No. LHS S.No. LHS S.No. LHS
1.
AAA
17.
AAC
33.
AAG
49.
AAT
2.
CAA
18.
CAC
34.
CAG
50.
CAT
3.
GAA
19.
GAC
35.
GAG
51.
GAT
4.
TAA
20.
TAC
36.
TAG
52.
TAT
5.
ACA
21.
ACC
37.
ACG
53.
ACT
6.
CCA
22.
CCC
38.
CCG
54.
CCT
7.
GCA
23.
GCC
39.
GCG
55.
GCT
8.
TCA
24.
TCC
40.
TCG
56.
TCT
9.
AGA
25.
AGC
41.
AGG
57.
AGT
10.
CGA
26.
CGC
42.
CGG
58.
CGT
11.
GGA
27.
GGC
43.
GGG
59.
GGT
12.
TGA
28.
TGC
44.
TGG
60.
TGT
13.
ATA
29.
ATC
45.
ATG
61.
ATT
14.
CTA
30.
CTC
46.
CTG
62.
CTT
15.
GTA
31.
GTC
47.
GTG
63.
GTT
16.
TTA
32.
TTC
48.
TTG
64.
TTT
one of the 4 base-pairs, giving rise to 464 rules. We also assume that during one evolution step,
a single rule is applied for the entire sequence, i.e. transitions in cells i and j of the sequences
are governed by the same rule. Although these assumptions do not encapsulate the myriad
mechanisms that could have been followed during evolutions, the assumptions are reasonable
since there has been evidence that a base-pair is more impacted by its immediate neighbors than
its farthest neighbors[13, 2].
1.3 Phylogenetics
In biology, the study of evolutionary relatedness among various groups of organisms (e.g.,
species, populations) is called Phylogenetics. A phylogenetic tree, shown in Figure 1.4, also
called an evolutionary tree or a tree of life, is a tree showing the evolutionary interrelationships
C HAPTER 1. I NTRODUCTION
6
Figure 1.4: Example Phylogenetic Tree with Gag Sequences
among various species or other entities that are believed to have a common ancestor. The leaves
of the tree represent various organisms, species, or genomic sequences. The internal node of
the tree stands for an abstract organism (species, sequence) whose existence is presumed and
whose evolution led to the organisms in the leaves.
Evolution is visualized with the help of phylogenetic trees corresponding to a set of organisms. Phylogenetic trees give a picture of relatedness between various organisms. A rooted
phylogenetic tree has a root that corresponds to the most recent common ancestor to all the
sequences under consideration. A branch in a rooted phylogenetic tree connecting an ancestor
and a progeny indicates that one sequence (progeny) is evolved from the other (ancestor). Unrooted trees illustrate the relatedness of the leaf sequences without making assumptions about
ancestry.
Various efforts have been made to construct phylogenetic trees for a given set of DNA
sequences[19, 33]. However, there are certain uncertainties associated with these phylogenetic
trees. The reconstruction of the sequences corresponding to intermediate nodes is not complete
C HAPTER 1. I NTRODUCTION
7
in the phylogenetic trees. There are several positions in the intermediate sequences where the
exact base-pairs are not known. The number of time steps required for the mutations to occur
is also not known explicitly. Finally, these trees do not provide any indications about the rules
that may have been followed during the evolution of the different sequences in the given tree.
1.4 Grid Computing
Grid computing involves collection of computational resources in order to solve problems of
large magnitude. These computational resources may be diverse in terms of their computing
power or architecture and may be under different administrative domains. These resources
may be shared resulting in resource dynamics in the system. They may be spread over a large
geographical area. Grid computing seamlessly organizes these resources in order to solve the
problem. More importantly, grid computing can utilize the unused computing power in order to
give a low cost solution to the problems, which otherwise are solved on expensive single high
performance system.
Though grids can have many possible architectures, one of the main features of the grids is
that many computing resources 1 are connected to each other through low-bandwidth and highly
shared network. Hence, communication is almost always a performance bottleneck for a grid
(when compared with a single high-performance system). Hence, applications which involve
low communication or applications which can be decomposed into tasks which involve little
or no communication between them are suitable for grids. Often, popular applications in grid
computing have a client-server architecture where client contacts server for some information
and actual processing of the information is done at the client side.
The problem considered in this thesis has client-server characteristics and hence is suitable
for grid computing. The entire problem can be divided into different computational tasks. These
tasks can be assigned to any number of available computing resources over grid. Modeling DNA
sequence mutations or transformations using cellular automata where each cell can assume one
of 4 possible states requires exploration of a huge search space and needs large amounts of
The computing resources here can be personal computers or clusters of servers or high performance
supercomputers
1
C HAPTER 1. I NTRODUCTION
8
computing cycles. Grid computing has previously been used successfully in order to solve
problems of very large magnitude.
Grid computing has been found to be useful in the fields that vary from mathematics to
biology and medicine. In the field of mathematics, Grids have been used to solve satisfiability
problem[7], to find primes of the form k · 2n ± 1. In the field of Earth Sciences, grid computing
is being used to produce a forecast of the climate in 21st century in ClimatePrediction.Net[8]
project. In physics, projects such as SETI@Home[28], µFluids[18], Einstein@Home[10] are
making use of unused processor cycles at various desktops in order to achieve high computing
power. In the area of Bioinformatics, SIMAP[30], Rosetta@home[25], Predictor@Home[22]
use the power of Grid computing to analyze the structures and functions of proteins. Evolution
of Antrhopods has been extensively studied using grid computing by Stewart et.al.[33].
In these applications, vast and ever-expanding grid resources have been used successfully
for parallel exploration and analysis of different parameter values. The particular parameters
that are of interest in our work on DNA sequence evolutions are cellular automata rules for
depicting the effects of neighboring base-pairs on evolutions, unknown bases in the intermediate
sequences and the exact number of time steps involved in the evolution of one sequence to
another.
1.5 Motivation and Problem Formulation
Study of evolution of different species or organisms is important to biologists since it has many
practical applications including drug discovery, population monitoring and management[11].
Availability of DNA sequence databases[27] in the last few decades has enabled the study the
evolution at the molecular level. There have been many studies on the evolutions of species
using DNA sequences[13, 26, 4, 2, 14]. In terms of molecular biology, evolution can be viewed
as a mutation event in which a particular DNA segment of the organism undergoes some change.
During evolution, a DNA segment consisting of a sequence or purines and pyrimidines (also
called as base-pairs) changes to a different sequence of base-pairs.
While the effects of some neighborhood base-pairs on the evolution of a DNA segment is
C HAPTER 1. I NTRODUCTION
9
known, there has been very little work[5, 31] to our knowledge that comprehensively analyze
the effects of different neighborhoods on evolutions. The exact effect of base-pairs on the mutation is still unknown and remains a challenge for evolutionary biologists. While phylogenetic
trees constructed out of existing packages [20] give an overall picture of the relationships, these
packages do not give fine level details of the way evolution might have progressed. The trees
do not give the exact number of time steps required for mutation and also does not give give
any indication about the effect of neighboring base pairs on mutations. These packages, while
constructing the phylogenetic tree, produce incomplete hypothetical sequences for intermediate
nodes. The exact base-pairs of some positions in the intermediate sequences are not known.
Our work tries to resolve the uncertainties associated with the phylogenetic trees. In particular, our work tries to determine the rules for neighborhood based mutations that may have
been followed during the evolutions of sequences. In addition, our work also tries to resolve the
uncertainties related to the number of time steps and unknown base-pairs in the intermediate
sequences of the phylogenetic trees. We use the vast number of resources available in computational grids to perform parameter searches associated with the phylogenetic trees with the
intention of narrowing the ranges of the parameters.
In this work, we model DNA sequence mutations using cellular automata to find rules for
neighborhood-based mutations for a particular phylogenetic tree on computational grids. By
parallel guided exploration of large number of cellular automata rules on grid resources, we also
attempt to resolve uncertainties associated with the phylogenetic tree, namely, finding unknown
base-pairs in intermediate sequences and the number of time steps for evolutions. This analysis
of mutations using cellular automata rules can be utilized by evolutionary biologists to better
predict mutations of DNA sequences. Thus, formally, the problem can be stated as to develop
solutions based on cellular automata to restrict and/or resolve the uncertainties associated
with phylogenetic trees by parallel guided exploration of vast number of cellular automaton
rules on different resources of computational Grids. This cellular automata model of DNA
gives us various important statistics about the phylogenetic tree. We then attempt to use these
statistics to predict future DNA sequences of the phylogenetic tree. The methods used for
prediction depend on calculating preserved sequences using position specific scoring matrix
C HAPTER 1. I NTRODUCTION
10
and the statistics collected about the phylogenetic tree during the analysis.
The rest of the thesis is organized as follows. In chapter 2, we look at the existing works
that are relevant to the problem. In particular, we look at the works that suggest the effect of
neighborhood bases on the mutation of DNA strand. In chapter 3, we look at the design of
Cellular Automata model of DNA mutations. We also see how this design can be used resolve
the uncertainties associated with phylogenetic trees. Chapter 4 describes various experiments
we have performed and their corresponding results. We also look at the interesting statistics that
we have collected. In chapter 5, we try to extend the cellular automata model in order to predict
future sequences. Chapter 6 concludes the thesis and gives directions for the future work.
Chapter 2
Related Work
In this chapter, we look at some of the efforts that have investigated DNA sequence evolution
by taking context dependency into account. We then look at the efforts that illustrate how Grid
Computing can help large applications.
2.1 DNA Sequence Evolution
There have been number of studies on the evolution of DNA sequences[14, 4, 13, 2, 17, 29, 31,
33]. The work by Korber et. al.[14] studies the evolution of HIV sequences using the molecular
clock assumption; this hypothesis postulates that molecular change is a linear function of time
and that substitutions accumulate according to a Poisson distribution. HIV-1 sequences were
analyzed to estimate the timing of the ancestral sequence of the main group of HIV-1, the strains
responsible for the AIDS pandemic. Using parallel supercomputers and assuming a constant
rate of evolution, maximum-likelihood phylogenetic methods were applied to unprecedented
amounts of data for this calculation. Results were validated by correctly estimating the timing
of two historically documented points.
There are a number of studies that analyze the impact of neighboring bases on the mutation
of a particular base. The work by Bulmer[4] finds that there is a marked increase in the frequency of transitions from the doublet CG. There are also some smaller effects of neighboring
bases on the frequencies of transitions from adenine and thymine. They also determine that the
11
C HAPTER 2. R ELATED W ORK
12
transition frequency from either of these bases is reduced by having G on the right (or C on the
left) and increased by having T on the right (or A on the left).
Hess[13] also concludes that substitution rates, representing averages over those for different regions of the genome, are distributed over a 60-fold range with strong biases in particular
neighbor-pair environments. Studies indicate that substitution rates vary for the same base-pair
for different neighbor-pair environments. They found that, in general, the rates are fastest in
alternating purine-pyrimidine sequences and slowest in purine-pyrimidine tracts. This clearly
indicates that the mutation rates are affected by neighboring base pairs.
Arndt et. al.[2] introduces a model of DNA sequence evolution which can account for biases
in mutation rates that depend on the identity of the neighboring bases. They have developed an
analytical model of evolution by adopting the methods of non-linear dynamics. They conclude
that phylogenetic analysis should be extended to include neighbor-dependent effects.
All the above efforts clearly indicate that neighboring bases have some effect on the mutation of a particular base. But none of these studies analyze the fine-grain effects of neighboring
bases during each step of evolution. Cellular automata model can exploit this neighborhood
dependency during DNA sequence evolution.
Morton et. al.[17] have analyzed 1776 aligned SNP sequences generated from nuclear genes
of maize to study the effect of neighborhood compositions on mutation dynamics. Their studies
have found that the A+T content of flanking nucleotides has an influence of various aspects of
mutation dynamics. Overall, the polarized SNP data yielded a G and C nucleotide mutation rate
(the GC rate) that is 1.6 times the rate of mutation for A and T nucleotides (the AT rate). The
sequences used in their study for the analysis were pre-generated while the sequences in our
study are dynamically generated due to change in states of cellular automata.
Siepel and Haussler[29] incorporate context-dependence in phylogenetic models to improve
the quality of phylogenetic trees. Thus the motivation of their work is similar to ours. Their
work indicates that the patterns of context-dependent substitutions are complex in both coding and noncoding regions. They build models of different orders and higher ordered models
produce better results than lower ordered models. Third-order models suggest that important
context effects occur at the level of nucleotide triplets. Their work focuses on using their im-
C HAPTER 2. R ELATED W ORK
13
proved context-dependent phylogenetic models to estimate the pattern and rates of substitutions
on the branches of a given phylogenetic tree. Their work also reports results on better estimates
of branch lengths. In addition, their work has the potential to refine phylogenetic tree construction. They have estimated substitution rates and context effects for 160,000 noncoding sites and
3 million sites in coding regions in mammalian genomes. Our work based on cellular automata
tries to determine finer-grained context-dependent effects on individual steps of mutations.
DNA evolution has been modeled as Cellular Automata in the work by Sirakoulis et. al.[31].
In this work, application of cellular automata rule to a DNA strand is treated as a matrix multiplication modulo 4. This strategy, however, cannot consider all possible cellular automata rules.
The neighborhood effects for mutations are less clear after the modulo 4 operation is performed.
This tool was created to visualize the evolution according to a particular cellular automata rule.
The work by Stewart et. al.[33] had prepared a global grid for studying arthropod evolution. The effort implemented on a global grid, a parallel version of fastDNAml[19] algorithm
using maximum likelihood approach to construct better phylogenetic trees. While these efforts
deal with constructing a better phylogenetic tree, we try to refine the phylogenetic tree using
grid computing technologies. We also try to find additional information about these trees by
modeling the DNA sequence evolution as cellular automata rules.
2.2 Grid Computing Applications
Grid Computing has been found useful in fields that vary from mathematics to biology and
medicine. In the following subsections, we take a brief look at some selected applications
where Grid Computing has helped produce results which require large amount of resources.
2.2.1 Applications in Mathematics and Earth Sciences
ClimatePrediction.Net[8] is an example where Grid Computing has been found useful to solve
a high computation intensive problem. The project is the largest experiment to try and produce
a forecast of the climate in the 21st century. As in our application, ClimatePrediction.Net
evaluates the effects of different values of various uncertain parameters. The application starts
C HAPTER 2. R ELATED W ORK
14
with fixed values of the parameters like ice fall speed, coefficients of diffusion. As more and
more experiments are performed, the better values of the parameters are found out.
GridSAT[7] is another example where Grid Computing enabled the finding of answers to
previously unsolved satisfiability problems. This application utilizes master-client model along
with the effective rescheduling techniques to make use of the idle workstations to solve difficult
problem. The solution involves breaking the big problem into smaller subproblems at every
step. These subproblems are then distributed among the available clients based on their capacity.
ABC@Home[1] aims at finding sets containing three integers, a, b,c such that a+b = c,a < b <
c, a,b,c have no common divisors and c > rad(abc) where rad(abc) is the product of distinct
primes in abc. The aim of SZTAKI Desktop Grid project[34] is to find all the generalized binary
number systems up to dimension 11. These applications are ever greedy for computational
resources. The more the amount of resources, the better are the results produced. Riesel Sieve
project[24] tries to find the primes of the form k · 2n − 1. PrimeGrid[23] is generating public
sequential prime numbers database and searching for twin primes of the form k·2 n −1, k·2n +1.
2.2.2 Applications in Astronomy, Physics and Chemistry
SETI@Home [28] is a typical example in which unused processor cycles all over the world are
utilized to search signs of intelligent signals from the space. µFluids project[18] is a massively
distributed computer simulation of two-phase fluid behavior in microgravity and microfluidics
problems. Einstein@Home[10] is a program that uses personal computer’s idle time to search
for spinning neutron stars (also called pulsars). These applications break down a large problem into smaller subproblems and then distribute these subproblems on large amount of grid
resources. Similarly, the goal of Spinhenge@home[32] is to study molecular magnets and controlled nanoscale magnetism with the help of grid computing. Quantum Monte Carlo@Home[6]
studies structure and reactivity of molecules using quantum chemistry.
LHC@home (Large Hadron Collider at Home)[15] is a particle accelerator being built at CERN,
the European Organization for Nuclear Research. These examples use master-worker paradigm
and illustrate the power of Grid Computing to solve the problems which require large amount
of computing power.
C HAPTER 2. R ELATED W ORK
15
2.2.3 Applications in Biology and Bioinformatics
Rosetta@home[25] aims to determine the 3-dimensional shapes of proteins. Project SIMAP
aims to calculate similarities between proteins.
SIMAP[30] provides a public database
of the resulting data, which plays a key role in many bioinformatics research projects.
Predictor@Home[22] attempts to predict the folded and functioning form of the protein. Predicting the structure of an unknown protein is a critical problem in enabling structure-based
drug design to treat new and existing diseases. Huge amounts of computational resources
made available by grid computing have helped the field of bioinformatics to come by with
complex multi-dimensional protein structures. Malariacontrol.net[16] runs simulation models of the transmission dynamics and health effects of malaria that are an important tool for
malaria control. They can be used to determine optimal strategies for delivering mosquito nets,
chemotherapy, or new vaccines which are currently under development and testing. Such modeling is extremely computer intensive, requiring simulations of large human populations with a
diverse set of parameters related to biological and social factors that influence the distribution of
the disease. The goal of World Community Grid[35] is to further critical non-profit research on
some of humanity’s most pressing problems by creating the world’s largest volunteer computing grid. Research includes HIV/AIDS, cancer, muscular dystrophy, dengue fever, and many
more. The work by Stewart et. al. [33] also made use of potential of Grid Computing for
analysis of evolution of Anthropods - research which otherwise would not have otherwise been
undertaken. In this case, master-worker paradigm helped to achieve coarse gain and fine grain
parallelism with fault tolerance techniques. The Biological General Repository for Interaction
Datasets (BioGRID)[3] is a curated biological database of protein-protein interactions. It strives
to provide a comprehensive resource of Protein-Protein interactions for all major species while
attempting to remove redundancy to create a single mapping of protein interactions.
These applications have been helped by large computing resources made available due to
Grid infrastructures. Most of these applications exploit parallalism using master-client architecture where many clients parallely perform computations assigned by the master. We also
use master-client approach along with Grid Computing to model DNA mutations using cellular
automata rules.
Chapter 3
Sequence Transformation on Grids
DNA mutations can be modeled using cellular automata since DNA mutations are related to the
neighboring base pairs. To study these mutations, we make use of phylogenetic trees. For each
branch of the phylogenetic tree, we try to transform ancestor sequence to progeny sequence using cellular automata rules. In this chapter, we describe the techniques used for transformation
of sequences from an ancestor to a progeny, a program for performing the transformations, and
our assumptions regarding evolutionary rates to manage the number of successful transformations. Further, we also see how these techniques are realized on a grid using some of the grid
computing techniques.
3.1 Sequence Transformation on a Branch
We use Phylip[20] to construct phylogenetic trees. DNA sequences were downloaded from
HIV Sequence database at Los Alamos[27]. The sequences were aligned by ClustalW web
interface[9]. The aligned sequences were then input to Phylip to obtain phylogenetic tree for
a given set of sequences. The ‘dnamlk’ program of the phylip package was used as it generates the phylogenetic tree assuming molecular clock. Transition/transversion ratio was kept
as 2.0. Transitions are the mutations which change purines into purines and pyrimidines into
pyrimidines. Transversions are the mutations which change purines into pyrimidines and vice
versa. Empirical base frequency was used. The option for intermediate sequence generation
16
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
17
was kept ON. At the end of the program, phylip output file was generated containing the entire
tree, branches with their branch lengths, intermediate sequences and other debugging information. Intermediate sequences were output in interleaved format. These sequences were separated from the phylip output file and stored sequentially. The branches and their corresponding
branch lengths were also obtained from the output file.
Each branch in a phylogenetic tree corresponds to an ancestor-progeny pair. For each
branch, we apply a set of cellular-automata rules for transforming an ancestor sequence to
the progeny sequence. We compare a sequence, produced during the transformation, with the
progeny sequence using a similarity value metric defined as the percentage of the number of
base pairs in the sequence matching with the corresponding base pairs in the progeny sequence.
Thus, a similarity value of 1 indicates that the transformation has resulted in the progeny sequence.
Following subsections describe the methods used for sequence transformation of one branch
in the phylogenetic tree.
3.1.1 Naive Approach
One naive approach for transformation is to randomly choose a cellular-automata rule at each
time step and apply the rule to the current sequence. We then monitor the progress of similarity
value metric over the time-steps. This approach can lead to the sequences deviating from the
progeny sequence as illustrated in Figure 3.1. The figure shows the similarity values of the
sequences when using the naive approach for an ancestor-progeny branch corresponding to
a phylogenetic tree constructed for gag sequences of HIV virus using Phylip package. The
progeny sequence in this branch has accession number X52154 and the ancestor sequence is
one of the intermediate sequences generated by the Phylip package.
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
18
Application of a Random Cellular Automata Rule
0.65
0.6
similarity value
0.55
0.5
0.45
0.4
0.35
0.3
0.25
0
50
100
150
timesteps
200
250
Figure 3.1: Application of Random Cellular Automata Rules
300
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
19
3.1.2 Selective Application of Cellular Automata Rules
Selective Application of a rule
1
similarity value
0.95
0.9
0.85
0.8
0.75
0
50
timesteps
100
150
Figure 3.2: Selective Application of Cellular Automata Rules
For successful transformation of an ancestor to a progeny, we can make use of the fact that not
all the base-pairs mutate at each time step. Indeed, many subsequences in ancestor sequence
match exactly with the progeny sequence at their corresponding positions. Thus, at a given time
step, we apply a random cellular automata rule only to those base-pairs in the current sequence
which differ with the corresponding base-pairs in the progeny sequence. This selective application of cellular automata rules helps in the convergence of sequences to a progeny sequence
as shown in Figure 3.2. This figure shows the similarity value for each time step when using
selective application of cellular automaton rules for the same branch. As seen in the figure, this
approach completes the transformation in 141 time steps. This approach can also be biologically justified since most of the successful sequences (which form complete proteins) are less
prune to mutations than the others.
C HAPTER 3. S EQUENCE T RANSFORMATION
C
G
A T
ON
G RIDS
20
Current Sequence
T A C G T C
T
Progeny Sequence
A A G A C
G
C
Rules
1. CGT −→ T
2. GTA −→ A
3. TAC −→ A
4. ACG −→ G
5. CGT −→ A
Figure 3.3: Example : Dynamic Formation of Cellular Automata Rules
3.1.3 Dynamic Formation of Cellular Automata Rules
In dynamic formation of cellular automata rules, we try to dynamically create a cellular automata rule using a sequence obtained during the transformation and the progeny sequence.
Figure 3.3 illustrates the dynamic formation of a rule.
We try to form a complete rule by forming the individual transitions with the use of current
and progeny sequences. The current sequence on which a rule is to be applied forms the left
hand side of the transitions. The progeny sequence forms the right hand side of the transitions.
For example, in Figure 3.3, the left hand side of a transition is formed by the first three basepairs of the current sequence, namely, CGT; and the right hand side of the transition is formed
by the corresponding base-pair in the progeny sequence, T. We begin the formation of the rule
from the first few base-pairs of the current and the progeny sequences. These base-pairs form a
window in the current sequence and map to a single base-pair in the child sequence. The size
of this window is 2 · n + 1 where n is the neighborhood size used in the cellular automata rule.
In our example, the neighborhood size is one, hence the window consists of three base-pairs
including a neighbor on either side of the base-pair under consideration for mutation. We slide
this window over the entire parent sequence or until we find a contradicting transition during
the formation of the rule.
A contradicting transition is found when two windows in the current sequence containing
exactly the same sub-sequence map to different base-pairs in the corresponding progeny se-
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
21
Dynamic Formation of Rule with Selective Application
1
similarity value
0.95
0.9
0.85
0.8
0.75
0
10
20
30
timesteps
40
50
60
Figure 3.4: Dynamic Formation and Selective Application of Cellular Automata Rules
quence. In our example, transitions 1 and 5 contradict with each other as they try to map the
same sequence, CGT, to different base-pairs. If a contradicting transition is found, we selectively apply a random cellular automata rule to the current sequence leading to a new sequence
and repeat the procedure of dynamic rule formation to the new sequence. With dynamic rule
formation and selective application of cellular automata rules, the number of time steps required
for the transformation between the same ancestor-progeny pair reduced considerably as shown
in Figure 3.4.
Based on the above principles, we have written a program, sequence transformer program,
that uses selective application and dynamic formation of cellular automata rules for transformation on an ancestor-progeny branch. We describe the program in the next section.
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
22
3.2 Sequence Transformer
The sequence transformer is a fundamental component in our infrastructure. The pseudo-code is
given in Algorithm 1. It takes as input, sequences for an ancestor and a progeny and produces as
output the number of time steps required for the transformation from the ancestor to the progeny
and the cellular automata rules applied during the transformation. The array rule arr maintains
the cellular automata rules applied for transformations and the array rule change indicates the
time step at which a rule was applied. (rule change[i + 1] − rule change[i]) are the number
of time steps for which the same rule rule arr[i] is applied. The input variable sametol is a
tolerance factor or number of time steps for which the same rule can be applied without increase
in similarity value. Initially, after alignment of the sequences, the base pairs corresponding
to some segments of the ancestor sequences are not known. The sequence transformer also
fills these unknown segments in the ancestor sequence with random base pairs (line 2). This
assignment of unknown segments is also recorded as output of the sequence transformer.
Multiple runs of sequence transformer on a branch can lead to evolution of the ancestor to
the progeny of the branch with different rules for transformations and different timesteps. In
order to work with manageable number of solutions, assumptions were made relating the branch
lengths of the branches and the time steps. These assumptions are explained in the next section.
3.3 Pseudo Molecular Clock Assumption
Molecular clock is an assumption of constant rate of evolution[14], i.e. the rate of evolutionary
change of any specified protein is approximately constant over time and over different lineages
in the phylogenetic tree. Thus, according to strict molecular clock, there exists a single rate of
mutation α such that
b1 = α · t 1 ; b2 = α · t 2 ; · · · ; b n = α · t n
(3.1)
where b1 , b2 , b3 , . . . , bn are branch-lengths of the n branches in the phylogenetic tree and t 1 , t2
, t3 , . . . , tn are the time steps taken for mutations of the corresponding branches. Branch length
is a measure of the difference between the ancestor and progeny of a branch and is obtained
along with the phylogenetic tree from the Phylip package . The time steps are outputs from
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
23
our sequence transformer program. The strict molecular clock assumption has been used in
some phylogenetic inferences[14]. In a non-molecular clock assumption, mutation rates for the
different sequences in the phylogenetic tree are different, i.e.
b 1 = α 1 · t1 ; b 2 = α 2 · t2 ; · · · ; b n = α n · tn
In our work, we use a pseudo-molecular clock assumption.
(3.2)
According to this,
α1 , α2 , α3 , ..., αn are related such that
b 1 = α 1 · t1 < b 2 = α 2 · t2 < · · · < b n = α n · tn
(3.3)
t1 < t 2 < t 3 < · · · < t n
(3.4)
and
This assumption is reasonable since greater the branch lengths or greater the difference between
the ancestor and progeny sequences, more the time steps required to transform from an ancestor
to a progeny.
After many invocations of sequence transformer for different branches, we accept only those
outputs of sequence transformers that adhere to the pseudo-molecular clock assumption. To
find such valid transformations, we use a greedy algorithm. The greedy algorithm starts with a
sorted order of branches in the phylogenetic tree in terms of their corresponding branch lengths.
For each branch, a linked list is maintained. Every node in a linked list corresponds to one
invocation of the sequence transformer and contains the inputs and outputs for the invocation
including the number of time steps taken for mutations on the branch. The greedy algorithm,
shown in Algorithm 2 finds a node, prev node, in the first linked list having the smallest number
of time steps. The prev node is inserted in a chain. The algorithm then considers the next
linked list and finds a node with the smallest time step value greater than the time step value of
prev node. This node now becomes the prev node and is added to the the chain. The entire
procedure is then repeated for all linked lists corresponding to all the branches. During this
algorithm, a branch whose linked list does not have a node with a time step value greater than
that of prev node may be found. Such branches are not included in the chain. We then try
to form another chain which may contain the remaining branches. We repeat this procedure
C HAPTER 3. S EQUENCE T RANSFORMATION
Branch 1
Branch 2
Branch 3
Branch 4
Branch 5
Branch 6
Branch 7
11
20
25
18
42
36
41
13
25
20
19
40
35
48
ON
G RIDS
10
16
28
17
41
35
51
14
21
21
18
45
36
49
24
11 15
17 18 23 20
29
18
38 37
55
Chains[0] = 10, 16, 20, 40, 41(Longest chain); Chains[1] = 17, 35
Figure 3.5: Illustration of the Greedy Algorithm
so that a node of every branch is included in some chain. At the end of this algorithm, we
obtain different chains each having different lengths. A chain containing all the branches in
the phylogenetic tree is called complete chain. Note that the greedy algorithm is bound to find
a complete chain in the current snapshot of time-steps for branches if one exists.
The working of the greedy algorithm is illustrated in Figure 3.5. The figure shows linked
lists for 7 branches. The numbers in the nodes of the linked lists represent the time steps. The
cells which contain the numbers either in bold font or italic font are part of a chain. The figure
also shows the 2 chains that are produced from the greedy algorithm on the example. The
numbers in bold font indicate one chain and the numbers in italics indicate the other. We can
see that there is no complete chain in the current snapshot. The chain marked with bold font is
the current longest chain.
Having described the various mechanisms, assumptions and algorithms, in the next section,
we see how they fit into overall design of the entire solution.
3.4 Design
The number of input and output parameters of the sequence transformer leading to the formation
of complete chains can be very large, making the problem suitable for grid computing. In order
to explore the vast space of parameters and to converge on the most likely values for these
parameters, we use the distributed resources of a computational grid. This method is similar to
the ensemble method used for climate prediction in ClimatePrediction.Net[8].
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
25
Figure 3.6: The Master-Worker Design
Following subsections describes the architecture of the entire design.
3.4.1 Master-Worker Paradigm
Master-worker paradigm is used for invocations of the sequence transformer, formation of
completion chains, and insertions of parameters corresponding to the complete chains into a
database. The overall design of our master-worker infrastructure is illustrated in Figure 3.6.
The master is responsible for assigning branches to the workers and collecting results from
them when they complete their calculations. The master assigns branches to the workers in
round-robin fashion. The master, after assigning a branch to a worker, does not wait for the
worker to complete its calculations, but proceeds to the next worker. Thus there is parallelism
in the calculations by the workers. When the first worker completes its calculations, the master
is notified of the completion and the worker sends the results back to the master. The master
stores the results from the workers into a database. We use PostgreSQL[21] database for data
insertion and querying. Additionally, the master periodically invokes the greedy algorithm for
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
26
chain formation with the available snapshot of linked list values for various branches. If a
complete chain could be formed from the current set of linked list values, the master forks
another process to insert the parameter values corresponding to the complete chain into the
database and to possibly form additional chains from the same set of values. Note that the
greedy algorithm gives us only one chain from the available set of parameter values while there
may be many complete chains in a given snapshot of linked list values. To obtain additional
chains, the forked process deletes an element from the original complete chain and invokes
the greedy algorithm to find another complete chain. This procedure is repeated by deleting
different sets of elements from the original complete chain. During this process, whenever a
complete chain is formed, it is inserted into the database.
A worker takes a branch from the master, fills the unknown base-pairs randomly and invokes
the sequence transformer. When the transformation is complete, it sends the results back to the
master. The results consist of the number of time steps required for transformations, the rules
used during the transformations and the assignment of base-pairs to the unknown portions of
the ancestor sequence. The worker then waits for a new branch from the master. The number
of worker processes are not related to the number of branches. Hence our framework can make
use of any number of available grid resources for the execution of worker processes.
Depending upon the state of the master process, two phases can be identified. We look at
these phases in the next subsection.
3.4.2 Phases of Execution
In phase I shown in Figure 3.7, the master continuously gives new branches to the worker
processes and collects the results from them. The master initially considers all branches for allocations to workers for invocation of sequence transformer on the branch. Once the number of
branches in the longest chain exceeds 60% of the total number of branches, the master considers
only those branches that are not in the longest chain for allocations to workers. Thus, more resources are utilized for difficult branches for complete chain formation. To avoid potential large
difference between the number of invocations of sequence transformer for any two branches,
we fix a threshold, allocation threshold, for maximum difference between the maximum and
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
27
Phase I
Give branches to workers
Accept results from workers
No
Calculate the length of the longest chain
Is a
complete chain
found?
Yes
Phase II
Figure 3.7: Phase I in Master
minimum invocations for any two branches in the phylogenetic tree. If this difference exceeds
the threshold, the branch with minimum number of invocations is assigned to the workers till
the difference becomes less than the threshold. During phase I, the master also invokes the
chain formation algorithm periodically.
Once a complete chain is formed, the master initiates phase II. Phase II, shown in Figure
3.8, involves insertion of the complete chains into the database and formation of additional
complete chains from the same data. Note that, for phase II to start, phase I must complete. But,
once phase II has been started, the next round of phase I can be started immediately ensuring
pipelined parallelism between phase I and phase II in the master.
3.5 Grid Computing Techniques
As discussed in the introduction, state of the grid resources may be highly dynamic. We use
load balancing and fault tolerance techniques to adapt to the resource dynamics in grids.
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
28
Phase II
Insert values into the database
Delete element(s) from the chain
Try to find new chain
Yes
Is a
complete chain
found?
No
Quit
Figure 3.8: Phase II in Master
3.5.1 Load Balancing
During phase I, when the number of branches in the longest chain is less than 60% of all the
branches, all branches are allocated to the workers for calculations. There is a possibility that a
particular branch is always assigned to a slow worker hampering the progress for the branch. A
threshold, loadbalance threshold, is fixed for maximum allowed difference between the number
of invocations of sequence transformer for any two branches. If the difference exceeds the
threshold, the branch with minimum number of invocations is allocated to two workers during
each iteration till the difference becomes less than the threshold. This technique allows uniform
progress for all branches irrespective of the different loads on the grid resources where the
workers are executing.
Theoretically, any number of phase I and phase II processes can be executing at any given
point. However, both phase I and phase II processes consume memory on the machine where
the master is executing. Hence, phase I and phase II processes must be started after verifying
whether the required amount of memory is available at the master machine. We follow a sim-
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
29
plistic approach where only one phase I and one phase II processes may be active at any given
point of time. This controls the load on the master machine allowing efficient execution of the
master.
3.5.2 Fault Tolerance
For each worker, the master forks a process that sends the parameters to and receives the results
from the worker. The forked process maintains connection with the worker throughout the
calculation of the branch. After the results are sent back by the worker, the forked child process
notifies about the arrival of results to the parent master. Hence, even if the worker fails during
its execution, only the forked process gets killed and the master will be able to continue its
execution.
Also, the master forks a new process for phase II operations. The forked phase II process
does not need to communicate with the main master process. This achieves not only parallelism
but also fault tolerance since even if the phase II process fails, it doesn’t affect the execution of
the main master process. The master process will eventually fork off another phase II process.
Finally, even if the master fails after some time, the results obtained so far are still accessible
since they are inserted into the database as and when the complete chains are formed. The user
can still collect the statistics for the results from the database irrespective of whether the master
is alive. Due to the durability property of the databases, we can continue to build upon the
previous successful transformations even in the case of the failure of master process.
Master collects the results from worker and stores them into the database. In the next section,
we look at the design of this database.
3.6 Database Design
There are five different tables that are used to maintain the details of the different complete
chains. These tables are branches, chains, ruletable, strands and working strand. These
database tables are illustrated in the tables 3.6, 3.2, 3.6, 3.6 and 3.6.
Table strands stores the information about the original strands. All the strands are given a
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
30
Table 3.1: strands
id
name length
seq
integer char[] integer char[]
Table 3.2: working strand
id
no uncert pos uncert uncert
integer
integer
integer[] char []
seq id
integer
specific id which is later used for referencing that particular strand. The field name stores the
name for that particular strand. This is the same name as generated by the Phylip package. The
field seq stores actual sequence which has the length stored in the field length.
The original strand has many positions where exact base pair is not known. The worker
process fills up these positions and then performs sequence transformation. Thus, each invocation of sequence transformation forms different sequences corresponding to the different bases
filled up at the unknown positions. These sequences are stored in the table working strand. id
is the primary key of the table. no uncert stores the number of positions where the exact base
pair is not used. pos uncert[] is an integer array that stores the positions of these base pairs.
uncert[] is the character array that actually stores the bases filled up by the workers. seq id is
the foreign key to the table strands referring to the strand in the strands table in which these
bases are filled up.
The table branches stores the information about the branches in the phylogenetic tree. Each
instance or record corresponds to one invocation of sequence transformer that was included in
some complete chain. id field is the primary key of the table. f rom id and to id are the foreign
keys referring to the id field of the table working strand. f rom id corresponds to the id in
the table working strands of the ancestor sequence. to id corresponds to the id in the table
working strands of the progeny sequence. length is the branch length generated by the Phylip
package. no ts is the number of time-steps required for the complete transformation. swpt is
the switching parameter used. Recollect that a particular rule is applied to the strand until the
similarity value doesn’t change for some fixed number of time-steps. If similarity value doesn’t
change after swpt time-step, the rule to be applied is changed. seq no is the foreign key to the
field id of ruletable which stores the information about the rules used in the transformation.
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
31
Table 3.3: branches
id
from id to id length no ts
swpt seq no
integer integer integer real
integer integer integer
Table 3.4: ruletable
id
no rc
no ts rc
rules
integer integer integer [] char[][]
In the table ruletable id is the primary key. no rc is the integer storing the number of rules
used in the transformation. no ts rc[] is the integer array storing the time-steps at which the
rules were changed. rules[][] is the two dimensional character array which stores the actual
rules. Since, we have 64 transitions in each of the rule for neighborhood size of 1, There are 64
characters in each of the rules corresponding to the right hand side of the rule. The position of
each character defines the three left hand side bases of a transition.
The table chains is the relation from the table branches to itself. It stores all the ids of all
the instances in branches that belong to a single complete chain. The field id acts as a primary
key and length specifies the number of instances of branches in that chain. The field ids[] is an
integer array which refers to all the ids of the table branches involved in that particular chain.
Thus, the above database makes it easier and efficient to collect various statistics generated through sequence transformation of all the branches. We discuss the statistic collection
programs in the next section.
3.7 Statistics Collection
We have developed various statistic collection programs that retrieve values from the database
and gives various kinds of statistics. There can be different complete chains with different parameters satisfying the pseudo molecular clock assumption shown in Equation 3.3. The statistic
collection programs extract collective statistics from all the complete chains. These statistic collection programs can be executed offline by interested users at any point of time. The various
statistics that are of interest are discussed in the following subsections.
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
32
Table 3.5: chains
id
length
ids
integer integer integer []
3.7.1 Timesteps
Number of time steps required for transformation. This information can be used to obtain more
accurate measures about the rates of evolution. Different invocations of sequence transformer on
the same branch by different worker processes may give different number of time-steps. Further,
some of these instances may not be stored as they may have violated the pseudo molecular
clock assumption. Each of these time-steps is stored in the database. We obtain the average and
standard deviation values of the time steps for a transformation corresponding to a particular
branch across different complete chains.
3.7.2 Unknown base-pairs
Probabilities of a base-pair assignment to the unknown segments of the ancestor DNA sequences are calculated. Base-pair assignment to unknown segments are stored for each run
of the sequence transformer. To calculate probability of a particular base at a certain position,
we count all the instances where that base pair was assigned to that position. Then we divide
this count by the total number of instances. These probabilities may help in re-building the
complete intermediate sequences of the phylogenetic tree. For each position in the sequence, at
which a base is not known, we obtain probability for each of the four possible bases.
3.7.3 Rules
Various rules used during transformation. These rules may give insights on the impact of neighboring base-pairs on the evolution of DNA sequences. The statistic collection programs try to
collect popular rules within a particular branch across all the complete chains and for a single
complete chain across all branches. To calculate popular rule in a branch across all the complete chains, we count the number of times a particular rule is used for the transformation of
the branch. Then this count is divided by the total number of rules used. Similarly, popular
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
33
branches are also determined across the complete tree.
The rules that were used more often than others are of more interest as these rules may guide
in finding the reasons behind certain types of mutations.
3.7.4 Differential rule analysis
Probability of a particular rule being used at a given time step of transformation in a given
branch. For a given branch, at a given time-step, all the rules used across all complete chains
are collected. The number of times those rules were used is also noted. Then, popularity of a
given rule at that time-step is calculated by dividing the count of the rule (the number of times
the rule was used at that time-step in all the complete chains) by the sum of all counts. This is
similar to the analysis performed for the calculation of popular rules. The difference is that this
calculation is now conducted at each time-step. This analysis may give finer insights at each
time step of mutation.
3.7.5 Popularity of transitions
The number of times a particular transition is used for a given branch. This may be useful in
determining the exact effects of a neighboring base-pairs on mutations. There are 64 possible
left hand sides for a transition. For each of these 64 left hand side, there are 4 possible right hand
sides. Hence, average popularity of a transition is 1/256. To calculate popular transitions, we
count the number of times a particular transition is used in a particular branch. We divide this
number by total number of transitions used in all the time-steps of the transformations of that
branch. Similar statistic can be calculated for the entire branch aggregating over the time-steps.
The statistic collection programs help in finding the popularity of a transition within a particular complete chain across all the branches and for a particular branch across all the complete
chains.
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
34
Algorithm 1 Algorithm for Sequence Transformer
1: Read the ancestor and progeny sequences of the branch
2: Fill the unknown base-pairs in ancestor sequence randomly and record in unknwn[]
3: current sequence ⇐ ancestor sequence; sametolv ⇐ 0
4: prev value ⇐ similarity value between the ancestor and progeny sequences
5: timestep ⇐ 0; rule ⇐ random Cellular Automata Rule
6: for i ⇐ 0 to maximum number of time steps do
7:
Try to form a drule which transforms the current sequence into progeny sequence
8:
if the drule could be formed then
9:
rules arr[timestep] ⇐ drule; rule change[timestep] = i; timestep ⇐ timestep+
1
10:
break
11:
end if
12:
Apply the rule to current sequence
13:
if i == 0 then
14:
rules arr[timestep] ⇐ rule ; rule change[timestep] = i
15:
end if
16:
Calculate similarity value
17:
if similarity value == 1.0 then
18:
break
19:
end if
20:
if similarity value > prev value then
21:
continue
22:
else
23:
sametolv + +
24:
if sametolv > sametol then
25:
rule ⇐ random Cellular Automata Rule; rules arr[timestep] ⇐ rule
26:
rule change[timestep] = i; timestep ⇐ timestep + 1; sametolv ⇐ 0
27:
end if
28:
end if
29: end for
30: time steps = i
C HAPTER 3. S EQUENCE T RANSFORMATION
ON
G RIDS
35
Algorithm 2 Greedy Algorithm for Chain Formation
Require: Array of Linked lists L[] corresponding to each branch. N , the number of branches
Ensure: Longestchain
{Chains[]: Array of Linked lists with each element holding a chain }
{inChain[]: an array of N elements where inChain[i] indicates if branch i belongs to a
chain}
1: chainCount ⇐ 0
2: for i ⇐ 0 to N − 1 do
3:
inchain[i] ⇐ 0
4: end for
5: for i ⇐ 0 to N − 1 do
6:
if inchain[i] == 1 then
7:
continue
8:
end if
9:
prev node ⇐ The node in L[i] such that node.length is smallest
10:
Insert prev node in Chains[chainCount]
11:
inchain[i] ⇐ 1
12:
tmp node = N U LL
13:
for j ⇐ i + 1 to N − 1 do
14:
tmp node ⇐ The node in L[j] such that node.length > prev node.length and
node.length is smallest
15:
if tmp node 6= N U LL then
16:
prev node ⇐ tmp node
17:
Insert prev node in Chains[chainCount]
18:
inchain[j] ⇐ 1
19:
end if
20:
end for
21:
chainCount + +
22: end for
23: Longest chain ⇐ Chains[i] such that Chains[i].length is highest in the array of linked
lists Chains[]
Chapter 4
Experiments and Results
In this chapter, we present some promising results obtained for 3 HIV sequence types, namely,
gag, gagpol and env sequences. The results were obtained by executing the statistics collection
programs on the PostgreSQL databases formed for the sequences. Since the grid infrastructure
that we had developed can execute the transformations for long periods of time and produce
better statistics, the results presented in this chapter should be treated as representing some
good samples and demonstrating the potential of the infrastructure.
4.1 Grid Infrastructure
The sequences were downloaded from the HIV Sequence database at Los Alamos[27] and were
aligned by ClustalW web interface[9]. For each of the 3 types of aligned sequences, a phylogenetic tree was obtained using the Phylip[20] package. We utilized a grid infrastructure
consisting of 23 machines distributed in 3 countries for our experiments. These machines operated in non-dedicated modes and were used by local users for different purposes. Table 4.1
describes the experiment infrastructure. The worker processes were executed on all the machines. The master process and the PostgreSQL database, where information pertaining to the
complete chains of a sequence type are stored by the master, were started on one of the AMD
machines in Indian Institute of Science, India. The results in this chapter correspond to obtaining statistics with 7332, 4759 and 3251 complete chains collected in the PostgreSQL databases
36
C HAPTER 4. E XPERIMENTS
Location
AND
R ESULTS
37
Table 4.1: The Distributed Infrastructure
Number
of Specifications
machines
Torc cluster, University of Tennessee
(UT), USA
8
GNU/Linux 2.6.8, Dual PIII 933
MHz, 512 MB RAM, 40GB Hard
Drive, 100 Mbps Ethernet
DAS-2, Vrije Universiteit, Netherlands
9
GNU/Linux 2.4.21, Dual PIII 996
MHz, 1 GB RAM, 20 GB Hard
Drive, 100 Mbps Fast Ethernet.
AMD cluster, Indian
Institute of Science,
India
6
AMD Opteron 246 based 2.21 GHz
servers, Fedora Core 4.0, 1 GB
RAM, 160 GB Hard Drive, Gigabit
Ethernet
for gag, gagpol and env sequences, respectively, after 13, 4 and 3 days, respectively, from the
start of the corresponding experiments on the distributed infrastructure.
4.2 Timesteps
Table 4.2 shows the averages and standard deviations of the number of time steps taken for
mutations from the ancestor to progeny sequences of some branches of the phylogenetic tree
for the gag sequences. The averages and standard deviations were obtained over the 7332
complete chains formed for the gag sequences. The low standard deviation values for large
number of branches indicate high convergence of time step values for the different branches.
Similar results for time steps were obtained for gagpol and env sequences. Tables 4.3 and 4.4
show similar results for time steps for gagpol and env sequences, respectively.
4.3 Popular Rules
Table 4.5 shows the most likely cellular automata rules for neighborhood-dependent mutations
that could have been followed at certain discrete time steps of mutations on some branches of
the phylogenetic tree formed from gag sequences. The 64 characters in column 2 of the table
represent the 64 right-hand sides of the transitions shown in Table 1.1. The last column of Table
C HAPTER 4. E XPERIMENTS
AND
R ESULTS
38
Table 4.2: Summary of time step information for Gag sequences
Branch Num- Average
Standard Deber (Ancestor- Number of viation
Progeny)
Time steps
0(8-9)
15.657665
1.961112
4(19-14)
31.242908
2.085160
5(43-44)
32.274139
2.096791
6(20-21)
34.051281
1.852813
7(13-24)
35.154255
1.833268
8(24-36)
36.161758
1.827739
9(5-4)
37.184532
1.814149
10(29-30)
38.198307
1.816491
11(15-16)
39.253136
1.898609
12(1-45)
40.259411
1.904292
13(12-13)
41.311649
1.974289
14(14-18)
42.380253
1.991383
15(25-26)
43.405754
1.986514
16(22-8)
44.409302
1.990247
17(22-15)
45.484451
2.040152
Table 4.3: Summary of time step information for GagPol sequences
Branch Num- Average
Standard Deber (Ancestor- Number of viation
Progeny)
Time steps
0(3-4)
16.003782
2.499934
1(6-10)
21.138685
1.995396
2(14-6)
23.404707
1.913676
3(10-15)
25.672621
1.507439
4(4-20)
30.176718
3.335719
5(14-17)
32.025635
3.510833
C HAPTER 4. E XPERIMENTS
AND
R ESULTS
39
Table 4.4: Summary of time step information for env sequences
Branch Num- Average
Standard Deber (Ancestor- Number of viation
Progeny)
Time steps
0(10-15)
21.201477
2.505688
1(8-10)
25.257153
2.474922
2(19-22)
31.203014
2.654692
3(4-18)
33.275299
2.591264
4(6-7)
35.227314
2.508755
5(16-12)
36.850201
2.525077
6(14-13)
38.647186
2.503064
7(16-17)
40.231621
2.366600
8(12-15)
42.673332
2.888016
9(10-9)
44.145802
2.879150
10(8-6)
46.235619
3.420729
4.5 shows the probability of application of the rule at the specified time step(s) and is obtained
by dividing the number of times the rule was applied and the total number of applications
of rules for the time step(s). Determining the most likely rules for neighborhood-dependent
mutations is the primary objective of our work and the high probability numbers for about 7000
total samples shown in Table 4.5 indicate the potential benefits of our methodologies. Similar
results for rules at discrete time steps were obtained for gagpol and env sequences. Tables 4.6
and 4.7 show results for differential rule analysis for gagpol and env sequences, respectively.
Table 4.8 shows some of the most and least popular rules that were applied for some
branches of the phylogenetic tree obtained from the gag sequences. Column 3 of the table
shows the popularity of a rule for a branch and is calculated by dividing the number of times
the rule was applied at different time steps for the branch corresponding to all the complete
chains and the total number of applications of rules for the branch. Column 4 shows the expected popularity assuming that all rules were applied equal number of times on the branch.
The first 5 entries of the table shows that certain rules are more popular and have 2-3 times
more popularity than the expected popularity. The last entry shows that the corresponding rule
C HAPTER 4. E XPERIMENTS
AND
R ESULTS
40
Table 4.5: Differential Rule Analysis for Gag Sequences
Branch Num- Rule Used
Time- Probability
ber (AncestorSteps
Progeny)
11(15-16)
CAGGCAAACGCCTGTTACATT
AATTTCGTGCCGTTAAAAGTG
TGCGGCTGATGCGCAAATGCCC
42-43
0.951841
12(1-45)
GGGTCAATTTTGGCATGACCA
TATTGTCCTCAGATACTAAGT
TCATAGAGAAGAACGATTACTT
43-44
0.951841
29(26-27)
CCGAAGTGATTGAAGCGCTTG
TTTCTGGCGATTTTTGTGGTC
CACTCACCTTATTCGCAAAATA
59-61
0.981132
32(1602 AG.NG.x)
TCGCAGCGACCGCATAAACAT
ACCGGCTGGCGATAAGCTGGA
CTCAACACGAGTGCCAAATCTT
65-66
0.929577
83(6-21)
ATATGTGCGGTTACTATCGGT
CTCCGGAGGTGCACTTACCGC
CGGATGGCACTAGAGAACTATA
198199
0.943114
Table 4.6: Differential Rule Analysis for GagPol Sequences
Branch Num- Rule Used
Time- Probability
ber (AncestorSteps
Progeny)
0(3-4)
GTGTCCGATAAAGATTTAAAT
GTGCAAACATTTCTCTCCGGC
TAGTACAAGTTCGACGCTTAGG
21-22
0.901235
11(9-8)
CAGATTATTCCAAGTACAGAA
ACCCTCGACAGGGGCGTCCAG
CCCCGGGCGGCATGTCCAGGGC
51-53
0.946429
19(18-19)
AGGGGCGTGAATGGACGAGTG
ATGCCGCTCCATAGGGCGAGG
AACAGTCATTATGCAAGAGAAG
69-71
0.962963
45(4AF382828)
AGGTGAGAGCGGTAGCTTCAT
TCCGCTGAGTAACATCGAATT
GTTACTATTTGCCAGAGTTCTA
88-88
0.821429
25(21L20587)
TTTCAGTAATACTCCTCTGTA
ATTTGGGAGAACACGTCGTAT
ATACTCCGCGGACGTAATGCTA
85-85
0.763780
C HAPTER 4. E XPERIMENTS
AND
R ESULTS
41
Table 4.7: Differential Rule Analysis for env Sequences
Branch Num- Rule Used
TimeProbability
ber (AncestorSteps
Progeny)
43(23U42720)
GCCAGAGGGATATGTCTGGAC
AGTGTGACCGAGACCATTGTT TTGGACAGCGTAGGACGGGCGC
158161
0.919271
45(22-20)
AAAACCGACGGCCAGCTACCT
ACTGTACTCCTGGATAGTCCG
TTGCAGCGTCGCCAACGAGATA
170173
0.921960
45(22-20)
AAAACCGACGGCCAGCTACCT
ACTGTACTCCTGGATAGTCCG
TTGCAGCGTCGCCAACGAGATA
174174
0.921960
6(14-13)
CGCACGCTAGGATACATCGCA
TTGCTGGACACAGCGATGACA
ACAGCTTAATAAGCACAAGAGT
43-43
0.864583
33(20L20587)
TAATGATATCTGGGTGAACTC
GTAAACCACGTATGCGAA
TAAAACTCCACTACACTCCC
108109
0.819672
41(21X52154)
TACAGCCCCTGCCAAATAACT
CGCTTTGCCACCGGATAGCAA
GCCTCAACGTTAAGTAACCCTT
138147
0.844498
TCTAG-
C HAPTER 4. E XPERIMENTS
AND
R ESULTS
42
Table 4.8: Popular Rules for a Branch for Gag Sequences
Branch Num- Rule
Popularity Expected
ber (AncestorPopularity
Progeny)
0(8-9)
CCTTAGGGCGGTGAGCTGAGGA
CAGGTAACTGTGCAAAGGGAC
ACACCTGTCGCGATCCATAGG
0.012445
0.005555
0(8-9)
CCGAGGTCGGATAAGACATAGG
AAGTTGTACCCTCTGAACTAA
TCGTTGGTCATCGGCAGCGTT
0.010139
0.005555
0(8-9)
CGAGACAGAACTGCTCCGAAAG
GCAGAGTGGAAGGCTTATTCT
CGGTAAACCTTTAGTGGCATG
0.012784
0.005555
1(18-15)
GTATAACAAGTCCTTCTGTGTC
TCCTCCAGAGGACGATTCCGT
GCGTACGGAACGCCTTCGTAA
0.010041
0.004291
2(10-37)
GTCAGGTATACATCTGGTCTCG
AATGGTAGCTCGATCAACCCC
ATCGCAACCTGGGACGCATCC
0.013078
0.004608
2(10-37)
CGCCGAAAACCTCTTCTTTAAC
ACGCGTTGCCCGTTTTATCCT
GCTTATTCTAGCTTTGTGACT
0.000105
0.004608
is less likely to have been used for the branch, having 40 times less popularity than the expected
popularity. Similar popular rules for the gagpol and env sequences are shown in Tables 4.9 and
4.10, respectively.
4.4 Base Pairs Corresponding to Unknown Positions
Table 4.11 shows the resolutions of unknown positions of the intermediate gag sequences.
Columns 3, 4, 5, and 6 show the probabilities that a particular unknown position in an intermediate sequence is occupied by a base pair and is calculated by dividing the number of resolutions
of the position with the base-pair by the total number of resolutions of the position. Entries 1,3
and 6 of the table show that the corresponding positions of the sequences are most likely to
be occupied by purines (A and G) than pyrimidines (C and T). The other entries show that the
C HAPTER 4. E XPERIMENTS
AND
R ESULTS
43
Table 4.9: Popular Rules for a Branch for GagPol Sequences
Branch Num- Rule
Popularity Expected
ber (AncestorPopularity
Progeny)
0(3-4)
AGGGCTGGGTGGGTACAACTAT
GTAACTCTGCGATAAGGGCCCA
CGCCTTATGTTTTAGAAAGT
0.016373
0.006289
0(3-4)
CTGAGAACACGTATGTGTCGGT
CTCATTTCAACTCTTCCATATT
TCTCCATTACTCCTCCAGGC
0.017153
0.006289
0(3-4)
CTATGTCTTGCAGTTCTACACC
CTCTTCGTTTGTGATAATGGTC
CTGCAAGCCACTTCCCCAAT
0.018941
0.006289
1(6-10)
ACCGCGAAAGGTGCGCTGTGGT
ACACGTCGAGAGAGCCTCACAT
TTATCAGATGCATTACGAAT
0.016120
0.006289
1(6-10)
CGACCCATAGACTGAGGCCCAC
ACACAGGCCATGCCAGAAACCT
TCGCTTGACGCCAGTTTCTT
0.016158
0.006289
1(6-10)
GGCCTTATCCTCGTTATCGCGG
AGACGCTGCCAGTAGTTACTCT
GCACGGTCATTCAGTAACGC
0.016538
0.006289
C HAPTER 4. E XPERIMENTS
AND
R ESULTS
44
Table 4.10: Popular Rules for a Branch for env Sequences
Branch Num- Rule
Popularity Expected
ber (AncestorPopularity
Progeny)
0(10-16)
CCCCTGTGCATAGGGCGAGCGG
ATACCAATTGACAAATTGAGGG
ACTTGCCAAGAAAGAAGAGG
0.026367
0.009345
1(8-10)
AGAACGACAATAATTCCCTCTT
GGATCCTGTACAATGCATCATG
ATTCAGTAGCTGAATATTAC
0.022690
0.008474
1(8-10)
CAGAGCGCTAAGTCATCTGGTA
CTGGTTACTGGGTAACCTCGGC
TGAAGGTGTACCTACTGACA
0.020014
0.008474
5(16-12)
ATATGCGACCCAGCATCCGACA
GACCCTTGTACTATAGACTCGC
GGCAATAGTTGCCTCACCAT
0.020333
0.005780
7(16-17)
GCTGTGATTCAAACGAGGACGG
AGCGTGTAAGCTAAGCAGTAGT
GCAATACCTCTAATTTAAGC
0.022057
0.006134
corresponding positions of the sequences are most likely to be occupied by pyrimidines than
purines. Table 4.12 and 4.13 show similar results for gagpol and env sequences, respectively.
C HAPTER 4. E XPERIMENTS
AND
R ESULTS
Table 4.11: Resolution of Unknown Positions for Gag Sequences
Seq
Unknown prob(A) prob(C) prob(G) prob(T) prob
prob
NumPosition (1)
(2)
(3)
(4)
(purines) (pyrimber
(1+3)
idines)
(Name)
(2+4)
26 (3)
1390
0.343699 0.149891 0.405074 0.101337 0.748773 0.251228
26 (3)
375
0.173895 0.242089 0.082788 0.501227 0.256683 0.743316
4 (1)
410
0.430306 0.113339 0.305101 0.151255 0.735407 0.264594
9 (14)
411
0.126978 0.340971 0.139935 0.392117 0.266913 0.733088
10 (15)
386
0.125887 0.415985 0.141980 0.316148 0.267867 0.732133
17 (21)
1251
0.262411 0.076650 0.469040 0.191899 0.731451 0.268549
4 (1)
419
0.127250 0.289416 0.147709 0.435625 0.274959 0.725041
18 (22)
413
0.221086 0.281915 0.061648 0.435352 0.282734 0.717267
18 (22)
428
0.133933 0.399209 0.151391 0.315466 0.285324 0.714675
Table 4.12: Resolution of Unknown Positions for GagPol Sequences
Seq
Unknown prob(A) prob(C) prob(G) prob(T) prob
prob
NumPosition (1)
(2)
(3)
(4)
(purines) (pyrimber
(1+3)
idines)
(Name)
(2+4)
4 (13)
1182
0.145921 0.349749 0.125733 0.378597 0.271654 0.728346
10 (19)
1239
0.330819 0.177354 0.375594 0.116234 0.706413 0.293588
45
C HAPTER 4. E XPERIMENTS
AND
R ESULTS
46
Table 4.13: Resolution of Unknown Positions for env Sequences
Seq
Unknown prob(A) prob(C) prob(G) prob(T) prob
prob
NumPosition (1)
(2)
(3)
(4)
(purines) (pyrimber
(1+3)
idines)
(Name)
(2+4)
10 (19)
1973
0.149503 0.421882 0.112165 0.316450 0.261668 0.738332
8 (17)
1077
0.343036 0.222814 0.389156 0.044993 0.732192 0.267807
12 (20)
1709
0.121063 0.299489 0.147342 0.432106 0.268405 0.731595
22 (9)
914
0.381259 0.181177 0.349785 0.087778 0.731044 0.268955
0 (1)
463
0.090723 0.421704 0.191265 0.296308 0.281988 0.718012
9 (18)
914
0.168916 0.335992 0.113804 0.381288 0.282720 0.717280
17 (4)
2302
0.371154 0.148012 0.341511 0.139323 0.712665 0.287335
6 (15)
1685
0.135072 0.365951 0.152454 0.346524 0.287526 0.712475
0 (1)
638
0.167945 0.333436 0.122021 0.376598 0.289966 0.710034
4.5 Potential of Grid Computing
Table 4.14: Usefulness of Large Number of Runs
December 23, 2006, Number of complete chains = 1347
Branch
Average Number of Time Steps
Standard Deviation
0
14.625093
3.565527
1
20.830734
3.153974
January 3 2007, Number of complete chains = 7607
Branch
Average Number of Time Steps
Standard Deviation
0
15.832522
1.890947
1
22.199816
1.708907
In order to show that our ever-running computations can have potential long term benefits in resolving uncertainties associated with mutations, we conducted experiments with gag sequences
C HAPTER 4. E XPERIMENTS
AND
R ESULTS
47
on the 6 AMD machines in India. We then observed the average time steps in mutations of 2
branches of the phylogenetic tree at 2 different periods of time separated by 10 days. Table 4.14
shows results corresponding to the 2 branches with 1347 complete chains collected in the PostgreSQL database on December 23, 2006 and with 7607 complete chains collected on January
3, 2007. The lower standard deviation values for the results collected on January 3 show that
the average number of time steps converges with increasing number of executions. Thus, our
work can give more definite findings regarding mutations with time progression.
Chapter 5
Predictions in phylogenetic trees
In Chapter 3, we described the modeling of the DNA sequence mutations in order to gain
insights into the mutation process. In this chapter, we describe our attempts to extend this
model in order to predict the future sequences. To verify the accuracy of the predictions, we try
to predict the already existing sequences of the phylogenetic tree.
5.1 Determining the Preserved Segments
Not all the base-pairs in a DNA sequence undergo mutations. In the transformation methods
described in the previous chapter, the cellular automata rules are applied only on those segments
of the current sequence which differ with the corresponding segments of the progeny sequence.
Thus, while predicting, we need to find the segments of the current sequence to which cellular
automata rules should not be applied (preserved segments). However, during prediction, we
do not have a priori knowledge of progeny sequence. Hence, we need to find the preserved
segments using some other method. We use Position Specific Scoring Matrix (PSSM) for this
purpose.
Position Specific Scoring Matrix is calculated over a set of sequences. The matrix gives us
an idea about the preserved sequences. It contains four columns corresponding to four base pairs
and rows equal to the number of bases in the strands. Entry at row i and column j corresponds
to the probability of occurrence of the base corresponding to column j at position i of a strand.
48
C HAPTER 5. P REDICTIONS
IN PHYLOGENETIC TREES
49
We use this probability to determine whether a base at a particular position in a strand should
be preserved.
5.1.1 Calculation of PSSM
For a specific position or a row in the PSSM, we calculate the occurrences of all the bases across
all the sequences. Then, we divide each of these occurrences by the total number of sequences
under consideration to get the probabilities of occurrences of the four bases for that particular
position. Thus, four columns corresponding to a position or row of PSSM are filled. This
procedure is repeated for all the positions in order to obtain the complete PSSM. Algorithm for
this calculation is shown in Algorithm 3
Algorithm 3 Calculation of Position Specific Scoring Matrix
Require: n sequences in sequences[] each having numpositions bases.
Ensure: pssm[][]
acount ⇐ 0; ccount ⇐ 0; gcount ⇐ 0; tcount ⇐ 0
for i ⇐ 0 to numpositions − 1 do
for j ⇐ 0 to n − 1 do
if sequences[j].bases[i] == A then
acount ⇐ acount + 1
end if
if sequences[j].bases[i] == C then
ccount ⇐ ccount + 1
end if
if sequences[j].bases[i] == G then
gcount ⇐ gcount + 1
end if
if sequences[j].bases[i] == T then
tcount ⇐ tcount + 1
end if
end for
pssm[i][0] ⇐ acount/n
pssm[i][1] ⇐ ccount/n
pssm[i][2] ⇐ gcount/n
pssm[i][3] ⇐ tcount/n
end for
PSSM, as calculated in Algorithm 3, will be used for predicting a sequence in a phylogenetic
tree. As shown in Algorithm 3, PSSM can be calculated over a set of sequences. Thus, we have
C HAPTER 5. P REDICTIONS
IN PHYLOGENETIC TREES
50
to define this set of sequences. For our methodology, we have defined three such sets.
1. whole-pssm : Here, we consider all the sequences in the phylogenetic tree for calculating
the PSSM. This PSSM gives the overall probability of a particular base at being at a
particular position across all the sequences. This PSSM is same for all the sequences.
The same whole-pssm will be used for predicting all nodes in the phylogenetic tree.
2. subtree-pssm : In this case, we use a subset of the entire phylogenetic tree. If the sequence for which we are going to predict bases is at level n, all the sequences at levels
from 0 to n − 1 form a subtree which existed before this sequence. We use this subset
to calculate subtree-pssm. Subtree-pssm summarizes the history of the sequence under
consideration. Subtree-pssm is same for all the nodes at a particular level. However, it
has to be re-calculated for sequences at different levels of the phylogenetic tree.
3. path-pssm : In this case, we isolate the set of branches that lead from root to the sequence. We form a set of these sequences to calculate PSSM. path-pssm uses exact history of a particular sequence and is different for different sequences and can help in more
specialized predictions.
5.1.2 Strategies for Determining Preserved Sequences
We derived four different strategies for determining preserved sequences using PSSM. Each of
these strategies decides whether to apply rule for transformations at a particular position in the
strand.
1. Conservative : This strategy decides not to apply rule only at positions where PSSM
value is 1.0. This strategy assumes that if a particular base pair appears in all the sequences, it may also appear at the same position in the next sequence.
2. Extreme-Conservative : This strategy decides not to apply rules at positions where
PSSM value is 1.0 and the base corresponding to that position also matches with the
base in the strand. PSSM used can be one of whole-pssm, subtree-pssm or path-pssm.
C HAPTER 5. P REDICTIONS
IN PHYLOGENETIC TREES
51
3. Flexible-1: This strategy decides not to apply rules at positions where PSSM value is
greater than or equal to a certain threshold value. The higher the threshold value, more
are the positions where rules are not applied. Various experiments were conducted to
determine this threshold value. This strategy allows a lot of flexibility and its performance
varies depending upon the value of the chosen threshold.
4. Flexible-2: This strategy decides not to apply rules at positions where
(a) PSSM value is greater than or equal to a certain threshold value and
(b) the base pair corresponding to the highest PSSM value at that position equals to the
base pair in the sequence under consideration.
This strategy offers more flexibility than Flexible-1 strategy as the positions where strategy decides not to apply rule may vary as we progress through the time-steps. This is
because of the second condition added. This increased flexibility improves the performance of the strategy and hence we use this strategy for our prediction experiments.
5.1.3 Evaluation of Strategies
To compare the strategies described in 5.1.2, we define 5 parameters for a strategy : true positives, false positives, true negatives and false negatives and decision parameter. True positives
are the positions where a strategy decides to exclude the position from applying the rule and the
corresponding bases in both ancestor and progeny are equal; i.e. decision taken by the strategy
was correct. False positives are the positions where a strategy decides not to apply rule for
transformations but the corresponding two bases were not the same in ancestor and progeny;
i.e. decision taken by the strategy was not correct. Similarly, true negatives and false negatives
are defined.
The number of false positives defines the upper bound on the similarity value (see Chapter
3) that can be attained due to the transformations. This is because the positions corresponding
to false positives are not going to be transformed even though the base pairs in these positions
do not match with the corresponding base pairs in the progeny sequence. This upper bound can
C HAPTER 5. P REDICTIONS
IN PHYLOGENETIC TREES
52
be calculated as
upperbound =
n − fp
n
(5.1)
where,
n = number of bases in the sequence.
f p = number of false positives.
Decision parameter is calculated by subtracting the number of wrong decisions from the
number of correct decisions and then normalizing the number by the total number of bases in
the the sequences. The Equation 5.2 gives the formula for the decision parameter.
decpar =
{(tp + tn) − (f p + f n)}
n
(5.2)
where,
tp = number of true positives.
tn = number of true negatives.
f p = number of false positives.
f n = number of false negatives.
We evaluate fitness of a strategy using Equation 5.3
f it(strategy) = 0.5 · decpar + 0.5 · upperbound
(5.3)
Conservative strategy gives a low percentage of false positives resulting in high upperbound
values but there are also lots of false negatives affecting the fitness of the strategy.
Extreme-conservative strategy gives similar results to conservative strategy when full-pssm
is used but the results are different when subtree-pssm or path-pssm are used. When full PSSM
is used, the second condition used in this strategy becomes redundant and extreme-conservative
strategy exhibits the same behavior as the conservative strategy. However, when subtree-pssm
C HAPTER 5. P REDICTIONS
53
IN PHYLOGENETIC TREES
Analysis of threshold values
0.75
path-pssm
sub-tree pssm
entire tree pssm
0.7
fit(Flexible-1)
0.65
0.6
0.55
0.5
0.45
0.4
0.35
0.1
0.2
0.3
0.4
0.5
0.6
Threshold values
0.7
0.8
0.9
1
Figure 5.1: Analysis of threshold values : Flexible-1
or path-pssm is used, the base pair in the ancestor and progeny sequences may not match even
if th PSSM value at that position is 1.0, giving different results. This strategy also suffers from
the drawback that it gives high number of false negatives.
5.1.4 Determination of Threshold Values for Flexible Strategies
We performed experiments for strategies Flexible-1 and Flexible-2 to determine the threshold
value discussed in subsection 5.1.2 . Figure 5.1 and Figure 5.2 show the fitness of strategies
Flexible-1 and Flexible-2 respectively, for various threshold values for a particular branch. The
Y-axis shows the fitness values calculated using Equation 5.3.
As we can see from both the graphs, the threshold values have similar effects on the performance of Flexible-1 and Flexible-2. For higher threshold values, the number of false positives
are low. However, the number of false negatives are very high. This decreases the metric value.
As we decrease the threshold value, many of the false negatives become true positives, increasing the metric value. The performance of the strategy reaches its peak at threshold value of 0.5.
For threshold values less than 0.5, the number of false positives increase so much that the met-
C HAPTER 5. P REDICTIONS
54
IN PHYLOGENETIC TREES
Analysis of threshold values
0.7
path-pssm
sub-tree pssm
entire tree pssm
0.65
fit(Flexible-2)
0.6
0.55
0.5
0.45
0.4
0.1
0.2
0.3
0.4
0.5
0.6
Threshold values
0.7
0.8
0.9
1
Figure 5.2: Analysis of threshold values : Flexible-2
ric value starts decreasing. After a certain stage, the fitness value stabilizes irrespective of the
threshold value. This happens because most of the preserved positions remain constant when
the threshold values are less than most of the PSSM entries.
The threshold values can be varied for different time-steps and then their effects on the fitness values can be analyzed. This will result in a more complicated strategy and can provide
better results. For the initial time-steps, we want the number of false positives as low as possible
so that the limit of similarity values that can be achieved can be high. This requires high threshold value. However, this also means that actual similarity value remains low. Hence to improve
upon this similarity value, the threshold value can be increased. Calculating a different threshold values for each of the time-steps can be a complicated and cumbersome task. Moreover, it
also depends upon the number of time-steps which differ for each of the branch. Hence it is
impossible to come up with a fixed generic strategy. We follow a simplistic approach with two
threshold values thr1 and thr. In this approach, thr1 is used only for the first time-step and thr
is used for the later time-steps. From the above discussion, it follows that value of thr 1 should
be as high as possible and value thr should be lower. We performed an analysis over different
ranges of both the thresholds. In this analysis, we actually used Flexible-1 and Flexible-2, and
C HAPTER 5. P REDICTIONS
IN PHYLOGENETIC TREES
55
applied random rules to come up with the similarity value for each of the branches. The optimal values for thr1 and thr determined from this analysis were 0.9 and 0.5 respectively. As
expected, value of thr1 is higher and the value of thr is the same as the best threshold value
obtained from the analysis of Figure 5.2 and Figure 5.1.
Also as seen from Figures 5.1 and 5.2, Flexible-2 performs better than Flexible-1 most of
the times. Moreover, Flexible-2 improves itself for increasing time-steps. Hence we have used
Flexible-2 for determining the preserved sequences in further work.
To summarize, during transformations, rules are not applied to the preserved segments.
PSSM is used to calculate these preserved segments of a DNA sequence. Any of the whole,
subtree or path PSSM can be used for this purpose. Of the four strategies that we have formulated, Flexible-2 performs best in terms of the fitness defined by Equation 5.3. Flexible-2
strategy decides not to apply rule at a particular position if PSSM value is greater than or equal
to a certain threshold value and the base pair corresponding to the highest PSSM value at that
position equals the base pair in the current sequence. The fitness of Flexible-2 is a function of
the threshold value. In order to nullify the effect of high number of false positives, two threshold values are used - thr1 for the first time-step and thr for the rest of the time-steps during the
transformations. From several experiments, the best values for thr 1 and thr are determined as
0.9 and 0.5 respectively.
5.2 Analysis of Random Strategies
Following the strategies for preserving segments during transformations using Flexible-2 strategy described in previous section, we mutate the rest of the segments in a given ancestor sequence for certain number of time-steps. The number of time-steps is equal to the average
number of time-steps for the branch whose ancestor corresponds to the ancestor sequence under consideration. After mutating for the time-steps, we compare the sequence obtained with
the progeny sequence. For mutations of the non-preserved segments, various techniques can be
used. In order to evaluate these techniques, we use three base strategies.
C HAPTER 5. P REDICTIONS
IN PHYLOGENETIC TREES
56
1. Upper bound: This is not really a strategy but the maximum upper bound that we can
achieve. This upper bound is calculated by using Equation 5.1. This upper bound is
calculated for each of the branches using Flexible-2.
2. Time-step invariant : In this strategy, we assume that preserved segments of the sequence are calculated only at the first time-step. Then these same segments are preserved
in the further time-steps. With this, we can analytically calculate the expected similarity
value as follows. By definition, false positives corresponding to the ancestor sequence,
always differ in ancestor and progeny sequences. True negatives and false negatives,
however, can converge to the correct base with the probability of 0.25. Thus,
exp sim =
n − {f p + 0.75 · (tn + f n)}
n
(5.4)
This analytical value was verified by conducting experiments where the non-preserved
segments are randomly mutated.
3. Time-step variant : In this case, analytical value can not be obtained as we allow to
change the preserved segments of the sequences at each time-step. Hence, we conducted
50 experiments and obtained average of the similarity values after the given number of
time-steps, for each of the branches. For each branch, we calculated the preserved segments using Flexible-2 (see 5.1). As discussed in the previous section, threshold used
for first time-step was 0.9 and the threshold used for the rest of the time-steps was 0.5.
We observed that similarity values obtained for almost all the branches for time-step invariant case were better than the analytical values obtained using Equation 5.4 for the
time-invariant case.
The three curves shown in Figure 5.3 are obtained using above random strategies. These
curves act as benchmark for our strategies. The upper bound curve sets the maximum limit that
we can reach. The challenge is to derive prediction strategies for mutations of non-preserved
segments that can yield similarity values better than time-step variant and time-step invariant
cases for all the branches.
C HAPTER 5. P REDICTIONS
57
IN PHYLOGENETIC TREES
Analysis of random strategies
1.1
Upper bound
Time-step invariant
Time-step variant
1.05
1
Similarity Value
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0
10
20
30
40
Branches
50
60
70
80
Figure 5.3: Analysis of random strategies
5.3 Methods Used for Prediction
We have collected various statistics for HIV sequences as discussed in Chapter 4. We use these
statistics in order to predict the future nucleotide sequences. In order to measure whether our
prediction methods are producing correct sequences, we prune the complete tree to a smaller
tree and then try to predict sequence in the complete tree using the pruned tree. The phylogenetic
tree under consideration has 16 levels. Thus, we split the tree at each level and try to predict
the sequences for the next level. For example, we consider only five levels from the root of
the phylogenetic tree. The leaves of this five-level tree are intermediate nodes in the original
tree. From this five-level tree, we try to predict the sequences at level six. Then these sequences
are compared with the level-six nodes of the complete or original phylogenetic tree to derive
the similarity value. The sequences at the sixth level are also obtained using the base strategies
mentioned in the previous section.
C HAPTER 5. P REDICTIONS
IN PHYLOGENETIC TREES
58
5.3.1 Roulette Wheel Method
We tried various methods, of which we describe here the most effective of all. In order to predict
the progeny sequence of branch A, we extract the rules used for each of the time-steps from the
database for the history of branch A. History consists all the branches that lead to ancestor
sequence of branch A from the root of the tree. We also obtain average number of time-steps
for branch A. The problem is to apply a rule for each time-step of branch A in order to reach
the progeny sequence.
To decide which rule should be used for a time-step, we form a roulette wheel using a
fraction of the most popular rules for that particular time-step for previous branches. Entire
roulette wheel is divided into sectors such that sum of the probabilities (sizes) of each of all the
sectors equals one. Each sector of the wheel corresponds to a rule extracted from the database
and probability (size) of that sector is proportional to the popularity of that particular rule. Now,
we rearrange the probabilities in the roulette wheel such that probability of a particular sector is
the sum of the probabilities of all of the previous sectors. Thus, the probability of the last sector
equals 1.0. Then a number between 0 and 1 is randomly generated. That number falls in one of
the sectors of the roulette wheel. The rule corresponding to that sector is used for the time-step.
Note that the probability of picking a particular rule is directly proportional to the popularity of
that rule.
The fraction of the most popular rules that should be used to form a roulette wheel is an
interesting parameter to search for. We have tried values ranging from 0.1 to 0.6 in the steps of
0.1 i.e. from top 10% to top 60% of the popular rules.
5.3.2 Roulette Wheel Method with Random Component
For the roulette wheel method discussed in the previous section, all the rules used to form a
roulette wheel were selected from the database. In this method, however, we create a part of the
roulette wheel using randomly generated rules. Thus, a roulette wheel now has two parts
1. Part consisting of the rules extracted from the database.
2. Part consisting of the randomly generate rules.
C HAPTER 5. P REDICTIONS
IN PHYLOGENETIC TREES
59
Same procedure (described in the previous subsection) is followed to pick a rule for a particular
time-step. The rule may belong to any one of the above parts with probability equal to the size
of the parts. For example, if we keep the size of the random part to be 0.1 of the total roulette
wheel, probability that the rule picked for a time-step is randomly generated is 0.1. For the
part consisting of the rules extracted from the database, the probabilities of each of the sectors
are now adjusted in such a way that their sum is 0.9 instead of 1.0. The fraction that consists
of randomly generated rules is an interesting parameter to search. We tried different values
ranging from 0.0 to 0.5 in steps of 0.05.
5.3.3 History Sizes
For both prediction methods, we use history of the branch under consideration. Naive strategy
is to consider all the branches from the root to that branch. We can also limit this history to
consist of only a certain number of immediate predecessors from the history of the branch. We
call this number as the history size. Thus, if we are performing experiments with the history size
of 3, three immediate predecessor branches in the path from root to the ancestor of the branch
under consideration are selected. Then the roulette wheel is formed only from the popular rules
of these 3 branches. History size does have an impact on the similarity value that we can attain.
The history size values of 3 and 4 seemed to produce better results than other history size values.
5.3.4 Experiments and Results
Using above techniques, we have tried to predict for the branches at level 3 onwards. There are
total of 77 branches. For each branch, we have four similarity values corresponding to upper
bound, time-step variant random and time-step invariant random, and roulette wheel method
with random component. We have performed experiments for history sizes ranging from 3 to
10. For each of these history sizes, we have varied the random components of the roulette wheel
from 0.0 to 0.5 in the steps of 0.05. For each of these random components, we have varied the
fraction of popular rules to be extracted from the database from 0.1 to 0.6 in the steps of 0.1.
The goal is to improve the value obtained from the roulette wheel method over both values
obtained from the random methods.
C HAPTER 5. P REDICTIONS
IN PHYLOGENETIC TREES
60
The roulette wheel method outperforms the time-step invariant random strategy in all the
77 branches with average improvement of 11.41%. However, the average improvement when
compared to time-step variant random strategy is quite low. Though we have been successful
to better the values obtained from random methods in 51 branches out of 77 branches for a
certain set of parameter values, average improvement over the time-variant random strategy is
only 0.003. There can be various reasons for this low average value.
5.4 Analysis
Clearly, the predictions done using the methods described so far do not yield better results as
compared to random strategies. In this section we analyze the reasons behind this.
1. New Rules : In all the methods described so far, we use the popular rules used in the
previous history. A small analysis was performed to find out whether a particular rule
was used more than once in the tree. It was found that rules were never repeated in any
part of the tree. This was expected as the search space consists of 4 64 rules and it is highly
unlikely that two randomly generated rules will be the same. This analysis suggests that
a method to derive new rule could be more effective. This new rule could be formed from
the popular rules using extrapolation. In function extrapolation, we use previous samples
to derive a new value. We do not use the most popular sample for the next prediction. We
are not exactly certain about how to derive this new rule from the previous history, but it
can give better results than using just the most popular rule as it is.
2. Preserved Sequences for Selective Application of Rules: While performing sequence
transformation, we saw that selective application of cellular automata rules quickly gave
us the convergence to progeny sequence. This happens because we had used the knowledge of progeny sequence itself to form the preserved sequence. This is affordable when
we are only doing the analysis. However, while making predictions, we don’t have this
powerful tool at our disposal. Though we use PSSMs for determining the preserved sequences, the rules that are used do not correspond to the same PSSMs. If, during analysis,
C HAPTER 5. P REDICTIONS
IN PHYLOGENETIC TREES
61
PSSMs had been used for selective application of cellular automata rules, we would have
had a direct connection between the rules and the preserved sequences.
We are not certain whether using PSSM for selective application of cellular automata rules
may yield the same convergence we have got. It may take more time-steps to converge
requiring higher number of grid resources. However, it may give better results as far as
predictions are concerned.
3. Biological Knowledge? : The rules obtained using the sequence transformation analysis
are used directly for prediction. Some kind of biological knowledge may give some
insights to filter these rules. Analogy here is with the construction of phylogenetic tree
itself. The phylogenetic tree building packages give a large number of trees. These
trees are then filtered using bootstrapping. Sometimes even the manual observation by
biological expert is needed in order to obtain the correct phylogenetic tree. This postanalysis on the rules is not performed. Bootstrapping of the rules may reduce the noise
giving us the better idea of the popular rules. Again, the methodology and amount of
resources required for bootstrapping remain to be seen.
Chapter 6
Conclusions and Future work
6.1 Conclusions
In this thesis, we have developed techniques based on cellular automata to analyze the rules
for neighborhood-based mutations on branches of a phylogenetic tree. Fine-level analysis of
rules at different time steps of mutations is one of the primary contributions of our work. We
have also used the cellular automata rules to resolve the uncertainties regarding phylogenetic
trees, including number of time steps of mutations and base-pairs in certain positions of the
intermediate sequences. Due to the vast number of rule space involved in our analyzes, we adopt
coarse-level parallelism by conducting parallel exploration of rules on different grid resources.
We have also built mechanisms for load-balancing and fault-tolerance that are necessary for
sustaining our ever-running computations on grid resources. We have built a database in order
to store the results generated during the parallel sequence transformations. Experiments were
performed on gag, gagPol and env sequences of HIV using distributed setup of 23 machines
across three countries. Various statistics collection programs are written to extract the useful
results from the database. These programs can be invoked any time offline by the user during
or after the execution of the application. Based on the results collected, we have shown some
interesting statistics related to time steps, popular cellular automata rules for a branch and the
resolution of unknown positions in the intermediate sequences.
In the second part of the work, we have laid foundations for predictions of sequences in phy62
C HAPTER 6. C ONCLUSIONS
AND
F UTURE
WORK
63
logenetic trees. Various strategies for calculating the preserved sequences have been formulated.
Base strategies have been developed in order to compare the results of any prediction strategy.
Roulette Wheel strategy has been developed for actual predictions. Though the Roulette Wheel
strategy performed better than the base strategies in large number of cases, the average performance improvement was not significant. Not using PSSM during the sequence transformations
and lack of bootstrap filtering on rules may have hampered the performance of the strategies.
6.2 Future Work
Our current work can adequately deal with sequences of small lengths as HIV virus. This work
can be extended to deal with sequences of large lengths like in human genomes. To perform
computations and collect results for large sequences in reasonable amount of time, fine-grain
parallelism, where the individual transformations from an ancestor to a progeny have to be
parallelized, can be employed. Robust scheduling mechanisms can be developed to map the
individual sequence transformations to the processors of a grid. Existing scheduling techniques
that need expected time to completions will not be adequate, since the number of time steps
needed for sequence transformations cannot be determine a priori.
This work can be generalized to deal with different rules for different patterns of
neighborhood-based mutations. In particular, different neighborhood sizes greater than 1 can be
explored. Higher order neighborhood-dependencies have to be managed by using simplifying
biological assumptions. Specifically, the assumption that fitness of the strand always increases
during the transformation can be relaxed. Certain fluctuations in the fitness value may be allowed during the transformations in order to get more realistic analyzes.
Different prediction strategies can be built in order to get more accurate results for the
predictions. These prediction strategies may either depend on the realistic analysis discussed
above or they can be built by addressing the reasons discussed in the last section. Specifically,
analysis itself can be improved by incorporating PSSMs for Selective application of cellular
automata rules. The techniques used for finding the preserved sequences can be tuned so that
they can also be used for selective application of cellular automata rules. Rules obtained from
C HAPTER 6. C ONCLUSIONS
AND
F UTURE
WORK
64
the analyzes can be filtered using bootstrapping so that these rules can be used more confidently
for predictions. New prediction strategies may use techniques similar to function extrapolation
to derive new rules ensued of using the exactly same rules used in the history.
References
[1] Abc@home. http://abcathome.com/.
[2] P. Arndt, C. Burge, and T. Hwa. DNA Sequence Evolution with Neighbor-Dependent
Mutation. Journal of Computational Biology, 10:313–322, 2003.
[3] Biogrid. http://biogrid.net.
[4] M. Bulmer. Neighboring Base Effects on Substitution Rates in Pseudogenes. Molecular
Biology and Evolution, 3(4):322–329, 1986.
[5] C. Burks and D. Farmer. Towards Modeling DNA Sequences as Automata. Physica D:
Nonlinear Phenomena, 10(Issue 1-2):157–167, 1984.
[6] Quantum monte carlo@home. http://qah.uni-muenster.de/.
[7] Wahid Chrabakh and Richard Wolski. Gridsat: Design and implementation of a computational grid application. J. Grid Comput., 4(2):177–193, 2006.
[8] Climateprediction.net. http://www.climateprediction.net.
[9] ClustalW. http://www.ebi.ac.uk/clustalw.
[10] Einstein@home. http://einstein.phys.uwm.edu/.
[11] Understanding Evolution. http://evolution.berkeley.edu.
[12] N. Ganguly, B. Sikdar, A. Deutsch, G. Canright, and P. Chaudhuri. A Survey on Cellular
Automata. Technical report, Centre for High Performance Computing, Dresden University
of Technology, December 2003.
65
R EFERENCES
66
[13] S. Hess, D. Jonathan, and R. Blake. Wide Variations in Neighbor-Dependent Substitution
Rates. Journal of Molecular Biology, 236:1022–1033, 1994.
[14] B. Korber, M. Muldoon, J. Theiler, F. Gao, R. Gupta, A. Lapedes, B. H. Hahn, S. Wolinsky, and T. Bhattacharya. Timing the Ancestor of the HIV-1 Pandemic Strains. Science,
288(5472):1789–1796, June 2000.
[15] Lhc@home. http://lhcathome.cern.ch/lhcathome/.
[16] Malariacontrol.net. http://www.malariacontrol.net.
[17] B. Morton, I. Bi, M. McMullen, and B. Gaut. Variation in Mutation Dynamics Across
the Maize Genome as a Function of Regional and Flanking Base Composition. Genetics,
172(1):569–577, January 2006.
[18] Mufulid. http://www.ufluids.net/.
[19] G. Olsen, H. Matsuda, R. Hagstrom, and R. Overbeek. fastDNAmL: a Tool for Construction of Phylogenetic Trees of DNA Sequences using Maximum Likelihood. Computer
Applications in the Biosciences, 10(1):41–48, 1994.
[20] Phylip Package. http://evolution.genetics.washington.edu/phylip.
html.
[21] PostgreSQL. http://www.postgresql.org.
[22] Predictor@home. http://predictor.scripps.edu/.
[23] Primegrid. http://www.primegrid.com.
[24] Riesel sieve. http://boinc.rieselsieve.com/.
[25] Rosetta@home. http://boinc.bakerlab.org/rosetta/.
[26] H.-P. Schwefel. Deep Insight from Simple Models of Evolution. BioSystems, 64(1):189–
198, January 2002.
R EFERENCES
67
[27] HIV Sequence Database. http://hiv.lanl.gov.
[28] SETI@Home. http://setiathome.berkeley.edu.
[29] A. Siepel and D. Haussler. Phylogenetic Estimation of Context-Dependent Substitution
Rates by Maximum Likelihood. Molecular Biology and Evolution, 21(3):468–488, March
2004.
[30] Simap. http://boinc.bio.wzw.tum.de/boincsimap/.
[31] G. Sirakoulis, I. Karafyllidis, Ch. Mizas, V. Mardiris, A. Thanailakis, and Ph. Tsalides.
A Cellular Automaton Model for the Study of DNA Sequence Evolution. Computers in
Biology and Medicine, 33(5):439–453, September 2003.
[32] Spinhenge@home. http://spin.fh-bielefeld.de/.
[33] C. Stewart, D. Hart, M. Aumuller, R. Keller, M. Muller, H. Li, R. Repasky, R. Sheppard,
D. Berry, M. Hess, U. Wossner, and J. Colbourne. A Global Grid for Analysis of Arthropod Evolution. In Proceedings of the Fifth IEEE/ACM International Workshop on Grid
Computing, pages 328–337, 2004.
[34] Sztaki desktop grid. http://szdg.lpds.sztaki.hu/szdg/.
[35] World community grid. http://www.worldcommunitygrid.org/.
[36] S. Wolfram. A New Kind of Science. Wolfram Media, Inc., 2002.