Computational Problems in Modeling Evolution and Inferring Gene

Computational Problems in Modeling Evolution
and Inferring Gene Families
MEHMOOD ALAM KHAN
Doctoral Thesis
Stockholm, Sweden 2016
TRITA-CSC-A-2016:24
ISSN-1653-5723
ISRN-KTH/CSC/A-16/24-SE
ISBN 978-91-7729-131-2
KTH School of Computer Science
and Communication
SE-100 44 Stockholm
SWEDEN
Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av Akademisk avhandling tisdag den
18 oktober 2016 klockan 14.00 i Air, SciLifeLab, Tomtebodavägen 23A, Stockholm.
© Mehmood Alam Khan, October 2016
Tryck: Universitetsservice US AB
iii
To my family and the complicated simplicity!
iv
Abstract
Over the last few decades, phylogenetics has emerged as a very promising field,
facilitating a comparative framework to explain the genetic relationships among
all the living organisms on earth. These genetic relationships are typically represented by a bifurcating phylogenetic tree — the tree of life. Reconstructing a
phylogenetic tree is one of the central tasks in evolutionary biology. The different
evolutionary processes, such as gene duplications, gene losses, speciation, and lateral gene transfer events, make the phylogeny reconstruction task more difficult.
However, with the rapid developments in sequencing technologies and availability of genome-scale sequencing data, give us the opportunity to understand these
evolutionary processes in a more informed manner, and ultimately, enable us to
reconstruct genes and species phylogenies more accurately. This thesis is an attempt to provide computational methods for phylogenetic inference and give tools
to conduct genome-scale comparative evolutionary studies, such as detecting homologous sequences and inferring gene families.
In the first project, we present FastPhylo as a software package containing fast
tools for reconstructing distance-based phylogenies. It implements the previously
published efficient algorithms for estimating a distance matrix from the input sequences and reconstructing an un-rooted Neighbour Joining tree from a given distance matrix. Results on simulated datasets reveal that FastPhylo can handles hundred of thousands of sequences in a minimum time and memory efficient manner.
The easy to use, well-defined interfaces, and the modular structure of FastPhylo
allows it to be used in very large Bioinformatic pipelines.
In the second project, we present a synteny-aware gene homology method,
called GenFamClust (GFC) that uses gene content and gene order conservation to
detect homology. Results on simulated and biological datasets suggest that local
synteny information combined with the sequence similarity improves the detection
of homologs.
In the third project, we introduce a novel phylogeny-based clustering method,
PhyloGenClust, which partitions a very large gene family into smaller subfamilies.
ROC (receiver operating characteristics) analysis on synthetic datasets show that
PhyloGenClust identify subfamilies more accurately. PhyloGenClust can be used as
a middle tier clustering method between raw clustering methods, such as sequence
similarity methods, and more sophisticated Bayesian-based phylogeny methods.
Finally, we introduce a novel probabilistic Bayesian method based on the DLTRS model, to sample reconciliations of a gene tree inside a species tree. The
method uses MCMC framework to integrate LGTs, gene duplications, gene losses
and sequence evolution under a relaxed molecular clock for substitution rates. The
proposed sampling method estimates the posterior distribution of gene trees and
provides the temporal information of LGT events over the lineages of a species
tree. Analysis on simulated datasets reveal that our method performs well in identifying the true temporal estimates of LGT events. We applied our method to the
genome-wide gene families for mollicutes and cyanobacteria, which gave an interesting insight into the potential LGTs highways.
v
Sammanfattning
Fylogenetik har kommit fram som ett mycket lovande fält de senaste decennierna, vilket har möjliggjort ett jämförande perspektiv för att förklara samband mellan jordens organismer. Dessa samband är typiskt representerade
med ett binärt grenande fylogenetiskt träd — Livets träd. Att rekonstruera ett
fylogenetiskt träd är ett av de centrala uppgifterna i evolutionsbiologi. De olika
evolutionära processerna, exempelvis genduplikation, genförlust, och lateral
transfer mellan arter, gör fylogenirekonstruktionsproblemet mycket svårt. Den
snabba tekniska utvecklingen av sekvenseringsteknik och tillgängligheten av
storskaliga sekvensdata gör det möjligt att förstå dessa evolutionära processer på ett grundligt sätt och, slutligen, korrekt rekonstruera gener och arters
evolution. Denna avhandling är ett försök att bidra med beräkningsmetoder
för fylogenetisk inferens och verktyg till att genomföra storskaliga jämförande
evolutionära studier, som till exempel att detektera homologa sekvenser och
indela gener i familjer.
I det första projektet presenteras FastPhylo, ett programpaket som innehåller verktyg för att rekonstruera fylogenier med avståndsbaserade metoder. Det
implementerar tidigare publicerade algoritmer för att beräkna en avståndsmatriser givet sekvensdata och rekonstruera orotade träd med metoden Neighbor
Joining. Resultat på simulerade data visar att FastPhylo kan hantera hundratusentals sekvenser som indata på ett effektivt sätt, både vad gäller datorminne
och tidsåtgång. Användargränssnittet är enkelt med en modulär struktur som
gör FastPhylo lätt att använda i systematiska bioinformatiska beräkningar.
I det andra projektet presenterar vi en metod för att avgöra homologi som
tar hänsyn till information om synteni. Metoden, GenFamClust (GFC), använder information om vilka gener och deras inbördes ordning i arvsmassor. Resultatet på simulerade och biologiska datamängder visar att information om
lokal synteni kombinerat med sekvenslikhet förbättrar detektionen av homologer.
I det tredje projektet introducerar vi en ny fylogeni-baserad clustringsmetod, PhyloGenClust, som partitionerar en mycket stor genfamilj i mindre delfamiljer. Analys på syntetiska data visar att PhyloGenClust identifierar delfamiljer bra. PhyloGenClust kan användas som ett mellan-alternativ till enkla
clustringsmetoder, exempelvis baserade på likhetsmått, och mer sofistikerade
metoder baserade på Bayesianska fylogenimetoder.
Slutligen introducerar vi en ny probabilistisk metod för att datera gentransfer baserad på DLTRS-modellen. Metoden använder MCMC som ett ramverk
för att integrera lateral transfer (LGT), genduplikation, genförlust, och sekvensevolution med relaxerad molekylär klocka i en och samma modell. Den föreslagna metoden estimerar posteriorifördelningen av genträd och bidrar med
temporal information för LGT-händelser på kanterna i ett artträd. Analys på
simulerade data visar att vår metod är bra på att identifiera tiderna för LGT.
Vi använde vår metod på gener från fullständiga arvsmassor för cyanobakterier och mollicuter, vilket gav insikt i potentiella välanvända vägar för lateral
transfer.
Contents
Contents
vi
Acknowledgements
1
List of Publications
3
1
Introduction
1.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
7
2
Evolution: a Biological Perspective
2.1 Basic Biological Background . . .
2.2 Mutations and Natural Selection
Local Point Mutations . . . . . .
Global mutations . . . . . . . . .
Gene Duplication . . . . .
Lateral Gene Transfer . .
2.3 Homology . . . . . . . . . . . . .
2.4 Gene family . . . . . . . . . . . .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
12
13
15
16
17
18
Introduction to Computational Methods in Evolution
3.1 Basic phylogenetic concepts . . . . . . . . . . . . . . . . .
Multiple sequence alignment, gene tree, and species tree
Trees within trees: Reconciliation vs Realization . . . . .
3.2 Computational methods for phylogeny reconstruction . .
Distance-based methods . . . . . . . . . . . . . . . . . . .
Character-based methods . . . . . . . . . . . . . . . . . .
Parsimony . . . . . . . . . . . . . . . . . . . . . . .
Maximum-likelihood methods . . . . . . . . . . .
Bayesian methods . . . . . . . . . . . . . . . . . .
MCMC-based methods . . . . . . . . . . . . . . .
3.3 Reconciliation-based phylogeny reconstruction methods
Parsimony-based methods . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
21
23
24
25
27
27
27
28
29
31
32
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
3.4
4
Probabilistic methods . . . . . . . . . . . . . . . . . .
Duplication-loss models . . . . . . . . . . . .
Duplication-loss-transfer models . . . . . . .
Computational methods for inferring gene families
Sequence similarity methods . . . . . . . . . . . . .
Phylogeny-based methods . . . . . . . . . . . . . . .
Overview of the Included Papers
Bibliography
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
34
34
36
36
37
39
43
Acknowledgements
To begin with, I would like to thank my supervisor Lars Arvestad for his gratitude and support throughout my PhD studies, Thank you Lars! Jens Lagergren, who is one the smartest
and cool researcher I have worked with in my career. It took me a while to tune to Jens’s
frequency, but once you get adapted to his frequency range, you will enjoy both research
and his funny jokes. Pekka – a true researcher and one of the most rational human thinkers
I have ever encountered in my research career. I really enjoyed your fika-table puzzles. It
was a pleasure to work with you and learn from you. I should say that I have learned a lot
from you, academically, during our combined project. Thank you Pekka! Afshin – A good
mentor, thank you Afshin for your valuable advise; it helped me a lot :)
I would like to thank all my colleagues at SciLifeLab: Ikram – for every story X, there
exists a story Y such that the conditional distribution of Y can be determined from a generative model through Ikram’s rule. It has been such a please to meet you and learn from you,
socially. In fact, the most important knowledge I learned from you is how to miss a flight :).
It is noteworthy to mention that after Pairs flight, I missed three other fights following your
doorsteps :-P Owais – Thanks for all the good times and good discussions. Erik – A local
music lover; local doesn’t mean Swedish at all :) Erik’s tendency to learn Pashto words is
phenomenal. I remember he always comes up with a new Pashto word as a surprise. In
fact, the most interesting Pashto sentence was "Nowar neshta, khuwand na kai". To know
more about Erik, please Google SBC user guide :-P Joel– an organized java coder with a
great aesthetic sense. Mathias – Khair sha bacha! Mathias is a cool researcher with a great
tendency to become a good teacher. Whenever, I asked him a mathematical problem, he always came up with a simpler solution. Once I asked him a badminton-challenge-problem,
and he still has to figure out a simpler solution :-P ohh! I don’t want to miss this: your
blueberries pie was awesome! Kristoffer – Yao man! An almost perfectionist, not in math
but also in street dancing and gym as well. Linus – one of the most social swedes ever at
Gamma 6, with exceptionally good eye vision. If you want to know more about his eye
vision, just check his computer’s font or his handwriting. Yrin – also shares the Nobel Prize
for the most social swedes ever, along with Linus. He has great aptitude towards Art. Lumi
– I enjoyed your fika-table discussion about feminism and I still have the same stance: both
women and men are equal but their responsibilities can be different. Women rocks! Victor
– a funny and very punctual person; I couldn’t beat his punctuality to come early to the
office. Jose, Simon, and Nemo – the chill guys! Matthew and Wenjing – my badminton
crew! Thanks to all the great people that I have meet at SciLifeLab: Bengt, Lukas, Hossein,
Jovana, Hanna, Katja, Awun, and Hashim.
I would like to thank my friends within and outside academia who contributed indirectly to this thesis. Dr. Mansoor – my homie, thank you for listening to my academic
1
2
CONTENTS
frustrations and the support you gave me during and after my medical surgery. I enjoyed
our intellectual discussions about life and science in general. Dr. Saqib – my non-biological
brother, I think thanks would be a really small word for you; I simply cannot repay the
medical and social services that you provided to my parents while I was away. I also want
to thanks Dr. Nouman (my cricket mate) for taking care of my Mother’s health, thank you!
Dr Nouman azam – my childhood and a philosophical friend. Umar Gandapur – my nonbiological brother who is a living king, Dr. Naveed – the prince, and Dr. Faheem – a humble
friend. Fikri – thanks for the Turkish tea and food. I really enjoyed our discussions about
societal problems and issues. Dr. Niazi – a silent but humble person; thanks for all the great
food :) I don’t want to miss Akber The Great! I think, if you were not around, Stockholm
would have been such a boring place; your funny jokes, hospitality, fried-eggs with Paratas
and tea will never be forgotten. Thank you brother! I would like to thank my saturdayevenings-food-party group: Ibrar, Akbar, Ikram, Izhar, Arif, Usman and Sami. My LUMS
friends: Jameel, Hammad, Shujaat, Sanan and Fawad-la. My Stockholm friends: Ibrahim,
Zia Ullah, Sakel, Fahad, Naveed, Waris, Shabir, and Dr Imadur Rahman.
At the end, I would like to thank my Father – one of the most principled persons I ever
know, my most knowledgeable, humble, and rational Mother, and my siblings for their
continuous support and help. Thank you and I love you all!
List of Publications
I: Khan MA., Elias I., Sjölund E., Nylander K., Guimera RV., Schobesberger R.,
Schmitzberger P., Lagergren J., & Arvestad L.
Fastphylo: Fast tools for phylogenetics.
BMC Bioinformatics 2013; 14: 334.
II: Ali RH., Muhammad SA., Khan MA., & Arvestad L.
Quantitative synteny scoring improves homology inference and partitioning
of gene families.
BMC Bioinformatics 2013; 14(Suppl 15):S12.
III: Khan MA., & Arvestad L.
Phylogenetic partitioning of gene families.
Manuscript.
IV: Khan MA., Mahmudi O., Ullah I., Arvestad L., & Lagergren J.
Probabilistic inference of lateral gene transfer events
Accepted at RECOMB-CG, 2016.
SOFTWARE PACKAGES
1: FastPhylo: A software suits for distance matrix estimation and generating distancebased phylogenetic trees.
http://fastphylo.sourceforge.net
2: PhyloGenClust: A python package for phylogenetic clustering of gene families.
https://github.com/malagori/PhyloGenClust
3: JPrIME: A library of phylogenetic methods and tools.
https://github.com/arvestad/jprime
3
List of Abbreviations
RNA Ribonucleic acid
tRNA Transfer Ribonucleic acid
mRNA Messenger Ribonucleic acid
DNA Deoxyribonucleic acid
ILS incomplete lineage sorting
LCA Last common ancestor
MSA multiple sequence alignment
NJ Neighbor-Joining
MAP maximum a posteriori
ML Maximum likelihood
MLE Maximum likelihood estimation
MCMC Markov chain Monte Carlo
WGD whole-genome duplication
LGT Lateral gene transfer
DL Duplication loss
DLT Duplication loss transfer
MPR Most parsimonious reconciliation
5
Chapter 1
Introduction
“When randomness adopts pattern...”
—Anonymous
1.1
Fundamentals
There has been always a continuous conflict between human’s curiosity and satisfaction. The former, being more dominant, has given raise to a multitude of
amazing discoveries we witness in the world today. The later, however, is mostly
responsible for putting a bound over curiosity’s level in order to maintain balance.
Being driven by curiosity, many asks the very basic question about the existence
of life; whether to call it a design or a random chance? There has been a multitude of debates and discussions about such topic and every one has their own
favourite opinion. We don’t know how life started on earth, but irrespective of
whichever mechanism is behind it, is commonly agreed that it is somehow a rare
event [109, 172]. One of the intrinsic properties of life is that, once started, it tends
to reproduce. This process of reproduction, along with the notion of adaptation to
an environment, are prerequisites to biological evolution. Evolution is a dynamic
process mainly depended on the notion of “survival of the fittest” which is central
to the process of natural selection. Thus, making it possible for some forms of life
(extremophiles) to survive in harsh environments while other goes extinct. Over
the time, populations grow, remain constant, shrink, and go extinct. In this thesis, we will study molecular evolution and discuss computational methods for its
study.
1.2
Thesis Overview
This thesis is an attempt to extend further computational methods for phylogenetic inference and provide tools to conduct genome-scale comparative evolution7
8
CHAPTER 1. INTRODUCTION
ary studies.
Reconstructing genes and species phylogenies are crucial to many evolutionary
studies. Traditionally, evolutionists were using morphological data to infer species
evolution, but with the advent of modern molecular biology, the focus has been
shifted towards the molecular data that gives more insight into the historic relationship of genes and species. These relationships are typically depicted using
trees. A gene tree represents homologous relationship between set of genes, while
species tree represents the evolutionary history of a given set of species. Over the
last few decades, a plethora of methods for reconstructing genes and species trees
have been proposed. These methods can be broadly categorized into distancebased and character-based methods. In distance based methods, all the pair-wise
distances between a given set of molecular sequences are calculated and stored
in a matrix representation. The resultant distance matrix is then used to reconstruct a gene or species phylogeny. In Paper I of this thesis, we provide a software
package, called FastPhylo, containing fast tools for reconstructing distance-based
phylogenies. On the contrary, character-based phylogeny reconstruction methods
use the actual genetic information, instead of pre-computed pairwise distances,
to reconstruct gene or species trees. It includes methods like parsimony, ML, and
Bayesian methods. Section 3.2 of this thesis, gives a detail overview of the phylogeny reconstruction methods.
Detecting homologous relationships among set of genes is an important step in
many comparative genomic studies, such as inferring gene families, protein structural and function predictions etc. There are number of methods available that
use sequence similarity measure to detect homology, either by employing a simple
threshold or using graph theocratic approaches. Recent advancements in research
revealed that most of the gene families comes from tandem duplications [122, 51],
which strengthens the support of using syntenic information as one of the parameters for inferring homology [168]. In Paper II of this thesis, we work with the
homology detection problem and present a novel synteny-aware gene homology
method, called GenFamClust (GFC), which uses local-synteny information along
with the sequence similarity to detect homologous sequences.
One of the common problems with sequence similarity methods is that they are
sensitive to the sequence similarity thresholds, and usually end up in very large
families that are not feasible to conduct the downstream analysis, such as Bayesian
phylogenetic analysis. In Paper III, we develop a novel phylogeny-based gene
family inference method, called PhyloGenClust, that uses maximum parsimony
and ML approaches to partition a very large gene family into smaller subfamilies.
PhyloGenClust can be viewed as a middle tier clustering method between a raw
clustering method, such as sequence similarity methods, and a more sophisticated
Bayesian phylogeny inference method.
1.2. THESIS OVERVIEW
9
Recently, a significant number of studies have been attributed to a biological phenomena, known as lateral gene transfer (LGT) or horizontal gene transfer (HGT),
that is responsible for transferring genetic material from parents to offsprings in a
way that is different from the traditional vertical decent fashion. Today, LGTs are
considered to be an important mode of evolution not only among prokaryotes but
also in eukaryotes [92, 78, 29]. In LGT, genes are transferred horizontally between
closely or distantly related species that do not share an ancestor-descendant relationship. It shifted the traditional tree-based view of the evolution of life to a ‘network of life’ [30]. There are number of methods proposed to model LGTs, mostly
rely on parsimony approaches. Despite of the computational efficiency, parsimony
approches, however, lack explicit model of sequence evolution and thus, considered as less biologically realistic. Probabilistic models are also proposed to model
LGTs, giving detailed models of gene and sequence evolution [147, 155, 156]. The
models presented in [147, 155, 156], however, do not provide relative positions of
LGT events which are crucial to many evolutionary studies, involving the identification of LGT highways. In Paper IV of this thesis, we present a probabilistic
method, based on the DLTRS model [147], to sample reconciliations of the gene
tree inside a species tree that gives temporal information of LGT events over the
lineages of a species tree. The method is applied to some of the genome-wide gene
families for mollicutes and cyanobacteria that give an interesting insight into the
potential LGTs highways.
The rest of the thesis is organized as follows. Chapter 2 provides biological background on evolution and discusses different evolutionary concepts that are frequently used in this thesis. Chapter 3 introduces some of the basic phylogenetic
terminologies and discusses the major computational techniques that our methods are based on. Chapter 4 gives a brief summary of the papers included in the
thesis.
Chapter 2
Evolution: a Biological Perspective
In this chapter, we will provide a brief overview to few of the biological phenomena that are central to the evolutionary problems that we will discuss in the thesis.
Section 2.1 discusses some of the basic biological concepts; Section 2.2 explains the
different modes of mutation and selection processes. In Section 2.3 and Section
2.4, we give a brief biological introduction to homology and gene family concepts,
respectively.
2.1
Basic Biological Background
This section provides basic background to some of the biological concepts that are
fundamental to the evolutionary problems that we will discuss later in this thesis.
For more in-depth knowledge about molecular biology, the reader is referred to
the famous book by Alberts and colleagues: Essential cell biology [3].
The cell is the basic structural and functional unit of all kinds of living organisms.
It is the smallest unit of life, often referred as the building blocks of life that can
replicate independently. An organism can either be classified as unicellular or
multi-cellular organism, based on their cellular composition. Unicellular organism, also known as prokaryotes, consists of single cell (bacteria and virus), while
multicellular organisms (eukaryotes) comprised of many cells, such as plants, animals, and fungi. Along with the cell-composition differences, eukaryotes carry
their genetic material inside an enclosed compartment called nucleus, while in
prokaryotes, there is no such nucleus instead the genetic material is contained in
the cytosol. The genetic material comprises of long sequence of molecules called
Deoxyribonucleic acid (DNA).
DNA carries the genetic instructions to develop and maintain an organism. It has a
double helix structure and consists of four bases: Adenine (A), Guanine (G), Cytosine (C), and Thymine (T). These four bases combined with a sugar and phosphate
11
12
CHAPTER 2. EVOLUTION: A BIOLOGICAL PERSPECTIVE
group forms a nucleotide. A sequence of nucleotides, known as gene, encodes a
functional RNA or protein products. The gene is the basic unit of heredity. It is
transcribed into mRNA and mRNA is translated into a protein product but the
reverse is not possible. This phenomenon is known as the central dogma of molecular biology. Genes are located in chromosomes, a chromosome consists of many
genes, and many chromosomes constitutes a genome of an organism. Among different organisms, the size of a genome can vary from few hundred nucleotides
to billions of nucleotides. For instance, the smallest genome size known today
is about 112 kbp, belonging to Nasuia deltocephalinicola, while other organisms,
such as conifers, have genome size ranging from 20 to 30 Gbp [16, 119]. The human genome is 3 billion nucleotides. Although, human genome is pretty long,
only less then 2% of it is actually protein coding, while the rest constitutes of junk
DNA whose function is not known [44] or non-coding RNA genes.
In the following sections, we will discuss some of the evolutionary phenomena
that influence the genome of an organism.
2.2
Mutations and Natural Selection
Mutations refer to random changes in a DNA sequence of an organism. These
changes may happen due to copying errors during the cell’s replication process or
due to DNA’s exposure to radiation, chemical or viral attacks. Mutation usually
represents an advantageous change that may help the cell to survive but it may
also, sometimes, harm the cell and ultimately, results in a cell death. These phenomena are congruent with the Darwinian law of natural selection, which argues
that mutations may occur randomly and survival of individual depends upon its
ability to adopt to the environment.
Mutations and selection go hand in hand. The genome of an organism continuously
experience changes that helps the species to survive and adopt to its environment,
through advantageous mutations, but it may also cause an individual less fit and
more susceptible to environmental changes via deleterious mutations; thus causing
the specie to go extinct. Mutations occur randomly at equal rate all over genome.
Some mutations are local to an individual, such as somatic mutations, while other
mutations that occur in the germ cells are passed on to the offspring.
Mutations can be broadly categorized into two groups: local point mutations and
global mutations, explained in the next subsections.
Local Point Mutations
Local point mutation is type of mutation that occurs due to a nucleotide change
at a single point in DNA. Typically, these changes occur due to DNA’s exposure
to radiation or mistakes in DNA replication process. It may result in replacing
2.2. MUTATIONS AND NATURAL SELECTION
13
one nucleotide with another via substitution, or may insert or delete nucleotides by
introducing indels in a DNA sequence.
Point mutations can be functionally categorized into three major groups: synonymous mutations, non-synonymous mutations, and nonsense mutations. Synonymous
mutations are those point mutations that occur in a codon (triplet of nucleotides)
and do not effect the translation of codon to a specific amino acid. Non-synonymous
mutations, on other hand, changes the translation of codon to a different amino
acid; thus resulting in a completely different protein product. These types of mutations are harmful and sometimes cause diseases. Nonsense mutations convert an
amino acid to a stop codon which results in a truncated protein. The resulting
protein may malfunction or lose function completely.
Other categorizations of point mutations are transitions and transversions. Transitions involve nucleotide change from purine to purine (adenine ↔ guanine), or
pyrimidine to pyrimidine (cytosine ↔ thymine). Transversions accounts for replacement of a pyrimidine base with a purine base or vice versa.
Ideally, mutations should occur randomly at any region in a genome with equal
probability but in practice, we don’t observe such evidence. There are regions
in a genome that are highly susceptible to changes due to low selective pressure,
while others remain highly conserved because of the high selective pressure. For
instance, genes responsible for encoding the regulatory machinery of the cell, such
as promoters and enhancers etc., remain highly conserved throughout the evolutionary history of an organism.
Selective pressure, acting on a protein-coding region of a genome, can be measured
through the ratio ω = d N /dS , where d N represents the number of nonsynonymous
substitutions per site while, dS indicates the number of synonymous substitutions
per site. If ω > 1, then the sequence is believed to have experienced positive or
diversifying selection. Similarly, ω < 1 implies negative or purifying selection, while
ω ≈ 1 suggests that the sequence has evolved neutrally over the evolutionary time
[174].
Global mutations
In contrary to the local point mutations, global mutations, also known as macromolecular mutations, are those mutations that occurs at large scale on a genome.
These mutations mainly include genome rearrangement, chromosomal translocation, gene duplication (tandem, segmental, and whole genome duplications), and
lateral gene transfer.
Genome rearrangement is one of the major global mutations; usually occur due to
errors in the cell division process. It was first discovered, in 1930s, by Dobzhansky
14
CHAPTER 2. EVOLUTION: A BIOLOGICAL PERSPECTIVE
and Sturtevant [36], during their study of the gene rearrangement of Drosophila
pseudoobscura. Later in 1980s, many scientists confirmed such phenomena in other
species [81, 129, 128]. Genome rearrangement has played a vital role in the genome
evolution of fruit fly and Drosophila. It is typical driven by three types of events:
transpositions, reversals (also know as inversion), and trans-reversals [45].
• Transposition event causes a segment of genome to displace from its position
to a random position in a genome. The segment is transposed linearly, forinstance:
transposition
(1 2 3 4 5 6 7) −−−−−−−→ (1 4 5 2 3 6 7)
where, the number represents different genes and underlined numbers represents genes that are displaced to different position through transposition
event.
• In Reversal, a segment is reversed:
reversal
(1 2 3 4 5 6 7) −−−−→ (1 5 4 3 2 6 7)
• Trans-reversal, a segment is cut-out in a reverse order and inserted into a different region in genome:
transreversal
(1 2 3 4 5 6 7) −−−−−−−→ (1 2 6 7 5 4 3).
The segments of genome that move around in a genome are known as transposons
or mobile genetic elements. Barbara McClintock first discovered it in 1940s while
studying color patterns of the corn, Zea mays. Her discovery was awarded with
a Nobel Prize in 1983. She first called transposons jumping genes. These jumping
genes change positions in the genome and replicate themselves; consequently, altering the size of a cell’s genome. Around 60 to 80% of the maize genome [139]
and about 45% of the human genome [100] consists of transposons. Transposons
are classified into two classes based on their mechanism of transposition: Class I
transposons, also known as retrotransposition, (copy and paste) and Class II transposons (cut and paste).
Chromosomal translocation is another type of global mutation where a portion
of chromosome is exchanged with another chromosome. Chromosomal translocation has two main types: reciprocal translocation and robertsonian translocation.
Reciprocal translocation happens due to exchange of segments between nonhomologous chromosomes, while in robertsonian translocation, a whole chromosome is
transferred and attached to the centromere of two acrocentric chromosomes. Chromosomal translocations are usually harmless but it may sometimes lead to miscarriages or children with Down syndrome [99].
2.2. MUTATIONS AND NATURAL SELECTION
15
Gene Duplication
Gene duplication is one of the major evolutionary events where a region of DNA,
that contains a gene, gets duplicated and a new copy of a gene is created.
Gene duplication is an important mechanism responsible for genetic novelty. Many
scientists have recognized the role of gene duplication in the early 1930s [22, 21,
116, 141]; however, in 1970s, Susumu Ohno through a series of bold and controversial statements popularized the research on gene and genome duplication. In 1967,
Ohno proposed gene duplication as the single most important factor in evolution
[120]. Few years later, he reiterated this point in the book Evolution by Gene Duplication [121], postulating gene duplication as the only precursor that lead to the
creation of metazoans, animals and mammals from unicellular organisms; without
duplicated genes, multicellular life would have been impossible. He further hypothesized that at least one whole genome duplication event caused the evolution
of vertebrates.
According to Ohno, a gene after duplication may gets one of the two possible fates.
One of the two copies of genes may carry ancestral function, while the other copy
is free to evolve, acquiring a new function. The process of acquiring new functionality in a fresh gene copy is called neo-functionalization. The ancestral copy of a gene
is more likely to retain its original function due of high seletive pressure, while the
duplicate gene experience less selective pressure and consequently, develop new
function. The second fate of a duplicate gene, in most cases, is to become a pseudogene and ultimately, lost during the evolution.
In contrary to neo-functionalization, the third fate of a duplicated gene, introduced
by Stoltzfus [152] and Force et al [59], is sub-functionalization. In sub-functionalization,
each copy of the gene after duplication specializes in a subset of ancestral functionality (see Figure 2.1). It is a neutral mutation process in which no new adaptations
is formed rather the existing ancestral functionality is divided, mutually exclusive,
between the paralogs after divergence.
The main biological mechanisms responsible for a gene to duplicate are unequal
chromosomal cross-overing (chromosomal translocation) and chromosomal rearrangement (discussed above). Gene duplication can be further categorized into
three kinds, based on the scale of duplication: tandem duplication, segmental duplication, and whole genome duplication.
Tandem duplication refers to the duplication of a small portion of a chromosome
such that the duplicate copies remain adjacent to each other. Due to proximity, the
resultant paralogs share the same regulatory elements and most likely perform
similar functions [51]. Tandem duplication is believed to be responsible for the
evolution of clustered gene families [122]; for-instance: CG32708, CG32706, and
16
CHAPTER 2. EVOLUTION: A BIOLOGICAL PERSPECTIVE
Figure 2.1: Three possible functional fates of duplicated genes after a duplication event:
a) neofunctionalization, b) subfunctionalization and (c) deletion. Blue and red represent
different functions, white represents no function [162].
CG6999 are three genes in a cluster residing in the X chromosome of Drosophila
melanogaster that resulted from two rounds of tandem gene duplication in the last
5 Myr [51].
Segmental duplications (SDs), also know as low copy repeats (LCRs), involves
duplication of large segments of DNA, with typical size ranges from 1 to 400 kbp
in length and share high level of sequence identify (>90%) [142]. SDs can either be
tandem or interspersed, and can be interchromosomal or intrachromosomal. It is
usually caused by errors in chromosomal splicing during genetic recombination.
SDs played an important role in the evolution of primates and often linked to
cause genomic instability and diseases [12]. In humans, around 5% of the genome
consists of segmental duplications [12, 13].
In whole genome duplication, also known as polyploidization, the whole genome
gets duplicated such that the cell of an organism acquires one or more additional
sets of chromosomes. It usually occurs due to abnormalities in the cell division
process. WGD played an important role in the evolution of animal, fungi, and
other organisms [106, 121, 42], especially plants [88]. Ohno suggested that the
early vertebrate lineage underwent two rounds of whole genome duplication [121].
It is also believed that the ray-finned fish underwent three round of WGDs [110].
Lateral Gene Transfer
In contrary to gene duplication, Lateral Gene Transfer (LGT), also known as Horizontal Gene Transfer, is an evolutionary process responsible for transferring genetic material from parents to offspring in a way that is different from the traditional vertical decent fashion. In LGT, genes are transferred horizontally between
closely or distantly related species that do not share an ancestor-descendant relationship. It shifted the traditional tree-based view of the evolution of life to a
2.3. HOMOLOGY
17
‘network of life’ [30]. LGT is commonly mediated by viruses, plasmids and transposons and is common among prokaryotes [29, 98]. It can occur between eukaryotes [92] and from prokaryotes to eukaryotes [78, 29], but less frequently as in the
prokaryotic domain.
LGT has played a crucial role in the evolution of prokaryotes. Around 0.05–80% of
genes in bacterial and archaeal genomes have been subjected to LGT [101]. LGTs
are reasoned to be one of the primary sources for the expansion of gene families in prokaryotes [163, 149] and the evolution of antibiotic resistance genes in
pathogenic bacteria [104]. Due lack of sexual recombination in bacteria, LGTs
are sometime referred as “Bacterial sex” [118]. There is three main well studied
mechanisms for LGTs in prokaryotes: transformation, transduction, and conjugation. Transformation was first discovered in 1928 by Griffith [68] as gene exchange
method where a short fragments of naked DNA is taken up directly from the environment by naturally transformable bacteria and incorporates into their genomes.
Transduction involves exchange of foreign DNA from one bacterium to another
via bacteriophage [179]. In conjugation, genetic materials between bacteria are
exchanged via sexual pilus that requires direct cell-to-cell contact [158].
In contrast to prokaryotes, LGTs seems to have occurred less frequently in eukaryotes. However, this notion is gradually changing with the availability of whole
genome data of diverse eukaryotes. It is now evident that LGTs have occurred in
all major eukaryotic lineages [84]. Many studies have revealed that LGTs are not
only frequent in unicellular eukaryotes [7, 92, 8] but also occur in several multicellular eukaryotes [34, 26, 67, 82, 114, 66, 177]. In eukaryotes, there are two modes of
gene transfer suggested [160, 7]: a gene may transfer from the endosymbionts (mitochondria or chloroplasts-the photosynthetic plastid) to the nucleus of eukaryotic
cell; such a gene transfer is also known as endosymbiontic gene transfer [160]. Secondly, it may transfer between unrelated species [7].
2.3
Homology
Homology comes from a Greek word homologos, where homo means “same” and
logos means “relation”. Thus, in the context of molecular evolution, homology
is referred as the relatedness between two or a set of genes that shares a common ancestry. The concept of homology is considered as pre-evolutionary and
was first introduced by Richard Owen in 1843 [75]. Owen defined a homologue as
“The same organ in different animals under every variety of form and function”
[125]. His definition was considered as non-transformative that did not accommodate Darwin’s transformative theory of evolution by natural selection and was
restricted only to animals [75]. Later in 1870, Ray Lankester, an English anatomist
and embryologist, redefined homology by considering the shared evolutionary
ancestry into account: “Structures which are genetically related, in so far as they
have a single representative in a common ancestor, may be called homogeneous”
18
CHAPTER 2. EVOLUTION: A BIOLOGICAL PERSPECTIVE
[102]. Over the past few decades, homology has emerged as a hierarchical concept that has been studied at different levels in the evolutionary process [153, 73];
for-instance, Structure homology (reflects the final patterns of evolution) and Developmental homology (deals with the processes of evolution) are important concerns
in comparative evolutionary biology today [75, 166, 74]. Here in this thesis, we
will mainly discuss Genetic homology, a major class of developmental biology that
concerns relatedness of genes based on shared ancestry.
In the absence of fossil records for extinct species, it is, however, not possible to
determine exactly the common ancestor and all the intermediate forms for a given
set of genes. Consequently, detecting homology becomes an inference problem
[97], where sequence similarity can be treated as an observable variable expressed
in numeric with certain probability assigned to it. A somewhat related concept to
homology is synteny that refers to the co-localization of two or a set of genes on a
genomic landscape. The evidence that most of the gene families comes from tandem duplications [122, 51], rather convinced scientists to use syntenic information
as one of the parameters for inferring homology [168, 4].
Homology between two or a set of genes can be classified into different kinds,
based on the type of evolutionary events at the last common ancestor (LCA): paralogy, orthology, or xenology. Paralogy refers to the homologous relationship between two or a set of genes such that their last common ancestor stems from a
speciation event; the set of genes are referred as paralogous genes. Similarly, orthology and xenolgy relationships are defined by duplication and LGT events, respectively. Paralogous genes are further classified into two groups, based on temporal difference of duplication events: In-paralogs and out-paralogs. In-paralogs,
also known as super-paralogs or co-orthologs, are homologues that derived from
a LCA by duplication postdating the speciation in question; while in the outparalogs, duplication predates the speciation.
2.4
Gene family
A gene family is a group of homologous genes that tends to perform similar function; although it may not always true, such as the case with neo-functionalization.
Like homology, determining a gene family is an inference problem since we don’t
know how evolution has happened over the tree of life? And also, as mentioned
before, many considers evolution as a network or web of life [30, 71]. There has
been a continuous debate on tree-based vs network based homology; readers are
encouraged to read a review paper on such topic [71]. Consequently, it is rather
difficult to have a concrete definition of a gene family. However, there is a mutual
consensus among the biologists that orthologous genes tend to perform similar
function then paralogous gene, also known as orthology conjecture. Thus, many try
to infer groups of orthologous genes and call them gene families. In literature, one
2.4. GENE FAMILY
19
may find several definitions of a gene family, based on either sequence or structure similarities or sometimes sharing similar domains. The term gene family is
often interchangeable with the protein family, where a family is defined as a group
of protein sequences having similar structure [28]. Similarly, other aliases of gene
family are domain family (group of protein sequences sharing one or more common
domains) [24] and super family [130] etc.
Sailing along the tree based thinking, the most plausible and Fitch-inspired [57]
definition of gene family was given by Wapinski, et al [168], where they suggest
that a set of genes that descended from a single common ancestral gene constitutes
a gene family and termed it as Orthogroup [168]. In order to avoid confusion, from
now onward, we will use the term gene family that corresponds to genes evolved
through vertical decent from a common ancestor. In paper-III of this thesis, we
discuss the problem of defining gene families in detail, formalizing the definition
of gene family and provide a novel algorithm to infer gene families.
Chapter 3
Introduction to Computational
Methods in Evolution
“A model is not the ultimate truth; it’s rather a representation how we
see the world behaves.”
—Anonymous
In this chapter, we will discuss some of the state-of-art phylogeny reconstruction
methods. Section 3.1 introduces few basic phylogenetic terminology used in the
rest of the chapter. Section 3.2 discusses distance-based and character-based approaches for phylogeny reconstruction. Section 3.3 covers reconciliation-based
phylogeny reconstruction methods. Finally, Section 3.4 describes computational
methods and tools for inferring gene families.
3.1
Basic phylogenetic concepts
In this section, we will briefly discuss basic phylogenetic terminology that will be
used later in this chapter.
Multiple sequence alignment, gene tree, and species tree
One of the crucial steps in comparative genomics and evolutionary studies is the
multiple sequence alignment (MSA). Multiple sequence alignment involves comparing more than two sequences to identify regions of similarity. These regions
of similarity, also referred to as conserved regions, are important indicators in exploring the structural, functional and evolutionary relationship between a given
set of sequences. Almost all phylogeny reconstruction programs take an MSA as
input. It is commonly represented by a matrix with one sequence in each row.
Each column of an MSA represents homologous sites, with gaps representing insertions or deletions. Constructing a biologically plausible MSA for a given set of
21
22
CHAPTER 3. INTRODUCTION TO COMPUTATIONAL METHODS IN
EVOLUTION
sequences is a computationally difficult task and most formulations of the problem are shown to be NP-complete [167, 46]. However, there exists several heuristic
methods, including iterative refinement methods [115], progressive methods [79],
and progressive-iterative hybrid methods [41, 91], that work well in practice. Errors in a MSA propagates to the downstream analysis, therefore it is important to
choose best method for a given set of sequences. It is often recommended to use
visualization tools, such as Jalview [170], to check the quality of an MSA before
conducting a downstream analysis.
A gene tree in a phylogenetic jargon refers to a bifurcating tree that represents a
homologous relationship between a set of genes sharing common ancestry. A gene
tree is usually represented by a rooted tree, with leaves representing extant genes
and a root vertex represents the ancestral gene from which all the extant genes
originated. Each internal vertex represents an intermediate ancestral gene that has
evolved through one of the evolutionary events, i.e., duplication, speciation, or
LGT events. The branch or edge lengths measures the amount of evolution along
the branches and is typically expressed in units of expected number of nucleotide
substitutions per site.
Figure 3.1: shows Darwin’s first known sketch of an evolutionary tree [31].
Similarly, a species tree is also a bifurcating tree, representing the evolutionary
history of a given set of species. The root vertex represents the ancestral species of
all the given set of species, while leaves represents all the extant species. Internal
vertices represent speciation events. A speciation event is an evolutionary event
that causes parts of a population to evolve independently and eventually become
a distinct species. There are several reasons for a speciation event to occur: a geographic isolation that may results in different selective pressures, isolation due to
some other reasons, and genetic drifts that may cause the population, even without isolation, to evolve independently. Inferring a correct species tree is a difficult
3.1. BASIC PHYLOGENETIC CONCEPTS
23
task in evolutionary biology; perhaps, it has been studied for long time now. The
first species tree was given by Darwin (see Figure 3.1). It uses the morphological
traits to describe the relationship between organisms. Since the discovery of DNA,
scientists are using molecular data to infer a species tree. There are a number of
methods available to infer a species tree and we will discuss them later in this
chapter.
Trees within trees: Reconciliation vs Realization
In the last few decades, there is an extensive use of trees within trees data structure
in phylogenetic field. Although, the concept is applied earlier to other biological disciplines, e.g., biogeography and parasitology, independently [127]. Trees
within trees have been used to model some sort of historical associations between
a host entity and a guest entity. Depending of the field, these historical associations
can be between genes and organisms (phylogenetics), organisms and organisms
(parasitology), and organism and areas (biogeography) [126]. Trees-within-trees
data structure encapsulates the idea that a guest tree evolves inside a host tree using certain historical events. In phylogenetics, these historical events may be gene
duplication, gene loss, speciation, and lateral gene transfer.
Reconciliation is a trees-within-trees data structure that is used to model the evolutionary history of a gene tree inside a species tree. More formally, a reconciliation
refers to a mapping that associates each gene tree vertex with a species tree vertex,
implying speciation, or a species tree edge, representing either a duplication or
LGT event. Each leaf of a reconciled gene tree is mapped to a leaf of species tree,
representing species to which the gene belongs. Figure 3.2 shows a pictorial representation of two reconciliations of the same gene tree with a species tree. Due to
the continuous evolutionary time, there can be numerous reconciliations of a gene
tree G with a species tree S that makes it is difficult to determine the expected time
of evolutionary events over the edges of a species tree. In such scenarios, we work
with a time constraint reconciliation which we will discuss in the next paragraph.
Later in this chapter, we will discuss some of the popular phylogeny reconstruction methods that uses reconciliation approach to infer the evolutionary history of
a gene tree inside the species tree.
A realization is a type of trees-within-trees data structure that maps each vertex of a
gene tree either to the vertex or to an edge of a discretized species tree. A discretized
species tree refers to a species tree that is discretized in time, either equidistance
or some kind of discretization scheme. Realization is a time constrained reconciliation where the occurrence of each evolutionary event can be specified in an
exact evolutionary time. It is a subset of reconciliation i.e., each reconciliation
contains one or more realizations that are unique to that reconciliation. Both in
reconciliations and realizations, the child-parent relationship of gene tree vertices
are preserved and remain consistent with that of the gene tree, i.e., a gene tree
CHAPTER 3. INTRODUCTION TO COMPUTATIONAL METHODS IN
EVOLUTION
24
Figure 3.2: Two different reconciliations of the same gene tree (thin tree) with the species
tree (fat tree), each of which contains two different realizations. Vertices colored in green
represents duplication events, blue represents LGTs, and black represents speciation events
[147].
vertex is never mapped closer to the root in the species tree than its parent. Along
with this, in a realization, a child and its parents never get mapped to the same
discretization time. Figure 3.2 shows two different realizations for each of the two
reconciliations. In Paper-IV, we introduce a novel method to sample realizations
from the posterior distribution to estimate the time of LGT events over the discretized species tree.
3.2
Computational methods for phylogeny reconstruction
In this section, we will discuss some of the current state-of-art computational
methods used to reconstruct the evolutionary history of life. Phylogeny reconstruction methods, in general, can be broadly categorized into two groups: distancebased methods and character-based methods.
In distance based methods, all the pairwise distances between a given set of sequences are calculated and stored in a matrix representation. The resultant distance matrix is then used to reconstruct a gene or species phylogeny. In contrary
to distance-based methods, character-based methods use the actual genetic information instead of pre-computed pairwise distances. It uses genetic information
at single character1 in the sequence alignment to simultaneously compare all the
sequences and compute a tree score for each gene tree. A tree scoring function
uses these tree scores to select the best tree among other competing tees. Examples of character-based methods are maximum parsimony, maximum-likelihood,
and Bayesian methods which we will discuss in the later sections.
1 Character
here refers to a site in the sequence alignment
3.2. COMPUTATIONAL METHODS FOR PHYLOGENY RECONSTRUCTION25
Distance-based methods
Distance-based methods are one of the major classes of phylogeny reconstruction
methods. It involves two basic steps: first, all the pairwise distances between a
given set of sequences are calculated and stored in a matrix D. These distances
are calculated assuming a Markov model of nucleotide/amino-acids substitution.
A Markov model of nucleotide/amino-acids substitution measures the evolution
of molecular sequences under study. There are number of Markov substitution
models proposed for DNA and protein sequences, for instance, JC69 [90], K80 [96]
, HKY85 [77], GTR [159] etc. These substitution models differ in parameters that
are used to estimate the rate of substitution from one nucleotide to another.
In the second step, a tree inference method is applied to the resultant distance matrix D to reconstruct a phylogenetic tree T with branch lengths. A distance matrix
method tries to minimize the difference between the input distances and the distances induced by the phylogenetic tree. Distance matrix methods can be categorized into three groups, based on the measure of closeness between the calculated
distances and estimated evolutionary distances: Least square methods, minimum
evolution methods, and neighbour joining methods.
Least square methods [25, 58] try to minimize the difference between the calcuT induced by a
lated distance Di,j in the first step and the evolutionary distance Di,j
tree T (i.e., the sum of branch lengths on the tree connecting i and j), where i and j
are two sequences. Least square methods selects the best tree with minimum least
square difference i.e.:
T 2
arg min ∑( Di,j − Di,j
)
(3.1)
T
i,j
In contrary to least square methods, minimum evolution (ME) methods [43, 95,
136, 35] use the tree length2 to select the best tree. However, it uses the least square
criterion for calculating the branch lengths on a tree. ME methods are biased towards trees with shorter tree lengths than the trees with longer tree lengths. Trees
with minimum tree lengths are likely to be selected as the correct trees under the
minimum evolution criterion.
Neighbour joining (NJ)[138] is one of the most popular and widely used distance
matrix methods. It is an agglomerative clustering method used to reconstruct phy0
logenetic trees. It starts with a star tree T , having a root vertex x and iteratively
picks two vertices i and j based on the taxon distances in a matrix Q (3.2) (where Q
is calculated from input distance matrix D) and join them together. The two vertices i and j are replaced by a new ancestor vertex y, thus reducing the number of
vertices at the root vertex x by one (see Figure 3.3). New distances between y and
2 Tree
length is the sum of branch lengths on a tree.
26
CHAPTER 3. INTRODUCTION TO COMPUTATIONAL METHODS IN
EVOLUTION
Figure 3.3: shows one iteration of the neighbour joining algorithm. Two vertices (in this
case, 1 and 2) are collapsed into a new vertex y, thus reducing the number of nodes at the
root (vertex x) by one [175].
the rest of vertices are calculated and updated accordingly in the distance matrix.
This process continues until a fully resolved tree is obtained.
n
Q(i, j) = (n − 2) D (i, j) −
∑
n
D (i, k) −
k =1
∑ D( j, k)
(3.2)
k =1
where D (i, j) is the distance between taxa i and j.
NJ has an interesting property that it most often reconstruct the correct tree if the
input distance matrix D is nearly additive [11]. A distance matrix D is said to be
nearly additive if there is a distance D T induced by a phylogenetic tree T such that:
| D − D T |∞ <
µ( T )
2
(3.3)
where µ( T ) is the minimum branch length in T. There are a number of NJ based
methods [83, 171, 60, 55, 108] that uses this property; however, due to the computational cost associated with each of these methods, it is sometime not feasible to
use such methods on very large datasets. However, there are other NJ methods
with fast implementation, such as [94, 143], that can handle hundred of thousands
of taxa to compute phylogentic trees.
Despite the algorithmic simplicity and speed of NJ, it is, however, sensitive to
alignment errors and sequences that are evolutionary more diverge. In the next
section, we will study more realistic evolutionary models, such as Bayesian and
likelihood methods, that are less prone to these problems and incorporate detail
models of sequence evolution, gene tree evolution, and species evolution.
3.2. COMPUTATIONAL METHODS FOR PHYLOGENY RECONSTRUCTION27
Character-based methods
Parsimony
In evolutionary studies, maximum parsimony is one of the most widely used phylogeny reconstruction method. Parsimony, in general, refers to the simplest scientific explanation that fits the observed data. It relies on the principle of Occam’s
razor that prefers a hypothesis with fewest assumptions over the other competing hypotheses. In phylogenetics, the same principle is applied to select a gene
tree topology that requires minimum number of evolutionary events to explain
the evolutionary relationship among the given taxa. Edwards and Cavalli-Sforza
were among the firsts who advocated the use of parsimony for phylogenetic tree
reconstruction [43]. Parsimony methods are most often used due to its simplicity and computational efficiency over the more sophisticated and computationally demanding probabilistic methods for phylogeny reconstruction. Despite of
the popularity, parsimony-based methods suffers from the long-branch attraction3
problem that restricts its usage in scenarios where there is more among-site rate
variation in the alignment. There are number of tools avaiable that provide the
implementations of pasimony-based phylogeny reconstruction methods but the
most commonly used are PHYLIP [55], MEGA6 [157], NONA [63], and PAUP*
[154].
Maximum-likelihood methods
Maximum likelihood is one of the most widely used methods to estimate the parameters of a statistical model given the observed data. It was first introduced
by the famous English statistician, R. A. Fisher, around 1920. However, it was
first incorporated into the phylogenetic studies by Edwards and Cavalli-Sforza
[43] and later, popularized by Joseph Felsenstein [52, 53]. Felsenstein introduced
an algorithm, commonly known as Felsenstein’s pruning algorithm [52, 53], for
computing the likelihood of an evolutionary tree with branch lengths given the
sequence data.
In general, the likelihood function is defined as the probability of the observed
data D given the parameters Θ. Mathematically, it can be expressed as:
L(Θ | D ) = p( D | Θ)
(3.4)
where L(.) denotes the likelihood.
It is important to note that the L(.) is not a
R
probability distribution, i.e. Θ L (Θ | D ) 6= 1, rather it is the function of Θ only
since D is treated as fixed.
3 Long-branch attraction involves the problem of inferring an incorrect tree with long branches
grouped together by character-based methods under simplistic models [175].
CHAPTER 3. INTRODUCTION TO COMPUTATIONAL METHODS IN
EVOLUTION
28
The maximum likelihood estimate (MLE) of Θ is defined as the Θ∗ that maximizes
the likelihood function L(.):
Θ∗ = arg max L(Θ | D )
Θ
(3.5)
For some models, it is possible to determine the ML estimate as an explicit function of the observed data D. As an example, let the data D be generated from a
Gaussian distribution with mean µ and standard deviation σ, and θ = µ, σ2 .
We can estimate the parameters θ by solving Eq. 3.5 analytically for µ and σ, i.e.,
set the Eq.3.5 to zero, take its derivative with respect to µ and σ and ultimately,
solve the resulting equations. However, there are many other models for which
there are no closed form solutions available. In such cases, MLEs are computed
numerically using optimization methods, such as Expectation Maximization (EM)
method [33].
MLEs, however, have some interesting asymptotic properties that convinced the
evolutionary biologists to use ML methods as a tool of choice, i.e, under the sufficiently large data size, MLEs are consistent (approach the true values), unbiased,
and statistically efficient (having the smallest variance among unbiased estimates)
[175]. ML based methods have greater advantage over the parsimony and distance based methods in predicting the correct evolutionary tree; however, it is
more computationally demanding [56]. There are a number of software packages
available that implement ML-based methods, with the most popular are: PHYLIP
[55], PAUP* [154], RAxML [151], PhyML [70], PAML [173], and GARLI [180].
Bayesian methods
In recent years, Bayesian inference has been introduced as a powerful tool to reconstruct phylogenetic trees. It is based on Bayes’ theorem (3.6) developed by
Thomas Bayes. Bayesian inference is related to maximum likelihood but answers
totally different question. In ML methods, the basic question asked is what is the
probability of data D given that a certain hypothesis H is true? While, Bayesian methods answer question such as what is the probability of a hypothesis H being true given
that we observe the data D? i.e.,
P( H | D ) =
P( D | H ) P( H )
P( D )
(3.6)
where
• P( H ) is the prior probability of H being true before observing the data D.
• P( D | H ) is the likelihood of the data D given the hypothesis H.
• P( D ) represents the probability of data D ignoring which H is true.
3.2. COMPUTATIONAL METHODS FOR PHYLOGENY RECONSTRUCTION29
• P( H | D ) denotes the posterior and represents our updated belief that H is
true after observing the data D.
In the above equation, P( D ) is often intractable, involving high dimensional integrals over all the possible H. Bayesian methods, in such cases, relies on sampling
methods such as MCMC ( see MCMC-based methods section for more details) that
allows P( X ) to factor out as a normalizing constant, yielding Eq. 3.6 as:
P( H | D ) ∝ P( D | H ) P( H )
(3.7)
Unlike the ML methods, the parameters of the prior distribution may not be fixed
i.e., the parameters may come from a distribution with its own set of parameters,
known as hyperparameters. The use of a prior, however, is a debatable topic [17] but
it is one of the strengths of Bayesian methods that allows the researchers to incorporate prior belief about the parameters of a model. In the absence of prior knowledge about the parameters, it is often recommended [87] to use non-informative
(flat) priors and let the data governs the posterior but in such cases, the resulting
posterior becomes proportional to the likelihood. However, it has been shown that
with the increasing amount of data, informed priors often have less influence on
the results i.e., yielding similar posterior distributions [18].
MCMC-based methods
As mentioned in the previous subsection, Bayesian inference often involves various integration and optimization problems, especially when dealing with high
dimensional parametric space. Computing the full joint density in such scenarios
become intractable and the only way is to find an approximate solution. MCMC
provides such solution by sampling from a probability distribution based on constructing a Markov chain that has the target distribution ρ as its stationary distribution µ.
A Markov chain is a sequence of random variables X1 , X2 , ... on some finite state
space S = {s1 , ..., sn } such that the conditional distribution of Xn+1 given X1 , X2 , ...,
Xn depends on Xn only i.e.,
P( Xn+1 | Xn , ..., X1 ) = P( Xn+1 | Xn )
(3.8)
Eq. 3.8 is also known as the Markov property or sometimes referred to as memoryless property since the distribution over the states for the next step is only dependent on the current state, instead on the states preceding it. A Markov chain has a
transition matrix T that describes the probability of changes of state. For instance,
the probability Tij of transitioning from state i to state j is given in the ijth position
in the matrix T, where Tij = P Xn = s j | Xn−1 = si and ∑ j Tij = 1 (in discrete
case).
30
CHAPTER 3. INTRODUCTION TO COMPUTATIONAL METHODS IN
EVOLUTION
Now, suppose that the system is at an initial state si taken randomly from an arbitrary initial distribution π = h P (s1 ) , ..., P (sn )i, where π corresponds to a row
vector. The probability distribution over the states after one step can be computed
by multiplying row vector π with the transition matrix T, i.e., πT. Similarly, we
can compute the probability of the system at any state s j after j steps by multiplying the jth order transition matrix T i with the initial row vector π, i.e., πT i .
One of the desirable properties of Markov chain is that the distribution on the
states converges to a unique invariant distribution µ, πT n → µ as n → ∞, provided that it is irreducible and aperiodic. A Markov chain is said to be irreducible,
if any state si is reachable from any other state s j in a finite number of steps. To
define aperiodicity, we need to define periodicity first. A state si has a period k, if the
same state si is always reachable in a multiple of k time steps.
d(i ) = gcd {n > 0 : P( Xn = i | X0 = i ) > 0}
(3.9)
where “gcd” is the greatest common divisor. As an example, the periodicity of
a complete bipartite graph is k = 2 since it is possible to reach the same sate in
{2, 4, 6, 8, 10, ...} time steps. Any state si with d(i ) = 1 is said to be aperiodic and a
Markov chain is aperiodic if all its states are aperiodic.
A sufficient condition of a Markov chain that ensures µ as the target distribution ρ
from which we can sample is known as the detailed balance equation:
µi Tij = µ j Tji
(3.10)
When dealing with high dimensional problems that involves infinite state space
and where direct sampling from a probability distribution is difficult to achieve,
it is often recommended to use a Metropolis-Hastings algorithm. MetropolisHastings is an MCMC-based algorithm that constructs a random walk on a graph
representing the search space. Instead of using the transition matrix, it uses a set of
0
proposal distributions Qk ( Xk | X ) to construct a Markov chain, where 1 < k ≤ n
and X denotes n-dimensional state space, i.e., X = { X1 , ..., Xn }. The components
of X can either be discrete or continuous variables; each component in itself can
also be a multi-dimensional random variable, Xk = { x1 , ..., xn }. Given that we
0
are in the current state X, the new state X is proposed according to some pro0
0
posal distribution Qk ( Xk | X ), where the proposed state X differs only in the
kth component from the current state X. In each iteration of a Markov chain, the
perturbation of a component is taken randomly. The proposed state is accepted
according to the acceptance probability,
!
0
0
0
P ( X ) Q k ( Xk | X )
(3.11)
A( X, X ) = min 1,
0
P ( X ) Q k ( Xk | X )
3.3. RECONCILIATION-BASED PHYLOGENY RECONSTRUCTION
METHODS
31
It takes some time for a Markov chain to eventually converge to the desired distribution. This initial time period is called as the burn-in period of a Markov chain
and is mainly attributed to a lousy starting point of a random walk. The argument
is that a ‘bad’ starting point may result in over-sampling of the regions with low
probabilities under the stationary distribution before it converges to the desired
distribution. It is, thus, recommended to throw away the initial k samples before
start collecting the points. Subsequently, the size of k becomes the next decision
problem. There are number of diagnostic tests proposed to decide upon the burnin value, with the most commonly used are Geweke [62], Gelman-Rubin [61], and
Sahlin-Höhna’s Max-ESS method [137].
Although, MCMC based methods provide nice theoretical guaranties to converge
to the desired distribution, however, it comes at the cost of a high computation
burden. However, with the availability of computational resources and code parallelization techniques, it is now feasible to apply the MCMC framework to real
world problems. For instance, in phylogenetics, the MCMC framework has been
extensively used in the Bayesian inference of phylogenetic trees. Some of the most
popular and commonly used tools that use MCMC to reconstruct the evolutionary
history are MrBayes [134], BEAST [39], PhyloBayes [103], PrIME [2] and JPrIME
[146]. In Paper IV of this thesis, we introduce a novel method that employ the
MCMC framework to reconstruct gene trees and infer the time estimates of evolutionary events (gene duplications, speciations and lateral gene transfers) over the
edges of the species tree.
3.3
Reconciliation-based phylogeny reconstruction methods
In this section, we will discuss some of the state-of-art reconciliation based phylogeny reconstruction methods. Reconciliation methods consider an MSA, the
species tree, and the mapping of extant genes to the leaves of the species tree as an
input. It uses a reconciliation approach (discussed in Section 3.1) to map the evolutionary events onto gene trees in such a way that they are concordant with the
species tree. The evolutionary events can either be gene duplications, gene losses,
speciations, or lateral gene transfers.
In phylogenetics, the reconciliation technique is widely used for inferring gene
trees [40, 133], estimating the DLT rates [147], and inferring orthology relationships [169, 164]. Apart from phylogenetics, the concept of reconciliation is widely
used in biogeography and parasitology [127, 23].
In phylogenetics, reconciliation based methods are mainly categorized into two
groups: parsimony-based reconciliation methods and probabilistic reconciliation
methods. We will discuss each of them in the following subsections.
CHAPTER 3. INTRODUCTION TO COMPUTATIONAL METHODS IN
last decade (see, for instance, [138–142]). In parsimony, LGT adds a EVOLUTION
level of
complexity compared to gene duplication and loss, partly because multiple MPR
Parsimony-based
methods
optimas often coexist.
Initially having limitations such as being applicable only on
monocopy families
or allowing LGT
backwards
in time, therereconstruction
are today efficientrelies on
A parsimony-based
reconciliation
method
for phylogeny
the parsimony
principle
(discussedloss
in and
Section
3.2) sound
to reconcile
a gene
tree with
methods that
model duplication,
temporally
LGT events,
for inthe species
tree.
The
gene
tree
that
allows
the
evolution
of
sequences
with
stance the work by Tofigh and Lagergren [143], David and Alm [144], Bansal et fewest
changes is seleted. These changes may be gene duplications, gene losses, or latal. [145,146], Daubin et al. [28,147], and Doyon et al. [148,149]. A major shorteral gene transfer events. The concept of gene-species tree reconciliation was first
coming
many combinatorial
is that sequence
evolution and
introduced
byofGoodman
et al in frameworks
[64] and provided
an algorithm
thatreconuses parsiciliation constraints
are notthe
taken
intoparsimonious
account simultaneously.
Also,(MPR)
LGT from
mony approach
to compute
most
reconciliation
of a gene
tree with
the
species
tree.
Most
parsimonious
reconciliation
is
a
reconciliation,
species not included in the analysis can complicate gene family analysis, either by
with a minimum
cost,those
that stemming
uniquelyfrom
maps
of a gene tree
introducing reconciliation
homologs alongside
thethe
mostvertices
recent common
to the vertices or edges of a species tree. Figure 3.4 shows two reconciliations, i.e.,
ancestor of the studied species, or by giving rise to the entire family.
32
an MPR and a non-MPR scenarios, of the same gene tree.
Figure 3.4:
shows
reconciliations,
an MPR
non-MPR
scenarios,
of the same gene
Figure
10.two
Shows
two ways to reconcile
theand
samea gene
tree topology.
The non-MPR
tree [144].
solution induces two implicit losses (note that most methods work with the pruned
topology only).
In maximum parsimony reconciliation methods, each event is associated a cost,
speciation events are usually considered as null events and are therefore assigned
a zero cost
duplication,
and
transfer
assignedand
non-zero
Not while
surprisingly,
there is a loss,
twilight
zone
betweenevents
purely are
combinatorial
prob- costs.
A reconciliation
cost
may
be
defined
as
a
simple
duplication
cost,
involving
the
abilistic approaches. For example, sampling among equally optimal solutions in
number of inferred duplication events on a reconciled gene tree; a duplication-loss
LGT-enabled parsimony to achieve better estimates was suggested recently [146].
(DL) cost, i.e., the sum of inferred duplication and loss events; a transfer-loss (TL)
Another approach
to transfer
let duplication-loss
improvements of
cost involving
the sumisof
and lossreconciliations
events, or aguide
duplication-loss-transfer
an
a
priori
estimated
tree
inferred
using
sequence
data
[150–152].
(DLT) cost that accounts for the number of inferred duplication, loss, and transfer
events [105].
Probabilistic
effortsthere has been a plethora of reconciliation-based method
Over the
last few decades,
proposed. Each of them comes with different algorithmic flavours and trying to
More orthodox probabilistic gene-species tree reconciliation methods have mostly
minimize one of the aforementioned reconciliation costs. Page et al. [126] introduced
relied onthat
natural
extensions
of a trees
birth-death
[153]tree
to model
an algorithm
reconciles
gene
with process
a species
usingthe
DLpropensicost. Guigo et
ty for duplication, loss [154–157] and – in some cases – LGT events [48]. Such
models are employed in the papers of this thesis. A common way to escape the
computational burden that all possible reconciliations may pose under duplication-loss models is to confine the reconciliation space to MPR or near-MPR solutions, see, e.g., the work by Rasmussen and Kellis [123,158], and Boussau et al.
39
3.3. RECONCILIATION-BASED PHYLOGENY RECONSTRUCTION
METHODS
33
al. [69] used Goodman et al. [64] framework to find the species tree, with minimum
number of duplication events, from a given set of gene trees. Under the model
of duplication-loss parsimonious reconciliation, polynomial time algorithms are
provided in [178, 15]. Górecki et al. [65] proposed an architecture that allows the
whole set of reconciliations between a gene tree G and a species tree S and gave
a polynomial time algorithm to find one of the MPRs. Wapinski et al. [168] introduced a parsimony based method that uses synteny information along with the
sequence similarity to reconstruct gene tree with minimum DL reconciliation cost.
Analogous to the case of DL models, there are a number of gene-species tree reconciliation methods proposed that focuses on transfer-loss events [76, 19, 80, 89, 117].
Similarly, several algorithms have been introduced to compute parsimonious reconciliations under the DLT model, accounting for gene duplications, gene losses,
and lateral gene transfer events [27, 38, 124, 161, 14, 38].
Hahn showed in his research [72] that reconciliation based methods are biased
when the inferred gene tree is not correct. He conducted analysis on multiple
mammal and Drosophila genomes and demonstrated that reconciliation bias places
the duplication events near the root of the tree and losses towards the tips of the
tree. This bias is more in cases where there is a low bootstrap support for branches
on the gene tree. To address such uncertainty in gene tree, the branches with low
bootstrap support are collapsed, introducing polytomous nodes. Durand et al.
[40] address this problem by introducing an exact algorithm, implemented in a
software called NOTUNG, that uses the DL model to find the most parsimonious
reconciliation when gene tree is polytomous.
Probabilistic methods
Probabilistic methods for phylogeny reconstruction can be grouped into two categories: Maximum likelihood methods and Bayesian inference methods (discussed
in Section 3.2). In this section, we will focus more on Bayesian methods for phylogeny reconstruction that are based on generative models.
In phylogenetic settings, a generative model refers to a model that generates a gene
tree and sequences on a given dated species tree4 . There are number of generative
models proposed, describing the evolution of gene tree inside species tree. These
models can be categorized into groups based on the type of evolutionary events
that governs the process of evolution, i.e., duplication-loss model (DL), transferloss model (TL), duplication-loss-transfer model (DLT).
The basic underlying machinery of the aforementioned generative models are
based on a birth-death process that generates gene duplications, gene losses, and/or
lateral gene transfer events over the edges of a species tree, yielding a gene tree at
4 Dated
species tree is a species tree with times associated with the species tree edges.
34
CHAPTER 3. INTRODUCTION TO COMPUTATIONAL METHODS IN
EVOLUTION
the end of the birth-death process. DL models use a standard birth-death process
[93] to generate duplication and loss events, while DLT models uses an extension
of the birth-death process that incorporates transfer events along with duplications and losses (in Paper IV, we use such a birth-death process). Once a gene tree
is obtained, gene/protein sequences are generated using any standard substitution model of choice. In the following subsections, we will discuss some of the
probabilistic DL and DLT models.
Duplication-loss models
Duplication-loss models explain the discordance between the gene tree and the
species tree that arise due to gene duplications and gene losses. The first probabilistic duplication-loss model, also known as the gene evolution model, was given
by Arvestad et al. [9], where they modeled the evolution of a gene tree inside a
species tree, using a linear birth-death process [93]. In [10], the same group further
extended the gene evolution model to the gene sequence evolution model by incorporating the sequence evolution model that uses a strict molecular clock. Later,
Åkerborg et al. [2] relaxed the strict molecular clock assumption by using identical independent gamma distributions to model rate variation and termed the gene
sequence evolution model as GSR model (now known as the DLRS model). DLRS
uses MCMC framework to integrate three probabilistic submodels: a duplicationloss model that describes gene evolution, a substitution rate model explaining the
rate variation over the gene tree, and a sequence evolution model that describes
substitution events. The software implementation of DLRS model is provided in
PrIME [140] and JPrIME [145].
These probabilistic models are, however, computationally expensive in time to
cover the full posterior probability landscape, but there are other methods that
constrain the gene tree reconciliation space to a near-MPR space [133, 20]; thereby
achieving similar accuracy with a significant computational speedup.
Duplication-loss-transfer models
As the name suggests, duplication-loss-transfer models incorporate gene duplications, gene losses, and lateral gene transfer events. The first comprehensive
probabilistic DLT model, known as DLTRS5 model, was given by Sjöstrand et al.
[147]. The DLTRS model is an extension of the DLRS model that integrates DLT
events, substitution rate variation over a gene tree, and sequence evolution model.
It uses an extension of a linear birth-death process that consider lateral gene transfer events along with the gene duplication and gene loss events, to generate a gene
tree. DLTRS model allows the lineages of a gene tree to jump from one species lineage to another in a contemporaneous time. Figure 3.5 illustrates the basic compo5 DLTRS is an acronym of Duplication, Loss, Transfer, substitution Rate variation over gene tree,
and Sequence evolution.
3.3. RECONCILIATION-BASED PHYLOGENY RECONSTRUCTION
METHODS
35
nents of the model. For a given a multiple sequence alignment D, a dated species
Figure 3.5: An illustration of the DLTRS model (adopted from [107]), where a) shows the
evolution of a gene tree (thin tree) inside the species tree (fat tree) according to the DLT
model, capturing gene duplications (green), speciations (black), gene loss (red), and lateral
gene transfer event (blue), b) Illustrates the molecular clock relaxation using the Rate model,
and c) Shows the sequence evolution model.
tree S, and a gene-species tree mapping σ, where σ : L( G ) → L(S), the DLTRS
model computes a posterior distribution over all the gene trees G, branch lengths
l over the gene trees, and other parameters θ = (µ, δ, τ, m, ν) of the model, where
µ corresponds to birth rate, δ refers to loss rate, τ is the transfer rate, m and ν are
the edge rate mean and coefficient of variance, respectively. Mathematically, the
probability density of the DLTRS model can be expressed as:
p( G, l, θ | D, S) =
P( D | G, l ) p( G, l | θ, S) p(θ )
P( D | S)
(3.12)
Where the numerator is factorized into the joint likelihood and prior probability
densities. The factor P( D | G, l ) of the likelihood can be computed using the
standard DP algorithm introduced by Felsenstein [54], while the second factor
p( G, l | θ, S) can be computed using a dynamic programming algorithm presented
in [147]; p(θ ) is the prior probability distribution of parameters θ. The denominator P( D | S) is intractable and gets cancel out when computing the ratio between
two successive posterior probabilities. In Paper IV of this thesis, we extend the
36
CHAPTER 3. INTRODUCTION TO COMPUTATIONAL METHODS IN
EVOLUTION
DLTRS method and propose a dynamic programming algorithm for sampling reconciliations and computing the MAP reconciliations.
Recently, Szöllősi et al. [155, 156] introduced probabilistic DLT models that reconstructs the chronological ordered species phylogeny with the assumption that
some of the LGTs may come from the unsampled or extinct species.
3.4
Computational methods for inferring gene families
As discussed in section 2.4, identifying gene families is a central complex task in
evolutionary studies. Over the last two decades, a significant amount of efforts
have been done in this area and many methods have been proposed to infer gene
families. These methods can be broadly categorized into sequence similarity methods and phylogeny-based methods. In the following subsections, we will discuss
each of them in more details.
Sequence similarity methods
Sequence similarity methods, also known as hit-based methods, uses sequence similarity criteria to determine gene families. Hit-based methods usually involve two
steps: first, pairwise comparison of all the given sequences are performed to detect
homology, using a local alignment method such as BLAST [6] or Smith-Waterman
[148]. In the second step, a choice of clustering scheme is applied to group homologs into families. The type of clustering algorithms vary in the choice of similarity measure, i.e., BLAST bitscores, E-values, alignment length, or a combination of these parameters, or it may differ in the methodological way; for-example,
some sequence similarity methods are based on agglomerative clustering technique, typically single-linkage clustering, with a simple threshold criteria to cluster genes into families (e.g., BlastClust [37], SiLiX [113], ProtoMap [176], and GeneRage [49] etc), while other utilizes network or graph properties to find strongly
connected components in the graph and delimit them into gene families (e.g., NC
[150], HiFiX [112], TribMCL [50], Inparnoid [123]).
Sequence similarity methods are computationally fast but prone to relatively higher
errors. It may happen that a random gene with high sequence similarity may ends
up in a gene family, with no common ancestor; thus resulting in high false positives. Similarly, working with a single threshold, may sometimes result in missing
the de fecto homologies because the sequences may no longer agree with the level
of similarity that is greater than expected by random chance. These problems,
however, are to a great extent addressed in phylogeny based gene family methods
that uses explicit models of gene and sequence evolution to detect more reliable
homologs and ultimately, identify evolutionary diverse gene families, more accurately.
3.4. COMPUTATIONAL METHODS FOR INFERRING GENE FAMILIES
37
Phylogeny-based methods
In contrary to sequence similarity methods, phylogeny based methods uses sequence information and a statistical model that uses phylogenetic information to
infer gene families. It treats each gene member in a family as an evolved copy of
the last common ancestor. Phylogeny based methods incorporates explicit models of gene, species and sequence evolution to determine homolgous relationships
that are more accurate and reliable. It identifies clades of interest and creates families based on them.
Over the last few decades, there have been a number of phylogeny-based methods
proposed, with the most common methods are BranchClust [132], GreenphylDB
[135], Panther [111], HOGENOM [131], HoverGen [131], Ensembl [165], PhyloFacts [1], PhIGs [32], eggNOG [86], OMA [5] and PhylomeDB [85]. The main
methodological differences are found in the different approaches used for initial
clustering of sequences (for example using Blast-hits or hidden Markov models)
and how the actual phylogenetic clustering of genes or proteins into families is
performed.
Although, phylogeny-based methods require more computationally resources but
it gives more accurate results in inferring gene families, as compared to sequence
similarity methods; especially in scenarios where there are high differential gene
losses and varying rates of evolution. In Paper III of this thesis, we introduced a
phylogenetic-based gene family method that takes an MSA and a species tree as
an input and output gene families that results from each internal vertices of the
species tree.
Chapter 4
Overview of the Included Papers
This chapter presents a short summary of the published articles and a manuscript
included in this thesis. On all the papers, I have made contributions at different
stages of the problem solving process, i.e, identifying the problem, devising the
algorithm, implementation, data collection and simulation, analysis, results and
evaluation, and finally, writing a research article. For specifics, readers are referred
to the Authors’ contributions section in each of the included article in this thesis.
Paper I: This is a software article presenting FastPhylo as a software package that
contains the implementations of previously published algorithms in [48] and
[47]. [48] estimates a distance matrix from a given alignment and [47] reconstructs a phylogeny from the estimated distance matrix. In distance-based
phylogeny reconstruction methods, one of the main computational bottlenecks comes from computing the pairwise distance matrix for a given set of
sequences. We address this problem by using independent row operations
in a parallel fashion to estimate the whole distance matrix.
We used synthetic datasets to evaluate the performance of FastPhylo with
other NJ based methods, using the two performance matrices: speed and
memory consumption. Results on synthetic datasets revealed that FastPhylo
outperformed other NJ based methods in terms of speed and memory consumption but shows comparable speed with RapidNJ. We showed in the
paper that FastPhylo can work with very large problem instances.
The easy to use, well-defined interfaces, scalable and the modular structure
of FastPhylo allow it to be used in very large Bioinformatics pipelines. FastPhylo can be a good tool of choice in many studies; for instance, in MCMC
and maximum likelihood (ML) methods for phylogeny reconstruction, it can
be used to generate a good starting tree.
Paper II: In this article, we introduced a novel inference method, called GenFamClust, to detect homologous relationship among a given set of sequences.
39
40
CHAPTER 4. OVERVIEW OF THE INCLUDED PAPERS
Homology dectection is a crucial task to many comparative genomic studies, such as gene family inference, structural and functional prediction of
novel proteins etc. A common strategy to detect homology is to use sequence
similarity criteria or use some graph theocratic approaches. These methods,
however, are sensitive to the sequence similarity thresholds that most often
result in a large number of false positives. In this paper, we employ local
synteny information along with the sequence similarity measure to detect
homology.
GenFamClust computes two different scores, i.e., sequence similarity score
and synteny score, by applying the neighborhood correlation measure [150]
to the all-versus-all BLAST bitscores and local synteny information, respectively. An elliptical function is then used that takes the sequence similarity
and synteny scores as an input and outputs a one-dimensional evaluation
score which is used to cluster genes into different gene families.
We access the performance of GenFamClust using synthetic datasets and
the biological gold standard dataset. Results on both the datasets reveal
that our method performs better then the standalone sequence-similarityneighbourhood-correlation score and results in less number of false positives; the resultant gene families are more accurate and show similar properties. GenFamClust can be used as a part of large bioinformatics pipeline. It
can handle very large and diverse datasets spread across a variety of species,
as well as across varying protein domain architectures from single domain
to conserved and varying multi-domain proteins.
Paper III: This paper is an attempt to provide a novel method for partitioning a
very large gene family into more manageable subfamilies. It is not uncommon that some gene families are quite large. Some have hundreds of members even within a species and sequence clustering software, for example
GenFamClust in paper II, can faithfully return them. Unfortunately, some
downstream analysis cannot easily handle such large problem instances and
it is necessary to break them down into subfamilies. Such analysis includes,
in particular, Bayesian phylogenetic analysis like JPrIME DLRS [146]. This
can be done manually, but is time consuming and tedious task. The challenge is to find an objective and automatable procedure.
We argued that a method for creating subfamilies in a reliable way needs to
be based on phylogenetic inference and that subtrees are natural subfamilies.
In fact, this is a natural way of defining a gene family, which has been used
by, for example, [168]. To find good cut-points in the phylogeny, we chose
to reconcile phylogenies with a species tree. This is problematic since phylogeny inference is error-prone and even a small misplacement of a branch
can have large consequences when reconciling it with a species tree. We have
solved this using a bootstrap procedure, which effectively gives support values for inferred subfamilies.
41
We present a method, with two variants, that take homologous sequences
(assumed to be a family) as an input and output suggestions for final gene
families. The suggestions form a hierarchy, enabling the user to make a decision on cut points based on auxiliary knowledge about the biology.
The performance of our method was evaluated on synthetic datasets using
ROC analysis and also applied it to the standard biological datasets. We
found that our method performed well in identifying gene families with
high accuracy. The new method can solve issues in large-scale computations,
where it can be applied to problematic parts of the data.
Paper IV: In this paper, we propose probabilistic methods that sample reconciliations of a gene tree with the species tree and compute maximum a posteriori probabilities. The proposed methods are based on the DLTRS model,
introduced by Sjöstrand et al. [147] that uses a MCMC framework to integrate LGTs, gene duplications, gene losses and sequence evolution under a
relaxed molecular clock for substitution rates. Using the proposed sampling
method, we can estimate the posterior distribution of gene trees and the temporal information of LGT events over the species tree lineages. This temporal information, in turn, can be used to identify potential LGT highways
between different lineages of the species tree. In this paper, we evaluated
the sampling method using simulated and biological datasets. However, we
gave theoretical motivation of a probabilistic method that computes the most
likely reconciliation of gene tree with the species tree. Analysis on the simulated datasets reveals that our method performs well in identifying the true
LGT events over the species tree lineages. Using the biological datasets, we
managed to identify potential LGT highways that are reported by previously
published studies.
Bibliography
[1]
C. Afrasiabi, B. Samad, D. Dineen, C. Meacham, and K. Sjolander. The
PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification. Nucleic Acids Research, 41
(Web Server issue):W242–8, Jul 2013.
[2]
Örjan Åkerborg, Bengt Sennblad, Lars Arvestad, and Jens Lagergren. Simultaneous Bayesian gene tree reconstruction and reconciliation analysis.
Proceedings of the National Academy of Sciences, 106(14):5714–5719, 2009.
[3]
Bruce Alberts, Dennis Bray, Karen Hopkin, Alexander Johnson, Julian
Lewis, Martin Raff, Keith Roberts, and Peter Walter. Essential cell biology.
Garland Science, 2013.
[4]
Raja H Ali, Sayyed A Muhammad, Mehmood A Khan, and Lars Arvestad.
Quantitative synteny scoring improves homology inference and partitioning of gene families. BMC Bioinformatics, 14(Suppl 15):S12, 2013.
[5]
A.M. Altenhoff, N. Skunca, N. Glover, C.M. Train, A. Sueki, I. Pilizota,
K. Gori, B. Tomiczek, S. Muller, H. Redestig, G.H. Gonnet, and C. Dessimoz. The OMA orthology database in 2015: function predictions, better
plant support, synteny view and other improvements. Nucleic Acids Research, 43(Database issue):D240–9, Jan 2015.
[6]
Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and
David J Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990.
[7]
Jan O Andersson. Lateral gene transfer in eukaryotes. Cellular and Molecular
Life Sciences CMLS, 62(11):1182–1197, 2005.
[8]
Jan O Andersson. Gene transfer and diversification of microbial eukaryotes.
Annual Review of Microbiology, 63:177–193, 2009.
[9]
Lars Arvestad, Ann-Charlotte Berglund, Jens Lagergren, and Bengt
Sennblad. Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics, 19(suppl 1):i7–i15, 2003.
43
44
BIBLIOGRAPHY
[10]
Lars Arvestad, Ann-Charlotte Berglund, Jens Lagergren, and Bengt
Sennblad. Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. In Proceedings of the
Eighth Annual International Conference on Research in Computational Molecular
Biology, pages 326–335. ACM, 2004.
[11]
Kevin Atteson. The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica, 25(2-3):251–278, 1999.
[12]
Jeffrey A Bailey and Evan E Eichler. Primate segmental duplications: crucibles of evolution, diversity and disease. Nature Reviews Genetics, 7(7):552–
564, 2006.
[13]
Jeffrey A Bailey, Zhiping Gu, Royden A Clark, Knut Reinert, Rhea V Samonte, Stuart Schwartz, Mark D Adams, Eugene W Myers, Peter W Li, and
Evan E Eichler. Recent segmental duplications in the human genome. Science, 297(5583):1003–1007, 2002.
[14]
Mukul S Bansal, Eric J Alm, and Manolis Kellis. Efficient algorithms for the
reconciliation problem with gene duplication, horizontal transfer and loss.
Bioinformatics, 28(12):i283–i291, 2012.
[15]
Michael A Bender and Martin Farach-Colton. The LCA problem revisited.
In LATIN 2000: Theoretical Informatics, pages 88–94. Springer, 2000.
[16]
Gordon M Bennett and Nancy A Moran. Small, smaller, smallest: the origins
and evolution of ancient dual symbioses in a phloem-feeding insect. Genome
Biology and Evolution, 5(9):1675–1688, 2013.
[17]
James O Berger and Donald A Berry. Statistical analysis and the illusion of
objectivity. American Scientist, pages 159–165, 1988.
[18]
David Blackwell and Lester Dubins. Merging of opinions with increasing
information. The Annals of Mathematical Statistics, pages 882–886, 1962.
[19]
Alix Boc, Hervé Philippe, and Vladimir Makarenkov. Inferring and validating horizontal gene transfer events using bipartition dissimilarity. Systematic
Biology, page syp103, 2010.
[20]
Bastien Boussau, Gergely J Szöllősi, Laurent Duret, Manolo Gouy, Eric Tannier, and Vincent Daubin. Genome-scale coestimation of species and gene
trees. Genome Research, 23(2):323–330, 2013.
[21]
Calvin B Bridges. Salivary chromosome maps with a key to the banding
of the chromosomes of drosophila melanogaster. Journal of Heredity, 26(2):
60–64, 1935.
[22]
CB Bridges. Duplication. Anat. Rec, 15:357–358, 1919.
BIBLIOGRAPHY
45
[23]
Daniel R Brooks and Amanda L Ferrao. The historical biogeography of coevolution: emerging infectious diseases are evolutionary accidents waiting
to happen. Journal of Biogeography, 32(8):1291–1299, 2005.
[24]
Marija Buljan and Alex Bateman. The evolution of protein domain families.
Biochemical Society Transactions, 37(4):751–755, 2009.
[25]
Luigi L Cavalli-Sforza and Anthony WF Edwards. Phylogenetic analysis.
models and estimation procedures. American Journal of Human Genetics, 19(3
Pt 1):233, 1967.
[26]
Jarrod A Chapman, Ewen F Kirkness, Oleg Simakov, Steven E Hampson,
Therese Mitros, Thomas Weinmaier, Thomas Rattei, Prakash G Balasubramanian, Jon Borman, Dana Busam, et al. The dynamic genome of Hydra.
Nature, 464(7288):592–596, 2010.
[27]
MA Charleston. Jungles: a new solution to the host/parasite phylogeny
reconciliation problem. Mathematical Biosciences, 149(2):191–223, 1998.
[28]
Loredana Lo Conte, Bart Ailey, Tim JP Hubbard, Steven E Brenner, Alexey G
Murzin, and Cyrus Chothia. SCOP: a structural classification of proteins
database. Nucleic Acids Research, 28(1):257–259, 2000.
[29]
Diego Cortez, Patrick Forterre, and Simonetta Gribaldo. A hidden reservoir
of integrative elements is the major source of recently acquired foreign genes
and orfans in archaeal and bacterial genomes. Genome Biology, 10(6):R65,
2009.
[30]
Tal Dagan and William Martin. Getting a better picture of microbial evolution en route to a network of genomes. Philosophical Transactions of the Royal
Society B: Biological Sciences, 364(1527):2187–2196, 2009.
[31]
Charles Darwin. Notebook B: Transmutation, 1837.
[32]
P.S. Dehal and J.L. Boore. A phylogenomic gene cluster resource: the phylogenetically inferred groups (PhIGs) database. BMC Bioinformatics, 7:201,
2006.
[33]
Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (methodological), pages 1–38, 1977.
[34]
Elsa Denker, Eric Bapteste, Hervé Le Guyader, Michaël Manuel, and Nicolas
Rabet. Horizontal gene transfer and the evolution of cnidarian stinging cells.
Current Biology, 18(18):R858–R859, 2008.
[35]
Richard Desper and Olivier Gascuel. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. Journal of
Computational Biology, 9(5):687–705, 2002.
46
BIBLIOGRAPHY
[36]
Theodosius Dobzhansky and Alfred H Sturtevant. Inversions in the chromosomes of drosophila pseudoobscura. Genetics, 23(1):28, 1938.
[37]
I Dondoshansky and Y Wolf. BLASTClust, 2002.
[38]
Jean-Philippe Doyon, Celine Scornavacca, K Yu Gorbunov, Gergely J Szöllősi, Vincent Ranwez, and Vincent Berry. An efficient algorithm for
gene/species trees parsimonious reconciliation with losses, duplications
and transfers. In Comparative Genomics, pages 93–108. Springer, 2010.
[39]
Alexei J Drummond, Marc A Suchard, Dong Xie, and Andrew Rambaut.
Bayesian phylogenetics with BEAUti and the BEAST 1.7. Molecular Biology
and Evolution, 29(8):1969–1973, 2012.
[40]
Dannie Durand, Bjarni V Halldórsson, and Benjamin Vernot. A hybrid
micro-macroevolutionary approach to gene tree reconstruction. Journal of
Computational Biology, 13(2):320–335, 2006.
[41]
Robert C Edgar. MUSCLE: multiple sequence alignment with high accuracy
and high throughput. Nucleic Acids Research, 32(5):1792–1797, 2004.
[42]
Patrick P Edger and J Chris Pires. Gene and genome duplications: the impact of dosage-sensitivity on the fate of nuclear genes. Chromosome Research,
17(5):699–717, 2009.
[43]
Anthony WF Edwards and Cavalli LL Sforza. The reconstruction of evolution. Heredity, 18, 1963.
[44]
Greg Elgar and Tanya Vavouri. Tuning in to the signals: noncoding sequence
conservation in vertebrate genomes. Trends in genetics, 24(7):344–352, 2008.
[45]
Isaac Elias. Computational problems in evolution: Multiple alignment, genome
rearrangements, and tree reconstruction. PhD thesis, KTH Royal Institute of
Technology, 2006.
[46]
Isaac Elias. Settling the intractability of multiple alignment. Journal of Computational Biology, 13(7):1323–1339, 2006.
[47]
Isaac Elias and Jens Lagergren. Fast neighbor joining. In Automata, Languages
and Programming, pages 1263–1274. Springer, 2005.
[48]
Isaac Elias and Jens Lagergren. Fast computation of distance estimators.
BMC Bioinformatics, 8(1):89, 2007.
[49]
Anton J Enright and Christos A Ouzounis. GeneRAGE: a robust algorithm
for sequence clustering and domain detection. Bioinformatics, 16(5):451–457,
2000.
BIBLIOGRAPHY
47
[50]
Anton J Enright, Stijn Van Dongen, and Christos A Ouzounis. An efficient
algorithm for large-scale detection of protein families. Nucleic Acids Research,
30(7):1575–1584, 2002.
[51]
Chuanzhu Fan, Ying Chen, and Manyuan Long. Recurrent tandem gene duplication gave rise to functionally divergent genes in Drosophila. Molecular
Biology and Evolution, 25(7):1451–1458, 2008.
[52]
Joseph Felsenstein. Maximum likelihood and minimum-steps methods for
estimating evolutionary trees from data on discrete characters. Systematic
Zoology, pages 240–249, 1973.
[53]
Joseph Felsenstein. Maximum-likelihood estimation of evolutionary trees
from continuous characters. American Journal of Human Genetics, 25(5):471,
1973.
[54]
Joseph Felsenstein. Evolutionary trees from DNA sequences: a maximum
likelihood approach. Journal of Molecular Evolution, 17(6):368–376, 1981.
[55]
Joseph Felsenstein. PHYLIP–phylogeny inference package (version 3.2).
Cladistics, 5:163–166, 1989.
[56]
Joseph Felsenstein. Inferring phylogenies, volume 2. Sinauer Associates Sunderland, 2004.
[57]
Walter M Fitch. Distinguishing homologous from analogous proteins. Systematic Biology, 19(2):99–113, 1970.
[58]
Walter M Fitch and Emanuel Margoliash. Construction of phylogenetic
trees. Science, 155(3760):279–284, 1967.
[59]
Allan Force, Michael Lynch, F Bryan Pickett, Angel Amores, Yi-lin Yan, and
John Postlethwait. Preservation of duplicate genes by complementary, degenerative mutations. Genetics, 151(4):1531–1545, 1999.
[60]
Olivier Gascuel. BIONJ: an improved version of the NJ algorithm based
on a simple model of sequence data. Molecular Biology and Evolution, 14(7):
685–695, 1997.
[61]
Andrew Gelman and Donald B Rubin. Inference from iterative simulation
using multiple sequences. Statistical Science, pages 457–472, 1992.
[62]
John Geweke. Evaluating the accuracy of sampling-based approaches to the
calculation of posterior moments. In Bayesian Statistics, 1992.
[63]
Pablo A Goloboff. Analyzing large data sets in reasonable times: solutions
for composite optima. Cladistics, 15(4):415–428, 1999.
48
BIBLIOGRAPHY
[64]
Morris Goodman, John Czelusniak, G William Moore, AE Romero-Herrera,
and Genji Matsuda. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences.
Systematic Biology, 28(2):132–163, 1979.
[65]
Paweł Górecki and Jerzy Tiuryn. DLS-trees: a model of evolutionary scenarios. Theoretical Computer Science, 359(1):378–399, 2006.
[66]
Laurie A Graham, Stephen C Lougheed, K Vanya Ewart, and Peter L Davies.
Lateral transfer of a lectin-like antifreeze protein gene in fishes. PLoS ONE,
3(7):e2616, 2008.
[67]
Miodrag Grbić, Thomas Van Leeuwen, Richard M Clark, Stephane Rombauts, Pierre Rouzé, Vojislava Grbić, Edward J Osborne, Wannes Dermauw,
Phuong Cao Thi Ngoc, Félix Ortego, et al. The genome of Tetranychus urticae reveals herbivorous pest adaptations. Nature, 479(7374):487–492, 2011.
[68]
Fred Griffith. The significance of pneumococcal types. Journal of Hygiene, 27
(02):113–159, 1928.
[69]
Roderic Guigo, Ilya Muchnik, and Temple F Smith. Reconstruction of ancient molecular phylogeny. Molecular Phylogenetics and Evolution, 6(2):189–
213, 1996.
[70]
Stéphane Guindon and Olivier Gascuel. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic
Biology, 52(5):696–704, 2003.
[71]
Leanne S Haggerty, Pierre-Alain Jachiet, William P Hanage, David Fitzpatrick, Philippe Lopez, Mary J O’Connell, Davide Pisani, Mark Wilkinson, Eric Bapteste, and James O McInerney. A pluralistic account of homology: adapting the models to the data. Molecular Biology and Evolution,
page mst228, 2013.
[72]
Matthew W Hahn. Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biology, 8(7):1–9, 2007.
[73]
Brian K Hall. Homology and embryonic development. In Evolutionary Biology, pages 1–37. Springer, 1995.
[74]
Brian K Hall. Homology: The hierarchial basis of comparative biology. Academic
Press, 2012.
[75]
Brian K Hall. Homology, homoplasy, novelty, and behavior. Developmental
Psychobiology, 55(1):4–12, 2013.
[76]
Michael T Hallett and Jens Lagergren. Efficient algorithms for lateral gene
transfer problems. In Proceedings of the Fifth Annual International Conference
on Computational biology, pages 149–156. ACM, 2001.
BIBLIOGRAPHY
49
[77]
Masami Hasegawa, Hirohisa Kishino, and Taka-aki Yano. Dating of the
human-ape splitting by a molecular clock of mitochondrial DNA. Journal
of Molecular Evolution, 22(2):160–174, 1985.
[78]
J.A. Heinemann and G.F. Sprague, Jr. Bacterial conjugative plasmids mobilize DNA transfer between bacteria and yeast. Nature, 340(6230):205–209,
1989.
[79]
Desmond G Higgins and Paul M Sharp. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73(1):237–244,
1988.
[80]
Tobias Hill, Karl JV Nordström, Mikael Thollesson, Tommy M Säfström,
Andreas KE Vernersson, Robert Fredriksson, and Helgi B Schiöth. SPRIT:
Identifying horizontal gene transfer in rooted phylogenetic trees. BMC Evolutionary Biology, 10(1):1, 2010.
[81]
Sara B Hoot and Jeffrey D Palmer. Structural rearrangements, including
parallel inversions, within the chloroplast genome of anemone and related
genera. Journal of Molecular Evolution, 38(3):274–281, 1994.
[82]
Julie C Dunning Hotopp, Michael E Clark, Deodoro CSG Oliveira, Jeremy M
Foster, Peter Fischer, Mónica C Muñoz Torres, Jonathan D Giebel, Nikhil
Kumar, Nadeeza Ishmael, Shiliang Wang, et al. Widespread lateral gene
transfer from intracellular bacteria to multicellular eukaryotes. Science, 317
(5845):1753–1756, 2007.
[83]
Kevin Howe, Alex Bateman, and Richard Durbin. QuickTree: building huge
neighbour-joining trees of protein sequences. Bioinformatics, 18(11):1546–
1547, 2002.
[84]
Jinling Huang. Horizontal gene transfer in eukaryotes: The weak-link
model. Bioessays, 35(10):868–875, 2013.
[85]
J. Huerta-Cepas, S. Capella-Gutierrez, L.P. Pryszcz, M. Marcet-Houben, and
T. Gabaldon. PhylomeDB v4: zooming into the plurality of evolutionary
histories of a genome. Nucleic Acids Research, 42(Database issue):D897–902,
Jan 2014.
[86]
J. Huerta-Cepas, D. Szklarczyk, K. Forslund, H. Cook, D. Heller, M.C. Walter, T. Rattei, D.R. Mende, S. Sunagawa, M. Kuhn, L.J. Jensen, C. von Mering, and P. Bork. eggNOG 4.5: a hierarchical orthology framework with
improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res, 44(D1):D286–93, Jan 2016.
[87]
Edwin T Jaynes. Prior probabilities. Systems Science and Cybernetics, IEEE
Transactions on, 4(3):227–241, 1968.
50
BIBLIOGRAPHY
[88]
Yuannian Jiao, Norman J Wickett, Saravanaraj Ayyampalayam, André S
Chanderbali, Lena Landherr, Paula E Ralph, Lynn P Tomsho, Yi Hu, Haiying
Liang, Pamela S Soltis, , Douglas E. Soltis, Sandra W. Clifton, Scott E. Schlarbaum, Stephan C. Schuster, Hong Ma, Jim Leebens-Mack, and Claude W.
dePamphilis. Ancestral polyploidy in seed plants and angiosperms. Nature,
473(7345):97–100, 2011.
[89]
Guohua Jin, Luay Nakhleh, Sagi Snir, and Tamir Tuller. Parsimony score
of phylogenetic networks: hardness results and a linear-time heuristic.
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(3):495–
505, 2009.
[90]
Thomas H Jukes and Charles R Cantor. Evolution of protein molecules.
Mammalian Protein Metabolism, 3:21–132, 1969.
[91]
Kazutaka Katoh and Hiroyuki Toh. Parallelization of the MAFFT multiple
sequence alignment program. Bioinformatics, 26(15):1899–1900, 2010.
[92]
Patrick J Keeling and Jeffrey D Palmer. Horizontal gene transfer in eukaryotic evolution. Nature Reviews Genetics, 9(8):605–618, 2008.
[93]
David G Kendall. On the generalized ”birth-and-death” process. The Annals
of Mathematical Statistics, pages 1–15, 1948.
[94]
Mehmood A Khan, Isaac Elias, Erik Sjölund, Kristina Nylander, Roman V
Guimera, Richard Schobesberger, Peter Schmitzberger, Jens Lagergren, and
Lars Arvestad. Fastphylo: Fast tools for phylogenetics. BMC Bioinformatics,
14(1):334, 2013.
[95]
Kenneth K Kidd and Laura A Sgaramella-Zonta. Phylogenetic analysis: concepts and methods. American Journal of Human Genetics, 23(3):235, 1971.
[96]
Motoo Kimura. A simple method for estimating evolutionary rates of base
substitutions through comparative studies of nucleotide sequences. Journal
of Molecular Evolution, 16(2):111–120, 1980.
[97]
Eugene V Koonin and Michael Y Galperin. Evolutionary concept in genetics
and genomics. In Sequence Evolution Function, pages 25–49. Springer, 2003.
[98]
Eugene V Koonin, Kira S Makarova, and L Aravind. Horizontal gene transfer in prokaryotes: quantification and classification 1. Annual Review of Microbiology, 55(1):709–742, 2001.
[99]
Julie R Korenberg, Hiroko Kawashima, Stefan-M Pulst, T Ikeuchi, N Ogasawara, K Yamamoto, Steven A Schonberg, Ruth West, Leland Allen, Ellen
Magenis, K Ikawa, N Taniguchi, and Charles J Epstein. Molecular definition
of a region of chromosome 21 that causes features of the Down syndrome
phenotype. American Journal of Human Genetics, 47(2):236, 1990.
BIBLIOGRAPHY
51
[100] Eric S Lander, Lauren M Linton, Bruce Birren, Chad Nusbaum, Michael C
Zody, Jennifer Baldwin, Keri Devon, Ken Dewar, Michael Doyle, William
FitzHugh, et al. Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, 2001.
[101] Andrew S Lang, Olga Zhaxybayeva, and J Thomas Beatty. Gene transfer
agents: phage-like elements of genetic exchange. Nature Reviews Microbiology, 10(7):472–482, 2012.
[102] E Ray Lankester. On the use of the term homology in modern zoology, and
the distinction between homogenetic and homoplastic agreements. The Annals and Magazine of Natural History, 6(31):34–43, 1870.
[103] Nicolas Lartillot, Thomas Lepage, and Samuel Blanquart. PhyloBayes 3: a
Bayesian software package for phylogenetic reconstruction and molecular
dating. Bioinformatics, 25(17):2286–2288, 2009.
[104] Emmanuelle Lerat, Vincent Daubin, Howard Ochman, and Nancy A Moran.
Evolutionary origins of genomic repertoires in bacteria. PLoS Biology, 3(5):
e130, 2005.
[105] Ran Libeskind-Hadas, Yi-Chieh Wu, Mukul S Bansal, and Manolis Kellis.
Pareto-optimal phylogenetic tree reconciliation. Bioinformatics, 30(12):i87–
i95, 2014.
[106] Michael Lynch and Bruce Walsh. The origins of genome architecture, volume 98.
Sinauer Associates Sunderland, 2007.
[107] Owais Mahmudi. Probabilistic Reconciliation Analysis for Genes and Pseudogenes. PhD thesis, 2015.
[108] Thomas Mailund and Christian NS Pedersen. QuickJoin–fast neighbourjoining tree reconstruction. Bioinformatics, 20(17):3261–3262, 2004.
[109] Maxim A. Makukov and Vladimir I. shCherbak. Space ethics to test directed
panspermia. Life Sciences in Space Research, 3:10–17, 2014.
[110] Axel Meyer and Yves Van de Peer. From 2R to 3R: evidence for a fish-specific
genome duplication (FSGD). Bioessays, 27(9):937–945, 2005.
[111] H. Mi, A. Muruganujan, and P.D. Thomas. PANTHER in 2013: modeling
the evolution of gene function, and other gene attributes, in the context of
phylogenetic trees. Nucleic Acids Research, 41(Database issue):D377–86, Jan
2013.
[112] Vincent Miele, Simon Penel, Vincent Daubin, Franck Picard, Daniel Kahn,
and Laurent Duret. High-quality sequence clustering guided by network
topology and multiple alignment likelihood. Bioinformatics, 28(8):1078–1085,
2012.
52
BIBLIOGRAPHY
[113] Vincent Miele, Simon Penel, and Laurent Duret. Ultra-fast sequence clustering from similarity networks with SiLiX. BMC Bioinformatics, 12(1):1, 2011.
[114] Nancy A Moran and Tyler Jarvik. Lateral transfer of genes from fungi underlies carotenoid production in aphids. Science, 328(5978):624–627, 2010.
[115] Burkhard Morgenstern, Kornelie Frech, Andreas Dress, and Thomas
Werner. DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics, 14(3):290–294, 1998.
[116] HJ Muller. The origination of chromatin deficiencies as minute deletions
subject to insertion elsewhere. Genetica, 17(3):237–252, 1935.
[117] Luay Nakhleh, Tandy Warnow, and C Randal Linder. Reconstructing reticulate evolution in species: theory and practice. In Proceedings of the Eighth
Annual International Conference on Resaerch in Computational Molecular Biology,
pages 337–346. ACM, 2004.
[118] Hema Prasad Narra and Howard Ochman. Of what use is sex to bacteria?
Current Biology, 16(17):R705–R710, 2006.
[119] Björn Nystedt, Nathaniel R Street, Anna Wetterbom, Andrea Zuccolo, YaoCheng Lin, Douglas G Scofield, Francesco Vezzi, Nicolas Delhomme, Stefania Giacomello, Andrey Alexeyenko, et al. The Norway spruce genome
sequence and conifer genome evolution. Nature, 497(7451):579–584, 2013.
[120] Susumu Ohno. Sex chromosomes and sex-linked genes. Springer-Verlag, 1967.
[121] Susumu Ohno. Evolution by gene duplication. London: George Alien & Unwin
Ltd. Berlin, Heidelberg and New York: Springer-Verlag., 1970.
[122] Tomoko Ohta. Gene families: multigene families and superfamilies. eLS,
2006.
[123] Gabriel Östlund, Thomas Schmitt, Kristoffer Forslund, Tina Köstler,
David N Messina, Sanjit Roopra, Oliver Frings, and Erik LL Sonnhammer.
InParanoid 7: new algorithms and tools for eukaryotic orthology analysis.
Nucleic Acids Research, 38(suppl 1):D196–D203, 2010.
[124] Yaniv Ovadia, Daniel Fielder, Chris Conow, and Ran Libeskind-Hadas. The
cophylogeny reconstruction problem is NP-complete. Journal of Computational Biology, 18(1):59–65, 2011.
[125] Richard Owen. Lectures on the Comparative Anatomy and Physiology of the Vertebrate Animals: Delivered at the Royal College of Surgeons of England in 1844
and 1846, volume 1. Longman, Brown, Green, and Longmans, 1846.
BIBLIOGRAPHY
53
[126] Roderic DM Page. Maps between trees and cladistic analysis of historical
associations among genes, organisms, and areas. Systematic Biology, 43(1):
58–77, 1994.
[127] Roderic DM Page and Michael A Charleston. Trees within trees: phylogeny
and historical associations. Trends in Ecology & Evolution, 13(9):356–359, 1998.
[128] Jeffrey D Palmer and Laura A Herbo. Unicircular structure of the Brassica
hirta mitochondrial genome. Current Genetics, 11(6-7):565–570, 1987.
[129] Jeffrey D Palmer and Laura A Herbon. Tricircular mitochondrial genomes
of Brassica and Raphanus: reversal of repeat configurations by inversion.
Nucleic Acids Research, 14(24):9755–9764, 1986.
[130] Frances Pearl, Annabel Todd, Ian Sillitoe, Mark Dibley, Oliver Redfern,
Tony Lewis, Christopher Bennett, Russell Marsden, Alistair Grant, David
Lee, Adrian Akpor, Michael Maibaum, Andrew Harrison, Timothy Dallman, Gabrielle Reeves, Ilhem Diboun, Sarah Addou, Stefano Lise, Caroline
Johnston, Antonio Sillero, Janet Thornton, and Christine Orengo. The CATH
domain structure database and related resources Gene3D and DHS provide
comprehensive domain family information for genome analysis. Nucleic
Acids Research, 33(suppl 1):D247–D251, 2005.
[131] S. Penel, A.M. Arigon, J.F. Dufayard, A.S. Sertier, V. Daubin, L. Duret,
M. Gouy, and G. Perriere. Databases of homologous gene families for comparative genomics. BMC Bioinformatics, 10 Suppl 6:S3, 2009.
[132] Maria S Poptsova and J Peter Gogarten. BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics, 8(1):120, 2007.
[133] Matthew D Rasmussen and Manolis Kellis. A Bayesian approach for fast
and accurate gene tree reconstruction. Molecular Biology and Evolution, 28(1):
273–290, 2011.
[134] Fredrik Ronquist and John P Huelsenbeck. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19(12):1572–1574, 2003.
[135] Mathieu Rouard, Valentin Guignon, Christelle Aluome, Marie-Angélique
Laporte, Gaëtan Droc, Christian Walde, Christian M Zmasek, Christophe
Périn, and Matthieu G Conte. GreenPhylDB v2. 0: comparative and functional genomics in plants. Nucleic Acids Research, 2010.
[136] Andre Rzhetsky and Masatoshi Nei.
Theoretical foundation of the
minimum-evolution method of phylogenetic inference. Molecular Biology
and Evolution, 10(5):1073–1095, 1993.
[137] Kristoffer Sahlin. Estimating convergence of Markov chain Monte Carlo
simulations. Master’s thesis, Stockholm University, 2011.
54
BIBLIOGRAPHY
[138] Naruya Saitou and Masatoshi Nei. The neighbor-joining method: a new
method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406–425, 1987.
[139] Phillip SanMiguel, Alexander Tikhonov, Young-Kwan Jin, Natasha
Motchoulskaia, Dmitrii Zakharov, Admasu Melake-Berhan, Patricia S
Springer, Keith J Edwards, Michael Lee, Zoya Avramova, et al. Nested
retrotransposons in the intergenic regions of the maize genome. Science, 274
(5288):765–768, 1996.
[140] Bengt Sennblad, Lars Arvestad, Joel Sjöstrand, et al. PrIME: A set of C++
libraries for probabilistic integrated models of evolution., 2013. URL http:
//prime.scilifelab.se/. Accessed: 2016-03-16.
[141] AS Serebrovsky. Genes scute and achaete in Drosophila melanogaster and a
hypothesis of gene divergency. C. R. Acad. Sci.URSS, 19:77–81, 1938.
[142] Andrew J Sharp, Devin P Locke, Sean D McGrath, Ze Cheng, Jeffrey A
Bailey, Rhea U Vallente, Lisa M Pertz, Royden A Clark, Stuart Schwartz,
Rick Segraves, Vanessa V. Oseroff, Donna G. Albertson, Daniel Pinkel, and
Evan E. Eichler. Segmental duplications and copy-number variation in the
human genome. The American Journal of Human Genetics, 77(1):78–88, 2005.
[143] Martin Simonsen, Thomas Mailund, and Christian NS Pedersen. Rapid
neighbour-joining. In Algorithms in Bioinformatics, pages 113–122. Springer,
2008.
[144] Joel Sjöstrand. Reconciling gene family evolution and species evolution. PhD
thesis, Stockholm University, 2013.
[145] Joel Sjöstrand, Mehmood Alam Khan, Lars Arvestad, Bengt Sennblad, Jens
Lagergren, et al. JPrIME: a Java library with tools for phylogenetics., 2013.
URL https://github.com/arvestad/jprime/. Accessed: 2016-03-16.
[146] Joel Sjöstrand, Bengt Sennblad, Lars Arvestad, and Jens Lagergren. DLRS:
gene tree evolution in light of a species tree. Bioinformatics, 28(22):2994–2995,
2012.
[147] Joel Sjöstrand, Ali Tofigh, Vincent Daubin, Lars Arvestad, Bengt Sennblad,
and Jens Lagergren. A Bayesian method for analyzing lateral gene transfer.
Systematic Biology, 63(3):409–420, 2014.
[148] Temple F Smith and Michael S Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981.
[149] Berend Snel, Peer Bork, and Martijn A Huynen. Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Research, 12(1):
17–25, 2002.
BIBLIOGRAPHY
55
[150] Nan Song, Jacob M Joseph, George B Davis, and Dannie Durand. Sequence
similarity network reveals common ancestry of multidomain proteins. PLoS
Computational Biology, 4(5):e1000063, 2008.
[151] Alexandros Stamatakis. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics,
22(21):2688–2690, 2006.
[152] Arlin Stoltzfus. On the possibility of constructive neutral evolution. Journal
of Molecular Evolution, 49(2):169–181, 1999.
[153] Georg F Striedter and R Glenn Northcutt. Biological hierarchies and the
concept of homology. Brain, Behavior and Evolution, 38(4-5):177–189, 1991.
[154] David L Swofford. PAUP*. Phylogenetic analysis using parsimony (* and
other methods). Version 4., 2003.
[155] Gergely J Szöllősi, Bastien Boussau, Sophie S Abby, Eric Tannier, and Vincent
Daubin. Phylogenetic modeling of lateral gene transfer reconstructs the pattern and relative timing of speciations. Proceedings of the National Academy of
Sciences, 109(43):17513–17518, 2012.
[156] Gergely J Szöllősi, Eric Tannier, Nicolas Lartillot, and Vincent Daubin. Lateral gene transfer from the dead. Systematic Biology, 62(3):386–397, 2013.
[157] Koichiro Tamura, Glen Stecher, Daniel Peterson, Alan Filipski, and Sudhir Kumar. MEGA6: molecular evolutionary genetics analysis version 6.0.
Molecular Biology and Evolution, 30(12):2725–2729, 2013.
[158] EL Tatum and Joshua Lederberg. Gene recombination in the bacterium Escherichia coli. Journal of Bacteriology, 53(6):673, 1947.
[159] Simon Tavaré. Some probabilistic and statistical problems in the analysis of
DNA sequences. Lectures on mathematics in the life sciences, 17:57–86, 1986.
[160] Jeremy N Timmis, Michael A Ayliffe, Chun Y Huang, and William Martin.
Endosymbiotic gene transfer: organelle genomes forge eukaryotic chromosomes. Nature Reviews Genetics, 5(2):123–135, 2004.
[161] Ali Tofigh, Michael Hallett, and Jens Lagergren. Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Transactions on
Computational Biology and Bioinformatics, 8(2):517–535, 2011.
[162] Todd J Treangen, Anne-Laure Abraham, Marie Touchon, and Eduardo PC
Rocha. Genesis, effects and fates of repeats in prokaryotic genomes. FEMS
Microbiology Reviews, 33(3):539–571, 2009.
56
BIBLIOGRAPHY
[163] Todd J Treangen and Eduardo P Rocha. Horizontal transfer, not duplication,
drives the expansion of protein families in prokaryotes. PLoS Genetics, 7(1):
e1001284, 2011.
[164] René TJM van der Heijden, Berend Snel, Vera van Noort, and Martijn A
Huynen. Orthology prediction at scalable resolution by phylogenetic tree
analysis. BMC Bioinformatics, 8(1):83, 2007.
[165] Albert J Vilella, Jessica Severin, Abel Ureta-Vidal, Li Heng, Richard Durbin,
and Ewan Birney. EnsemblCompara GeneTrees: Complete, duplicationaware phylogenetic trees in vertebrates. Genome Research, 19(2):327–335,
2009.
[166] Günter P Wagner. The developmental genetics of homology. Nature Reviews
Genetics, 8(6):473–479, 2007.
[167] Lusheng Wang and Tao Jiang. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1(4):337–348, 1994.
[168] Ilan Wapinski, Avi Pfeffer, Nir Friedman, and Aviv Regev. Automatic
genome-wide reconstruction of phylogenetic gene trees. Bioinformatics, 23
(13):i549–i558, 2007.
[169] Ilan Wapinski, Avi Pfeffer, Nir Friedman, and Aviv Regev. Natural history
and evolutionary principles of gene duplication in fungi. Nature, 449(7158):
54–61, 2007.
[170] Andrew M Waterhouse, James B Procter, David MA Martin, Michèle Clamp,
and Geoffrey J Barton. Jalview version 2—a multiple sequence alignment
editor and analysis workbench. Bioinformatics, 25(9):1189–1191, 2009.
[171] Travis J Wheeler. Large-scale neighbor-joining with NINJA. In Algorithms in
Bioinformatics, pages 375–389. Springer, 2009.
[172] M. Wu and P.G. Higgs. The origin of life is a spatially localized stochastic
transition. Biol Direct, 7:42, 2012.
[173] Ziheng Yang. PAML: a program package for phylogenetic analysis by maximum likelihood. Computer Applications in the Biosciences, 13(5):555–556, 1997.
[174] Ziheng Yang and Joseph P Bielawski. Statistical methods for detecting
molecular adaptation. Trends in Ecology & Evolution, 15(12):496–503, 2000.
[175] Ziheng Yang and Bruce Rannala. Molecular phylogenetics: principles and
practice. Nature Reviews Genetics, 13(5):303–314, 2012.
[176] Golan Yona, Nathan Linial, and Michal Linial. ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids
Research, 28(1):49–55, 2000.
BIBLIOGRAPHY
57
[177] Satoko Yoshida, Shinichiro Maruyama, Hisayoshi Nozaki, and Ken Shirasu.
Horizontal gene transfer by the parasitic plant striga hermonthica. Science,
328(5982):1128–1128, 2010.
[178] Louxin Zhang. On a Mirkin-Muchnik-Smith conjecture for comparing
molecular phylogenies. Journal of Computational Biology, 4(2):177–187, 1997.
[179] Norton D Zinder and Joshua Lederberg. Genetic exchange in Salmonella.
Journal of Bacteriology, 64(5):679, 1952.
[180] D. Zwickl. Genetic algorithm approaches for the phylogenetic analysis of large
biological sequence datasets under the maximum likelihood criterion. PhD thesis,
The University of Texas at Austin, 2006.