Inference of ancestral recombination graphs

Inference of ancestral recombination graphs through
topological data analysis
arXiv:1505.05815v1 [q-bio.QM] 21 May 2015
P. G. Cámara1,2 , A. J. Levine2 and R. Rabadán1
1 Department
of Systems Biology and Department of Biomedical Informatics,
Columbia University College of Physicians and Surgeons,
1130 St. Nicholas Ave., New York
2 The
Simons Center for Systems Biology,
Institute for Advanced Study, Princeton, NJ 08540.
Abstract
The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relations within and across species.
Recombination, reassortment, horizontal gene transfer, and species hybridization constitute
examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from tens or hundreds of genomes, we are interested in the reconstruction
of potential evolutionary histories leading to the observed data. Ancestral recombination
graphs (ARGs) represent potential histories that explicitly accommodate recombination and
mutation events across orthologous genomes. However, ARGs are computationally costly
to reconstruct and usually become infeasible for more than few tens of genomes. Recently,
Topological Data Analysis (TDA) methods have been proposed as robust and scalable methods that can capture the genetic scale and frequency of recombination. We build on previous
TDA developments for detecting and quantifying recombination, and present a novel framework that can be applied to hundreds of genomes and can be interpreted in terms of minimal
histories of mutation and recombination events, quantifying the scales and identifying the
genomic locations of recombinations. For that aim, we extend the notion of barcodes in persistent homology, largely increasing their sensitivity to recombination, and present a new type
of summary graph (topological ARG, or tARG), analogous to ARGs, that capture ensembles of minimal recombination histories. We implement this framework in a software package,
called TARGet, and apply it to several examples, including small migration between different
populations and horizontal evolution in finches inhabiting the Galápagos Islands.
Contents
1 Introduction
1
2 Topological aspects of minimal ARGs
4
3 Persistent homology and the extended barcode of a sample
7
4 Examples
12
4.1
A simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
4.2
Genetic exchange between two divergent populations . . . . . . . . . .
14
4.3
Darwin’s finches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
5 Discussion
16
6 Methods
18
1
6.1
Algorithm implementation . . . . . . . . . . . . . . . . . . . . . . . . . .
18
6.2
Population genetics simulations
. . . . . . . . . . . . . . . . . . . . . . .
18
6.3
Darwin’s finches genotyping . . . . . . . . . . . . . . . . . . . . . . . . .
19
Introduction
Since the publication of the first draft of the human genome in 2001 [1,2], there has been
an explosion in genomic data. The genomes of thousands of different human individuals
across the globe have been sequenced [3], cancer patients are getting their genome routinely studied [4–7], several hundreds of eukaryotic genomes have been characterized,
and new viral, bacterial and archaeal species are being sequenced on an almost daily
basis [8, 9]. The rapid progression of single cell sequencing [10] permits genomic analysis of many uncultivated bacterial species, providing a glimpse into what has been
denoted as microbial dark matter [11]. Even before Darwin [12], taxonomists have
been classifying organisms into different taxa, from species to domains. But it was
Darwin who provided a historical dimension to the taxonomical enterprise, proposing that closely related species in the hierarchical taxonomy share ancestors. Since
then, tree-like structures have been proposed to represent the evolutionary/historical
relationship between organisms. In the last few years, however, the richer and more
comprehensive genomic characterization of many organisms have underscored the need
of representations that are not strictly tree-like. Phenomena such as horizontal gene
transfer in bacteria [13], the ability of viruses to borrow and lend genes across species,
1
the exchange of genomic material by archea and bacteria [14], and hybridization in
metazoa (in plants, in particular [15,16]) are exposing some of the limitations imposed
by tree-like phylogenetic structures. The definition of species itself becomes cumbersome in bacteria and viruses [17]. Needless to say that within many species, including
humans, genetic recombination is so pervasive that tree-like representations are useless.
It is then natural to wonder what other frameworks could be used to capture phylogenetic relations without losing the interpretability and simplicity of trees [18–20]. Of
particular interest are representations that reduce to trees when evolution is tree-like;
that capture genetic relations (genetic scales) between ancestors, and identify genomic
regions originating from different ancestral lineages; and, more generally, that allow for
an interpretation of the observed data in terms of a chronological sequence of events.
Several such frameworks have been proposed in the last two decades. The study of
phylogenetic networks has been an area particularly active [21–23]. Phylogenetic networks provide representations that extend trees to graphs (networks), generating loops
when the data does not fit into a tree. Some of those methods can easily be applied to
more than one hundred genomes [24–29] providing the opportunity for large-scale representations. However, the biological interpretation of these representations is limited,
as loops represent inconsistencies with trees, but it is unclear how these inconsistencies
arose historically, what genomic regions were involved, or how frequently an exchange
happened. Other types of representations, sometimes named explicit networks [21, 30],
do aim to provide a historical account in terms of a chronology of events. Ancestral
recombination graphs, or ARGs, provide potential explanations of the observed data
in terms of a progression of recombination and mutation events. As in trees, mutations are represented as events along the branches. Recombinations, however, appear
as the fusion of two parental branches into one offspring branch. ARGs provide simple histories that can be used in association mapping [31–33], SNP genotyping [34] or
statistical inference of the frequency and scale of recombination [35]. However, these
applications are hindered by the computational infeasibility of constructing ARGs that
explain hundreds of sequences. The construction of minimal ARGs, containing the
minimum number of recombination events required to explain the sample in absence of
convergent evolution and back-mutation, is known to be an NP-hard problem [36–38].
Several approximations have been developed in the last few years, including galled
trees [39, 40], branch and bound [41], heuristic [31] and sequentially Markov coalescent
(SMC) approaches [42].
Recently, a new framework to study genomic relations has been proposed [43–45],
2
based on Topological Data Analysis [46–48]. Topology is the area of mathematics
that aims to characterize properties of spaces up to continuous deformations. TDA
extends the concepts and tools of topology to finite metric spaces, that is, finite sets of
points and distances between them. Stability results [43, 49, 50] guarantee that small
fluctuations in the data only create small changes in the inferred topological features,
providing robust characterizations of the data. In this framework, genomes can be
viewed as points in a high dimensional space, and relations between these genomes are
characterized by distances. Recombination events appear as loops in these spaces, and
their frequency and scale is summarized in a set of open intervals, named barcodes.
These provide a basic structure on which statistics of genomic exchange can be built.
TDA methods are particularly well suited for large datasets. In the context of
molecular phylogenetics and evolution, they have been applied to the study of viral
recombination and reassortment [43], bacterial species [44] and point estimators in population genetics [45]. However, these implementations have limitations. Specifically,
they only use information about genetic distances between sequences, and so they
discard the full structure of segregating characters, missing numerous recombination
events that are required to explain the data. Relatedly, it is unclear which specific
evolutionary histories explaining the data TDA informs about, and what is the precise
relation between barcodes and these histories.
Here we address these two important aspects, improving on the scalable capabilities of persistent homology to extract robust information on the possible evolutionary
histories of a sample of genetic sequences. In particular, we show that by systematically sampling subsets of segregating sites and performing TDA, we are able to identify
most of the necessary recombination events identified by bound methods [41, 51, 52],
providing a significant improvement of past methods [43] in terms of interpretation and
sensitivity. Moreover, we introduce a novel type of graph, closely related to minimal
ARGs, that we name topological ARG (tARG); and show that barcodes inform about
the topological features and genetic scales of these graphs.
Like minimal ARGs [30,31], tARGs can be considered as explicit, parsimonious, interpretable phylogenetic representations. The main advantage of tARGs and barcodes
versus minimal ARGs is, however, the possibility of obtaining such phylogenetic information in polynomial time, which allows us to deal with hundreds of sequences. We
have implemented this method in a software, called TARGet, and have illustrated it with
several examples, including horizontal evolution of finches inhabiting the Galápagos
archipelago. The software, instructions and example files used in the manuscript can
3
be obtained from https://github.com/RabadanLab/TARGet.
2
Topological aspects of minimal ARGs
An ARG is an explicit phylogenetic network representing a possible evolutionary history
of a sample of genetic sequences, where only mutation and recombination events are
present and convergent evolution is not considered and so never occurs [30, 53, 54].
ARGs are very useful constructs in population genetics and phylogenetics. However,
the problem of building a minimal ARG from a set of genetic sequences is known to be
NP-hard [36–38]. The use of ARGs has therefore been traditionally limited to small
samples, consisting of a handful of sequences.
In this section, we introduce a particular class of minimal ARGs and a set of related
graphs. Then, using computational algebraic topology, in the next section we show that
it is possible to extract, in polynomial time, phylogenetic information from this class
of minimal ARGs, without having to explicitly construct them. Thus, by restricting to
this specific class of graphs, we are able to extend the realm of ARGs to large samples
of sequences.
To be specific, we consider a sample S consisting of n distinct genetic sequences with
m binary segregating characters. The latter can be single nucleotide polymorphisms
(SNPs), indels, gene duplications or any other genetic trait that takes one of two
possible states, 0 or 1, in each sequence. An ARG is then formally defined as a directed
acyclic graph N with n leaf nodes and a unique root node, where every node other than
the root has in-degree one (tree node) or two (recombination node) and every character
or sequence in S labels respectively a unique edge or leaf in N . Moreover, each node
in N is labelled by a m-length binary sequence, such that 1) the sequence labelling a
tree node differs from the sequence of the parent node only at the character labelling
the edge that connects the two nodes; and 2) the sequence labelling a recombination
node is a single-crossover combination of the sequences labelling the two parent nodes.
There is an infinite number of ARGs that can explain a particular sample S. An
stochastic model, such as the coalescent model with recombination [53,55], would assign
probabilities to each possible ARG. Here, however, we adopt a parsimony approach
and consider ARGs that are minimal (in a sense defined below), without assuming an
underlying probabilistic model. Such an approach has proven useful in summarizing all
the information contained in the sequences into evolutionary histories where all events
are required.
4
Figure 1: Minimal ARGs and condensed graphs. Two examples of minimal ARGs
and the condensed graphs that result from collapsing their unlabelled edges. The root
node is marked red whereas sampled nodes are marked green. Mutations in the r-th
character are indicated by mr . Edges pointing to a recombination node are labelled
with the letter P or S, depending on whether they contribute to the prefix or suffix
of the recombinant sequence. Recombinant nodes are marked with the position of
the recombination breakpoint. All nodes are labelled by their sequence of characters.
Condensed graphs of ARGs can be embedded into m-dimensional hypercubes and their
diagonals.
Specifically, we consider ARGs that contain exactly the minimum number Rmin of
single-crossover recombinations required to explain the sample, and that minimize the
function
D(N ) =
R
min
X
dr
(2.1)
r=0
where the sum runs over all recombination events in N , and dr is the genetic distance
between the two parental sequences involved in the r-th recombination. This is a
slightly more restricted definition of minimal ARG than the one that usually appears
in population genetics literature [30], where the condition on D(N ) is generally not
required. By construction, a minimal ARG explaining any given sample always exists.
5
Examples of minimal ARGs are given in figs. 1 and 2.
An ARG can be condensed by collapsing all unlabelled edges, so that the resulting
graph can be embedded into an m-dimensional hypercube and its diagonals, except
for the axial ones (fig. 1). The number of edges and vertices of such a condensed
representation is m + 2Rmin and m + Rmin + 1, respectively, whereas the number of
independent loops is Rmin .
Figure 2: Minimal ARGs and eq. (2.1). Two examples of ARGs containing the
minimum number of recombination events, Rmin = 3, required to explain the sample.
The ARG in a) minimizes eq. (2.1). It is therefore a minimal ARG according to the
restricted definition used here. The ARG in b), on the contrary, does not minimize
eq. (2.1).
The general question that we want to address is, given a set S of sampled sequences,
what information can we obtain about the structure of minimal ARGs that explain S,
without having to explicitly construct them? To address this question, we consider the
6
union graph of all undirected condensed minimal ARGs explaining S and sharing the
same set of vertices (fig. 3). We call this construction topological ARG, or tARG, for
reasons that will become clear in next section. Note that, contrary to ARGs, a tARG
is uniquely specified in terms of its vertices. Denoting by Rmin the number of loops of
a tARG, by construction it follows that Rmin ≥ Rmin . In particular, Rmin > Rmin when
some recombination node in a minimal ARG explaining the sample can be alternatively
obtained from another pair of nodes in the ARG, without affecting (2.1). In terms of
the sample, Rmin > Rmin when there are three characters for which all eight possible
allele combinations are present in S. In models like the coalescent with recombination,
this only happens at very large recombination rates (see fig. 5 below). In the most
extreme case, where the sample consists of all possible 2m sequences with m segregating
characters, Rmin = 2m − m − 1 and Rmin = 2m−1 m − 2m + 1.
3
Persistent homology and the extended barcode
of a sample
Topological data analysis (TDA) has emerged during the last decade as a branch of applied topology that attempts to infer topological features of spaces from sets of sampled
points [46]. Topological invariants are algebraic structures (groups, modules, etc) of
the space preserved under continuous deformations. One example of such invariants are
homology groups. Generators of the zero-th homology group correspond to connected
components of the space; generators of the first homology group correspond to noncontractible loops; and in general, generators of the n-th homology group correspond
to non-contractible n-dimensional voids. In the case where only a finite set of points of
the space is known, there is still a well-defined notion of homology, known as persistent
homology [47, 48]. At each scale , a simplicial complex (a finite set representation of
the topology of the space) can be constructed by taking balls of radius centred at the
sampled points, and considering their intersections (fig. 4). Two points are connected
by an edge in the simplicial complex if the corresponding balls intersect. This process
produces a filtration of simplicial complexes parametrized by , from which homology
groups can be computed. Remarkably, the computation time of persistent homology
is polynomial in the number of points [48].
In our current context, we exploit the use of persistent homology to infer topological
features of an unknown tARG from the known set of sampled nodes. A distance
function between pairs of nodes of the tARG can be defined by taking the hamming
7
Figure 3: Topological ARGs. Examples of condensed minimal ARGs (left) and their
corresponding tARGs (right). In a tARG the edges are completely determined by the
vertices. In these two examples the topology of the resulting tARG differs from that of
the original condensed minimal ARGs. In example a) the minimal ARG has Rmin = 4,
whereas the corresponding tARG has Rmin = 5. In example b) the minimal ARG has
Rmin = 11, whereas the corresponding tARG has Rmin = 17.
distance between the corresponding sequences of characters that label the nodes. This
is the simplest possible notion of genetic distance.
8
Figure 4: Persistent homology of a sample of genetic sequences. Barcode and
simplicial complexes at several values of the filtration parameter , for the sample of
sequences S = {000, 010, 101, 111}. Only the zero-th (H0 ) and first (H1 ) homology
groups are shown. At small the four sampled points are disconnected and H0 consists of 4 elements. Increasing leads to a loop, that appears as a single element of
H1 . Further increasing fills in the loop, leading to a single connected surface, that
is represented by a single element of H0 . A minimal ARG explaining S and the corresponding tARG were presented in figure 1b. The barcode only captures one of the
Rmin = Rmin = 2 recombination events. The generators of the loop in this case are the
four sequences present in S, that form a loop enclosing the two recombination loops
present in the minimal ARG.
Barcodes are a useful representation of persistent homology [56]. These are graphical representations where each element of a homology group is represented by a segment
spanning the interval [b , d ], where b and d are the values of the parameter at which
the corresponding feature is formed and destroyed, respectively. Hence, the barcode of
a sample contains information about the topology and scales of the tARG underlying
the sample (fig. 4). Specifically, elements of the first homology group correspond to
9
recombination loops in the tARG, whereas b and d provide information about the genetic scales involved in the recombination events. The number of generators of the first
homology group or persistent first Betti number, b1 , is thus a lower bound of Rmin . First
homology generators permit to identify a set of sequences in S that enclose the corresponding recombination loop in the tARG, and can be used to partially reconstruct
the underlying tARG.
Whereas tARGs provide an interpretation of persistent homology barcodes in terms
of minimal evolutionary histories and ARGs, the sensitivity of persistent homology to
recombination is in general low, and b1 is a loose lower bound of Rmin . To address
a similar problem, Myers and Griffiths showed in ref. [52] that, given a lower bound
coming from a sample of aligned sequences, it is possible to build a more stringent
bound by suitably combining the local bounds that result from intervals within the
alignment. In this way, information about the ordering of characters is incorporated,
and the location of crossover breakpoints in the sequence is constrained. This general
idea was applied in [52] to the haplotype bound to built a stronger lower bound,
n − m − 1 ≤ RMG ≤ Rmin .
A similar idea can be applied in the context of persistent homology barcodes to build
an extended first-homology barcode, given by the disjoint union of the first-homology
barcodes of a set of optimally chosen, non-overlapping intervals within the sequence
alignment. From a geometric perspective, this corresponds to projecting the original
space on a optimal set of mutually orthogonal hyperplanes in the ambient hypercube,
and computing persistent homology in each of those projections. The extended barcode
incorporates information about the full structure of characters in the sample, largely
increasing the sensitivity of persistent homology and providing information on the
location of the recombination breakpoints in the sequence.
To extend the construction of ref. [52] to persistent homology barcodes, we need to
establish an ordering relation on barcodes. Being sets of intervals, it is natural to take
the maximum of two barcodes to be given by the one with largest L0 -norm, namely
largest b1 . If both barcodes have the same L0 -norm, we may successively compare
other norms (e.g. other Lp -norms), until the tie is broken. The algorithm of [52] is
then generalized to persistent homology barcodes as follows:
1. Let Bik be the first-homology barcode of the sequences that result from the i-th
to k-th characters in S. Set Rij = 0 and k = 2.
2. For j = 1, . . . , k − 1, set Rjk = max{Rji ∪ Bik : i = j, ..., k − 1}
10
3. If k < m, increment k by 1 and go to step 2.
The extended first-homology barcode of S is the union barcode R1m that results from
this algorithm.
Figure 5: Comparison between lower bounds b̄1 ≤ Rmin and RMG ≤ Rmin in
coalescent simulations. Values of b̄1 and RMG for simulated samples of 40 sequences
with 12 segregating sites, sampled from a population under the coalescent model with
recombination. 4000 samples were simulated in total. The colored band represents the
interdecile range, whereas the central line represents the mean. The values of b̄1 and
RMG are highly correlated (Pearson’s r = 0.98). At very high recombination rates, b̄1
tends to be bigger than RMG , as cases like those of fig. 3 occur more frequently.
The number of bars in the extended barcode, b̄1 , is a optimized lower bound of
Rmin , in the same way as RMG is a optimized lower bound of Rmin ,
tARG
x


−−−→
Rmin ≥ b̄1 ≥ b1
minimal ARG −−−→ Rmin ≥ RMG ≥ n − m − 1
In biological data, b̄1 and RMG are in general very close to each other (fig. 5), as the
probability of generating minimal ARGs that produce tARGs with a different topology
(e.g. those of fig. 3) is low. However, as opposed to RMG , the extended first-homology
barcode provides additional phylogenetic information about the sample, such as the
genetic scales and sequences involved in the recombination events. These features put
extended first-homology barcodes at the very interesting interface between the fast,
but phylogenetically limited, existing lower bounds to Rmin ; and the slow, but phylogenetically rich methods for reconstructing minimal ARGs. We have implemented the
11
computation of extended first-homology barcodes of genetic samples in publicly available software, called TARGet, that also attempts to partially reconstruct the tARG
using persistent homology generators. Although TARGet makes use of the above algorithm inspired in that of ref. [52], more efficient integer linear programming algorithms,
like that of ref. [41], can in principle be also generalized to the computation of extended
first-homology barcodes.
4
Examples
We consider three examples that illustrate how the formal developments presented in
previous sections can be used to extract useful phylogenetic information from samples of genetic sequences. The first example is a simple toy model where an explicit
minimal ARG can be easily constructed. It displays how the information contained
in the extended first-homology barcode of the sample directly maps to features of
minimal ARGs. The second example, based on simulated data of two sexually reproducing populations exchanging genetic material at low rate, shows the applicability
of persistent homology to large datatsets, consisting of several hundreds of sequences.
It also demonstrates the use of phylogenetic information contained in the extended
first-homology barcode to distinguish among various biological settings with similar
recombination rates. The last example consists of a 9 megabase scaffold in the genome
of 112 Darwin’s finches [57]. This example illustrates the applicability of the extended
first-homology barcode to real datasets.
4.1
A simple example
We illustrate the use and interpretation of extended barcodes with a simple example, consisting of a sample of 4 genetic sequences with 7 binary characters: 1111001,
1111111, 0000110 and 0000000. Minimal ARGs explaining this sample require two
single-crossover recombination events. One such minimal ARG is presented in fig. 6a.
The most ancestral recombination event involves genetically distant parental gametes,
leading to a large recombination loop. On the contrary, the most recent recombination
event involves genetically close parental gametes, leading to a small recombination
loop in the ARG. These features are captured by the extended first-homology barcode of the 4 sequences (fig. 6b), which consists of two bars, corresponding to the
two recombination events. The length and position of the bars represent the genetic
scales associated to the recombination events, with the high (low) genetic scale bar
12
corresponding respectively to the large (small) recombination loop. The position of
the crossover breakpoints associated to these recombination events is also correctly
reproduced. Hence, taking as input the 4 sequences, the extended first-homology barcode extracts phylogenetic information from minimal ARGs that explain the sample,
without requiring complete reconstruction of the ARGs.
Figure 6: Minimal ARG, extended first-homology barcode and reconstructed
tARG of a sample of 4 sequences. The four sampled sequences are represented by
green leaf nodes in the minimal ARG depicted in a). The minimal ARG involves two
single-crossover recombination loops. Both recombination loops and their genetic scales
are correctly captured by the extended first-homology barcode of the samples, shown
in b). Intervals containing the location of recombination breakpoints are indicated over
each bar. Persistent first-homology generators can be used to reconstruct the topology
of the tARG, as depicted in c). Without adding any extra sequences to the sample,
the two bars are associated to the same four generators, allowing only to reconstruct
the large envelope of the two recombination loops. Adding sequences E and F to the
sample (represented by blue leaf nodes in (a)) disentangles the generators of the two
recombination loops, fully reconstructing the topology of the tARG. The persistent
first-homology generators, and hence the reconstructed tARG, strongly depend on the
specific sampled nodes.
We note here the importance of using the extended barcode instead of the ordi13
nary barcode, used in previous applications of persistent homology [43]. In this simple
example b1 = 1, and only one of the two recombination events would have been detected if the ordinary barcode had been used. The extended barcode, implementing
a composite bound algorithm similar to that of Myers and Griffiths, largely increases
the sensitivity to recombination events.
Finally, we can attempt to reconstruct the tARG of the sample by using persistent
homology generators (fig. 6c). Whereas there are theorems ensuring the stability of
barcodes against small perturbations [49, 50], the generators of persistent homology
identified by TDA strongly depend on the sample, and many different choices of basis
are possible. In this simple example, both bars in the extended first-homology barcode
are generated by the four sampled sequences. Therefore, the reconstructed loop enclosing each recombination event is the same in both cases and corresponds to the large
enveloping loop in the minimal ARG (fig. 6a). Adding the internal nodes 1111000 and
1111110 to the sample permits disentangling the generators of the two recombination
loops (fig. 6c), fully reconstructing the topology of the underlying tARG. Hence, if
sampling frequency is high, this method also allows to approximate the tARG of the
sample with some accuracy.
4.2
Genetic exchange between two divergent populations
We now consider a more involved example consisting of two sexually-reproducing populations, simulated under the coalescent with recombination. The two populations
diverged 24N generations before present. Their effective population sizes are taken to
be constant and given by N and N/5. We consider two different cases, depicted in
fig. 7. In the first case (fig. 7a), the two populations are completely isolated from each
other. In the second case (fig. 7b), on the contrary, there is a small migration rate
between the two populations, with a migration event occurring every 2N generations
on average. The recombination rate is the same in both cases. Alternatively, in a phylogenetic context, this setting describes the incomplete lineage sorting of two species,
with or without the presence of gene flow.
We randomly sample 250 sequences from the large population and 50 sequences from
the small population. The full sample consists of 300 sequences with 300 segregating
sites. We present in fig. 7 the extended first-homology barcode for simulated samples
in each of the above two cases. Whereas the number of detected recombination events
in the tARG, counted by the number bars, is similar in both cases, their genetic scales
are very different. In particular, in presence of migration the scale of some of the
14
Figure 7:
Extended first-homology barcode of two divergent sexually-
reproducing populations. The case in a) assumes the two populations are completely isolated. All recombination events present in the extended first-homology barcode involve genetically close parental gametes. The case in b) considers a small migration rate between the two populations. Some of the recombination events present in
the extended first-homology barcode involve genetically distant parental strains. The
total number of detected recombination events is similar in both cases and uniform
across the entire genome. Intervals with the location of the recombination breakpoints
are indicated for each recombination event, where positions refer to segregating sites.
recombination loops is large, corresponding to migration events that are followed by a
recombination event (fig. 7b).
Note that in this example the extended first-homology barcode provides rich phylogenetic information that can be hardly obtained by other methods. Methods that
attempt to construct a minimal (or nearly-minimal) ARG are computationally inefficient for such large sample sizes [41, 58]. Fast bound methods [41, 51, 52], on the
15
other hand, do not provide enough phylogenetic information to distinguish between
the cases with and without migration, as the total recombination rate is the same in
both situations. Sequentially Markov coalescent (SMC) approaches [42] produce an
ARG that is far from being minimal but is a good approximation to the maximum
likelihood. However, these methods require an underlying coalescent model, whose
mutation, recombination and population structure parameters are not given with the
sample. Finally, algorithms for constructing phylogenetic split networks [29] are fast
and provide very different outputs in each of the above two cases. However, the interpretation of their output in terms of recombination and migration events is obscure.
4.3
Darwin’s finches
The previous examples served to illustrate the relation between features of the extended
first-homology barcode of a genetic sample and those of the minimal ARGs explaining
the sample. However, both examples were based on simulated data. We now consider
a more realistic example, consisting of the genetic sequences of 112 Darwin’s finches,
belonging to 15 different species inhabiting the Galápagos archipielago and Cocos Island [57]. We aligned and genotyped a 9 megabase scaffold of their genome and, after
filtering for high-quality variants, we focussed on a set of 140 SNPs that were homozygous across the 112 samples, thus avoiding potential phasing artefacts. By considering
this set, we mostly restrict to very ancestral recombination/gene flow events, close to
the origin of radiation from a common ancestor 1.5 million years ago [59]. We used
TARGet to obtain the extended first-homology barcode of the sample, as well as the
partially reconstructed tARG.
The extended first-homology barcode (fig. 8a) contains 13 recombination events,
mostly involving samples from multiple species and usually including samples from
the genus Certhidea (fig. 8b), the most ancestral lineage among the genera present in
the sample [57]. These results add support to the evidence for genetic introgression
found in [57]. Our analysis also reveals that the crossover breakpoints of these events
localize at four different genomic regions within the 9 megabase scaffold that we have
considered in this example (fig. 8c).
5
Discussion
As the famous title of the essay by Dobzhansky “Nothing in Biology Makes Sense Except in the Light of Evolution” underscores, evolutionary processes are central orches16
Figure 8: Extended first-homology barcode and partially reconstructed
tARG of a sample of 112 Darwin’s finches. The extended first-homology barcode
is shown in a), based on 140 homozygous SNPs present in a 9 megabase scaffold. In
total, 13 recombination/gene flow events are captured in the extended barcode, with
different genetic scales. Bars are colored according to the position of the corresponding
recombination breakpoint in the genome, as depicted in c). We also indicate the number of recombination events detected at each genomic interval, as well as some of the
orthologous genes present at regions where recombination events are detected. The reconstructed tARG is presented in b). Recombination loops in the reconstructed tARG
are outlined using the same code of colors. We have also included leaf nodes that do
not participate in any recombination event, using a nearest neighbour algorithm based
on genetic distance. Edge lengths are arbitrary.
trating themes in biology. Mutations, recombinations and other evolutionary processes
get imprinted into genomes through selection, reflecting the accumulated history giving
rise to an organism. Phylogenetics try to reconstruct the evolutionary history through
the comparison of genomes of related organisms. In addition to reporting relationships and elucidating particular histories, one would like to understand and quantify
17
how different evolutionary processes have occurred. The identification and quantification of evolutionary processes can be challenging due to the lack of a well-established
universal framework to capture evolutionary relationships beyond trees. In addition,
robust statistical inference needs to exploit the large number of genomes that are now
becoming available, aggravating the computational burden and obscuring interpretations. Ideally, we would like to have a biologically interpretable framework able to
quantify different evolutionary processes by analyzing large numbers of genomes.
In this paper we have proposed a few steps in this direction. We have extended the
notion of barcodes in persistent homology to identify the genetic scale and number of
recombination events. We have shown that, by correctly studying persistent homology in subsets of segregating sites, it is possible to characterize the genomic regions
where recombination takes place and identify the gametes involved in particular recombination events. The persistent homology barcodes derived from each of these sets
can be structured as an “extended barcode” where each bar captures a recombination
event. Extended barcodes can be interpreted as counting and quantifying the scale
of recombination events in a variation of Ancestral Recombination Graphs (ARGs).
Topological ARGs represent a summary of potential recombination histories that can
explain the data. The method proposed, TARGet, is scalable to hundreds of genomes.
As an alternative to some phylogenetic networks, extended barcodes provide robust
quantification of events, the distribution of genetic scales, computational scalability
and interpretative graphs.
6
6.1
Methods
Algorithm implementation
We implemented the algorithm for the computation of the extended first-homology barcode
and tARG reconstruction in publicly available multi-threaded software, TARGet, which is
distributed under the GNU General Public License (GPL v3). The application is fully written
in Python 2.7, and relies on Dionysus C++ library for persistent homology computations
(http://www.mrzv.org/software/dionysus).
6.2
Population genetics simulations
We performed 4000 simulations of a sample of 40 sequences with 12 segregating sites, using
the software ARGweaver [42]. The population was simulated using a coalescent infinite sites
model with recombination. Population-scaled recombination rate, ρ, was randomly generated
18
in each simulation, taking values from a uniform distribution between 0 and 110. For each
simulated sample, Myers and Griffiths lower bound RMG ≤ Rmin was computed using the
software RecMin [52], with parameters -s 12 -w 12. Lower bounds b̄1 ≤ Rmin were computed
using our application TARGet, with parameters -s 12 -w 12.
Samples of genetic exchange between two divergent populations were simulated using the
software ms [60], using the commands,
ms 300 1 -s 300 -r 40 10000 -I 2 250 50 -ej 6.0 1 2 -n 2 0.2 -m 1 2 0.5
and,
ms 300 1 -s 300 -r 40 10000 -I 2 250 50 -ej 6.0 1 2 -n 2 0.2
respectively for the cases with and without migration. The extended first-homology barcode
of each sample was computed using TARGet with parameters -s 12 -w 14 -e.
6.3
Darwin’s finches genotyping
Raw paired-end reads from 112 Darwin finches [57] were obtained from SRA archive (accession number PRJNA263122) and aligned against the consensus sequence of Geospiza Fortis,
version GeoFor 1.0/geoFor1, scaffold JH739904. We followed essentially the same procedure
than that of ref. [57] for the alignment, SNP calling, genotyping and filtering. In short, the
alignment was performed with Burrows-Wheeler aligner (BWA) [61], version 0.7.5, using
BWA-MEM algorithm and default parameters. PCR duplicates were marked using Picard
tools (http://picard.sourceforge.net/). Indel realignment, SNP discovery and simultaneous genotyping across the 112 samples was performed using Genome Analysis Toolkit
(GATK) [62], following GATK best practice recommendations [63]. SNP calls were filtered
by keeping variants with SNP quality > 100, total depth of coverage > 117 and < 1750,
ratio between SNP quality and depth of coverage > 2, Fisher strand bias < 60, mapping
quality > 50, mapping quality rank > -4 and read position rank sum > -2. To avoid phasing
errors, we only considered SNPs that were homozygous across the 120 samples. The resulting
genotypes were processed with TARGet for extended first-homology barcode computation and
tARG reconstruction, using the options -s 14 -w 14.
Acknowledgments: We thank D. S. Rosenbloom, K. Emmett, M. Lesnick, S. Zairis
and V. Aleksandrov for useful discussions and comments. The work of the authors is
supported by NIH grant 1 U54 CA121852-05.
19
References
[1] Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial
sequencing and analysis of the human genome. Nature. 2001;409(6822):860–921.
[2] Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The
sequence of the human genome. science. 2001;291(5507):1304–1351.
[3] Consortium GP, et al. An integrated map of genetic variation from 1,092 human
genomes. Nature. 2012;491(7422):56–65.
[4] Chin L, Andersen JN, Futreal PA. Cancer genomics: from discovery science to
personalized medicine. Nature medicine. 2011;17(3):297–303.
[5] Brennan CW, Verhaak RG, McKenna A, Campos B, Noushmehr H, Salama SR,
et al. The somatic genomic landscape of glioblastoma. Cell. 2013;155(2):462–477.
[6] Frattini V, Trifonov V, Chan JM, Castano A, Lia M, Abate F, et al. The integrated landscape of driver genomic alterations in glioblastoma. Nature genetics.
2013;45(10):1141–1149.
[7] Zairis S, Khiabanian H, Blumberg AJ, Rabadan R. Moduli Spaces of Phylogenetic
Trees Describing Tumor Evolutionary Patterns. In: Brain Informatics and Health.
Springer; 2014. p. 528–539.
[8] Edwards RA, Rohwer F. Viral metagenomics. Nature Reviews Microbiology.
2005;3(6):504–510.
[9] Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH.
Genome sequences of rare, uncultured bacteria obtained by differential coverage
binning of multiple metagenomes. Nature biotechnology. 2013;31(6):533–538.
[10] Rinke C, Lee J, Nath N, Goudeau D, Thompson B, Poulton N, et al. Obtaining genomes from uncultivated environmental microorganisms using FACS–based
single-cell genomics. Nature Protocols. 2014;9(5):1038–1048.
[11] Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng JF, et al.
Insights into the phylogeny and coding potential of microbial dark matter. Nature.
2013;499(7459):431–437.
[12] Darwin C. On the origins of species by means of natural selection. London:
Murray. 1859;.
20
[13] Ochman H, Lawrence JG, Groisman EA. Lateral gene transfer and the nature of
bacterial innovation. Nature. 2000;405(6784):299–304.
[14] Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ, Haft DH, et al. Evidence
for lateral gene transfer between Archaea and bacteria from genome sequence of
Thermotoga maritima. Nature. 1999;399(6734):323–329.
[15] Burke JM, Arnold ML. Genetics and the fitness of hybrids. Annual review of
genetics. 2001;35(1):31–52.
[16] Adams KL, Wendel JF. Polyploidy and genome evolution in plants. Current
opinion in plant biology. 2005;8(2):135–141.
[17] Doolittle WF, Papke RT. Genomics and the bacterial species problem. Genome
biology. 2006;7(9):116.
[18] Doolittle WF.
Phylogenetic classification and the universal tree.
Science.
1999;284(5423):2124–2128.
[19] Woese CR. A new biology for a new century. Microbiology and Molecular Biology
Reviews. 2004;68(2):173–186.
[20] OMalley MA, Koonin EV. How stands the Tree of Life a century and a half after
The Origin. Biol Direct. 2011;6:32.
[21] Huson DH, Rupp R, Scornavacca C. Phylogenetic networks: concepts, algorithms
and applications. Cambridge University Press; 2010.
[22] Huson DH, Scornavacca C. A survey of combinatorial methods for phylogenetic
networks. Genome biology and evolution. 2011;3:23–35.
[23] Morrison DA. Introduction to phylogenetic networks. RJR Productions. 2011;.
[24] Bandelt HJ, Dress AW. Split decomposition: a new and useful approach to
phylogenetic analysis of distance data. Molecular phylogenetics and evolution.
1992;1(3):242–252.
[25] Bandelt HJ, Forster P, Sykes BC, Richards MB. Mitochondrial portraits of human
populations using median networks. Genetics. 1995;141(2):743–753.
[26] Huson DH. SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics. 1998;14(1):68–73.
21
[27] Bandelt HJ, Forster P, Röhl A. Median-joining networks for inferring intraspecific
phylogenies. Molecular biology and evolution. 1999;16(1):37–48.
[28] Bandelt HJ, Macaulay V, Richards M. Median networks: speedy construction
and greedy reduction, one simulation, and two case studies from human mtDNA.
Molecular phylogenetics and evolution. 2000;16(1):8–28.
[29] Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Molecular biology and evolution. 2006;23(2):254–267.
[30] Gusfield D. ReCombinatorics: The Algorithmics of Ancestral Recombination
Graphs and Explicit Phylogenetic Networks. MIT Press; 2014.
[31] Minichiello MJ, Durbin R. Mapping trait loci by use of inferred ancestral recombination graphs. The American Journal of Human Genetics. 2006;79(5):910–922.
[32] Sutter NB, Bustamante CD, Chase K, Gray MM, Zhao K, Zhu L, et al. A
single IGF1 allele is a major determinant of small size in dogs.
Science.
2007;316(5821):112–115.
[33] Wu Y. Association mapping of complex diseases with ancestral recombination
graphs: models and efficient algorithms.
Journal of Computational Biology.
2008;15(7):667–684.
[34] Le SQ, Durbin R. SNP detection and genotyping from low-coverage sequencing
data on multiple diploid samples. Genome research. 2011;21(6):952–960.
[35] Wall JD. A comparison of estimators of the population recombination rate. Molecular Biology and Evolution. 2000;17(1):156–163.
[36] Wang L, Zhang K, Zhang L. Perfect phylogenetic networks with recombination.
Journal of Computational Biology. 2001;8(1):69–78.
[37] Bordewich M, Semple C. On the computational complexity of the rooted subtree
prune and regraft distance. Annals of combinatorics. 2005;8(4):409–423.
[38] Bordewich M, Semple C.
Computing the minimum number of hybridization
events for a consistent evolutionary history.
2007;155(8):914–928.
22
Discrete Applied Mathematics.
[39] Gusfield D, Eddhu S, Langley C. Optimal, efficient reconstruction of phylogenetic
networks with constrained recombination. Journal of bioinformatics and computational biology. 2004;2(01):173–213.
[40] Gusfield D. Optimal, efficient reconstruction of root-unknown phylogenetic networks with constrained and structured recombination. Journal of Computer and
System Sciences. 2005;70(3):381–398.
[41] Song YS, Wu Y, Gusfield D. Efficient computation of close lower and upper
bounds on the minimum number of recombinations in biological sequence evolution. Bioinformatics. 2005;21(suppl 1):i413–i422.
[42] Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS genetics. 2014;10(5):e1004342.
[43] Chan JM, Carlsson G, Rabadan R. Topology of viral evolution. Proceedings of
the National Academy of Sciences. 2013;110(46):18566–18571.
[44] Emmett KJ, Rabadan R. Characterizing Scales of Genetic Recombination and
Antibiotic Resistance in Pathogenic Bacteria Using Topological Data Analysis.
In: Brain Informatics and Health. Springer; 2014. p. 540–551.
[45] Emmett K, Rosenbloom D, Camara P, Rabadan R. Parametric inference using persistence diagrams: A case study in population genetics. arXiv preprint
arXiv:14064582. 2014;.
[46] Carlsson G. Topology and data. Bulletin of the American Mathematical Society.
2009;46(2):255–308.
[47] Edelsbrunner H, Letscher D, Zomorodian A. Topological persistence and simplification. Discrete and Computational Geometry. 2002;28(4):511–533.
[48] Zomorodian A, Carlsson G. Computing persistent homology. Discrete & Computational Geometry. 2005;33(2):249–274.
[49] Cohen-Steiner D, Edelsbrunner H, Harer J. Stability of persistence diagrams.
Discrete & Computational Geometry. 2007;37(1):103–120.
[50] Chazal F, Cohen-Steiner D, Guibas LJ, Mémoli F, Oudot SY. Gromov-Hausdorff
Stable Signatures for Shapes using Persistence. In: Computer Graphics Forum.
vol. 28. Wiley Online Library; 2009. p. 1393–1403.
23
[51] Hudson RR, Kaplan NL. Statistical properties of the number of recombination
events in the history of a sample of DNA sequences. Genetics. 1985;111(1):147–
164.
[52] Myers SR, Griffiths RC. Bounds on the minimum number of recombination events
in a sample history. Genetics. 2003;163(1):375–394.
[53] Griffiths RC, Marjoram P. Ancestral inference from samples of DNA sequences
with recombination. Journal of Computational Biology. 1996;3(4):479–502.
[54] Griffiths RC, Marjoram P. An ancestral recombination graph. Institute for Mathematics and its Applications. 1997;87:257.
[55] Hudson RR. Properties of a neutral allele model with intragenic recombination.
Theoretical population biology. 1983;23(2):183–201.
[56] Ghrist R. Barcodes: the persistent topology of data. Bulletin of the American
Mathematical Society. 2008;45(1):61–75.
[57] Lamichhaney S, Berglund J, Almén MS, Maqbool K, Grabherr M, Martinez-Barrio
A, et al. Evolution of Darwin/’s finches and their beaks revealed by genome
sequencing. Nature. 2015;518(7539):371–375.
[58] Song YS, Hein J. Parsimonious reconstruction of sequence evolution and haplotype
blocks. In: Algorithms in Bioinformatics. Springer; 2003. p. 287–302.
[59] Petren K, Grant P, Grant B, Keller L. Comparative landscape genetics and the
adaptive radiation of Darwin’s finches: the role of peripheral isolation. Molecular
Ecology. 2005;14(10):2943–2957.
[60] Hudson RR. Generating samples under a Wright–Fisher neutral model of genetic
variation. Bioinformatics. 2002;18(2):337–338.
[61] Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler
transform. Bioinformatics. 2009;25(14):1754–1760.
[62] McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al.
The Genome Analysis Toolkit: a MapReduce framework for analyzing nextgeneration DNA sequencing data. Genome research. 2010;20(9):1297–1303.
24
[63] DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al.
A framework for variation discovery and genotyping using next-generation DNA
sequencing data. Nature genetics. 2011;43(5):491–498.
25