Maximum Likelihood of Phylogenetic Networks

Maximum Likelihood of Phylogenetic Networks
Guohua Jin1 , Luay Nakhleh1 , Sagi Snir2 , and Tamir Tuller3
?
1
2
Dept. of Computer Science Rice University Houston, TX, USA
Dept. of Mathematics University of California Berkeley, CA, USA
3
School of Computer Science. Tel-Aviv University, Israel.
Abstract. Horizontal gene transfer (HGT) is believed to be ubiquitous among
bacteria, and plays a major role in their genome diversification as well as their
ability to develop resistance to antibiotics. In light of its evolutionary significance
and implications in human health, developing accurate and efficient methods for
detecting and reconstructing HGT is imperative. In this paper we provide a first
likelihood framework for phylogeny-based HGT detection and reconstruction.
Beside the formulation of various likelihood criteria, we offer novel hardness results and heuristics for efficient and accurate reconstruction of HGT under these
criteria. We implemented our heuristics and used them to analyze biological as
well as synthetic data. Our methods exhibited very good performance on the biological data, and preliminary results on the synthetic data show a similar trend.
Further, preliminary results on synthetic data show good promise for detecting
chimeric (partial) HGT, which has been mostly ignored by existing computational methods. Implementation of the criteria as well as heuristics is available
from the authors upon request.
1
Introduction
Unlike eukaryotes, which evolve largely through vertical lineal descent, bacteria acquire
genetic material through the transfer of DNA segments across species boundaries—a
process known as horizontal gene transfer (HGT). This process plays a major role in
bacterial genome diversification [10, 11, 26, 25, 5, 22], and is a significant mechanism
by which bacteria develop resistance to antibiotics [12]. In the presence of HGT, the
evolutionary history of a set of organisms is modeled by a phylogenetic network, which
is a directed acyclic graph obtained by positing a set of edges between pairs of the
branches of an organismal tree to model the horizontal transfer of genetic material [29].
Therefore, to reconstruct an accurate Tree (or Network) of Life and to unravel bacterial genomic complexities, developing accurate criteria and efficient methods for reconstructing and assessing the quality of phylogenetic networks is imperative. A large
body of work has been introduced in recent years to address phylogenetic network reconstruction and evaluation. In general, three categories of non-treelike models have
been addressed, all of which have been introduced under the umbrella concept of phylogenetic networks. However, major differences exist among the three categories. Splits
networks are graphical models that capture incompatibilities in the data due to various
factors, not necessarily HGT or hybrid speciation; examples of methods that address
these networks are given in [38, 6, 21]. The second category is that of recombination
networks, which are used to model the evolution of haplotypes and genes at the population level; see [2, 19] for examples. Phylogenetic networks are the extension of phylogenetic trees to enable the modeling of reticulation events, such as HGT and hybrid
?
Corresponding author, email [email protected]
2
Jin et al.
speciation (these are also called reticulate networks in [21]); examples of methods that
address these networks are given in [20, 29, 33, 28, 30]. See [27, 28] for detailed surveys
of the various phylogenetic network models and methodologies.
One of the most accurate and commonly used criteria for reconstructing phylogenetic trees is maximum likelihood (ML) [14]. Roughly speaking, this criterion considers
a phylogenetic tree from a probabilistic perspective as a generative model, and seeks the
model (i.e., tree) that maximizes the likelihood of observing a given set of sequences at
the leaves of the tree. Likelihood in the general network setting has been investigated
in the past by various works. However, no HGT-specific likelihood framework has ever
been suggested. von Haeseler and Churchill [39] provided a framework for evaluating
likelihoods on networks and subsequently [38] provided an approach to assess this likelihood. These works consider a network as an arbitrary set of splits and therefore fall
into the first category. They are characterized by the combined analysis approach, which
entails combining all gene datasets first (by sequence concatenation), and then analyze
the combined data set. A serious drawback of the latter approach is that when individual
genes are governed by different evolutionary mechanisms and models (a scenario that
is very common in reticulate evolution), combining multiple data sets is problematic
[32]. Likelihood on networks has also been considered in the setting of recombination
networks (see e.g. [18, 8]). These methods, similarly to ours, are indeed tailored to identify breakpoints along the given sequences, however, their underlying model is different
from ours as they model a different process.
In this work, we extend the ML criterion to handle specifically HGT-oriented (i.e.
phylogenetic) networks, and propose a set of criteria and efficient heuristics for computing them. Our extension is based on the fundamental observation that, barring recombination, the evolutionary history of a gene is modeled by a tree, such that a phylogenetic
network can be modeled by its constituent trees [31]. We propose a set of ML criteria for phylogenetic networks; these criteria differ in how the tree information is used,
which variant of the ML criterion is used, and finally what input is provided. Further,
we investigate the computational complexity of some of these criteria and devise a set
of efficient heuristics for reconstructing and evaluating phylogenetic networks based on
them. In particular, we prove that scoring the likelihood of a phylogenetic network is
NP-hard in general, and provide an empirically efficient exact algorithm for the problem, relying on the notion of bi-connected components. Further, we devise an efficient
branch-and-bound heuristic for the problem of adding a number of HGT edges to a tree
to obtain an optimal phylogenetic network. Finally, we make very preliminary progress
on the consistency of the ML criterion for phylogenetic networks, and show that, given
long enough sequences, ML outperforms the parsimony criterion for networks [30].
We have implemented our criteria and heuristics and studied their performance
on biological as well as synthetic data. For the biological data, we analyzed the Rubisco gene in eubacteria and plastids, which was previously analyzed by Delwiche and
Palmer, who postulated a set of HGT events for it [9]. For the synthetic data, we simulated multiple data sets with various HGT events and applied our techniques to the
data. In the case of the biological data, our criteria and heuristics performed very well
with respect to the identification of the correct number of HGT events as well as their
placements on the organismal trees. In preliminary results that we have obtained on
ML of Phylogenetic Networks
3
the synthetic data, our ML criteria have shown similar trends in detecting the correct
number and placement of the HGT edges, and exhibited very good performance in
identifying HGT breakpoints within genes, contributing towards addressing chimeric
HGT [3].
2
Maximum Likelihood of Phylogenetic Networks
2.1 Preliminaries and Definitions
Let T = (V, E) be a tree, where V and E are the tree nodes and tree edges, respectively,
and let L(T ) denote its leaf set and I(T ) its internal nodes. Further, let χ be a set of
taxa (species). Then, T is a phylogenetic tree over χ if there is a bijection between χ
and L(T ). Henceforth, we will identify the taxa set with the leaves they are mapped to,
and let [n] = {1, . . . , n} denote the set of leaf-labels. A tree T is said to be rooted if
the set of edges E is directed and there is a single distinguished internal vertex r with
in-degree 0.
A phylogenetic network N = N (T ) = (V 0 , E 0 ) over the taxa set χ is derived from
T = (V, E) by adding a set H of edges to T , where each edge h ∈ H is added as
follows: (1) split an edge e ∈ E by adding new node, ve ; (2) split an edge e0 ∈ E by
adding new node, ve0 ; (3) finally, add a directed reticulation edge from ve to ve0 . The
edges in H must satisfy that the phylogenetic network should be acyclic and satisfy
additional temporal constraints [29]. Fig. 1(a) shows a phylogenetic network obtained
by adding the edge (X, Y ) to the underlying organismal tree. The tree in Fig. 1(b)
Y
X
A
B
(a)
C
D
A
B
C
D
A
B
(b)
C
D
(c)
Fig. 1. (a) A phylogenetic network with a single HGT even from X to Y . (b) The underlying
organismal (species) tree. (c) The tree of a horizontally transferred gene.
models the evolution of all genetic material that is vertically inherited from the ancestral
organism, whereas the tree in Fig. 1(c) models the evolution of horizontally transferred
genetic material. We denote by T(N ) the set of all trees contained inside network N .
Each such tree is obtained by the following two steps: (1) for each node of in-degree
2, remove one of the incoming edges, and then (2) for every node x of in-degree and
out-degree 1, whose parent is u and child is v, remove node x and its two adjacent
edges, and add a new edge from u to v. For example, the set T (N ) of the network N in
Fig. 1(a) contains only the two trees which are shown in Fig. 1(b) and 1(c).
2.2 Likelihood of Phylogenetic Networks
Under the ML criterion, a phylogenetic tree is viewed as a probabilistic model from
which input sequences S are assumed to be sampled. For given input sequences S the
ith site, Si , is the set of values at the ith position for every sequence in S 4 . In this
4
can be viewed as the ith column when the sequences are aligned
4
Jin et al.
work we assume that sites evolve Independently and Identically Distributed (iid). Since
the parameters of the phylogenetic tree, M , are unknown, they are usually estimated
from the observed sequences by maximizing the likelihood function P (S|M ) [14]. In
general, the overall likelihood of the aligned sequences S given the model M is obtained
by the product of the likelihood of every site i given M as follows:
L(S|M ) =
k
Y
L(Si |M ),
(1)
i=1
where k is the sequence length. Therefore, we will consider the likelihood of a single
site henceforth.The most likely model is the one maximizing Equation 1.
When calculating the likelihood of a tree per a given site, two variants are considered: average likelihood [37] and ancestral likelihood [36] (the former is the more
popular of the two). The ML criterion assumes a model of evolution. We consider here
the Jukes-Cantor model of sequence evolution [24]. However, all the results here can be
generalized to any other model of sequence evolution. Given a set of aligned sequences
S ∈ Σ n×k , a tree T with I(T ) internal nodes (|I(T )| ≤ n − 2), and the edge transition
probabilities p, Lav (Si |T, p), the average likelihood of obtaining site Si under T is
defined as
X
Y
Lav (Si |T, p) =
m(pe , Si , a) ,
(2)
a∈Σ |I(T )| e∈E(T )
where a ranges over all combinations of assigning labels to the I(T ) internal nodes of
T . Each term m(pe , Si , a) is either pe /(|Σ|−1) or (1−pe ), depending on whether in the
i-th site of S and a, the two endpoints of e are assigned different character states (and
then m(pe , Si , a) = pe /(|Σ| − 1)) or the same character state (and then m(pe , Si , a) =
1−pe ). In the other variant of likelihood for phylogenetic trees, the ancestral likelihood,
we replace the summation in Equation 2 with maximization as follows:
Y
Lanc (Si |T, p) = max
m(pe , Si , a).
(3)
a∈Σ |I(T )|
e∈E(T )
That is, we seek the unique labeling to internal nodes to maximize the expression. The
ML solution (or solutions) for a specific tree T is the point (or points) in the edge space
p = [pe ]e∈E(T ) that maximizes the expression L(S|T, p) of Equations 2 and 3, where
M = (T, p). We will refer to both criteria when evaluating the likelihood of a network.
The natural way for generalizing this setting to networks is as follows. The topology
of a phylogenetic network is defined as above, however in this case since tree edges have
transition probabilities, when adding a reticulation edge between the edges e, e0 ∈ E
we should mention where along the edges e, and e0 we add the two new vertices. The
transition probability of the reticulation edge is always 0, meaning there are no substitutions along it (the reason being that HGT is instantaneous at the scale of evolution).
However each reticulation edge r = (e, e0 ) has reticulation probability br associated
with it. This probability denotes the probability of a DNA segment being transferred
along that edge. Fig. 2 describes a simple phylogenetic network.
Let re(T ) denote the set of reticulation edges used to obtain tree T in the network
N , and let H(N ) denote the set of all reticulation edges in N . Let
Y
Y
P (T ) =
br
(1 − br ),
r∈re(T )
r∈H(N )\re(T )
ML of Phylogenetic Networks
5
i
P(
i,f)
e,i)
P(
e
f
P(f,Y)
P(X,B)
P(Y,C)
P(
A,
Y
A
B
C
)
X
f,D
P(
e)
P(e,X)
D
Fig. 2. Simple example of phylogenetic network under the likelihood setting.
where b denotes the reticulation edge probabilities.
The likelihood of a network is obtained as a function of the likelihoods of the trees
contained in it. Here again we consider two variants. In the first, the likelihood is the
sum of the likelihood of all the trees of the network, where for each tree T we also
need to multiply the resultant likelihood by P (T ). Again, we can choose between the
two tree likelihood functions described above (Equations 2 and 3). Thus we get the
following equation:
Lall (Si |N, p, b) =
X
P (T ) · L(Si |T, p)
(4)
T ∈T(N )
In the other variant we want to reconstruct the sequence of reticulation events. Thus we
want to find for each site one tree such that the likelihood of the leaf labels is maximized;
we get the following equation:
Lbest (Si |N, p, b) = max (P (T ) · L(Si |T, p))
T ∈T(N )
(5)
We stress that a tree likelihood can be computed only once the network is given as it is
a component in the network. In order to complete the definition of the maximum likelihood of phylogenetic networks, we add the last criterion which is the type of the input
provided. Therefore, we can define multiple ML criteria, depending on three issues:
1. Likelihood criterion for the trees: Is ancestral likelihood or average likelihood
used to assess each tree likelihood in the network?
2. Tree criterion: At each site, is the best tree likelihood or the sum of the likelihoods
over all trees taken?
3. Input data: Which of the network’s parameters (topology and/or probabilities) are
given? We consider three possibilities:
(a) The tiny problem: The network topology, transition probabilities and reticulation probabilities are given.
(b) The small problem: The initial tree and reticulation edges (i.e. network topology) are given but not the transition or reticulation probabilities.
(c) The big problem: An initial tree is given and we want to add reticulation edges.
The above gives rise to 12 variants of the network maximum likelihood problem. We
will refer to each of these variants by a triplet X–Y –Z, where X ∈ {ancestral, average},
Y ∈ {all, best}, and Z ∈ {tiny, small, big}.
6
3
Jin et al.
Properties of the ML Criterion
In this section we provide consistency and hardness results to some of the problems
we formulated above. For simplicity we assume here the Neyman 2-state model of
sequence evolution [34]; however all the results here can be easily generalized to more
complex models. For lack of space all the proofs are deferred to the appendix.
3.1 Optimality and Identifiability
A reconstruction method is said to be consistent if the probability that it reconstructs the
true tree tends to 1 as the sequence length tends to infinity. A model (tree or network)
generating some data D is said to be identifiable by D if D has enough information to
identify the model uniquely. Consistency and identifiability in phylogenetics are very
topical issues, sparked by Felsenstein’s seminal observation that MP is not consistent.
Here we address two related aspects of these issues for the best–tiny (ancestral and
average) versions of the problem. We define a reconstruction method for the best–tiny
problem to be deterministic if for the given network N , and a specific character λ0 , the
method will always return the same tree T0 ∈ T (N ).
It is easy to see that both ML (average and ancestral) and MP are deterministic under the the best–tiny problem. It is also easy to see that when the length of the sequences
generated by the network tends to infinity, all network’s probabilities are greater than
zero, and the network is not a tree, the probability of a deterministic method to reconstruct correctly all sites tends to a number strictly less than 1. Therefore, the notion of
consistency does not apply to the best–tiny problem. Nevertheless, we can aim at reconstructing correctly the maximum number of sites and we say that a method µ is optimal
if when the number of sites (sequence lengths) tends to infinity, there is no other deterministic method µ0 that reconstructs more sites than µ. In the appendix we prove that
ML is optimal under average–best–tiny problem while MP is not.
3.2 ML on Networks is NP-hard
Finding a phylogenetic networks when only the sequences are given is naturally hard
as the big average and ancestral problems are hard for trees [1, 7]. However in our case
we assume an initial phylogenetic tree and seek to improve the likelihood by adding
reticulation edges. We conjecture that these variants of the problems are hard. For a first
step we prove the hardness of the following versions of the problem (the proofs are in
the appendix): Tiny best tree ancestral, tiny best tree average, tiny all tree ancestral, tiny
all tree average, and small best ancestral tree. The results are established by a reduction
from the 3SAT problem and the max-2-sat problem [16].
4
Algorithms for the Tiny Versions
The algorithms we describe in this section handle the tiny versions of the ML problem.
As we showed, a polynomial time algorithm is unlikely to be found. Therefore we here
devise algorithms that are not polynomial but may have good empirical performance.
4.1 The Naive Algorithm
As is well-known, the tiny versions of ML problem on trees can be solved by the simple
dynamic programming algorithm of Felsenstein [14] for average ML and by the algorithm of Pupko et al. [36] for the ancestral ML. Therefore the most naive solution is to
ML of Phylogenetic Networks
7
search over all the network’s trees separately and either sum their likelihood or select
the optimal. The complexity of such an approach is O(n)2r for each site where r is the
number of reticulation edges in the network and n is the size of the tree. As can be easily
seen, this approach is fairly naive and can result in a large unnecessary computations.
4.2
The Component-Wise Naive Algorithm
We now describe a more efficient algorithm that takes into consideration independent
components in the network. For a network N and a node v ∈ V (N ), let Nv denote the
graph induced by the nodes reachable from v. Let u, v be two nodes in N . We say that u
and v are unrelated if u ∈
/ Nv and v ∈
/ Nu . In order to compute likelihood in a bottomup fashion, we need the following property to hold: for every two internal nodes u, v,
either Nu ⊆ Nv or Nu ∩ Nv = ∅. Indeed, this is the underlying principle in Felsenstein
pruning algorithm to compute likelihood on a tree [14].
Definition 1. A biconnected component (bi-component) [17] is a subgraph induced by
a maximal set of vertices W , such that in the underlying graph, there are two vertex
disjoint paths between any two vertices in W (we assume all the edges are undirected).
A node in a bi-component B is a leaf in B if it has no children in (the directed graph
of) B. Otherwise it is internal in B. Also, a node is a root in B if it has no ancestor in
B. It is easy to see that every bi-component has at least one leaf and exactly one root.
Also, every two bi-components are internal-vertex-disjoint.
Observation 1 Let B be a bi-component with a root r. Then for every internal node
u ∈ V (B) \ {r} there is an unrelated internal node v ∈ V (B) s.t. Nu ∩ Nv 6= ∅ (and
by definition Nu 6⊆ Nv ). In particular, every leaf in a bi-component has at least two
parents.
The above observation yields that in a bi-component, we cannot run the simple dynamic
programming algorithm. On the other hand, applying the naive algorithm is costly and
unnecessary as it can be avoided between different bi-components.
Definition 2. Let N be a network. Then B(N ), the bi-component graph of N is a
graph (V (B), E(B)) where V is the set of bi-components in N and two bi-components
B1 , B2 are connected by a (directed) edge in B(N ) if (1) there is an edge in N between
v1 ∈ B1 and v2 ∈ B2 , and (2) there is a vertex v ∈ B1 , B2 , and v is a leaf in B1 (and
necessarily internal in B2 ).
Observation 2 Let N be a phylogenetic network. Then B(N ) is a tree.
Based on the above observation, a better solution is to decompose the network into
bi-components, run the naive exhaustive algorithm inside every bi-component and the
tree algorithm in the bi-component. Let r(B) be the number of reticulation edges in
a bi-component B, B∗ the largest bi-component in N and r∗ the maximum of r(B)
over all bi-components of N . Then, the above improvement reduces the complexity of
the algorithm to O(|B(N )|2r∗ |B ∗ |). This improvement can be quite important as the
bi-components can be sparse in a network with few reticulation edges.
8
5
Jin et al.
Heuristics for the Small and the big Versions
In contrast to the tiny problem, in the small versions the network topology is given
without the probabilities. In this case we have used the standard hill climbing technique
to assess the optimal parameters the given topology.
For accelerating the running time for the big problems, we use a branch and bound
(B&B) heuristic [35, 15]. In general, the B&B technique is based on the decreasing
monotonicity of the objective function with respect to partial inputs. In other words,
the value of the objective function for a certain input is not greater than any valid part
of that input. The B&B principal asserts that if the value of the objective function for
some partial input is smaller than some known value on the total input, it is worthless
to explore all inputs extending this partial input. B&B is normally applied to NP-hard
problem and can lead to very efficient running times. In our case, the input is a phylogenetic network and a set of characters and the objective function is the likelihood of
the network WRT the character set.
Let N = (V, E) denote a phylogenetic network. A phylogenetic network N 0 =
0
(V , E 0 ) is defined as edge separated subnetwork of N if V 0 ⊆ V , ∀(v1 , v2 ) ∈ E 0 ⇒
(v1 , v2 ) ∈ E, and there is a cut of a single edge e between (V 0 , V \ V 0 ). The following
observations establish an upper bound for the likelihood of a phylogenetic network.
Observation 3 If a network N 0 is a valid phylogenetic network and is an edge separated subnetwork of phylogenetic network N , then P (S|N 0 ) > P (S|N ). 5
For a set S of sequences of length k, we denote by Snm (n ≥ 0, m ≤ k) the set of these
sequences restricted only to the positions n ≤ i ≤ m
Observation 4 For any phylogenetic network, N and set S of sequences, we have
P (Snm |N ) ≥ P (S|N ).
Based on these observations, we propose the following heuristic: (1) Optimize the likelihood of sub-networks; if a candidate network contains an edge separated sub-network
with low likelihood score, it is not the network with the ML score; (2) Optimize the
likelihood of a network on part of the sites; if a candidate network has low likelihood
score for these sites, it is not the network with the ML score.
6
Empirical Performance
We implemented our ML criteria and algorithms and tested them on both biological
as well as synthetic data. For the biological data, we considered a 15-taxon dataset of
plastids, cyanobacteria, and proteobacteria, which is a subset of the dataset considered
by Delwiche and Palmer [9] and for which multiple HGT events were conjectured by
the authors. Due to space limitations and given that our results on synthetic data are
very preliminary, we omit these results; nonetheless, complete experimental results on
the synthetic data will be available for the final version of the paper. Details about the
biological dataset are in Section B in the appendix.
5
In the full version of this paper we prove a stronger claim. We show that a tree with τ reticulation edges, has a subtree such that when adding to this subtree τ reticulation edges it has
higher likelihood than the original tree
ML of Phylogenetic Networks
6.1
9
Data Analysis
Since the organismal tree and sequence alignment are given for both the biological
and synthetic datasets, our analysis amounted to solving the big version of ML on the
datasets. In other words, for each dataset (biological and synthetic), we sought edges
whose addition to the organismal trees gave optimal solutions to the big version of ML.
Our goal was to verify whether solving this problem would identify the HGT events
that Delwiche and Palmer conjectured for the biological data [9]. We also compared
our results to the results of [4]. We assessed the quality of our method with respect
to the number of HGT events it identified, and the actual placement of the HGT events
themselves on the organismal trees. To estimate the mutation and reticulation probabilities, we used the heuristic described in Section 5, where we obtained internal sequences
using ancestral ML, and then used normalized Hamming distances as probabilities.
6.2
Results and Discussion
We implemented a heuristic for solving the big ML problem with ancestral and average
likelihood, considering all trees inside a network as well as the best tree. Due to space
limitations, we show only the results of the version that uses ancestral likelihood and
sums over all trees. Results are shown in Fig. 3. We investigated the performance of ML
with respect to two main questions: (1) Does the ML criterion infer the correct number
Rhodobacter capsulatus
Rhodobacter sphaeroides I
4
x 10
H2
H1
H3
Nitrobacter
1.2
Thiobacillus denitrificans I
β Proteobacteria
H4
Alcaligenes H16 plasmid
Hydrogenovibrio II
Chromatium L
γ Proteobacteria
H6
H5
Thiobacillus ferrooxidans ATCC 19859
Prochlorococcus
Synechococcus
Gonyaulax
Cyanophora
1.15
1.1
1.05
Cyanobacteria
1
Dinoflagellate plastid
Glaucophyte Plastid
Cyanidium
Red and Brown Plastid
Euglena
Green Plastid
(a)
−log Likelihood
Thiobacillus denitrificans II
H0: 12212.2
+H1: 11538.8
+H2: 10885.5
+H3: 10341.0
+H4: 9931.3
+H5: 9519.71
+H6: 9432.43
1.25
α Proteobacteria
0.95
0
1
2
3
4
Number of HGT edges
5
6
(b)
Fig. 3. (a) The species tree of the 15 organisms, as reported by Delwiche and Palmer, and the 6
HGT edges inferred by our heuristic for the big ML problem. (b) The improvement in the likelihood score as a function of the number of HGT edges added to the species tree. The improvement
achieved by adding the sixth HGT edge is smaller compared to that achieved by adding the first
five events, which indicates that five HGT events suffice to model the evolution of the rbcL gene
for these 15 organisms. Likelihood criterion = ancestral, and tree criterion = all.
of HGT events? (2) Does the ML criterion infer the correct location of the HGT events?
For the 15-taxon biological data, Delwiche and Palmer hypothesized the occurrence
of several HGT events of the rubisco genes as opposed to ancient gene duplication and
loss. Our findings, based on the ML criterion, agree with the grouping of the organisms
based on he forms I and II rubisco gene, as hypothesized by Delwiche and Palmer. Detailed discussion of the results, shown in 3(a) is presented in Section B in the appendix.
From the definition of the ML criterion for networks it follows that as more edges
are added to the network, the likelihood score either remains the same or improves, but
10
Jin et al.
never becomes worse. Therefore, a significant question to investigate was whether the
improvement becomes relatively negligible at some point so that the addition of edges
could be stopped (stopping rule). To answer this question, we plotted the improvement
in the likelihood score as a function of the number of HGT edges added; the results
are shown in Fig. 3(b). The figure shows that while adding the first five edges achieves
drastic improvements in the likelihood score, adding the sixth edge results in a much
lower improvement, which is indicated by the slow decrease in the likelihood score.
appendix.
7
Conclusions and Future Research
Phylogenetic networks model evolutionary histories of sets of organisms in the presence of non-treelike evolutionary events such as HGT and hybrid speciation. In this
paper, we introduced a first maximum likelihood framework for reconstructing and
evaluating phylogenetic networks. This framework gave rise to an array of computational problems. In this paper, we addressed some of these problems, namely proving
the NP-hardness of the “tiny” variants, established mathematical properties of several
variants, and devised efficient heuristics and algorithms. We implemented our criteria and methods and analyzed a biological dataset of eubacteria and plastids [9], and
a large set of simulated data. We have obtained very good results on the biological
dataset, and preliminary results on the synthetic data have shown very good promise
as well. Many issues related to statistical analysis have not been addressed such as the
existence of multiple maxima or the relaxation of the i.i.d. assumption. However, as this
is the first paper to define this ML approach, we believe the most important points, such
as rigorous definitions, hardness of computations, preliminary heuristics and biological
relevance, have been addressed.
Naturally, there are still many open questions. Given the large size of the prokaryotic branch of the Tree of Life, and the estimated high rates of HGT in this branch, developing more computationally efficient algorithms and heuristics is imperative. Since
horizontal transfer involves large segments of DNA, it is clear that the neighboring sites
are not independent. In the future we intend to model this phenomena by using HMMs.
In this work we assume each reticulation edge has a reticulation probability associated
with it. We assume that the probabilities of different edges are independent. We intend
to explore models with distributions of reticulation probabilities where different reticulation edges are dependent. The empirical analysis done here was aimed to demonstrate
the viability of the ML criterion. In the future, we intend to use our method for analyzing new and larger biological datasets.
Acknowledgments
The authors would like to thank Derek Ruths for providing them with the simulated
data. We also like to thank Satish Rao for pointing out the issue of lateral edge independence. This work was supported in part by the Rice Terascale Cluster funded by NSF
under grant EIA-0216467, Intel, and HP. Sagi Snir was supported in part by NSF grant
CCR-0105533.
References
1. L. Addario-Berry, B. Chor, M. Hallett, J. Lagergren, A. Panconesi, and T. Wareham. Ancestral maximum likelihood of evolutionary trees is hard. Jour. of Bioinformatics and Comp.
Biology, 2(2):257–271, 2004.
ML of Phylogenetic Networks
11
2. V. Bafna and V. Bansal. Improved recombination lower bounds for haplotype data. In
Proceedings of the Ninth Annual International Conference on Computational Molecular Biology, pages 569–584, 2005.
3. U. Bergthorsson, K.L. Adams, B. Thomason, and J.D. Palmer. Widespread transfer of mitochondrial genes in flowering plants. Nature, 424:197–201, 2003.
4. Alix Boc and Vladimir Makarenkov. New efficient algorithm for detection of horizontal
gene transfer events. In WABI, pages 190–201, 2003.
5. J.R. Brown. Ancient horizontal gene transfer. Nat. Rev. Genet., 4:121–132, 2003.
6. D. Bryant and V. Moulton. NeighborNet: An agglomerative method for the construction of
planar phylogenetic networks. In R. Guigo and D. Gusfield, editors, Proc. 2nd Workshop Algorithms in Bioinformatics (WABI’02), volume 2452 of Lecture Notes in Computer Science,
pages 375–391. Springer Verlag, 2002.
7. B. Chor and T. Tuller. Maximum likelihood of evolutionary trees is hard. RECOMB, pages
296–310, 2005.
8. Husmeier D. and McGuire G. Detecting recombination with mcmc. Bioinformatics, 18:345–
353, 2002.
9. C. F. Delwiche and J. D. Palmer. Rampant horizontal transfer and duplicaion of rubisco
genes in eubacteria and plastids. Mol. Biol. Evol, 13(6), 1996.
10. W.F. Doolittle, Y. Boucher, C.L. Nesbo, C.J. Douady, J.O. Andersson, and A.J. Roger. How
big is the iceberg of which organellar genes in nuclear genomes are but the tip? Phil. Trans.
R. Soc. Lond. B. Biol. Sci., 358:39–57, 2003.
11. J.A. Eisen. Assessing evolutionary relationships among microbes from whole-genome analysis. Curr. Opin. Microbiol., 3:475–480, 2000.
12. I.T. Paulsen et al. Role of mobile DNA in the evolution of Vacomycin-resistant Enterococcus
faecalis. Science, 299(5615):2071–2074, 2003.
13. J. Felsenstein. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool., 22:240–249, 1978.
14. J. Felsenstein. Evolutionary trees from DNA sequences: A maximum likelihood approach.
J. Mol. Evol., 17:368–376, 1981.
15. J. Felsenstein. Phylip (phylogeny inference package) version 3.5c. distributed by the author.
department of genetics, university of washington, seattle. 1993.
16. M. R. Garey and D. S. Johnson. Computer and Intractability. Bell Telephone Laboratories,
incorporated, 1979.
17. M. C. Golumbic. Algorithmic Graph Theory and Perfect Graphs. Academic Press, 1980.
18. N.C. Grassley and E.C. Holmes. A likelihood method for the detection of selection and
recombination using nucleotide sequences. Mol Biol Evol, 14:239–247, 1997.
19. D. Gusfield and V. Bansal. A fundamental decomposition theory for phylogenetic networks
and incompatible characters. In Proceedings of the Ninth Annual International Conference
on Computational Molecular Biology, pages 217–232, 2005.
20. M. Hallett, J. Lagergren, and A. Tofigh. Simultaneous identification of duplications and
lateral transfers. In Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, pages 347–356, 2004.
21. D. H. Huson and D. Bryant. Application of phylogenetic networks in evolutionary studies.
Mol. Biol. Evol, 23(2):254–267, 2006.
22. R. Jain, M.C. Rivera, J.E. Moore, and J.A. Lake. Horizontal gene transfer accelerates genome
innovation and evolution. Molecular Biology and Evolution, 20(10):1598–1602, 2003.
23. G. Jin, L. Nakhleh, S. Snir, and T. Tuller. Parsimony of phylogenetic networks: Hardness
results and efficient algorithms and heuristics. submitted, 2006.
24. T. H. Jukes and C. R. Cantor. Evolution of protein molecules. in mammalian-protein
metabolism. H. N. Munro(Ed), 121-132 New York: Academic Press, 1969.
12
Jin et al.
25. C.G. Kurland. Something for everyone—horizontal gene transfer in evolution. Embo Reports, 1(2):92–95, 2000.
26. J.A. Lake, R. Jain, and M.C. Rivera. Mix and match in the tree of life. Science, 283:2027–
2028, 1999.
27. C.R. Linder, B.M.E. Moret, L. Nakhleh, and T. Warnow. Network (reticulate) evolution:
biology, models, and algorithms. In The Ninth Pacific Symposium on Biocomputing (PSB),
2004. A tutorial.
28. V. Makarenkov, D. Kevorkov, and P. Legendre. Phylogenetic network reconstruction approaches. Applied Mycology and Biotechnology (Genes, Genomics and Bioinformatics), 6,
2005. To appear.
29. B.M.E. Moret, L. Nakhleh, T. Warnow, C.R. Linder, A. Tholse, A. Padolina, J. Sun, and
R. Timme. Phylogenetic networks: modeling, reconstructibility, and accuracy. IEEE/ACM
Transactions on Computational Biology and Bioinformatics, 1(1):13–23, 2004.
30. L. Nakhleh, G. Jin, F. Zhao, and J. Mellor-Crummey. Reconstructing phylogenetic networks
using maximum parsimony. Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference (CSB2005), 393:440–442, Jun 2005.
31. L. Nakhleh, D. Ruths, and H. Innan. Gene trees, species trees, and species networks. In
R. Guerra and D. Allison, editors, Meta-analysis and Combining Information in Genetics.
Chapman & Hall, CRC Press, 2005. In press.
32. L. Nakhleh, J. Sun, T. Warnow, C.R. Linder, B.M.E. Moret, and A. Tholse. Towards the development of computational tools for evaluating phylogenetic network reconstruction methods. In Proceedings of the Eighth Pacific Symposium on Biocomputing (PSB 03), 2003.
33. L. Nakhleh, T. Warnow, and C.R. Linder. Reconstructing reticulate evolution in species:
theory and practice. In Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, pages 337–346, 2004.
34. J. Neyman. Molecular studies of evolution: A source of novel statistical problems. In S.
Gupta and Y. Jackel, editors, Statistical Decision Theory and Related Topics, pages 1–27.,
1971. Academic Press, New York.
35. D. Penny and M. D. Hendy. Turbotree: a fast algorithm for minimal trees. Comput Appl
Biosci, 3(3):183–187, 1987.
36. T. Pupko, I. Pe’er, R. Shamir, and D. Graur. A fast algorithm for joint reconstruction of
ancestral amino-acid sequences. Mol. Biol. Evol, 17(6):890–896, 2000.
37. M. Steel and D. Penny. Parsimony, likelihood and the role of models in molecular phylogenetics. Mol. Biol. Evol., 17:839–850, 2000.
38. K. Strimmer and V. Moulton. Likelihood analysis of phylogenetic networks using directed
graphical models. Mol. Biol. Evol., 17(6):875–881, 2000.
39. A. von Haeseler and G. A. Churchill. Network models for sequence evolution. J. Mol. Evol.,
37:77–85, 1993.
A
A.1
Properties of the ML Criterion
Optimality and Identifiability
Proposition 1. ML is optimal under average–best–tiny problem while MP is not.
Proof. Let λ0 be some character (state assignment) on the leaves of the network N . Let
T0 ∈ T (N ) be the tree maximizing the probability of obtaining λ0 . That is Lav (λ0 |T0 , p) =
maxT ∈T (N ) Lav (λ0 |T, p). By definition of the average–best–tiny problem, ML will always choose T0 for a site Si exhibiting λ0 . By definition, λ0 will be mostly generated
ML of Phylogenetic Networks
13
by T0 . Since the above arguments are true for every state assignment λ0 , we obtain the
first part of the proposition.
It remains to show that MP fails to maximize the correct number of reconstructed
sites for even a single character. Consider a quartet tree ((1, 2), (3, 4)) with the pendant
edges leading to taxa 1 and 4 having probability 12 − ε and the pendant edges leading to
2 and 3 and the internal edge having probability ε. We now add a reticulation edge from
the pendant edge leading to 4 to the pendant edge leading to 1, whose two endpoints
coincide with the endpoints of the tree edges (i.e., between the taxa 4 and 1. See Fig. 4).
We denote the underlying tree by T and the tree obtained by the reticulation event by
T 0 . The proof proceeds along lines similar to the classical result of Felsenstein [13]. The
probability of obtaining a character on T is a sum of four expressions depending on the
internal assignment at the internal nodes. In particular, for the character ((0, 1), (1, 0))
we have at every such expression either the factor ( 21 − ε) or ( 12 + ε) for the pendant
edge leading to 1. This factor is absent in the ML expressions for T 0 . Therefore, we can
bound the ratio between the likelihood of obtaining ((0, 1), (1, 0)) on both trees by a
function of ε alone. Now, for every such ε we can find a reticulation event probability
ε2 such that the overall probability of obtaining ((0, 1), (1, 0)) on T is bigger than on
T 0 . Specifically, let P0 and P00 be the probabilities of obtaining λ0 given the trees T and
T 0 , respectively. Then we observe
P0 =
)|
2|I(T
X
i=0
1
+ (−1)i ε · ti
2
(6)
, where ti is the product of the appropriate probabilities on the tree edges (other than
|I(T )|
the pendant edge to 1), which stems from the internal assignment a ∈ {0, 1}2
.
Therefore,
)|
2|I(T
X
1
+ (−1)i ε · ti
P0 =
2
i=0
1
− ε P00 .
=
2
>
1
−ε
2
)|
2|I(T
X
ε · ti
i=0
(7)
Now, the probability of obtaining λ0 on T and T 0 given the network N is (1 −
ε2 )P0 and ε2 P00 respectively, where ε2 is the reticulation edge probability. We require
(1 − ε2 )P0 > ε2 P00 ; however, by Equation 7 it suffices to require
1
(1 − ε2 )
− ε P00 ≥ ε2 P00 .
(8)
2
A simple calculation leads to the conclusion that for every
ε2 ≤
1 − 2ε
3 − 2ε
the requirement holds.
We end by commenting that MP will always choose the reticulation edge and return
the tree ((1, 4), (3, 2)) (regardless of the probability ε2 ).
14
Jin et al.
2
3
1
4
Fig. 4. Although most ((0, 1), (1, 0)) characters will be produced on the underlying tree, MP will
choose the tree containing the reticulation edge.
Eventually, it can be shown that for any other location of the reticulation edge along
these two pendant edges, an even larger ε2 suffices. In order to show that MP fails to
satisfy network identifiability, we use the same network as in Fig. 4.
Observation 5 Let N be the network in Fig. 4 and let N 0 be the network obtained from
N by swapping taxa 2 and 4. Then for every character λ0 , the parsimony score on N
and N 0 is the same.
The above observation yields that MP cannot distinguish between N and N 0 even when
the number of sites is very large.
A.2
ML on Networks is NP-hard
Problem 1. 3-Satisfiability (3SAT) [16]
Input: A formula F over a set U of variables, collection C of clauses over U such that
each clause c ∈ C has |c| = 3.
Question: Is there a truth assignment for U that simultaneously satisfies all clauses in
C?
Given such a formula F we construct a network N (F ) as follows. In the sequel, we use
x is connected to y to indicate that x is a child of y.
1. A root R with a 1-leaf child.
2. For every variable we crate a node connected to a diamond loop as depicted in
Fig. 5.
to R
0
0
0
0
1
0
x
1
1
Fig. 5. For every variable we create a diamond gadget.
ML of Phylogenetic Networks
15
3. For every clause ci = (i∨j∨k) we generate a 1-leaf, called clause leaf, and connect
it by a tree edge to variable node i and by a reticulation edge to an intermediate
node c0i . Intermediate node c0i is connected by a tree edge to variable node j and
by a reticulation edge to an intermediate node c00i . Finally, intermediate node c00i is
connected by a tree edge to variable node k (see Fig. 6). The length of a (tree) edge
emanating from a variable node is 1 if the literal is negated and 0 otherwise.
x
y
z
0
0
1
0
0
1
Fig. 6. A clause gadget for the clause (x̄ ∨ y ∨ z). (complementing 1-leaves removed).
4. Finally, we create a complementing 1-leaf child connected by a 0 long edge to every
internal node with out-degree 1.
A complete network representing the formula (x̄ ∨ y ∨ z) ∧ (x ∨ ȳ ∨ w̄) (without the
complementing 1-leaves) is shown in Fig. 7.
R
0
0
0
0
0
diamond
gudget
x
diamond
gudget
y
1
0
1
0
diamond
gudget
z
diamond
gudget
w
0
1
0
0
0
0
1
1
1
Fig. 7. The reduction from 3-SAT to the tiny best ancestral likelihood. Each variable has a node to
which all literals of that variable are connected. In the figure we see the sub network representing
the formula (x̄ ∨ y ∨ z) ∧ (x ∨ ȳ ∨ w̄).
Observation 6 The network N (F ) is a valid phylogenetic network.
16
Jin et al.
Proof. We observe the following:
– There is a single internal node with in-degree 0.
– Every internal node has in-degree > 1.
– Every node with in-degree > 1 has exactly one entering tree edge and the rest are
0-length reticulation edges.
– The temporal constraint property holds.
Let S denote the leaf assignment under the reduction. Then we get the following
claim:
Claim. F has a satisfying assignment if and only if there is a tree T ∈ T(N ) and
internal assignment a ∈ {0, 1}r s.t. L(S|T, a, p) > 0.
Proof. We begin with some auxiliary observations.
Observation 7 For any tree T in the network and internal assignment a, L(S|T, a, p) >
0 if and only if the root R and all non variable nodes (nodes which are not assigned to
variables) must have internal assignment 1.
We first show that if F has a satisfying assignment, then there is a tree T ∈ T(N )
and internal assignment a ∈ {0, 1}r s.t. L(S|T, p) > 0. For every variable we assign
its value from the satisfying assignment. For every clause ci , we choose the path from
the 1-leaf representing ci to the literal satisfying it. Note that if that literal is negated,
then the edge emanating from it has substitution probability 1. Now, every variable is
connected to the root by either the 0 − 0 path in the diamond if the variable is assigned
1 or the 1 − 0 path otherwise. Finally, if we set the probability of every reticulation edge
to be positive, we get the desired result.
To prove the other direction, assume there exists a tree T ∈ T(N ) and internal assignment a ∈ {0, 1}r s.t. L(S|T, p) > 0. Then by Observation 7, all non-variable internal
nodes are forced to the value 1.
Observation 8 For every variable v in T , all emanating edges from v in T have the
same probability which is either 0 if v has assignment 1 or 1 otherwise.
Since every leaf must be connected to the root we get that every clause is satisfied.
We comment that in the final tree, few internal nodes may remain with out-degree
0 and hence disappear in the resulting tree. In addition, other internal nodes can remain
with out-degree 1 and contracted with the resulting obvious probability on the new
edge.
Theorem 9. The tiny best tree ancestral sequences of phylogenetic networks is NPhard.
Proof. It is easy to see that the problem is in NP, since given a tree T it is easy to check
if T ∈ T(N ) and subsequently calculate its likelihood. Additionally, the reduction of
Claim A.2 is performed in time polynomial in the size of F .
Corollary 1. The tiny best tree average likelihood of phylogenetic networks is NP-hard.
ML of Phylogenetic Networks
17
Proof. Using the same reduction as in Claim A.2, and Observation 7, we see that for a
tree to obtain likelihood greater than zero, all internal nodes are forced to 1. Therefore,
the only freedom is in the variable nodes and by Observation 8 each such assignment
defines a truth assignment to the variables in F .
By using similar arguments to Corollary 1 we also obtain the two following hardness
results:
Corollary 2. The tiny all trees ancestral likelihood and the tiny all trees average likelihood of phylogenetic networks are NP-hard.
All the above results set the complexity of the tiny versions of the network likelihood problems. The next reduction deals with the “small” problem where the network
is given but we seek to find the tree and the edge probabilities that maximize the likelihood over the whole trees of the network. For the reduction we use the same construct
we used in [23]. We also restrict the edge probabilities to some interval [0, p] for p < 1.
We prove the hardness of the small best tree ancestral ML problem by a reduction
from the Maximum 2-Satisfiability (max-2-sat) problem [16], which is formally defined
as follows.
Problem 2. Maximum 2-Satisfiability (max-2-sat)
Input: Set U of variables, collection C of clauses over U such that each clause c ∈ C
has |c| = 2, and a positive integer K ≤ |C|.
Question: Is there a truth assignment for U that simultaneously satisfies at least K of
the clauses in C?
We start with a lemma which will be used in our main proof. Let a “True-True” denote
a clause that has no negated literals, “True-False” denote a clause that has exactly one
negated literal, and “False-False” denote a clause in which both literals are negated. For
the “True-True” clause we generate the subnetwork shown in Fig. 8 on the right, For
the “True-False” clause we generate the subnetwork shown in Fig. 8 on the left and for
the “False-False” we generate the same network as for “True-True” but flip the value of
the leaves.
Lemma 1. [23]
(1) The minimal number of substitutions is 3 for a “True-True” network is obtained by
labelling x = 1, y = 1, or both. Otherwise,it is 4.
(2) The minimal number of substitutions is 3 for a “True-False” network is obtained by
labelling x = 0, y = 1, or both. Otherwise, it is 4.
(3) The minimal number of substitutions is 3 for a “False-False” network is obtained
by labelling x = 0, y = 0, or both. Otherwise, it is 4.
Given a formula F as input to max-2-sat, we create a node for every variable and
connect every variables to two ancestral nodes and these two to the root. In addition, for
every clause ci we generate the appropriate subnetwork as in Fig. 8 and connect it to
the variables involved. Fig. 9 shows a complete network generated for a specific clause.
Since this version of problem deals with the ancestral version where every internal
node is assigned a value of either 0 or 1, we get the following observation:
18
Jin et al.
Fig. 8. Part of the reduction from max-2-sat to small best tree ancestral likelihood.
Fig. 9. The complete network generated for the clauses (x ∨ w), (w ∨ ȳ), (x ∨ z̄).
Observation 10 At any optimal tree, the probability of every edge will be set to either
zero or p.
Observation 11 Unless all variables have the same value (0 or 1) there is exactly one
substitution on the subnetwork above the variable nodes.
Therefore, WLOG, we will assume the optimal assignment has both values.
Claim. Given a set of clauses C over variables X, as input to max-2-Sat, X has an
assignment which k clauses are satisfied, if and only if the network constructed has a
tree with likelihood p4|C|−k+1 .
Proof. ⇒ By Lemma 1 there are 4|C| − k substitutions at the networks below the variable nodes and by Observation 11 exactly one more above the variable nodes. Now, by
ML of Phylogenetic Networks
19
Observation 10 the edges on which a substitution occurs get probability p and the others
0. This yields the desired result.
The proof of the other direction proceeds similarly.
Theorem 12. The small best ancestral tree is NP-hard.
B
B.1
Empirical Performance: Further Analysis
Data: further details
The 15-taxon rbcL dataset of [9] consists of two Form I sequences from the α, β, and
γ-proteobacteria groups, two from cyanobacteria, one from green plastids, one from
red plastids, one cyanophora, and four Form II rubisco sequences. For this dataset, we
obtained the species (organismal) tree which was reported in [9, 4]. The species tree
is based 16S rRNA and other evidence. We analyzed the rubisco gene rbcL of these
15 organisms. The gene dataset consists of 15 aligned amino acid sequences, each of
length 532 (the alignment is available from
http://www.life.umd.edu/labs/delwiche/alignments/rbcLgb7-95.distrib.txt).
B.2
Analysis of the results in Fig. 3
(1) The two most significant HGT edges are H1 and H2 in Fig. 3(a), and they group the
form II rubisco (Thiobacillus denitrificans II and Hydrogenovibrio II) of the β and γ
proteobacteria, respectively, together with the form II rubisco (Rhodobacter capsulatus)
of the α proteobacteria. This result agrees with the grouping of these three organisms,
based on the form II rubisco, into one clade, as shown in Fig. 2 of [9]. (2) The third most
significant HGT edge is H3 in Fig. 3(a), and it indicates an HGT of the red type form I
rubisco from the α proteobacteria (Rhodobacter sphaeroides I) to the β proteobacteria
(Alcaligenes H16 plasmid). This result agrees with the grouping of all red type form I
rubisco genes in one clade in Fig. 2 of [9]. (3) The fourth most significant HGT edge is
H4 in Fig. 3(a), which completes the grouping of the form II rubisco genes, along with
H1 and H2, by indicating an HGT of the form II rubisco from Rhodobacter capsulatus to
Gonyaulax (an α proteobacteria and plastid, respectively). This result is also supported
by the single clade of all form II rubisco in Fig. 2 of [9]. (4) The fifth most significant HGT edge is H5 in Fig. 3(a), which indicates an HGT of the the form I rubisco
from the the red and brown plastids (Cyanidium) to the α proteobacteria (Rhodobacter
sphaeroides I). This HGT event is in agreement with the grouping of the red type form
I rubisco in Fig. 2 of [9]. (5) The last HGT edge is H6 in Fig. 3(a) which groups the
red and brown plastids with the α, β, and γ proteobacteria. Such grouping is supported
by the red type form I rubisco group in Fig. 2 of [9]. To summarize, the HGT edges
computed by our heuristic agree with the grouping of the organisms based on the forms
I and II rubisco, as hypothesized by Delwiche and Palmer.
We also compared our results to the results reported by Boc and Makarenkov [4],
they used a distance method for discovering HGT and analyzed a dataset similar to ours.
Three of our edges (H1 , H3 , H4 ) appeared in [4]. Two other edges (H2 , H5 ) appeared
also in [4], but in the opposite direction. One edge, H6 , appeared in our results but didn’t
20
Jin et al.
appear in the results of [4]. In general, our results are similar to the result reported by
Boc and Makarenkov, but simpler. Our results included 6 HGTs, while the solution of
Boc and Makarenkov included 8 HGTs.