Phylogenetic Mixture Models Can Reduce Node

Syst. Biol. 57(2):286–293, 2008
c Society of Systematic Biologists
Copyright ISSN: 1063-5157 print / 1076-836X online
DOI: 10.1080/10635150802044045
Phylogenetic Mixture Models Can Reduce Node-Density Artifacts
CHRIS VENDITTI , ANDREW M EADE, AND M ARK PAGEL
School of Biological Sciences, University of Reading, Reading, RG6 6AJ, United Kingdom
Abstract.—We investigate the performance of phylogenetic mixture models in reducing a well-known and pervasive artifact
of phylogenetic inference known as the node-density effect, comparing them to partitioned analyses of the same data. The
node-density effect refers to the tendency for the amount of evolutionary change in longer branches of phylogenies to be
underestimated compared to that in regions of the tree where there are more nodes and thus branches are typically shorter.
Mixture models allow more than one model of sequence evolution to describe the sites in an alignment without prior
knowledge of the evolutionary processes that characterize the data or how they correspond to different sites. If multiple
evolutionary patterns are common in sequence evolution, mixture models may be capable of reducing node-density effects by
characterizing the evolutionary processes more accurately. In gene-sequence alignments simulated to have heterogeneous
patterns of evolution, we find that mixture models can reduce node-density effects to negligible levels or remove them
altogether, performing as well as partitioned analyses based on the known simulated patterns. The mixture models achieve
this without knowledge of the patterns that generated the data and even in some cases without specifying the full or true
model of sequence evolution known to underlie the data. The latter result is especially important in real applications, as
the true model of evolution is seldom known. We find the same patterns of results for two real data sets with evidence
of complex patterns of sequence evolution: mixture models substantially reduced node-density effects and returned better
likelihoods compared to partitioning models specifically fitted to these data. We suggest that the presence of more than one
pattern of evolution in the data is a common source of error in phylogenetic inference and that mixture models can often
detect these patterns even without prior knowledge of their presence in the data. Routine use of mixture models alongside
other approaches to phylogenetic inference may often reveal hidden or unexpected patterns of sequence evolution and
can improve phylogenetic inference. [Mixture models; molecular evolution; node-density effect; phylogeny reconstruction;
simulation.]
When the models of sequence evolution that are used
to infer phylogenetic trees misrepresent the true underlying processes that generated the data, phylogenetic inference can be misled and return incorrect or biased trees.
A pervasive artifact of phylogenetic inference that arises
from such model misspecification is the node-density effect (Fitch and Beintema, 1990; Fitch and Bruschi, 1987;
Venditti et al., 2006; Webster et al., 2003). Fitch and
Bruschi (1987) were the first to describe the effect, which
manifests as a positive association between the path
length, defined as the sum of all the branch lengths along
a path from the root to the tip of a phylogeny, and number of nodes along a path, plotted across all of the paths
in the tree. These authors speculated that the amount of
evolutionary change in longer branches of the tree may
often be underestimated, owing to the chance of “multiple hits” or more than one substitution occurring at a
site along the path that the branch describes. Other things
equal, as the number of nodes along a path increases, individual branch lengths will be shorter, and the amount
of evolution in each branch is better estimated. Summed
over branches this gives the impression of more total
evolution along paths with more nodes. In addition to
its effect on phylogenetic inference, the node-density artifact has the potential to confound comparative evolutionary studies, particularly those involving the use of
branch lengths to infer evolutionary rates, or studies in
which relative amounts of evolution along various paths
are important (e.g., Pagel et al., 2006; Xiang et al., 2004;
Webster et al., 2003).
Evolutionary changes may be underestimated because
of saturation or from misspecifying the model of sequence evolution. In the former, once more than one
substitution occurs per site along a branch, the evolu-
tionary history begins to be lost. Model misspecification,
on the other hand, can underestimate the true amount
of evolution simply by failing to specify a model that
is complex enough to describe the evolutionary process.
For example, it may often be the case that different sites
in a gene-sequence alignment have evolved according
to qualitatively different evolutionary processes. Elsewhere we have called this “pattern heterogeneity” (Pagel
and Meade, 2004) to distinguish it from rate heterogeneity, in which sites simply differ in their rate but not their
pattern of evolution.
Investigators often attempt to account for heterogeneity in the patterns of evolution among sites by partitioning the data, assigning a different model of evolution to
different sites. Partitioning by codon position, by gene,
or by the stems and loops of ribosomal genes is a common approach. Although partitioning often leads to substantially improved fits of models to the data, several
studies have now shown that sites within a given partition often evolve as heterogeneously as sites between
partitions (Fitch and Beintema, 1990; Hickson et al., 1996;
Lartillot and Philippe, 2004; Pagel and Meade, 2004, 2005;
Ronquist et al., 2006; Simon et al., 2006). To the extent
this is true in general, we might expect partitioning approaches to suffer from node density and other artifacts
of phylogenetic inference.
Phylogenetic mixture models provide an alternative
to the partitioning approach. Mixture models allow each
site of a gene-sequence alignment to be characterized by
more than one model of evolution. In a conventional homogeneous model of sequence evolution, all sites in a
data alignment are assumed to arise from a single evolutionary process. In the case of nucleotide data, this process is represented by the familiar 4 × 4 matrix (or Q) of
286
2008
VENDITTI ET AL.—MIXTURE MODELS CAN REDUCE NODE-DENSITY ARTIFACTS
transition rates among A, C, G, T. The likelihood of the
data is calculated as the product over sites of the individual probabilities of the data at each site:
P(D|Q, T) =
P(Di |Q, T),
i
where the probability of the data D (a set of aligned sequences) is conditional upon the model of evolution Q
and the topology T, and the product is over the i sites in
the alignment.
The mixture model approach (Pagel and Meade, 2004,
2005) allows each site to be described by two or more
Q’s specifying different patterns and rates of substitution. Defining different Q matrices as Q1 , Q2 , . . . , Q J ,
the probability of the data under the pattern heterogeneity mixture model can now be written as
P(D|Q1 , Q2 , . . . , Q J , T) =
w j P(Di |Q j , T)
i
j
where D and T are as above, and the summation over
j (1 ≤ j ≤ J) specifies that the likelihood of the data
at each site is summed over J separate rate or Q matrices. The separate Q matrices are weighted by the
w s where w1 + w2 . . . + w J = 1.0. The mixture model
can also be combined with Yang’s (1994) popular ratehomogeneity model (see Pagel and Meade, 2004, 2005).
This model is implemented within a Bayesian framework in the computer package BayesPhylogenies (available from www.evolution.reading.ac.uk), using uniform
prior distributions throughout except for an exponential
(with a mean of 10) prior distribution on branch lengths.
Lartillot and Philippe (2004) introduced a similar mixture model for amino acid evolution.
Typically sites in the alignment will find their best description under one of the models of evolution, but the
best model will be different among sites. An attractive
feature of mixture models is that they do not require a priori assignment of each site to a particular model or of the
model’s prior probability. Instead, the mixture model approach automatically identifies the site patterns from the
variation in the data and estimates their weights. There
is a growing weight of evidence that suggests that mixture models can often better characterize sequence evolution, resulting in improved likelihood scores, topological
changes, increased tree length, and reduced long-branch
attraction (Lartillot and Philippe, 2004; Lewis et al., 2006;
Pagel and Meade, 2004, 2005; Philippe et al., 2005; Simon
et al., 2006).
Our interest here is to investigate pattern heterogeneity as a source of node-density artifacts in phylogenetic
inference and whether mixture models can detect it well
enough to reduce or even remove their effects altogether,
even without prior knowledge of the true patterns in the
data. Of particular interest is to discover whether nodedensity effects can be effectively removed even when the
model of sequence evolution is “incomplete”; that is, not
a full description of the model used to generate the data.
287
This question is important because researchers will seldom know what the true model of evolution is, and yet
they may still wish to use inferred trees to investigate
historical events that rely on accurate branch length information, such as ancestral states and adaptive trends
(e.g., Organ et al., 2007), or to retrieve dates from molecular clocks.
We use simulated and real data to answer these
questions. The simulations are designed to identify the
strength and degree of node-density effects that arise
when pattern heterogeneity is incorrectly characterized
and whether mixture models of increasing complexity
can eliminate or reduce node-density artifacts. We make
no attempt to simulate the specific evolutionary patterns
that might be expected from, for example, codon-based
models, models of secondary structure, or models with
invariable sites. We expect that the effects of poorly characterizing the variation or patterns that these models give
rise to will be of a similar nature to the effects we observe
in our simulations, even if the precise details differ from
one combination of simulation parameters to another.
The simulations provide a best-case scenario that can be
used as a benchmark against which to compare the performance of mixture models in real data. Accordingly,
we also analyze two real data sets assumed to harbor
complex patterns of sequence evolution.
S IMULATED D ATA S ETS
Phylogeny
We require a phylogeny and sequence data simulated to have pattern heterogeneity. We used PhyloGen
(Rambaut, 2002) with the speciation rate set to half that of
the extinction parameter (birth = 0.2, death = 0.1) to produce a random ultrametric tree of 50 species. We added
an “artificial” outgroup to the tree to ensure that, at the
phylogenetic inference step, all branches leading to the
root were estimated properly. We use a single tree because our goal is to identify what effects may arise rather
than to establish generality. Elsewhere (Venditti et al.,
2006), we have shown that the node-density artifact can
arise in almost any topology for a simulated tree of this
size.
Pattern Heterogeneity in Simulated Gene-Sequence Data
We used Seq-Gen (Rambaut and Grassly, 1997) to simulate four gene-sequence alignments of 5000 sites each,
using the tree described above. To produce pattern heterogeneity in a simulated sequence, we drew 1000 sites
each from five different general time reversible (GTR)
rate matrices. It is the qualitative differences among the
rate parameters in the successive GTR matrices that produce pattern heterogeneity, and we varied the degree of
this heterogeneity among the four simulated alignments.
Thus we presume that each site was generated by a single process but that different sites derive from different
processes.
For the least variable alignment, we drew the rate parameters for each of the five GTR matrices, denoted by
288
SYSTEMATIC BIOLOGY
TABLE 1. The categories assigned to the rates of the GTR matrices
used to generate the simulated alignments. The categories indicate the
uniform distribution from which the rate was drawn. How variable the
alignment was depended on the number of categories used to generate
the alignment (see text). The table shows which categories were used
in each dataset.
Uniform interval from which rate
was drawn
One interval data set
Two interval data set
Three interval data set
Four interval data set
0–0.1
0–1
—
—
√
—
√
√
√
√
0–10
√
√
√
√
0–100
—
—
—
√
Q, at random from the same underlying uniform interval (0–10). We call this the one-interval data set. Pattern
heterogeneity can arise among the sites in this alignment
even though all of the rate parameters are drawn from
the same distribution. For example, the A ↔ C rate might
be greater than the A ↔ T rate in one matrix but smaller
in another. For the two-interval data set, we first randomly
assigned each rate parameter in each of the five matrices
to one of two uniform intervals (0–1 or 0–10) and then
drew at random a value from the appropriate interval
for each rate parameter. For the three-interval data set, we
assigned each rate in the GTR matrices to one of three
uniform intervals (0–0.1 or 0–1 or 0–10) and then drew
at random a value from that interval. We carried out the
same procedure for the most variable four-interval data set,
except each rate was assigned to one of four uniform intervals (0–0.1 or 0–1 or 0–10 or 0–100). Different uniform
intervals and/or different numbers of categories of intervals would lead to different results. Table 1 shows which
categories were used in each of the simulated alignments.
Phylogenetic Reconstruction
We used the Bayesian mixture model described in
Pagel and Meade (2004, 2005) to produce posterior samples of phylogenetic trees from each of the simulated
alignments. The mixture model calculates the likelihood
of the data by summing the likelihood at each site over
more than one model of sequence evolution, without
prior partitioning of the data. We identified a model of
sequence evolution by Q. For each data set we inferred
trees using five different models, ranging from a simple 1Q model (the conventional nonmixture or homogeneous GTR model) through to mixture models with two,
three, four, and five Q’s. We did not specify the (known)
values of the rate coefficients or the weights in advance,
but rather estimated them from the data. Our expectation
is that when the model is underspecified (fewer than five
Qs), the inferred trees will suffer from node-density effects, but that these will diminish as the models become
more complex.
We also analyzed each data set using a Bayesian
reversible-jump (Green 1995) implementation of the
Pagel and Meade (2004, 2005) mixture model to determine how many different rate matrices were required
to explain the data (available in the BayesPhylogenies
VOL. 57
package available from www.evolution.reading.ac.uk).
We used a uniform prior distribution on Q’s; although
a Dirichlet prior process gave the same answers, it converged more slowly. The reversible-jump procedure automatically moves among Markov chains with different
numbers of rate matrices, and at convergence estimates
the posterior support for these different chains (there is
no a priori limit to the number of matrices that can be estimated). Even though we generated the alignments from
five distinct rate matrices, the reversible-jump model will
reveal whether these leave sufficiently distinct patterns
in the data to be included in the inference model. To compare the mixture models to the true model, we estimated
posterior samples of trees for each alignment based on
partitioning according to the known pattern heterogeneity in the data (corresponding to perfect prior knowledge
of sites).
For each simulated alignment, we ran a number of independent Markov chains to check that the chain moved
to the same region of tree space (at least 5,000,000 iterations). We report results from a posterior sample of 1000
trees sampled at wide intervals (10,000 iterations) from
a single chain for each of our analyses.
Testing for the Node-Density Artifact
We used the delta test (Pagel et al., 2006; Venditti et al.,
2006; Webster et al., 2003) to analyze each tree in each posterior sample for evidence of the node-density effect. The
delta test examines the form of the relationship between
path lengths and nodes in a regression model of the form
x = βn1/δ , where x is the path length, n is the number of
nodes, β measures the strength of the effect, and δ is the
parameter controlling the curvature of the relationship.
This equation is fitted in a fully phylogenetic context controlling for nonindependence among the species in the
tree that arises from their shared phylogenetic histories
(see Pagel, 1997, 1999). Fitting the delta test model requires a rooted tree and so each of the inferred trees was
rooted using the artificial outgroup. We removed this
outgroup before any further analysis (see Venditti et al.,
2006).
As the true tree used in the simulations was ultrametric, any association between path length and nodes (i.e.,
any β significantly >0) can be attributed to the artifact. In
real data, a positive association between path length and
nodes can arise for reasons other than the artifact (Pagel
et al., 2006; Webster et al., 2003; Xiang et al., 2004), and so
for this reason it is necessary to distinguish a real association from the relationship caused by the node-density
effect. This is achieved by simultaneously estimating the
parameter δ in the equation above. Any value of δ > 1 in
conjunction with a significant regression coefficient implies that path length increases at a decreasing rate as
number of nodes continues to increase and is considered
evidence for the node-density artifact. This method has
been shown to detect the artifact in over 95% of phylogenies where it is present (Venditti et al., 2006). A more
detailed description of the delta test can be found in Venditti et al. (2006) and Webster et al. (2003). Trees can be
2008
VENDITTI ET AL.—MIXTURE MODELS CAN REDUCE NODE-DENSITY ARTIFACTS
submitted online to test for the node-density artifact at
www.evolution.reading.ac.uk.
R EAL D ATA S ETS
We analyzed two published nucleotide sequence
alignments (Jansa et al., 2006, and Wahlberg, 2006), inferring Bayesian posterior samples of trees from the models of sequence evolution reported in the original papers and from mixture models we estimated from the
data. It is not our intent in reanalyzing these data to
suggest different phylogenetic hypotheses from those
these authors report but to ask whether a mixture model
approach can reduce node-density artifacts. Jansa et al.
(2006) used the mitochondrial cytochrome b gene and the
nuclear-encoded IRBP exon 1 to infer the phylogenetic
relationships among Philippine murine rodents. These
authors analyzed their data with a single substitution
model and a partitioned model, although here we report
only the latter as it returned substantially better likelihood scores. The partitioned model assigned a different
1Q+4+I model to each gene. Wahlberg (2006) studied
the phylogeny of the butterfly subfamily Nymphalinae
using two nuclear genes (elongation factor 1-alpha and
wingless) and one mitochondrial gene (cytochrome oxidase subunit I). Wahlberg also partitioned estimating a
different 1Q+4+I model for each gene.
Phylogenetic Reconstruction
We ran a number of independent Markov chains to
check that the chain moved to the same region of tree
space for each analysis. We report results for each of our
analyses from a posterior sample of 1000 trees sampled
at wide intervals (10,000 iterations) from a single chain.
To perform the partitioned analyses, we partitioned the
data and applied models to the partitions following the
original authors’ specifications and generated posterior
samples of trees from the Markov chain. For the mixture
model analyses we used the Bayesian reversible-jump
implementation of the Pagel and Meade (2004, 2005) mixture model to determine how many different rate matrices were required. Rate heterogeneity is included in the
mixture model using Yang’s (1994) discrete-gamma rate
heterogeneity model (4).
289
In both real data sets the reversible-jump model found
that five rate matrices were required to explain the data,
and we generated posterior samples from these models.
We then re-ran the mixture model but restricted it to one
fewer Q each time until we reached the simple 1Q model
without 4. All phylogenies were rooted with the outgroup specified in the original paper. The outgroup was
then removed before any further analysis.
R ESULTS
Simulated Data
Each of the four simulated alignments contains five
different patterns of site evolution. We expect the likelihood of the data to improve with number of Q’s fitted
in the mixture model. Figure 1a shows this to be the case
and, as expected, the data set with the most variation,
the four-interval data set, records the greatest improvement. Less variable data sets return a smaller overall improvement as well as a smaller improvement with each
additional Q. However, the range of the y-axis masks the
numerical improvement in the likelihoods. Even in the
least variable data set the 5Q model improves the likelihood by 2247 log units. We observe a similar pattern
in the tree lengths (Fig. 1b), with the greatest increase in
tree length corresponding to the most variable data sets.
Figure 1c shows that the percentage of trees in our
posterior samples that have the node-density artifact declines as number of Q’s in the mixture model is increased.
The rate and magnitude of the reduction in node-density
effects depends on the variability of the data set, but in
all cases the node-density artifact is reduced to a negligible level using between three and five Q’s. For the oneand three-interval data sets, node-density effects fall to
nearly zero using just a 3Q model of sequence evolution.
This shows that node-density effects can all but disappear even when the estimated model is less complex than
the true model of evolution.
The percentage of trees showing node-density effects
when using the 5Q mixture model is 0.2, 1.05, 0.4, and
2.6 for the one-, two-, three-, and four-interval data sets,
respectively (see Table 2). This is comparable to the true
partitioned model, which returns 0.0%, 0.6%, 0.4%, and
3.0% node-density effects for the one-, two-, three-, and
FIGURE 1. Filled circles indicate the results from the trees inferred from the simulated four-interval data set, open circles indicate the threeinterval data set, filled squares the two-interval data set, and the open squares the one-interval data set (see text and Table 1). (a) The mean
likelihood value for each data set as model complexity increases. The range of the y-axis masks the numerical improvement in the likelihoods.
Even in the least variable data set the 5Q model improves the likelihood by 2247.3 log units. (b) The tree length (expected nucleotide substitutions
per site) for each of the data sets. (c) The percentage of β significantly > 1 and δ > 1 falls as Q’s are added to the model.
290
SYSTEMATIC BIOLOGY
TABLE 2. The percentage of trees in the posterior sample showing
node-density effects in the partitioned, reversible-jump, and the 5Q
models. The brackets indicate the number of Q’s (rate matrices) as
indicated by the reversible-jump analyses.
One interval dataset
Two interval dataset
Three interval dataset
Four interval dataset
5Q
mixture
model
Reversible-jump
mixture
model
5Q
partition
model
0.2
1.05
0.4
2.6
0.2 (4Q)
3.6 (4Q)
0.4 (5Q)
2.6 (5Q)
0.0
0.6
0.4
3.0
four-interval data sets, respectively (see Table 2). Table 2
also shows the percentage of trees with node-density effects obtained from applying the reversible-jump model.
These analyses showed that a 4Q model was adequate to
explain the data for the one- and two-interval data sets,
whereas the three- and four-interval data sets required a
5Q model.
Real Data
Jansa et al. (2006) data set.—The upper panel of Table 3 reports the mean likelihood score and tree lengths
over the posterior sample of trees as derived from the
partitioned model used in the original paper and from
the mixture model as derived from the reversible-jump
approach. The partitioned model estimates a separate
1Q+4+I model for each gene and yields a mean likelihood score of −39,424.8, a mean tree length of 10.1,
and 32.7% of the trees have the node-density artifact.
The reversible-jump mixture model found that a 5Q+4
model was required for these data. This improved the
likelihood over the partitioned analysis (log Bayes factor
test of the harmonic means [mixture model versus partiTABLE 3. Results from applying the reversible-jump mixturemodel and the partitioned models to real data sets. The upper panel
Jansa et al. (2006) data set, lower panel Wahlberg (2006) data set (see
text). The mixture model improves the likelihood and reduces the percentage of trees with the node-density artifact. The conventional partitioning model results in longer trees (see text and Appendix 1).
Model
Jansa et al. (2006)
Wahlberg (2006)
Number of
parameters
Mean −lnL
Harmonic mean -lnL
Ln Bayes factor
Tree length
Percent of trees with
the node-density
artifact
Number of
parameters
Mean -lnL
Harmonic mean -lnL
Ln Bayes factor
Tree length
Percent of trees with
the node-density
artifact
5Q+4
Partition
model
34
20
−38485.5
−39424.8
−38509.1
−39449.3
1880.5
9.3
10.1
10.4
32.7
34
30
−34198.8
−34942.5
−34222.4
−34965.2
1485.6
4.1
4.7
4
20.6
VOL. 57
tion model] = 1880.5; a value >10 is considered “strong
evidence” for a model; Raftery, 1996) and returned
fewer than one-third the number of node-density effects
(Table 3).
The partitioned model tree is longer than the mixturemodel tree, an effect we speculate arises as an artifact of
the invariable sites model. In Appendix 1 we show how
varying the proportion of invariable sites in this model alters the tree lengths we obtain from these data. This may
occur because the length of the tree does not influence the
likelihood of a site evaluated under the invariable component of this model. As more sites are assigned to this
category, the tree length can increase without affecting
the likelihood.
From the perspective of the mixture model, the invariable sites model should emerge naturally as one of the Q
matrices in the mixture model if this site pattern is truly
present in the data. In fact, what we normally observe is
that a matrix of very slow rates emerges, but not an invariable matrix, in which all rates are zero. This accords
with intuition that many sites that are invariant in the
data are not in fact invariable in the sense of incapable
of changing. Characterizing them with a “slow” matrix
will not produce the tree-length bias we think is inherent to the invariable sites model. We observe a matrix of
slow rates in the both data sets investigated here (one Q
matrix has mean rates that are on average only 6.6% and
1.1% of the nearest other slow matrix in the Jansa et al.
and Wahlberg data sets, respectively).
Figure 2a reports the mean likelihood scores for mixture models applied to the Jansa et al. data, showing
how each additional Q in the mixture model improves
the likelihood (the results for the partitioned model are
plotted on the right of figure). The mean tree lengths (Fig.
2b) increase with model complexity, although the majority of the increase is between the 1Q and 1Q+4 models.
In some cases, trees get shorter (for example, 2Q+4 in
Fig. 2b) despite an improved likelihood, and we find that
these are normally associated with topological shifts in
the trees (e.g., Pagel and Meade, 2005).
Figure 2c plots the percentage of trees in which β was
significantly greater than zero and separately the percentage of trees in which β was significantly greater
than zero and δ > 1. The former measures the percentage of trees with a positive relationship between nodes
and path length and the latter measures the percentage
of trees with the node-density artifact. If all the trees
with a positive association have the node-density effects,
then these numbers will be the same. Both percentages
are very high for the simplest models, indicating a high
percentage of node-density effects, but both decline to
∼10% by the 3Q+4 model, indicating that most nodedensity effects have been removed. By comparison, the
partitioned model returns β significantly greater than 1
and δ greater than one 32.7% of the time, indicating that
roughly a third of the trees have node-density effects.
Wahlberg (2006) data set.—The lower panel of Table
3 tells a similar story for the Wahlberg alignment. The
reversible-jump mixture model settled on a 5Q+4 solution, yielding a better likelihood than the partitioned
2008
VENDITTI ET AL.—MIXTURE MODELS CAN REDUCE NODE-DENSITY ARTIFACTS
291
FIGURE 2. Result of the analysis of the real data. Panels a, b, and c correspond to the Jansa et al. (2006) data set, panels d, e, and f to the
Wahlberg (2006) data set. (a) and (d) The likelihood values of the trees inferred for the mixture model analyses (filled circles) and the partitioned
model (filled square). (b) and (d) Tree length for each for mixture models analyses (filled circles) and the original partition model (filled square).
(c) and (d) The percentage of β significantly > 1 (filled circles for the mixture model analyses and filled square for the partition model) and the
percentage of β significantly > 1 and δ > 1 (open circles for the mixture model analyses and open square for the partition model) for the models
analyzed.
analysis (log Bayes factor test of harmonic mean =
1485.6) and about 80% fewer node-density effects. The
mixture model tree is shorter, which we think again is an
artifact of the invariable sites model (see Appendix 1).
As with the Jansa et al. (2006) alignment, we observe
a general increase in the likelihood and in tree length
for the Wahlberg data (Fig. 2d) as model complexity increases, with fluctuations in tree length (Fig. 2e). However, in contrast to the Jansa et al. (2006) data set, the
percentage of trees in the Wahlberg sample that retain a
significant positive relationship between nodes and path
length (i.e., β significantly >1) only falls slightly with
model complexity (Fig. 2f)—at the same time, the percentage showing the node-density effect (β significantly
> 1 and δ > 1) declines rapidly to the final value of 4%.
This shows that there is a relationship between nodes
and path length in the Wahlberg tree that is independent
of node-density effects. We cannot say why this relationship arises here, although in other work (Pagel et al.,
2006) we have shown that this pattern can sometimes
be interpreted as evidence for punctuational episodes of
molecular evolution associated with speciation.
D ISCUSSION
Our results show that pattern heterogeneity in gene
sequence alignments can cause significant node-density
artifacts in inferred trees. Mixture models can substantially reduce or even remove these artifacts, despite the
fact that the model is not based upon any prior knowledge of the genes or the patterns of evolution that exist in the data. Encouragingly, the mixture models can
do this even when the model is an incomplete description of the data, a result we observe in both the real and
simulated data sets. Hugall and Lee (2007) recently sug-
gested that currently available model selection and optimization procedures are not sufficient to characterize
evolution adequately enough to reduce or remove nodedensity artifacts. This led them to question some of the
trees we have found to be free of these effects (Webster
et al., 2003), in effect asserting that they must suffer from
node-density artifacts. However, our finding that it is
possible to reduce to negligible levels or even remove
node-density effects altogether with an incomplete mixture model shows this worry to be overstated (see also
Venditti and Pagel, 2008).
Part of the mixture-model’s success seems to derive
from an ability to find patterns of sequence evolution
that are not detected in partitioned analyses of the same
data. Partitioned analyses of gene-sequence data make
sense in principle, but at least in the two real data sets
we studied here, the mixture models lead to substantial
improvements in the likelihood of the data and greatly
reduced node-density effects. Anecdotally, we have analyzed using mixture models the over 120 well-sampled
data alignments reported in a previous study (Pagel et al.,
2006). The majority of these require more than one model
of sequence evolution. In 12 of these data sets, the original authors reported their partitioning strategies and
the likelihoods of their data in sufficient detail for us to
make comparisons to the mixture model. In all but one
case, the mixture model improves on these partitioned
analyses. In that one case, the analysis relies upon 42 partitions. These results suggest that pattern heterogeneity
is widespread and that mixture models provide an attractive approach to detect it.
We have not explicitly simulated or investigated the
patterns of site evolution that might be expected of
codon-based models (e.g., Goldman and Yang, 1994;
292
SYSTEMATIC BIOLOGY
Muse and Gaut, 1994; Yang and Nielsen, 2000), of models that presume that evolution is constrained by correlations among sites that arise from secondary structure
(e.g., Hudelot et al., 2003; Telford et al., 2005), or from
other models that incorporate selection or at least nonrandom patterns of nucleotide substitutions. These models are, like the pattern heterogeneity mixture model,
homogeneous in applying the same model of evolution
throughout the tree. Our expectation then is that they will
in general each produce their own characteristic patterns
of site evolution and that mixture models will be able to
detect them. In accord with this expectation, we have
shown elsewhere (Pagel and Meade, 2004) using a mixture model approach applied to ribosomal data that stem
and loop sites cannot be easily assigned to different evolutionary models: many stem sites evolve like loops and
vice versa. Equally, we and others (Bofkin and Goldman,
2007; Pagel and Meade, 2004) have shown that there is often as much variation in the evolutionary patterns within
codon positions as there is between.
The relatively poor performance of the partitioned
models suggests that investigators’ hunches about the
way gene sequences evolve are often not upheld in the
real world. Our analysis of the Wahlberg data shows
how a potentially interesting trend in the data can
be missed by partitioned analysis. Where differences
among conventional partitions do exist, mixture models will find them anyway. At the very least, then, we
suggest that mixture models should become a routine
component of the phylogeneticist’s armory, fitted alongside more conventional models, and become part of the
standard model testing and selection procedure. Where
mixture models improve upon these other approaches
they should be used. If this is shown generally to be the
case, real computational benefits could emerge: Felsenstein (2004) has calculated that a codon-based model
is 3547 times more computationally intense than a nucleotide model!
Nevertheless, we do not suggest that mixture models
offer a panacea for gene-sequence analysis. For example,
we should not necessarily expect a pattern heterogeneity
mixture model to detect the variability in gene sequence
data that arises from nonhomogeneous processes, such
as nonstationarity and heterotachy. The former arises
from, for example, directional tendencies in GC content
among lineages, whereas heterotachy refers to the phenomenon of sites evolving at different rates in different
regions of the tree. Neither is commonly accounted for
in phylogenetic studies, and both may bias inferences
(Lake, 1994; Lockhart et al., 1994, 2006; Lopez et al.,
2002; Mooers and Holmes, 2000). Processes such as
these may account for the 4% to 10% of trees in the real
data sets that suffer from node-density artifacts (see
Table 3). If researchers suspect that nonhomogeneous
processes have operated in their data, then they should
apply models specifically designed for these processes
rather than ones that assume homogeneous evolution
throughout the tree.
The uses of phylogenies extend far beyond simply describing how organisms are related to each other. Many
VOL.
57
evolutionary comparative studies including those analyzing evolutionary rates, making ancestral reconstructions, and/or attempting to date divergence times rely
on the true reconstruction of branch lengths. Our results
show that mixture models often make it possible to characterize and interpret complex signals that exist in molecular sequence data and that are invisible to many conventional models, reducing artifacts and producing trees
with accurately estimated branch lengths.
ACKNOWLEDGMENTS
This work was supported by grant NE/C51992X/1 from the
Natural Environment Research Council, United Kingdom, to M.P.
R EFERENCES
Bofkin, L., and N. Goldman. 2007. Variation in evolutionary processes
at different codon positions. Mol. Biol. Evol. 24:513–521.
Felsenstein, J. 2004. Inferring phylogenies. Sinauer Associates, Sunderland, Massachusetts.
Fitch, W. M., and J. J. Beintema. 1990. Correcting parsimonious trees
for unseen nucleotide substitutions: The effect of dense branching as
exemplified by ribonuclease. Mol. Biol. Evol. 7:438–43.
Fitch, W. M., and M. Bruschi. 1987. The evolution of prokaryotic
ferredoxins—With a general method correcting for unobserved substitutions in less branched lineages. Mol. Biol. Evol. 4:381–94.
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide
substitution for protein-coding DNA sequences. Mol. Biol. Evol.
11:725–736.
Green, P. J. 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82:711–732.
Hickson, R. E., C. Simon, A. Cooper, G. S. Spicer, J. Sullivan, and D.
Penny. 1996. Conserved sequence motifs, alignment, and secondary
structure for the third domain of animal 12S rRNA. Mol. Biol. Evol.
13:150–169.
Hudelot, C., V. Gowri-Shankar, H. Jow, M. Rattray, and P. G. Higgs.
2003. RNA-based phylogenetic methods: Application to mammalian
mitochondrial RNA sequences. Mol. Phylogenet. Evol. 28:241–
252.
Hugall, A. F., and M. S. Y. Lee. 2007. The likelihood node density effect and consequences for evolutionary studies of molecular rates.
Evolution 61:2293–2307.
Jansa, S. A., F. K. Barker, and L. R. Heaney. 2006. The pattern and timing of diversification of Philippine endemic rodents: Evidence from
mitochondrial and nuclear gene sequences. Syst. Biol. 55:73–88.
Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and
protein sequences: Paralinear distances. Proc. Natl Acad. Sci. USA
91:1455–1459.
Lartillot, N., and H. Philippe. 2004. A Bayesian mixture model for
across-site heterogeneities in the amino-acid replacement process.
Mol. Biol. Evol. 21:1095–1099.
Lewis, R. L., A. T. Beckenbach, and A. O. Mooers. 2006. The phylogeny
of the subgroups within the melanogaster species group: Likelihood
tests on COI and COII sequences and a Bayesian estimate of phylogeny. Mol. Biol. Evol. 37:15–24.
Lockhart, P., P. Novis, B. G. Milligan, J. Riden, A. Rambaut, and T.
Larkum. 2006. Heterotachy and tree building: A case study with
plastids and eubacteria. Mol. Biol. Evol. 23:40–45.
Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence
evolution. Mol. Biol. Evol. 11:605–612.
Lopez, P., D. Casane, and H. Philippe. 2002. Heterotachy, an important
process of protein evolution. Mol. Biol. Evol. 19:1–7.
Mooers, A., and E. C. Holmes. 2000. The evolution of base composition
and phylogenetic inference. Trends Ecol. Evol. 15:356–365.
Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates,
with application to the chloroplast genome. Mol. Biol. Evol. 11:715–
724.
2008
VENDITTI ET AL.—MIXTURE MODELS CAN REDUCE NODE-DENSITY ARTIFACTS
Organ, C. L., A. M. Shedlock, A. Meade, M. Pagel, and S. V. Edwards.
2007. Origin of avian genome size and structure in non-avian dinosaurs. Nature 446:180–4.
Pagel, M. 1997. Inferring evolutionary processes from phylogenies.
Zool. Scripta 26:331–348.
Pagel, M. 1999. Inferring the historical patterns of biological evolution.
Nature 401:877–84.
Pagel, M., and A. Meade. 2004. A phylogenetic mixture model for
detecting pattern-heterogeneity in gene sequence or character-state
data. Syst. Biol. 53:571–81.
Pagel, M., and A. Meade. 2005. Mixture models in phylogenetic inference. Pages 121–139 in Mathematics of evolution and phylogeny (O.
Gascuel, ed.). Oxford University Press, New York.
Pagel, M., A. Meade, and D. Barker. 2004. Bayesian estimation of ancestral character states on phylogenies. Syst. Biol. 53:673–84.
Pagel, M., C. Venditti, and A. Meade. 2006. Large punctuational contribution of speciation to evolutionary divergence at the molecular
level. Science 314:119–21.
Philippe, H., Y. Zhou, H. Brinkmann, N. Rodrigue, and F. Delsuc. 2005.
Heterotachy and long-branch attraction in phylogenetics. BMC Evol.
Biol. 5:50–58.
Raftery, A. E. 1996. Hypothesis testing and model selection Pages 163–
187 in Markov chain Monte Carlo in practice (W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, eds.). Chapman & Hall, London.
Rambaut, A. 2002. PhyloGen: Phylogenetic tree simulator package, version 1.1. Department of Zoology, University of Oxford.
Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: An application for the
Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235–8.
Ronquist, F., B. Larget, J. P. Huelsenbeck, J. B. Kadane, D. Simon, and P.
van der Mark. 2006. Comment on “Phylogenetic MCMC algorithms
are misleading on mixtures of trees.” Science 312:367
Simon, C., T. R. Buckley, F. Frati, J. Stewart, and B. A. 2006. Incorporating
molecular evolution into phylogenetic analysis, and a new compilation of conserved polymerase chain reaction primers for animal
mitochondrial DNA. Annu. Rev. Ecol. Syst. 37:545–579.
Telford, M. J., M. J. Wise, and V. Gowri-Shankar. 2005. Consideration
of RNA secondary structure significantly improves likelihood-based
estimates of phylogeny: Examples from the bilateria. Mol. Biol. Evol.
22:1129–1136.
Venditti, C., and M. Pagel. 2008. Model misspecification not the nodedensity effect. Evolution In press.
Venditti, C., A. Meade, and M. Pagel. 2006. Detecting the node-density
artifact in phylogeny Reconstruction. Syst. Biol.55:637–343.
Wahlberg, N. 2006. That awkward age for butterflies: Insights from the
age of the butterfly subfamily Nymphalinae (Lepidoptera: Nymphalidae). Syst. Biol. 55:703–714.
Webster, A. J., R. J. Payne, and M. Pagel. 2003. Molecular phylogenies
link rates of evolution and speciation. Science 301:478.
Xiang, Q. Y., W. H. Zhang, R. E. Ricklefs, H. Qian, Z. D. Chen, J. Wen,
and J. L. Hua. 2004. Regional differences in rates of plant speciation
and molecular evolution: A comparison between eastern Asia and
eastern North America. Evolution 58:2175–84.
Yang, Z. 1994. Maximum likelihood phylogenetic estimation from
DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39:306–314.
293
Yang, Z., and R. Nielsen. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models.
Mol. Biol. Evol. 17:32–43.
First submitted 11 October 2007; reviews returned 7 December 2007;
final accept 16 January 2008
Associate Editor: Thomas Buckley
APPENDIX
I NVARIABLE S ITES AND TREE LENGTH
The invariable site model is a mixture model based on a conventional model matrix (GTR, for example) and the invariable model matrix (which assumes sites do not change). The weight assigned to the
invariable model is estimated from the data during the analysis.
We suggest that the invariable sites model can lead to longer trees
because the length of the tree does not influence the likelihood of a site
evaluated under the invariable component of this model. Other models,
either pure single matrix models or alternative mixture models, have to
take account of the sites the invariable model assumes do not change.
They do so by assuming that these sites evolve slowly. This translates
to shorter trees. If this is true, one would expect trees’ lengths to get
shorter as the weight afforded to the invariable sites model is reduced.
We used the Jansa et al. (2006) data set to examine this.
We inferred a sample of trees as before using the 1Q+4+I model,
and the proportion of invariable sites (and the weight the model contributes to the likelihood) was estimated to be 0.36. We also inferred
three more samples of phylogenies, two of which use the 1Q+4+I
model. In one the weight afforded to the invariables sites model was
fixed to be 0.2 and in the other it was 0.1. We also ran a simple
1Q+4 models, and compared the tree lengths. Figure A1 shows the
mean inferred tree length from each model. As we expect, as the
weight given to the invariable sites model is reduced the trees get
shorter.
FIGURE A1. Trees get shorter as weight attributed to the invariable
sites model in phylogenetic inference is decreased.