Syst. Biol. 57(2):286–293, 2008 c Society of Systematic Biologists Copyright ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150802044045 Phylogenetic Mixture Models Can Reduce Node-Density Artifacts CHRIS VENDITTI , ANDREW M EADE, AND M ARK PAGEL School of Biological Sciences, University of Reading, Reading, RG6 6AJ, United Kingdom Abstract.—We investigate the performance of phylogenetic mixture models in reducing a well-known and pervasive artifact of phylogenetic inference known as the node-density effect, comparing them to partitioned analyses of the same data. The node-density effect refers to the tendency for the amount of evolutionary change in longer branches of phylogenies to be underestimated compared to that in regions of the tree where there are more nodes and thus branches are typically shorter. Mixture models allow more than one model of sequence evolution to describe the sites in an alignment without prior knowledge of the evolutionary processes that characterize the data or how they correspond to different sites. If multiple evolutionary patterns are common in sequence evolution, mixture models may be capable of reducing node-density effects by characterizing the evolutionary processes more accurately. In gene-sequence alignments simulated to have heterogeneous patterns of evolution, we find that mixture models can reduce node-density effects to negligible levels or remove them altogether, performing as well as partitioned analyses based on the known simulated patterns. The mixture models achieve this without knowledge of the patterns that generated the data and even in some cases without specifying the full or true model of sequence evolution known to underlie the data. The latter result is especially important in real applications, as the true model of evolution is seldom known. We find the same patterns of results for two real data sets with evidence of complex patterns of sequence evolution: mixture models substantially reduced node-density effects and returned better likelihoods compared to partitioning models specifically fitted to these data. We suggest that the presence of more than one pattern of evolution in the data is a common source of error in phylogenetic inference and that mixture models can often detect these patterns even without prior knowledge of their presence in the data. Routine use of mixture models alongside other approaches to phylogenetic inference may often reveal hidden or unexpected patterns of sequence evolution and can improve phylogenetic inference. [Mixture models; molecular evolution; node-density effect; phylogeny reconstruction; simulation.] When the models of sequence evolution that are used to infer phylogenetic trees misrepresent the true underlying processes that generated the data, phylogenetic inference can be misled and return incorrect or biased trees. A pervasive artifact of phylogenetic inference that arises from such model misspecification is the node-density effect (Fitch and Beintema, 1990; Fitch and Bruschi, 1987; Venditti et al., 2006; Webster et al., 2003). Fitch and Bruschi (1987) were the first to describe the effect, which manifests as a positive association between the path length, defined as the sum of all the branch lengths along a path from the root to the tip of a phylogeny, and number of nodes along a path, plotted across all of the paths in the tree. These authors speculated that the amount of evolutionary change in longer branches of the tree may often be underestimated, owing to the chance of “multiple hits” or more than one substitution occurring at a site along the path that the branch describes. Other things equal, as the number of nodes along a path increases, individual branch lengths will be shorter, and the amount of evolution in each branch is better estimated. Summed over branches this gives the impression of more total evolution along paths with more nodes. In addition to its effect on phylogenetic inference, the node-density artifact has the potential to confound comparative evolutionary studies, particularly those involving the use of branch lengths to infer evolutionary rates, or studies in which relative amounts of evolution along various paths are important (e.g., Pagel et al., 2006; Xiang et al., 2004; Webster et al., 2003). Evolutionary changes may be underestimated because of saturation or from misspecifying the model of sequence evolution. In the former, once more than one substitution occurs per site along a branch, the evolu- tionary history begins to be lost. Model misspecification, on the other hand, can underestimate the true amount of evolution simply by failing to specify a model that is complex enough to describe the evolutionary process. For example, it may often be the case that different sites in a gene-sequence alignment have evolved according to qualitatively different evolutionary processes. Elsewhere we have called this “pattern heterogeneity” (Pagel and Meade, 2004) to distinguish it from rate heterogeneity, in which sites simply differ in their rate but not their pattern of evolution. Investigators often attempt to account for heterogeneity in the patterns of evolution among sites by partitioning the data, assigning a different model of evolution to different sites. Partitioning by codon position, by gene, or by the stems and loops of ribosomal genes is a common approach. Although partitioning often leads to substantially improved fits of models to the data, several studies have now shown that sites within a given partition often evolve as heterogeneously as sites between partitions (Fitch and Beintema, 1990; Hickson et al., 1996; Lartillot and Philippe, 2004; Pagel and Meade, 2004, 2005; Ronquist et al., 2006; Simon et al., 2006). To the extent this is true in general, we might expect partitioning approaches to suffer from node density and other artifacts of phylogenetic inference. Phylogenetic mixture models provide an alternative to the partitioning approach. Mixture models allow each site of a gene-sequence alignment to be characterized by more than one model of evolution. In a conventional homogeneous model of sequence evolution, all sites in a data alignment are assumed to arise from a single evolutionary process. In the case of nucleotide data, this process is represented by the familiar 4 × 4 matrix (or Q) of 286 2008 VENDITTI ET AL.—MIXTURE MODELS CAN REDUCE NODE-DENSITY ARTIFACTS transition rates among A, C, G, T. The likelihood of the data is calculated as the product over sites of the individual probabilities of the data at each site: P(D|Q, T) = P(Di |Q, T), i where the probability of the data D (a set of aligned sequences) is conditional upon the model of evolution Q and the topology T, and the product is over the i sites in the alignment. The mixture model approach (Pagel and Meade, 2004, 2005) allows each site to be described by two or more Q’s specifying different patterns and rates of substitution. Defining different Q matrices as Q1 , Q2 , . . . , Q J , the probability of the data under the pattern heterogeneity mixture model can now be written as P(D|Q1 , Q2 , . . . , Q J , T) = w j P(Di |Q j , T) i j where D and T are as above, and the summation over j (1 ≤ j ≤ J) specifies that the likelihood of the data at each site is summed over J separate rate or Q matrices. The separate Q matrices are weighted by the w s where w1 + w2 . . . + w J = 1.0. The mixture model can also be combined with Yang’s (1994) popular ratehomogeneity model (see Pagel and Meade, 2004, 2005). This model is implemented within a Bayesian framework in the computer package BayesPhylogenies (available from www.evolution.reading.ac.uk), using uniform prior distributions throughout except for an exponential (with a mean of 10) prior distribution on branch lengths. Lartillot and Philippe (2004) introduced a similar mixture model for amino acid evolution. Typically sites in the alignment will find their best description under one of the models of evolution, but the best model will be different among sites. An attractive feature of mixture models is that they do not require a priori assignment of each site to a particular model or of the model’s prior probability. Instead, the mixture model approach automatically identifies the site patterns from the variation in the data and estimates their weights. There is a growing weight of evidence that suggests that mixture models can often better characterize sequence evolution, resulting in improved likelihood scores, topological changes, increased tree length, and reduced long-branch attraction (Lartillot and Philippe, 2004; Lewis et al., 2006; Pagel and Meade, 2004, 2005; Philippe et al., 2005; Simon et al., 2006). Our interest here is to investigate pattern heterogeneity as a source of node-density artifacts in phylogenetic inference and whether mixture models can detect it well enough to reduce or even remove their effects altogether, even without prior knowledge of the true patterns in the data. Of particular interest is to discover whether nodedensity effects can be effectively removed even when the model of sequence evolution is “incomplete”; that is, not a full description of the model used to generate the data. 287 This question is important because researchers will seldom know what the true model of evolution is, and yet they may still wish to use inferred trees to investigate historical events that rely on accurate branch length information, such as ancestral states and adaptive trends (e.g., Organ et al., 2007), or to retrieve dates from molecular clocks. We use simulated and real data to answer these questions. The simulations are designed to identify the strength and degree of node-density effects that arise when pattern heterogeneity is incorrectly characterized and whether mixture models of increasing complexity can eliminate or reduce node-density artifacts. We make no attempt to simulate the specific evolutionary patterns that might be expected from, for example, codon-based models, models of secondary structure, or models with invariable sites. We expect that the effects of poorly characterizing the variation or patterns that these models give rise to will be of a similar nature to the effects we observe in our simulations, even if the precise details differ from one combination of simulation parameters to another. The simulations provide a best-case scenario that can be used as a benchmark against which to compare the performance of mixture models in real data. Accordingly, we also analyze two real data sets assumed to harbor complex patterns of sequence evolution. S IMULATED D ATA S ETS Phylogeny We require a phylogeny and sequence data simulated to have pattern heterogeneity. We used PhyloGen (Rambaut, 2002) with the speciation rate set to half that of the extinction parameter (birth = 0.2, death = 0.1) to produce a random ultrametric tree of 50 species. We added an “artificial” outgroup to the tree to ensure that, at the phylogenetic inference step, all branches leading to the root were estimated properly. We use a single tree because our goal is to identify what effects may arise rather than to establish generality. Elsewhere (Venditti et al., 2006), we have shown that the node-density artifact can arise in almost any topology for a simulated tree of this size. Pattern Heterogeneity in Simulated Gene-Sequence Data We used Seq-Gen (Rambaut and Grassly, 1997) to simulate four gene-sequence alignments of 5000 sites each, using the tree described above. To produce pattern heterogeneity in a simulated sequence, we drew 1000 sites each from five different general time reversible (GTR) rate matrices. It is the qualitative differences among the rate parameters in the successive GTR matrices that produce pattern heterogeneity, and we varied the degree of this heterogeneity among the four simulated alignments. Thus we presume that each site was generated by a single process but that different sites derive from different processes. For the least variable alignment, we drew the rate parameters for each of the five GTR matrices, denoted by 288 SYSTEMATIC BIOLOGY TABLE 1. The categories assigned to the rates of the GTR matrices used to generate the simulated alignments. The categories indicate the uniform distribution from which the rate was drawn. How variable the alignment was depended on the number of categories used to generate the alignment (see text). The table shows which categories were used in each dataset. Uniform interval from which rate was drawn One interval data set Two interval data set Three interval data set Four interval data set 0–0.1 0–1 — — √ — √ √ √ √ 0–10 √ √ √ √ 0–100 — — — √ Q, at random from the same underlying uniform interval (0–10). We call this the one-interval data set. Pattern heterogeneity can arise among the sites in this alignment even though all of the rate parameters are drawn from the same distribution. For example, the A ↔ C rate might be greater than the A ↔ T rate in one matrix but smaller in another. For the two-interval data set, we first randomly assigned each rate parameter in each of the five matrices to one of two uniform intervals (0–1 or 0–10) and then drew at random a value from the appropriate interval for each rate parameter. For the three-interval data set, we assigned each rate in the GTR matrices to one of three uniform intervals (0–0.1 or 0–1 or 0–10) and then drew at random a value from that interval. We carried out the same procedure for the most variable four-interval data set, except each rate was assigned to one of four uniform intervals (0–0.1 or 0–1 or 0–10 or 0–100). Different uniform intervals and/or different numbers of categories of intervals would lead to different results. Table 1 shows which categories were used in each of the simulated alignments. Phylogenetic Reconstruction We used the Bayesian mixture model described in Pagel and Meade (2004, 2005) to produce posterior samples of phylogenetic trees from each of the simulated alignments. The mixture model calculates the likelihood of the data by summing the likelihood at each site over more than one model of sequence evolution, without prior partitioning of the data. We identified a model of sequence evolution by Q. For each data set we inferred trees using five different models, ranging from a simple 1Q model (the conventional nonmixture or homogeneous GTR model) through to mixture models with two, three, four, and five Q’s. We did not specify the (known) values of the rate coefficients or the weights in advance, but rather estimated them from the data. Our expectation is that when the model is underspecified (fewer than five Qs), the inferred trees will suffer from node-density effects, but that these will diminish as the models become more complex. We also analyzed each data set using a Bayesian reversible-jump (Green 1995) implementation of the Pagel and Meade (2004, 2005) mixture model to determine how many different rate matrices were required to explain the data (available in the BayesPhylogenies VOL. 57 package available from www.evolution.reading.ac.uk). We used a uniform prior distribution on Q’s; although a Dirichlet prior process gave the same answers, it converged more slowly. The reversible-jump procedure automatically moves among Markov chains with different numbers of rate matrices, and at convergence estimates the posterior support for these different chains (there is no a priori limit to the number of matrices that can be estimated). Even though we generated the alignments from five distinct rate matrices, the reversible-jump model will reveal whether these leave sufficiently distinct patterns in the data to be included in the inference model. To compare the mixture models to the true model, we estimated posterior samples of trees for each alignment based on partitioning according to the known pattern heterogeneity in the data (corresponding to perfect prior knowledge of sites). For each simulated alignment, we ran a number of independent Markov chains to check that the chain moved to the same region of tree space (at least 5,000,000 iterations). We report results from a posterior sample of 1000 trees sampled at wide intervals (10,000 iterations) from a single chain for each of our analyses. Testing for the Node-Density Artifact We used the delta test (Pagel et al., 2006; Venditti et al., 2006; Webster et al., 2003) to analyze each tree in each posterior sample for evidence of the node-density effect. The delta test examines the form of the relationship between path lengths and nodes in a regression model of the form x = βn1/δ , where x is the path length, n is the number of nodes, β measures the strength of the effect, and δ is the parameter controlling the curvature of the relationship. This equation is fitted in a fully phylogenetic context controlling for nonindependence among the species in the tree that arises from their shared phylogenetic histories (see Pagel, 1997, 1999). Fitting the delta test model requires a rooted tree and so each of the inferred trees was rooted using the artificial outgroup. We removed this outgroup before any further analysis (see Venditti et al., 2006). As the true tree used in the simulations was ultrametric, any association between path length and nodes (i.e., any β significantly >0) can be attributed to the artifact. In real data, a positive association between path length and nodes can arise for reasons other than the artifact (Pagel et al., 2006; Webster et al., 2003; Xiang et al., 2004), and so for this reason it is necessary to distinguish a real association from the relationship caused by the node-density effect. This is achieved by simultaneously estimating the parameter δ in the equation above. Any value of δ > 1 in conjunction with a significant regression coefficient implies that path length increases at a decreasing rate as number of nodes continues to increase and is considered evidence for the node-density artifact. This method has been shown to detect the artifact in over 95% of phylogenies where it is present (Venditti et al., 2006). A more detailed description of the delta test can be found in Venditti et al. (2006) and Webster et al. (2003). Trees can be 2008 VENDITTI ET AL.—MIXTURE MODELS CAN REDUCE NODE-DENSITY ARTIFACTS submitted online to test for the node-density artifact at www.evolution.reading.ac.uk. R EAL D ATA S ETS We analyzed two published nucleotide sequence alignments (Jansa et al., 2006, and Wahlberg, 2006), inferring Bayesian posterior samples of trees from the models of sequence evolution reported in the original papers and from mixture models we estimated from the data. It is not our intent in reanalyzing these data to suggest different phylogenetic hypotheses from those these authors report but to ask whether a mixture model approach can reduce node-density artifacts. Jansa et al. (2006) used the mitochondrial cytochrome b gene and the nuclear-encoded IRBP exon 1 to infer the phylogenetic relationships among Philippine murine rodents. These authors analyzed their data with a single substitution model and a partitioned model, although here we report only the latter as it returned substantially better likelihood scores. The partitioned model assigned a different 1Q+4+I model to each gene. Wahlberg (2006) studied the phylogeny of the butterfly subfamily Nymphalinae using two nuclear genes (elongation factor 1-alpha and wingless) and one mitochondrial gene (cytochrome oxidase subunit I). Wahlberg also partitioned estimating a different 1Q+4+I model for each gene. Phylogenetic Reconstruction We ran a number of independent Markov chains to check that the chain moved to the same region of tree space for each analysis. We report results for each of our analyses from a posterior sample of 1000 trees sampled at wide intervals (10,000 iterations) from a single chain. To perform the partitioned analyses, we partitioned the data and applied models to the partitions following the original authors’ specifications and generated posterior samples of trees from the Markov chain. For the mixture model analyses we used the Bayesian reversible-jump implementation of the Pagel and Meade (2004, 2005) mixture model to determine how many different rate matrices were required. Rate heterogeneity is included in the mixture model using Yang’s (1994) discrete-gamma rate heterogeneity model (4). 289 In both real data sets the reversible-jump model found that five rate matrices were required to explain the data, and we generated posterior samples from these models. We then re-ran the mixture model but restricted it to one fewer Q each time until we reached the simple 1Q model without 4. All phylogenies were rooted with the outgroup specified in the original paper. The outgroup was then removed before any further analysis. R ESULTS Simulated Data Each of the four simulated alignments contains five different patterns of site evolution. We expect the likelihood of the data to improve with number of Q’s fitted in the mixture model. Figure 1a shows this to be the case and, as expected, the data set with the most variation, the four-interval data set, records the greatest improvement. Less variable data sets return a smaller overall improvement as well as a smaller improvement with each additional Q. However, the range of the y-axis masks the numerical improvement in the likelihoods. Even in the least variable data set the 5Q model improves the likelihood by 2247 log units. We observe a similar pattern in the tree lengths (Fig. 1b), with the greatest increase in tree length corresponding to the most variable data sets. Figure 1c shows that the percentage of trees in our posterior samples that have the node-density artifact declines as number of Q’s in the mixture model is increased. The rate and magnitude of the reduction in node-density effects depends on the variability of the data set, but in all cases the node-density artifact is reduced to a negligible level using between three and five Q’s. For the oneand three-interval data sets, node-density effects fall to nearly zero using just a 3Q model of sequence evolution. This shows that node-density effects can all but disappear even when the estimated model is less complex than the true model of evolution. The percentage of trees showing node-density effects when using the 5Q mixture model is 0.2, 1.05, 0.4, and 2.6 for the one-, two-, three-, and four-interval data sets, respectively (see Table 2). This is comparable to the true partitioned model, which returns 0.0%, 0.6%, 0.4%, and 3.0% node-density effects for the one-, two-, three-, and FIGURE 1. Filled circles indicate the results from the trees inferred from the simulated four-interval data set, open circles indicate the threeinterval data set, filled squares the two-interval data set, and the open squares the one-interval data set (see text and Table 1). (a) The mean likelihood value for each data set as model complexity increases. The range of the y-axis masks the numerical improvement in the likelihoods. Even in the least variable data set the 5Q model improves the likelihood by 2247.3 log units. (b) The tree length (expected nucleotide substitutions per site) for each of the data sets. (c) The percentage of β significantly > 1 and δ > 1 falls as Q’s are added to the model. 290 SYSTEMATIC BIOLOGY TABLE 2. The percentage of trees in the posterior sample showing node-density effects in the partitioned, reversible-jump, and the 5Q models. The brackets indicate the number of Q’s (rate matrices) as indicated by the reversible-jump analyses. One interval dataset Two interval dataset Three interval dataset Four interval dataset 5Q mixture model Reversible-jump mixture model 5Q partition model 0.2 1.05 0.4 2.6 0.2 (4Q) 3.6 (4Q) 0.4 (5Q) 2.6 (5Q) 0.0 0.6 0.4 3.0 four-interval data sets, respectively (see Table 2). Table 2 also shows the percentage of trees with node-density effects obtained from applying the reversible-jump model. These analyses showed that a 4Q model was adequate to explain the data for the one- and two-interval data sets, whereas the three- and four-interval data sets required a 5Q model. Real Data Jansa et al. (2006) data set.—The upper panel of Table 3 reports the mean likelihood score and tree lengths over the posterior sample of trees as derived from the partitioned model used in the original paper and from the mixture model as derived from the reversible-jump approach. The partitioned model estimates a separate 1Q+4+I model for each gene and yields a mean likelihood score of −39,424.8, a mean tree length of 10.1, and 32.7% of the trees have the node-density artifact. The reversible-jump mixture model found that a 5Q+4 model was required for these data. This improved the likelihood over the partitioned analysis (log Bayes factor test of the harmonic means [mixture model versus partiTABLE 3. Results from applying the reversible-jump mixturemodel and the partitioned models to real data sets. The upper panel Jansa et al. (2006) data set, lower panel Wahlberg (2006) data set (see text). The mixture model improves the likelihood and reduces the percentage of trees with the node-density artifact. The conventional partitioning model results in longer trees (see text and Appendix 1). Model Jansa et al. (2006) Wahlberg (2006) Number of parameters Mean −lnL Harmonic mean -lnL Ln Bayes factor Tree length Percent of trees with the node-density artifact Number of parameters Mean -lnL Harmonic mean -lnL Ln Bayes factor Tree length Percent of trees with the node-density artifact 5Q+4 Partition model 34 20 −38485.5 −39424.8 −38509.1 −39449.3 1880.5 9.3 10.1 10.4 32.7 34 30 −34198.8 −34942.5 −34222.4 −34965.2 1485.6 4.1 4.7 4 20.6 VOL. 57 tion model] = 1880.5; a value >10 is considered “strong evidence” for a model; Raftery, 1996) and returned fewer than one-third the number of node-density effects (Table 3). The partitioned model tree is longer than the mixturemodel tree, an effect we speculate arises as an artifact of the invariable sites model. In Appendix 1 we show how varying the proportion of invariable sites in this model alters the tree lengths we obtain from these data. This may occur because the length of the tree does not influence the likelihood of a site evaluated under the invariable component of this model. As more sites are assigned to this category, the tree length can increase without affecting the likelihood. From the perspective of the mixture model, the invariable sites model should emerge naturally as one of the Q matrices in the mixture model if this site pattern is truly present in the data. In fact, what we normally observe is that a matrix of very slow rates emerges, but not an invariable matrix, in which all rates are zero. This accords with intuition that many sites that are invariant in the data are not in fact invariable in the sense of incapable of changing. Characterizing them with a “slow” matrix will not produce the tree-length bias we think is inherent to the invariable sites model. We observe a matrix of slow rates in the both data sets investigated here (one Q matrix has mean rates that are on average only 6.6% and 1.1% of the nearest other slow matrix in the Jansa et al. and Wahlberg data sets, respectively). Figure 2a reports the mean likelihood scores for mixture models applied to the Jansa et al. data, showing how each additional Q in the mixture model improves the likelihood (the results for the partitioned model are plotted on the right of figure). The mean tree lengths (Fig. 2b) increase with model complexity, although the majority of the increase is between the 1Q and 1Q+4 models. In some cases, trees get shorter (for example, 2Q+4 in Fig. 2b) despite an improved likelihood, and we find that these are normally associated with topological shifts in the trees (e.g., Pagel and Meade, 2005). Figure 2c plots the percentage of trees in which β was significantly greater than zero and separately the percentage of trees in which β was significantly greater than zero and δ > 1. The former measures the percentage of trees with a positive relationship between nodes and path length and the latter measures the percentage of trees with the node-density artifact. If all the trees with a positive association have the node-density effects, then these numbers will be the same. Both percentages are very high for the simplest models, indicating a high percentage of node-density effects, but both decline to ∼10% by the 3Q+4 model, indicating that most nodedensity effects have been removed. By comparison, the partitioned model returns β significantly greater than 1 and δ greater than one 32.7% of the time, indicating that roughly a third of the trees have node-density effects. Wahlberg (2006) data set.—The lower panel of Table 3 tells a similar story for the Wahlberg alignment. The reversible-jump mixture model settled on a 5Q+4 solution, yielding a better likelihood than the partitioned 2008 VENDITTI ET AL.—MIXTURE MODELS CAN REDUCE NODE-DENSITY ARTIFACTS 291 FIGURE 2. Result of the analysis of the real data. Panels a, b, and c correspond to the Jansa et al. (2006) data set, panels d, e, and f to the Wahlberg (2006) data set. (a) and (d) The likelihood values of the trees inferred for the mixture model analyses (filled circles) and the partitioned model (filled square). (b) and (d) Tree length for each for mixture models analyses (filled circles) and the original partition model (filled square). (c) and (d) The percentage of β significantly > 1 (filled circles for the mixture model analyses and filled square for the partition model) and the percentage of β significantly > 1 and δ > 1 (open circles for the mixture model analyses and open square for the partition model) for the models analyzed. analysis (log Bayes factor test of harmonic mean = 1485.6) and about 80% fewer node-density effects. The mixture model tree is shorter, which we think again is an artifact of the invariable sites model (see Appendix 1). As with the Jansa et al. (2006) alignment, we observe a general increase in the likelihood and in tree length for the Wahlberg data (Fig. 2d) as model complexity increases, with fluctuations in tree length (Fig. 2e). However, in contrast to the Jansa et al. (2006) data set, the percentage of trees in the Wahlberg sample that retain a significant positive relationship between nodes and path length (i.e., β significantly >1) only falls slightly with model complexity (Fig. 2f)—at the same time, the percentage showing the node-density effect (β significantly > 1 and δ > 1) declines rapidly to the final value of 4%. This shows that there is a relationship between nodes and path length in the Wahlberg tree that is independent of node-density effects. We cannot say why this relationship arises here, although in other work (Pagel et al., 2006) we have shown that this pattern can sometimes be interpreted as evidence for punctuational episodes of molecular evolution associated with speciation. D ISCUSSION Our results show that pattern heterogeneity in gene sequence alignments can cause significant node-density artifacts in inferred trees. Mixture models can substantially reduce or even remove these artifacts, despite the fact that the model is not based upon any prior knowledge of the genes or the patterns of evolution that exist in the data. Encouragingly, the mixture models can do this even when the model is an incomplete description of the data, a result we observe in both the real and simulated data sets. Hugall and Lee (2007) recently sug- gested that currently available model selection and optimization procedures are not sufficient to characterize evolution adequately enough to reduce or remove nodedensity artifacts. This led them to question some of the trees we have found to be free of these effects (Webster et al., 2003), in effect asserting that they must suffer from node-density artifacts. However, our finding that it is possible to reduce to negligible levels or even remove node-density effects altogether with an incomplete mixture model shows this worry to be overstated (see also Venditti and Pagel, 2008). Part of the mixture-model’s success seems to derive from an ability to find patterns of sequence evolution that are not detected in partitioned analyses of the same data. Partitioned analyses of gene-sequence data make sense in principle, but at least in the two real data sets we studied here, the mixture models lead to substantial improvements in the likelihood of the data and greatly reduced node-density effects. Anecdotally, we have analyzed using mixture models the over 120 well-sampled data alignments reported in a previous study (Pagel et al., 2006). The majority of these require more than one model of sequence evolution. In 12 of these data sets, the original authors reported their partitioning strategies and the likelihoods of their data in sufficient detail for us to make comparisons to the mixture model. In all but one case, the mixture model improves on these partitioned analyses. In that one case, the analysis relies upon 42 partitions. These results suggest that pattern heterogeneity is widespread and that mixture models provide an attractive approach to detect it. We have not explicitly simulated or investigated the patterns of site evolution that might be expected of codon-based models (e.g., Goldman and Yang, 1994; 292 SYSTEMATIC BIOLOGY Muse and Gaut, 1994; Yang and Nielsen, 2000), of models that presume that evolution is constrained by correlations among sites that arise from secondary structure (e.g., Hudelot et al., 2003; Telford et al., 2005), or from other models that incorporate selection or at least nonrandom patterns of nucleotide substitutions. These models are, like the pattern heterogeneity mixture model, homogeneous in applying the same model of evolution throughout the tree. Our expectation then is that they will in general each produce their own characteristic patterns of site evolution and that mixture models will be able to detect them. In accord with this expectation, we have shown elsewhere (Pagel and Meade, 2004) using a mixture model approach applied to ribosomal data that stem and loop sites cannot be easily assigned to different evolutionary models: many stem sites evolve like loops and vice versa. Equally, we and others (Bofkin and Goldman, 2007; Pagel and Meade, 2004) have shown that there is often as much variation in the evolutionary patterns within codon positions as there is between. The relatively poor performance of the partitioned models suggests that investigators’ hunches about the way gene sequences evolve are often not upheld in the real world. Our analysis of the Wahlberg data shows how a potentially interesting trend in the data can be missed by partitioned analysis. Where differences among conventional partitions do exist, mixture models will find them anyway. At the very least, then, we suggest that mixture models should become a routine component of the phylogeneticist’s armory, fitted alongside more conventional models, and become part of the standard model testing and selection procedure. Where mixture models improve upon these other approaches they should be used. If this is shown generally to be the case, real computational benefits could emerge: Felsenstein (2004) has calculated that a codon-based model is 3547 times more computationally intense than a nucleotide model! Nevertheless, we do not suggest that mixture models offer a panacea for gene-sequence analysis. For example, we should not necessarily expect a pattern heterogeneity mixture model to detect the variability in gene sequence data that arises from nonhomogeneous processes, such as nonstationarity and heterotachy. The former arises from, for example, directional tendencies in GC content among lineages, whereas heterotachy refers to the phenomenon of sites evolving at different rates in different regions of the tree. Neither is commonly accounted for in phylogenetic studies, and both may bias inferences (Lake, 1994; Lockhart et al., 1994, 2006; Lopez et al., 2002; Mooers and Holmes, 2000). Processes such as these may account for the 4% to 10% of trees in the real data sets that suffer from node-density artifacts (see Table 3). If researchers suspect that nonhomogeneous processes have operated in their data, then they should apply models specifically designed for these processes rather than ones that assume homogeneous evolution throughout the tree. The uses of phylogenies extend far beyond simply describing how organisms are related to each other. Many VOL. 57 evolutionary comparative studies including those analyzing evolutionary rates, making ancestral reconstructions, and/or attempting to date divergence times rely on the true reconstruction of branch lengths. Our results show that mixture models often make it possible to characterize and interpret complex signals that exist in molecular sequence data and that are invisible to many conventional models, reducing artifacts and producing trees with accurately estimated branch lengths. ACKNOWLEDGMENTS This work was supported by grant NE/C51992X/1 from the Natural Environment Research Council, United Kingdom, to M.P. R EFERENCES Bofkin, L., and N. Goldman. 2007. Variation in evolutionary processes at different codon positions. Mol. Biol. Evol. 24:513–521. Felsenstein, J. 2004. Inferring phylogenies. Sinauer Associates, Sunderland, Massachusetts. Fitch, W. M., and J. J. Beintema. 1990. Correcting parsimonious trees for unseen nucleotide substitutions: The effect of dense branching as exemplified by ribonuclease. Mol. Biol. Evol. 7:438–43. Fitch, W. M., and M. Bruschi. 1987. The evolution of prokaryotic ferredoxins—With a general method correcting for unobserved substitutions in less branched lineages. Mol. Biol. Evol. 4:381–94. Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725–736. Green, P. J. 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82:711–732. Hickson, R. E., C. Simon, A. Cooper, G. S. Spicer, J. Sullivan, and D. Penny. 1996. Conserved sequence motifs, alignment, and secondary structure for the third domain of animal 12S rRNA. Mol. Biol. Evol. 13:150–169. Hudelot, C., V. Gowri-Shankar, H. Jow, M. Rattray, and P. G. Higgs. 2003. RNA-based phylogenetic methods: Application to mammalian mitochondrial RNA sequences. Mol. Phylogenet. Evol. 28:241– 252. Hugall, A. F., and M. S. Y. Lee. 2007. The likelihood node density effect and consequences for evolutionary studies of molecular rates. Evolution 61:2293–2307. Jansa, S. A., F. K. Barker, and L. R. Heaney. 2006. The pattern and timing of diversification of Philippine endemic rodents: Evidence from mitochondrial and nuclear gene sequences. Syst. Biol. 55:73–88. Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein sequences: Paralinear distances. Proc. Natl Acad. Sci. USA 91:1455–1459. Lartillot, N., and H. Philippe. 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. 21:1095–1099. Lewis, R. L., A. T. Beckenbach, and A. O. Mooers. 2006. The phylogeny of the subgroups within the melanogaster species group: Likelihood tests on COI and COII sequences and a Bayesian estimate of phylogeny. Mol. Biol. Evol. 37:15–24. Lockhart, P., P. Novis, B. G. Milligan, J. Riden, A. Rambaut, and T. Larkum. 2006. Heterotachy and tree building: A case study with plastids and eubacteria. Mol. Biol. Evol. 23:40–45. Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11:605–612. Lopez, P., D. Casane, and H. Philippe. 2002. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19:1–7. Mooers, A., and E. C. Holmes. 2000. The evolution of base composition and phylogenetic inference. Trends Ecol. Evol. 15:356–365. Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715– 724. 2008 VENDITTI ET AL.—MIXTURE MODELS CAN REDUCE NODE-DENSITY ARTIFACTS Organ, C. L., A. M. Shedlock, A. Meade, M. Pagel, and S. V. Edwards. 2007. Origin of avian genome size and structure in non-avian dinosaurs. Nature 446:180–4. Pagel, M. 1997. Inferring evolutionary processes from phylogenies. Zool. Scripta 26:331–348. Pagel, M. 1999. Inferring the historical patterns of biological evolution. Nature 401:877–84. Pagel, M., and A. Meade. 2004. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. 53:571–81. Pagel, M., and A. Meade. 2005. Mixture models in phylogenetic inference. Pages 121–139 in Mathematics of evolution and phylogeny (O. Gascuel, ed.). Oxford University Press, New York. Pagel, M., A. Meade, and D. Barker. 2004. Bayesian estimation of ancestral character states on phylogenies. Syst. Biol. 53:673–84. Pagel, M., C. Venditti, and A. Meade. 2006. Large punctuational contribution of speciation to evolutionary divergence at the molecular level. Science 314:119–21. Philippe, H., Y. Zhou, H. Brinkmann, N. Rodrigue, and F. Delsuc. 2005. Heterotachy and long-branch attraction in phylogenetics. BMC Evol. Biol. 5:50–58. Raftery, A. E. 1996. Hypothesis testing and model selection Pages 163– 187 in Markov chain Monte Carlo in practice (W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, eds.). Chapman & Hall, London. Rambaut, A. 2002. PhyloGen: Phylogenetic tree simulator package, version 1.1. Department of Zoology, University of Oxford. Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235–8. Ronquist, F., B. Larget, J. P. Huelsenbeck, J. B. Kadane, D. Simon, and P. van der Mark. 2006. Comment on “Phylogenetic MCMC algorithms are misleading on mixtures of trees.” Science 312:367 Simon, C., T. R. Buckley, F. Frati, J. Stewart, and B. A. 2006. Incorporating molecular evolution into phylogenetic analysis, and a new compilation of conserved polymerase chain reaction primers for animal mitochondrial DNA. Annu. Rev. Ecol. Syst. 37:545–579. Telford, M. J., M. J. Wise, and V. Gowri-Shankar. 2005. Consideration of RNA secondary structure significantly improves likelihood-based estimates of phylogeny: Examples from the bilateria. Mol. Biol. Evol. 22:1129–1136. Venditti, C., and M. Pagel. 2008. Model misspecification not the nodedensity effect. Evolution In press. Venditti, C., A. Meade, and M. Pagel. 2006. Detecting the node-density artifact in phylogeny Reconstruction. Syst. Biol.55:637–343. Wahlberg, N. 2006. That awkward age for butterflies: Insights from the age of the butterfly subfamily Nymphalinae (Lepidoptera: Nymphalidae). Syst. Biol. 55:703–714. Webster, A. J., R. J. Payne, and M. Pagel. 2003. Molecular phylogenies link rates of evolution and speciation. Science 301:478. Xiang, Q. Y., W. H. Zhang, R. E. Ricklefs, H. Qian, Z. D. Chen, J. Wen, and J. L. Hua. 2004. Regional differences in rates of plant speciation and molecular evolution: A comparison between eastern Asia and eastern North America. Evolution 58:2175–84. Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39:306–314. 293 Yang, Z., and R. Nielsen. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17:32–43. First submitted 11 October 2007; reviews returned 7 December 2007; final accept 16 January 2008 Associate Editor: Thomas Buckley APPENDIX I NVARIABLE S ITES AND TREE LENGTH The invariable site model is a mixture model based on a conventional model matrix (GTR, for example) and the invariable model matrix (which assumes sites do not change). The weight assigned to the invariable model is estimated from the data during the analysis. We suggest that the invariable sites model can lead to longer trees because the length of the tree does not influence the likelihood of a site evaluated under the invariable component of this model. Other models, either pure single matrix models or alternative mixture models, have to take account of the sites the invariable model assumes do not change. They do so by assuming that these sites evolve slowly. This translates to shorter trees. If this is true, one would expect trees’ lengths to get shorter as the weight afforded to the invariable sites model is reduced. We used the Jansa et al. (2006) data set to examine this. We inferred a sample of trees as before using the 1Q+4+I model, and the proportion of invariable sites (and the weight the model contributes to the likelihood) was estimated to be 0.36. We also inferred three more samples of phylogenies, two of which use the 1Q+4+I model. In one the weight afforded to the invariables sites model was fixed to be 0.2 and in the other it was 0.1. We also ran a simple 1Q+4 models, and compared the tree lengths. Figure A1 shows the mean inferred tree length from each model. As we expect, as the weight given to the invariable sites model is reduced the trees get shorter. FIGURE A1. Trees get shorter as weight attributed to the invariable sites model in phylogenetic inference is decreased.
© Copyright 2026 Paperzz