Lack of Resolution in the Animal Phylogeny: Closely Spaced Cladogeneses or Undetected Systematic Errors? Denis Baurain,* Henner Brinkmann,* and Hervé Philippe* *Canadian Institute for Advanced Research and Département de Biochimie, Université de Montréal, Montréal, Québec, Canada; and Département des Sciences de la Vie, Université de Liège, Liège, Belgium A recent phylogenomic study reported that the animal phylogeny was unresolved despite the use of 50 genes. This lack of resolution was interpreted as ‘‘a positive signature of closely spaced cladogenetic events.’’ Here, we propose that this lack of resolution is rather due to the mutual cancellation of the phylogenetic signal (historical) and the nonphylogenetic signal (due to systematic errors) that results from inadequate taxon sampling and/or model of sequence evolution. Starting with a data set of comparable size, we use 3 different strategies to reduce the nonphylogenetic signal: 1) increasing the number of species; 2) replacing a fast-evolving species by a slowly evolving one; and 3) using a better model of sequence evolution. In all cases, the phylogenetic resolution is markedly improved, in agreement with our hypothesis that the originally reported lack of resolution was artifactual. Recently, Rokas, Kruger, and Carroll (2005) (RKC) were unable to resolve most nodes of the phylogenetic tree of animals despite the use of 50 genes (12,060 amino acid positions) from 17 animal species. After dismissing potential alternative explanations (e.g., missing data, long-branch attraction, compositional bias, and taxon sampling), they concluded that ‘‘the lack of resolution is a positive signature of closely spaced cladogenetic events.’’ This conclusion is at odds with several similar studies reporting a good resolution for the animal tree (Philippe, Lartillot, et al. 2005; Delsuc et al. 2006; Marletaz et al. 2006; Matus et al. 2006). Although phylogenomic approaches are effective at reducing the stochastic (sampling) error, they have been shown to be sensitive to systematic errors (Delsuc et al. 2005). Systematic errors stem from the inaccuracy of tree reconstruction methods, which, in a probabilistic framework, is directly related to model misspecifications (Felsenstein 2004; Philippe, Zhou, et al. 2005). In opposition to the genuine phylogenetic signal, the signal that results from systematic errors is called nonphylogenetic (Ho and Jermiin 2004; Philippe, Delsuc, et al. 2005). Interestingly, RKC trees have markedly lower statistical support with Parsimony than with maximum likelihood (ML). Because Parsimony is known to be more sensitive to systematic errors than ML (Felsenstein 2004), we propose that the lack of resolution observed by Rokas et al. (2005) is due to the presence of a similar amount of conflicting phylogenetic and nonphylogenetic signal. The nonphylogenetic signal results from the incorrect interpretation, by the reconstruction method, of multiple substitutions occurring at a given position over the phylogeny (e.g., convergent acquisition of an A by 2 unrelated AT-rich species) (Olsen 1987). Consequently, it will be stronger for ancient and fast-evolving lineages. Given the age of the metazoan radiation, any alignment of animal sequences is expected to contain a sizable amount of nonphylogenetic signal. This is especially true for the RKC data set that included several fast-evolving species (e.g., Caenorhabditis and platyhelminths). If our interpretation is correct, using strategies that reduce the nonphylogenetic signal without Key words: animal evolution, phylogenomics, phylogenetic resolution, systematic error, nonphylogenetic signal. E-mail: [email protected]. Mol. Biol. Evol. 24(1):6–9. 2007 doi:10.1093/molbev/msl137 Advance Access publication September 29, 2006 Ó The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] altering the genuine phylogenetic signal should increase resolution. As abundantly discussed, reduction of the nonphylogenetic signal can be achieved either by improving the detection of multiple substitutions or by minimizing their number. Accordingly, we use 3 complementary approaches to test our hypothesis: 1) increasing the number of species (improved detection; Hendy and Penny 1989; Hillis 1996); 2) replacing a fast-evolving species by a slowly evolving species (minimization; Aguinaldo et al. 1997; Graybeal 1998); and 3) using a better model of sequence evolution (improved detection; Olsen 1987; Whelan and Goldman 2001; Lartillot and Philippe 2004). The scarce taxon sampling of the RKC data set prevented its use in this work. Instead, we updated our previous alignments (Delsuc et al. 2006) and assembled a data set of 133 genes (31,089 amino acid positions) from 57 species. To reduce the computational burden, only the 12,942 positions determined for at least 75% of the species were taken, yielding a data set with 12% of missing positions. Representative groups for the 3 clades currently recognized within Bilateria (Halanych 2004) were chosen as follows: Deuterostomia (tunicates and vertebrates), Ecdysozoa (arthropods, nematodes, and tardigrades), and Lophotrochozoa (annelids, mollusks, and platyhelminths), the latter 2 groups forming the Protostomia. The multilayered outgroup was composed of fungi, choanoflagellates, sponges, and cnidarians. Four variants of this data set were then constructed to explore the effects of taxon sampling (see below). Because of the potential limitations of the current heuristic algorithms (Philippe, Lartillot, et al. 2005), trees were inferred using a WAG 1 C4 model (Yang 1993; Whelan and Goldman 2001) with 3 different heuristics (TREEFINDER [Jobb et al. 2004], PHYML [Guindon and Gascuel 2003] and PHYML with SPR moves [Hordijk and Gascuel 2005]) and an exhaustive analysis with constraints (see Supplementary Material online). All heuristics behaved similarly (fig. S1, Supplementary Material online) and were somewhat biased toward the artifactual attraction of nematodes and platyhelminths (see Supplementary Material online). Therefore, only PHYML– SPR bootstrap values are presented, except for the 4 internal nodes within protostomes for which results from the exhaustive analysis are also provided. In the following, the average bootstrap support for these 4 nodes (AB4N) will be used to estimate the level of resolution. As usual, high bootstrap supports are indicative of robustness (data Lack of Resolution and Systematic Errors 7 FIG. 1.—Improving the phylogenetic resolution of the animal tree by reducing the nonphylogenetic signal. ML trees were computed using a WAG 1 C4 model from 23-taxa data sets, including either Caenorhabditis (A) or Xiphinema (B), and from 56-taxa data sets, including either Caenorhabditis (C) or Xiphinema (D). Nodes supported by 100% bootstrap values in PHYML–SPR analyses (Hordijk and Gascuel 2005) are denoted by black squares, whereas lower values are given in plain style. Resampling estimated log-likelihood bootstrap values from exhaustive analyses are shown in bold for the 4 unconstrained nodes used to compute the AB4N statistics. In (B), bootstrap values from the CAT model are given in italics for the same 4 nodes. All topologies (except A) are well resolved and in excellent agreement with the current consensus (Halanych 2004). As recently suggested (Goldstein and Blaxter 2002), nematodes strongly cluster with tardigrades, thus breaking the monophyly of Panarthropoda. Although a long-branch attraction artifact (Philippe, Lartillot, et al. 2005) is unlikely because Xiphinema evolves slowly, this unusual grouping should be tested by the inclusion of other ecdysozoan phyla (e.g., priapulids and onychophorans). Interestingly, the richer taxon sampling also enhanced the phylogenetic resolution in other parts of the tree, such as Cnidaria and Metazoa (to the exclusion of Suberites) (compare [A] with [C] and [B] with [D]). Ta: tardigrades; Ne: nematodes; Ar: Arthropods; Pl: platyhelminths; An: annelids; Mo: mollusks; Ur: urochordates; Ve: vertebrates; Cn: cnidarians; Po: poriferans; Ch: choanoflagellates. 8 Baurain et al. consistency) and not necessarily of accuracy (correct reconstruction of a clade). Trees inferred with our data set using a taxon sampling similar to RKC (23 species, fig. 1A) received marginally better support values than those obtained with the RKC alignment. In particular, we strongly recovered Bilateria and Protostomia. The low resolution reported by RKC for these nodes could be caused by an unfortunate selection of genes, including some of questionable orthology (i.e., cytosolic HSP70 [Martin and Burg 2002] and tubulins [Philippe, Lartillot, et al. 2005]). Alternatively, it could stem from a slightly different species sampling (especially hexactinellid and calcareous poriferans). We will thus focus on protostome relationships for which the resolution of both data sets is similar. Inside Protostomia, the artifactual grouping of the fast-evolving nematode Caenorhabditis with platyhelminths (Philippe, Lartillot, et al. 2005) confirms the presence of a strong nonphylogenetic signal. To test our hypothesis that the lack of resolution is due to the mutual cancellation of phylogenetic and nonphylogenetic signal, we 1) added 33 animal species to improve the detection of multiple substitutions by breaking long branches (Hendy and Penny 1989; Hillis 1996) and 2) replaced Caenorhabditis by a slowly evolving nematode to naturally reduce the number of multiple substitutions (Aguinaldo et al. 1997; Graybeal 1998). With 56 species (fig. 1B), the tree switched to a topology where nematodes cluster with tardigrades and arthropods within Ecdysozoa (66%). This shows that the artifactual grouping of nematodes and platyhelminths can be overcome by considering a large number of taxa. In addition, the statistical support increased throughout the tree and particularly within Protostomia (AB4N of 76% vs. 58%). When nematodes are represented by Xiphinema that had evolved approximately 2 times slower than Caenorhabditis (fig. 1C), the same topological change occurs, but the phylogenetic resolution increases more noticeably (AB4N of 88%). Of the 2 approaches used to reduce the nonphylogenetic signal, the replacement of a fast- by a slowly evolving species appears more efficient than the addition of numerous species. In fact, the major advantage of a rich taxon sampling is to allow the identification and the elimination of rogue taxa (e.g., Caenorhabditis), a strategy that reduces the amount of nonphylogenetic signal while preserving the phylogenetic signal (Aguinaldo et al. 1997). As expected, the combination of both approaches (fig. 1D) further improves the resolution (AB4N of 96%). In summary, through the generation of a large amount of nonphylogenetic signal, an inadequate taxon sampling (Rokas et al. 2005) can drastically affect the resolving power of phylogenomics (compare fig. 1A and D). A third way to overcome the nonphylogenetic signal is to use better evolutionary models (Whelan et al. 2001; Felsenstein 2004; Philippe, Delsuc, et al. 2005). Hence, to further test our hypothesis, we used the category (CAT) 1 C model (Lartillot and Philippe 2004) in an Markov Chain Monte Carlo framework, as implemented in PhyloBayes (http://www.lirmm.fr/mab/). The CAT model explicitly handles the heterogeneity of the substitution process across positions. In particular, it ensures a better detection of multiple substitutions at positions displaying only 2 or 3 differ- ent amino acids (Lartillot et al. 2006). Instead of evaluating the resolution with posterior probabilities, we performed a bootstrap analysis to allow the comparison with the ML approach used so far. Despite the use of Caenorhabditis (fig. 1B and Supplementary Material online), the CAT model yielded better support values than the WAG model (for 23 species, AB4N of 66% vs. 58% and for 56 species, 91% vs. 76%). In addition, a more careful examination (fig. 1B) shows that 3 out of 4 nodes (Lophotrochozoa, Ecdysozoa, and nematodes 1 tardigrades) received 100% bootstrap support, whereas the last node corresponding to the position of platyhelminths within Lophotrochozoa received relatively low bootstrap support (65%). This raises an interesting point concerning the connection between phylogenetic/nonphylogenetic signal and the observed resolution. As proposed by our hypothesis, the lack of resolution can be due to the mutual cancellation of both signals, even when the genuine phylogenetic signal is strong (e.g., Ecdysozoa). In contrast, the apparent resolution of the basal position of platyhelminths within Lophotrochozoa is likely caused by a strong nonphylogenetic signal (not surprising given their long branch). When the CAT model is used, the support for this artifactual position seriously decreases, which suggests that the genuine phylogenetic signal for positioning platyhelminths is weak. Interestingly, the CAT model is less sensitive to taxon sampling than the WAG model and is therefore a valuable step toward the elaboration of an optimal model that would have the desirable property of drawing inferences insensitive to taxon sampling. In conclusion, the results from 3 independent approaches have confirmed our hypothesis that the lack of resolution in the animal phylogeny observed by RKC and ourselves was due to a strong nonphylogenetic signal and does not constitute ‘‘a positive signature of closely spaced cladogenetic events.’’ Our results are congruent with many previous studies underlining the prime importance of taxon sampling, either empirically or through simulation (e.g., Wheeler 1992; Lecointre et al. 1993; Adachi and Hasegawa 1995; Hillis 1996; Graybeal 1998; Hedtke et al. 2006). Although the emphasis is often put on the accuracy of phylogenetic inference, the present work demonstrates that its resolving power can also be drastically affected by taxon sampling. More generally, our opinion is that it is no longer worthwhile to argue on the relative benefits of gene versus taxon sampling (Graybeal 1998; Rosenberg and Kumar 2001; Rokas and Carroll 2005; Hedtke et al. 2006) but that progress in sequencing technology will lead to data sets rich in both genes and taxa (Philippe and Telford 2006). Instead, phylogeneticists should put most of their efforts into developing better models of sequence evolution (Lanave et al. 1984; Galtier and Gouy 1995; Tuffley and Steel 1998; Whelan and Goldman 2001; Robinson et al. 2003; Blanquart and Lartillot 2006), which improve both accuracy and resolution, as shown here with the CAT model. Acknowledgments This work was supported by operating funds from Genome Québec and Natural Sciences and Engineering Research Council. H.P. is a member of the Program in Lack of Resolution and Systematic Errors 9 Evolutionary Biology of the Canadian Institute for Advanced Research, whom we thank for interaction support, and is grateful to the Canada Research Chairs Program and the Canadian Foundation for Innovation for salary and equipment support. D.B. is a postdoctoral researcher of the Fonds National de la Recherche Scientifique (FNRS) at the University of Liège (Belgium) and is gratefully indebted to the FNRS for the financial support of his stay at the University of Montréal. The authors want to thank Nicolas Lartillot, Didier Casane, Béatrice Roure, and Naiara Rodrı́guez-Ezpeleta for their comments on a previous version of the manuscript, as well as 2 anonymous reviewers for useful suggestions. Nicolas Lartillot is also gratefully acknowledged for his help with the PhyloBayes analyses. Literature Cited Adachi J, Hasegawa M. 1995. Phylogeny of whales: dependence of the inference on species sampling. Mol Biol Evol. 12: 177–179. Aguinaldo AM, Turbeville JM, Linford LS, Rivera MC, Garey JR, Raff RA, Lake JA. 1997. Evidence for a clade of nematodes, arthropods and other moulting animals. Nature. 387:489–493. Blanquart S, Lartillot N. 2006. A bayesian compound stochastic process for modelling non-stationary and non-homogeneous sequence evolution. Mol Biol Evol. 23:2058–2071. Delsuc F, Brinkmann H, Chourrout D, Philippe H. 2006. Tunicates and not cephalochordates are the closest living relatives of vertebrates. Nature. 439:965–968. Delsuc F, Brinkmann H, Philippe H. 2005. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 6:361–375. Felsenstein J. 2004. Inferring phylogenies. Sunderland (MA): Sinauer Associates, Inc. Galtier N, Gouy M. 1995. Inferring phylogenies from DNA sequences of unequal base compositions. Proc Natl Acad Sci USA. 92:11317–11321. Goldstein B, Blaxter M. 2002. Tardigrades. Curr Biol. 12:R475. Graybeal A. 1998. Is it better to add taxa or characters to a difficult phylogenetic problem? Syst Biol. 47:9–17. Guindon S, Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 52:696–704. Halanych KM. 2004. The new view of animal phylogeny. Annu Rev Ecol Evol Syst. 35:229–256. Hedtke SM, Townsend TM, Hillis DM. 2006. Resolution of phylogenetic conflict in large data sets by increased taxon sampling. Syst Biol. 55:522–529. Hendy MD, Penny D. 1989. A framework for the quantitative study of evolutionary trees. Syst Zool. 38:297–309. Hillis DM. 1996. Inferring complex phylogenies. Nature. 383: 130–131. Ho SY, Jermiin L. 2004. Tracing the decay of the historical signal in biological sequence data. Syst Biol. 53:623–637. Hordijk W, Gascuel O. 2005. Improving the efficiency of SPR moves in phylogenetic tree search methods based on maximum likelihood. Bioinformatics. 21:4338–4347. Jobb G, von Haeseler A, Strimmer K. 2004. TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics. BMC Evol Biol. 4:18. Lanave C, Preparata G, Saccone C, Serio G. 1984. A new method for calculating evolutionary substitution rates. J Mol Evol. 20:86–93. Lartillot N, Brinkmann H, Philippe H. Forthcoming 2006. Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol Biol. Lartillot N, Philippe H. 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol Biol Evol. 21:1095–1109. Lecointre G, Philippe H, Van Le HL, Le Guyader H. 1993. Species sampling has a major impact on phylogenetic inference. Mol Phylogenet Evol. 2:205–224. Marletaz F, Martin E, Perez Y, et al. (12 co-authors). 2006. Chaetognath phylogenomics: a protostome with deuterostome-like development. Curr Biol. 16:R577–R578. Martin AP, Burg TM. 2002. Perils of paralogy: using HSP70 genes for inferring organismal phylogenies. Syst Biol. 51: 570–587. Matus DQ, Copley RR, Dunn CW, Hejnol A, Eccleston H, Halanych KM, Martindale MQ, Telford MJ. 2006. Broad taxon and gene sampling indicate that chaetognaths are protostomes. Curr Biol. 16:R575–R576. Olsen G. 1987. Earliest phylogenetic branching: comparing rRNA-based evolutionary trees inferred with various techniques. Cold Spring Harb Symp Quant Biol. LII:825–837. Philippe H, Delsuc F, Brinkmann H, Lartillot N. 2005. Phylogenomics. Annu Rev Ecol Evol Syst. 36:541–562. Philippe H, Lartillot N, Brinkmann H. 2005. Multigene analyses of bilaterian animals corroborate the monophyly of ecdysozoa, lophotrochozoa, and protostomia. Mol Biol Evol. 22:1246–1253. Philippe H, Telford MJ. 2006. Large-scale sequencing and the new animal phylogeny. Trends Ecol Evol. 21:614–620. Philippe H, Zhou Y, Brinkmann H, Rodrigue N, Delsuc F. 2005. Heterotachy and long-branch attraction in phylogenetics. BMC Evol Biol. 5:50. Robinson DM, Jones DT, Kishino H, Goldman N, Thorne JL. 2003. Protein evolution with dependence among codons due to tertiary structure. Mol Biol Evol. 28:28. Rokas A, Carroll SB. 2005. More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. Mol Biol Evol. 22:1337–1344. Rokas A, Kruger D, Carroll SB. 2005. Animal evolution and the molecular signature of radiations compressed in time. Science. 310:1933–1938. Rosenberg MS, Kumar S. 2001. Incomplete taxon sampling is not a problem for phylogenetic inference. Proc Natl Acad Sci USA. 98:10751–10756. Tuffley C, Steel M. 1998. Modeling the covarion hypothesis of nucleotide substitution. Math Biosci. 147:63–91. Wheeler WC. 1992. Extinction, sampling, and molecular phylogenetics. In: Novacek MJ, Wheeler QD, editors. Extinction and phylogeny. New York: Columbia University Press. p. 205–215. Whelan S, Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol. 18:691–699. Whelan S, Lio P, Goldman N. 2001. Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends Genet. 17:262–272. Yang Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol. 10:1396–1401. Scott Edwards, Associate Editor Accepted September 25, 2006
© Copyright 2026 Paperzz