Vol. 20 Suppl. 1 2004, pages i116–i121 DOI: 10.1093/bioinformatics/bth902 BIOINFORMATICS Phylogenomics and the number of characters required for obtaining an accurate phylogeny of eukaryote model species Hernán Dopazo, Javier Santoyo and Joaquín Dopazo∗ Bioinformatics Unit. Biotechnology Programme, Centro Nacional de Investigaciones Oncológicas (CNIO), Melchor Fernández Almagro, 3, 28029, Madrid, Spain Received on January 15, 2004; accepted on March 1, 2004 ABSTRACT Motivation: Through the most extensive phylogenomic analysis carried out to date, complete genomes of 11 eukaryotic species have been examined in order to find the homologous of more than 25 000 amino acid sequences. These sequences correspond to the exons of more than 3000 genes and were used as presence/absence characters to test one of the most controversial hypotheses concerning animal evolution, namely the Ecdysozoa hypothesis. Distance, maximum parsimony and Bayesian methods of phylogenetic reconstruction were used to test the hypothesis. Results: The reliability of the ecdysozoa, grouping arthropods and nematodes in a single clade was unequivocally rejected in all the consensus trees. The Coelomata clade, grouping arthropods and chordates, was supported by the highest statistical confidence in all the reconstructions. The study of the dependence of the genomes’ tree accuracy on the number of exons used, demonstrated that an unexpectedly larger number of characters are necessary to obtain robust phylogenies. Previous studies supporting ecdysozoa, could not guarantee an accurate phylogeny because the number of characters used was clearly below the minimum required. Contact: [email protected] 1 INTRODUCTION Recently completed sequences of several eukaryotic genomes provide an enormous amount of genetic data to address questions surrounding functional genomics and evolutionary biology. A major area of interest focuses on phylogenetics. Molecular systematics based on gene or protein sequences do not need to rely on the sampling of a few sequences, or even a few dozen or hundreds of phylogenetic markers. The availability of different complete genomes permits the analysis of genetic data on a whole genome scale. Phylogenomic studies based on the analysis of complete genomes of eukaryotic species may result in special interest for solving controversies in evolution. One of the major controversies surrounding ∗ To whom correspondence should be addressed. i116 metazoan evolution is the status of the Ecdysozoa clade grouping moulting animals, such as Drosophila melanogaster and Caenorhabditis elegans (Aguinaldo et al., 1997). During the last eight years, two alternative phylogenetic hypotheses concerning metazoan phylogeny have been under discussion (Adoutte et al., 1999). The traditional picture, although there has never been complete agreement on phylogeny and classification, is based on the gradual increase in the complexity of animal body plans. Thus, the evolutionary tree that could be found in most influential systematics textbooks (Hyman, 1940) displays the basal position of the simplest animals (poriferans: sponges), the posterior radiation of diploblastic animals (coelenterates: sea anemones, corals, hydras) and the posterior origin of the triploblastic bilateral animals (acoelomates: platyhelminthes; pseudocoelomates: nematodes; protostomates: arthropods, annelids, molluscs; lophophorates: brachiopods; and deuterostomates: echinoderms, hemichordates, chordates). Given this scenario, animal model organisms like the worm, C.elegans belong to pseudocoelomates; the flies, D.melanogaster and Anopheles gambiae belongs to protostomates; and the chordates, Ciona intestinalis, Fugu rubripes, Mus musculus and Homo sapiens to deuterostomes. Indeed, deuterostomes (chordates) and protostomates (arthropods) cluster in a single monophyletic group: coelomata, and pseudocoelomates (nematodes) is the sister taxa. Consequently, under this traditional evolutionary picture flies are genetically more closely related to humans than to worms (Fig. 1A). Alternatively, a more recent analysis based on small subunit ribosomal RNA (18S rRNA) sequences has challenged this traditional evolutionary scenario (Aguinaldo et al., 1997). Thus, coelenterates, acoelomates, pseudocoelomates and lophophorates seem to be artificial systematic groups. Coelomate still remains though containing different descendants. One of its branches leads to deuterostomates and the other to protostomates. This last group contains two new clades, the Lophotrochozoa and the Ecdysozoa. Although the branching order basically remains unsolved, ecdysozoa groups nematodes and arthropods in a single clade. Under this Bioinformatics 20(Suppl. 1) © Oxford University Press 2004; all rights reserved. Accurate phylogeny of eukaryote model species 2 SYSTEMS AND METHODS 2.1 Datasets Fig. 1. Alternative phylogenetic hypothesis concerning the 11 eukaryote model species studied. Traditional phylogenetic hypothesis considers the worm C.elegans as sister taxa of arthropods (A). Alternatively, the ecdysozoa hypothesis cluster worms and arthropods in a single moulting clade (B). Systematic conflicts involve the ancestral–descendant relationships of species derived from node 5. Nodes: 1, mammals; 2, vertebrates; 3, chordates; 4, arthropods; 5, coelomates (A tree), ecdysozoa (B tree); 6, bilateral animals; 7, opisthokonts; 8, plantae; and 9, superior eukarya. By default, node 9 have 100% bootstrap support in all the phylogenetic analysis. evolutionary scenario, flies are genetically closer to worms than to humans (Fig. 1B). Subsequent molecular and morphological studies have been carried out though the controversy remains unsolved. While the use of different single gene sequences support the ecdysozoa hypothesis (Aguinaldo et al., 1997; de Rosa et al., 1999; Mallatt and Winchell, 2002; Manuel et al., 2001; Peterson and Eernisse, 2001; Ruiz-Trillo et al., 2002), the analysis of dozens to hundreds of aligned sequences support the coelomata clade (Blair et al., 2002; Wang et al., 1999; Wolf et al., 2004). The acceptance of the new animal phylogeny with the Ecdysozoa hypothesis imposes a new scheme to understand the Cambrian explosion (Balavoine and Adoutte, 1998; Morris, 2000), the origin of metazoan body plans (de Rosa et al., 1999; Carrol et al., 2001) and consequently sets a new phylogenetic framework for the advancement in comparative genomics. Here, we present strong phylogenomic evidence against the ecdysozoa hypothesis based on the presence/absence of homologous sequences corresponding to the exons of genes of 7 human chromosomes found in 11 genomes of eukaryotic model organisms. To our knowledge, this is the most extensive phylogenomic analysis carried out to date in terms of the number of characters involved. In order to avoid problems derived from the inaccuracy of some of the most recently released proteomes, we have favoured the direct analysis of genomic sequences to check the presence/absence of the genes in the data set. Since exons are scattered across long genomic distances, and direct homology search of the protein in the genomes is difficult with methods, such as BLAST (Altschul et al., 1997), we chose to search for individual exons instead of whole proteins. The total number of exons used was 25 676 corresponding to 3042 annotated human genes. Complete genome sequences of 11 eukaryotic species, including Plasmodium falciparum (Gardner et al., 2002), Arabidopsis thaliana (Arabidopsis Genome Initiative, 2000), Oryza sativa (Yu et al., 2002), Saccharomyces cerevisiae (Goffeau et al., 1996), C.elegans (C.elegans Sequencing Consortium, 1998), A.gambiae (Holt et al., 2002), D.melanogaster (Adams et al., 2000), C.intestinalis (Dehal et al., 2002), F.rubripes (Aparicio et al., 2002), M.musculus (Mouse Genome Sequencing Consortium, 2002) and H.sapiens (International Human Genome Sequencing Consortium, 2001) were downloaded (Table 1) and formatted to run local BLAST (Altschul et al., 1997). Amino acid sequences corresponding to all the exon’s genes in a randomly chosen sample of human chromosomes (comprised of Y, 13, 14, 18, 20, 21 and 22) were obtained from the Ensembl database project (Hubbard et al., 2002). We used the tblastn engine (Altschul et al., 1997) that searches a query amino acid sequence on the six translation frames of the target sequence (the complete genomes). Exon sequences of genes of seven human chromosomes were used for a similarity search on complete genomes sequences of the mentioned eukaryote species. Exons shorter than 22 amino acids in length were discarded from the analysis in order to avoid false positives due to the small size of the query sequence. Each best hit was filtered by means of the combination of three different values (v1 , v2 , v3 ). Being v1 , a blast E-value threshold; v2 , a proportion of aligned sequence length; and v3 , a chromosome context reference: the minimum number of exons of the gene matching on the same chromosome. When v3 = 1, all exons pass through the filter; if v3 = 2, at least two exons of the same gene must match on the same chromosome to be included on the data set. A matrix of presence/absence of putative homologous exons was built-up using the following filtering scheme: v1 ≤ 1e − 05, v2 ≥ 75%, v3 = 2. The use of other two alternative filtering schemes (v1 ≤ 1e − 03, v2 ≥ 50%, v3 = 1 and v1 ≤ 1e − 03, v2 ≥ 75%, v3 = 2) did not change the results (data not shown). The resulting matrix has a total number of 25 676 characters (exons), corresponding to 3042 annotated human genes. These sequences were derived from a genomic sample of 570 Mb (which covers ∼20% of the human genome). These characters are uniformly distributed i117 H.Dopazo et al. Table 1. Sources of the genomes used in the study Species Version/date Location H.sapiens M.musculus F.rubripes C.intestinalis D.melanogaster A.gambiae C.elegans S.cerevisiae A.thaliana O.sativa P.falciparum 31 16/10/03 9.1.1 1.0 3.0 9.1a.1 19/12/02 28/01/03 28/01/03 27/05/02 4.0 http://genome.cse.ucsc.edu/goldenPath/14nov2002/bigZips/ http://genome.cse.ucsc.edu/goldenPath/mmFeb2002/bigZips/ ftp.ensembl.org/pub/current_fugu/data/fasta/dna/Fugu_rubripes.latestgp.fa ftp.jgi-psf.org/pub/JGI_data/Ciona/v1.0/ciona.fasta.gz ftp.fruitfly.org/pub/download/compressed/ ftp.ensembl.org/pub/current_mosquito/data/golden_path/ ftp.sanger.ac.uk/pub/C.elegans_sequences/CHROMOSOMES/CURRENT_RELEASE http://genome-ftp.stanford.edu/pub/yeast/data_download/sequence/genomic_sequence/chromosomes/fasta ftp.ncbi.nih.gov/genomes/A_thaliana/ http://btn.genomics.org.cn/rice/download.php?pre=contig_sequence&name=all_sequence http://www.plasmodb.org/restricted/data/P_falciparum/WG/nuc/ across the chromosomes of most of the eukaryotic species analysed and, consequently can be considered a homogeneous sample of their genomes. From a theoretical point of view, this feature has demonstrated to be critical in supporting bootstrap values (Felsenstein, 1985). Indeed a previous genome-wide study has demonstrated that sampling characters scattered along the whole genomes gives better results rather than sampling co-lineal blocks when the whole-genome phylogeny is assayed (Cummings et al., 1995). A reduced matrix was obtained by randomly sampling 19 000 characters without replacement from whole matrix. This matrix was necessary to run MrBayes (Ronquist and Huelsenbeck, 2003) program, because of their limitations in the input data size. 2.2 Phylogenetic methods Maximum parsimony (MP) trees were constructed for all the datasets using the PAUP∗ (Swofford, 2003) default option of the branch and bound search. Unweighted parsimony was used for all the analysis. Cladograms were evaluated using 5000 and 10 000 bootstrap replicates, obtained using the R package (Ihaka and Gentleman, 1996). Distance trees were estimated by neighbour-joining and least-square methods using PHYLIP (Felsenstein, 2002). Maximum parsimony and distance trees were summarized using the 50% majority-consensus rule. Bayesian analysis computing the posterior probability of trees was carried out in MrBayes (Ronquist and Huelsenbeck, 2003). Thus, 10 000 phylogenetic trees were sampled for every 100 generation by using the Markov chain Monte Carlo (MCMC) with 4 chains. The first 1000 trees in the sample were removed to avoid including trees sampled before convergence of the Markov chains. Posterior probabilities of trees were averaged on 9000 samples under an unordered standard binary model. Branch lengths were saved and represented on the 50% majority-consensus tree. i118 A total of 100 random matrices of 8 different sizes ranging from 100 to 20 000 characters were constructed without any replacement from the original matrix. These matrices indeed were 100 times randomized using replacement in order to get at least 10 000 cladograms to compare topological distances individually with the whole-genome tree. Tree distances and MP trees were calculated using treedist and pars algorithms of PHYLIP (Felsenstein, 2002), respectively. 3 RESULTS 3.1 Genomes’ phylogeny strongly support the coelomata tree Distance, MP and Bayesian analysis conclusively support the coelomata tree. Figure 2 shows the phylogenetic reconstruction of the genomes tree and the statistical confidence of nodes using the distinct phylogenetic methods. All the nodes are supported with the highest statistical confidence independent of the phylogenetic technique. Those supporting the ecdysozoa clade hypothesis have argued in favour of the long branch attraction effect to explain the position of C.elegans as sister taxa of coelomates. The branch length of C.elegans, however, was recurrently shorter than those of arthropods or the basal chordate C.intestinalis (data not shown). The basal position of C.elegans in all the genome trees does not derive from a topological effect caused by a highly differentiated nematode. 3.2 Effect of the number of characters on genome tree accuracy In order to assess the relationship between the number of characters and the probability of obtaining the whole-genome tree, random samples of different sizes derived from the homology matrix were drawn. Figure 3 shows the dependence of the mean tree distance to the whole-genome tree on the number of characters used in its reconstruction. Confident topologies Accurate phylogeny of eukaryote model species compatible with ecdysozoa in datasets, including <5000 characters. Although we exhaustively searched for the ecdysozoa group on trees derived from matrices built from 1000 to 5000 characters, this clade has never been obtained as a topological solution from the randomly sampled smaller datasets. This result demonstrates that a large number of characters are required to ensure with sufficient accuracy that the tree is identical to the whole-genome tree. Moreover, the figure points out the strong phylogenetic signal of the matrix. Indeed, we could still recover the coelomata tree with sufficient accuracy even removing more than 5000 characters. 4 Fig. 2. Phylogenetic reconstruction of the genome trees using the homology matrices. Bootstrap proportions were obtained from the 50% majority-rule consensus tree using 1000 replicates for distance methods (NJ, LS) and 10 000 using MP. Posterior probabilities were averaged on 9000 samples in the Bayesian analysis (BI). Fig. 3. Dependence of the average distance to the true tree on the number of sampled characters used in the reconstruction. Each point represents the mean tree distance of at least 10 000 cladograms to the highest supported topology of the genome trees found. Bars correspond to standard error deviation. Insert shows the mean standard error deviation versus characters. Samples below 5000 correspond to 100, 500, 1000 and 2500 characters. with mean tree distance units below one can be obtained when more than 10 000 characters are used. Since the ecdysozoa topology differs from the coelomata by two tree distance units, we could expect to find topologies DISCUSSION It is widely accepted that the phylogenetic reconstruction lies on the premise that sampled data are representative of the whole genome from which they are drawn. Sequences are generally chosen on the basis of factors such as the functional characteristics of a region, the historical use in systematic studies and other considerations generally not related to its ability to reconstruct whole-genome relationships (Cummings et al., 1995). The monophyly of the ecdysozoa group was actually proposed on the basis of the analysis of different single gene sequences (Aguinaldo et al., 1997; de Rosa et al., 1999; Mallatt and Winchell, 2002; McHugh, 1997; Manuel et al., 2001; Ruiz-Trillo et al., 2002; Peterson and Eernisse, 2001). Indeed, the total number of characters favouring the hypothesis ranges from only one rare genomic event (the triplication of the β-thymosin scanned on uncompleted genomes) to ∼4000 sites when using SSU+LSU rRNA sequences (Mallatt and Winchell, 2002). Considering this last exhaustive study favouring the hypothesis, only the 18S rRNA gene (∼1500 sites) showed phylogenetic signal to support the clade with 81% of bootstrap confidence (Mallatt and Winchell, 2002). This result is comparable with the original study of the same gene using ∼1100 characters giving at best 77% of confidence (Aguinaldo et al., 1997). The posterior re-examination of the original data set (Aguinaldo et al., 1997) demonstrated important weakness in the phylogenetic analysis, for instance, the strict support for the clade with only two binary characters (Wägele et al., 1999). Moreover, the same study criticized the reliability of the original morphological synapomorphies, for instance, the moulting processes are not homologous, while arthropods shed their cuticles made up of chitin, the nematodes’ ecdysis compromise collagen proteins. Apart from these criticisms, the accepted reliability of 18S rRNA as a phylogenetic marker has been questioned. Some authors emphasized the extreme nucleotide composition bias among taxa (Abouheif et al., 1998; Hasegawa and Hashimoto, 1993), the differences in substitution rates causing long branch attraction (Abouheif et al., 1998; Carmean and Crespi, 1995), and the absence of enough informative positions for a robust reconstruction of the i119 H.Dopazo et al. metazoan phylogeny (Phylippe et al., 1994). Indeed, although secondary structure-based alignments can improve uncertainties of the rRNA positional homology, the multiple alignment problem of these sequences cannot easily be solved without ambiguities (Notredame et al., 1997). DNA evolution models considering base paired sites, as independent characters are unsuitable for modelling RNA coding genes, consequently, corrections to compensatory substitutions occurring in different parts of the molecule must be taken into account (Dixon and Hillis, 1993; Jow et al., 2002). The methodology used in this paper circumvents several difficulties generally associated with phylogenetics and genome trees methods (Wolf et al., 2002). For instance, as we search for homologous to exon sequences on whole genome sequences, problems associated with the incompleteness of the derived proteomes, or in protein homologous databases (Tatusov et al., 2001), normally used in other genome tree reconstructions, are avoided. The same occurs with the multiple alignment problems leading to positional homology errors and the selection of the evolutionary model to correct for multiple hits. Our approach does not depend on these procedures. The implicit evolutionary assumption used in the analysis is the equal probability of exon gain or loss on the branches. Since data are based on the presence or absence of exons, long branch attraction effects associated with different rates of sequence evolution among species are avoided. Since paralogous sequences of the human genome were not specifically filtered, we choose the worst scenario to test the coelomata hypothesis considering that all the human paralogous sequences [5% of the human genome, see Eichler (2001)] correspond to coding sequences. If this was the case, approximately 1000 paralogous exons would be included in the datasets. We have shown that indeed by removing 5000 characters we could still recover the coelomata tree with sufficient accuracy. Thus considering the worst scenario of human duplications affecting the dataset, our results are robust to the paralogous error inclusion. In the approach presented here, there is a potential problem of bias that could favour Coelomata group. The data used are exons found in different species as homologous to human exons. Nevertheless, Ecdysozoa grouping would be favoured by the absence of such characters. Whether this bias overrate the benefits to avoid long branch attraction effect is a matter for further analysis. We have presented the results based on the largest set of sequences published to date that support the coelomata clade with 100% of statistical confidence. Previous phylogenetic studies in which extensive regions of eukaryotic genomes were sampled drew similar conclusions. In particular, the analysis of several dozens (Hausdorf, 2000; Wang et al., 1999) to more than 100 concatenated nuclear proteins (Blair et al., 2002; Wolf et al., 2004) support the chordate plus arthropod clade. A recent paper emphasizes the benefits of selecting a high number of concatenated characters in order to resolve conflictive systematic hypothesis supported by single gene i120 methods (Rokas et al., 2003). They pointed out that bootstrap values having 100% confidence on five internal branches has never been previously shown in systematic studies. We have demonstrated that using the most stringent condition of homology, all the phylogenetic reconstruction methods yield 100% of support on all the eight internal branches of the coelomata tree. Comparative biology lies on the accuracy of its phylogenetic hypothesis. We have shown how phylogenetic reconstruction based on the whole genome sequences has the potential to solve one of the most controversial hypothesis in animal evolution: the reliability of the ecdysozoa clade. If sampling a huge number of characters seems to be the way of providing the highest statistical confidence on conflicting evolutionary hypothesis, the accurate selection of the organisms for future genome projects must be systematically and phylogenetically considered. ACKNOWLEDGEMENTS We are indebted to all the members of the Bioinformatics unit (CNIO). Special thanks to A. Cucchi for useful comments on the manuscript and Amanda Wren for the revision of the English. This work was partially supported by grant PI020919 from the FIS. H.D. is supported by a fellowship from Fundación Carolina. REFERENCES Abouheif,E., Zardoya,R. and Meyer,A. (1998) Limitations of metazoan 18S rRNA sequence data: implications for reconstructing a phylogeny of the animal kingdom and inferring the reality of the cambrian explosion. J. Mol. Evol., 47, 394–405. Adams,M.D., Celniker,S.E., Holt,R.A., Evans,C.A., Gocayne,J.D., Amanatides,P.G., Scherer,S.E., Li,P.W., Hoskins,R.A., Galle,R.F. et al. (2000) The genome sequence of Drosophila melanogaster. Science, 287, 2185–2195. Adoutte,A., Balavoine,G., Lartillot,N. and de Rosa,R. (1999) Animal evolution: the end of the intermediate taxa? Trends Genet., 15, 104–108. Aguinaldo,A.M., Turbeville,J.M., Linford,L.S., Rivera,M.C., Garey,J.R., Raff,R.A., and Lake,J.A. (1997) Evidence for a clade of nematodes, arthropods and other moulting animals. Nature, 387, 489–493. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W., and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Aparicio,S., Chapman,J., Stupka,E., Putnam,N., Chia,J.M., Dehal,P., Christoffels,A., Rash,S., Hoon,S., Smit,A., et al. (2002) Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science, 297, 1301–1310. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature, 408, 796–815. Balavoine,G. and Adoutte,A. (1998) One or three cambrian radiations? Science, 280, 397–398. Accurate phylogeny of eukaryote model species Blair,J.E., Ikeo,K., Gojobori,T. and Hedges,B. (2002) The evolutionary position of nematodes. BMC Evol. Biol., 2, 7. C. elegans Sequencing Consortium. (1998) Genome sequence of the nematode C.elegans. a platform for investigating biology. Science, 282, 2012–2018. Carmean,D. and Crespi,B.J. (1995) Do long branches attract flies? Nature, 373, 666. Carrol,S.B., Grenier,J.K. and Weatherbee,S.D. (2001) From DNA to Diversity. Molecular Genetics and the Evolution of Animal Design. Blackwell Science, MA. Cummings,M.P., Otto,S. and Wakeley,J. (1995) Sampling properties of DNA sequences data in phylogenetic analysis. Mol. Biol. Evol., 12, 814–822. de Rosa,R., Grenier,J.K., Andreeva,T., Cook,C.E., Adoutte,A., Akam,M., Carroll,S.B. and Balavoine,G. (1999) Hox genes in brachiopods and priapulids and protostome evolution. Nature, 399, 772–776. Dehal,P., Satou,Y., Campbell,R.K., Chapman,J., Degnan,B., De Tomaso,A., Davidson,B., Di Gregorio,A., Gelpke,M., Goodstein,D.M. et al. (2002) The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science, 298, 2157–2167. Dixon,M. and Hillis,D. (1993) Ribosomal RNA secondary structure: compensatory mutations and implications for phylogenetic analysis. Mol. Biol. Evol., 10, 256–267. Eichler,E.E. (2001) Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet., 17, 661–669. Gardner,M.J., Hall,N., Fung,E., White,O., Berriman,M., Hyman,R.W., Carlton,J.M., Pain,A., Nelson,K.E., Bowman,S. et al. (2002) Genome sequence of the human malaria parasite Plasmodium falciparum. Nature, 419, 498–511. Felsenstein,J. (1985) Confidence limits on phylogenies: an approach using bootstrap. Evolution, 39, 783–791. Felsenstein,J. (2002) PHYLIP: Phylogeny Inference Package (Version 3.6a3). Distributed by the author. Department of Genome Sciences, University of Washington, Seattle, WA. Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B., Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M. et al. (1996) Life with 6000 genes. Science, 274, 563–567. Hasegawa,M. and Hashimoto,T. (1993) Ribosomal RNA trees misleading? Nature, 361, 23. Holt,R.A., Subramanian,G.M., Halpern,A., Sutton,G.G., Charlab,R., Nusskern,D.R., Wincker,P., Clark,A.G., Ribeiro,J.M., Wides,R. et al. (2002) The genome sequence of the malaria mosquito Anopheles gambiae. Science, 298, 129–149. Hyman,L.H. (1940) The Invertebrates. Protozoa through Ctenophora, Vol. 1. McGraw-Hill, NY. Hausdorf,B. (2000) Early evolution of the bilateralia. Syst. Biol., 49, 130–142. Ihaka,R. and Gentleman,R. (1996) R: a language for data analysis and graphics. J. Comput. Graph. Stat., 5, 299–314. International Human Genome Sequencing Consortium (2001) Initial sequencing analysis of the human genome. Nature, 409, 860–921. Hubbard,T., Barker,D., Birney,E., Cameron,G., Chen,Y., Clark,L., Cox,T., Cuff,J., Curwen,V., Down,T. et al. (2002) The Ensembl genome database project. Nucleic Acids Res., 30, 38–41. Jow,H., Hudelot,C., Rattray,M. and Higgs,P.G. (2002) Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Mol. Biol. Evol., 19, 1591–1601. Mallatt,J. and Winchell,C.J. (2002) Testing the new animal phylogeny: first use of large-subunit and small-subunit rRNA gene sequences to classify the protostomes. Mol. Biol. Evol., 19, 289–301. Manuel,M., Kruse,M., Müller,W. and Le Parco,Y. (2000) The comparison of beta-thymosin homologues among metazoa supports an arthropod–nematode clade. J. Mol. Evol., 51, 378–381. McHugh,D. (1997) Molecular evidence that echiurans and pogonophorans are derived annelids. Proc. Natl Acad. Sci. USA, 94, 8006–8009. Morris,C. (2000) The Cambrian “explosion”: slow-fuse or megatonnage. Proc. Natl Acad. Sci. USA, 97, 4426–4429. Mouse Genome Sequencing Consortium (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520–562. Notredame,C., O’Brien,E.A. and Higgins,D.G. (1997) RAGA: RNA sequence alignment by genetic algorithm. Nucleic Acids Res., 25, 4570–4580. Peterson,K.J. and Eernisse,D.J. (2001) Animal phylogeny and the ancestry of bilaterians: inferences from morphology and 18S rDNA gene sequences. Evol. Dev., 3, 170–205. Phylippe,H., Chenuill,A. and Adoutte,A. (1994) Can the cambrian explosion be inferred through molecular phylogeny? Development (Suppl.) 15–25. Rokas,A., Williams,B., King,N. and Carroll,S.B. (2003) Genomescale approaches to resolving incongruence in molecular phylogenies. Nature, 425, 798–804. Ronquist,F. and Huelsenbeck,J. (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19, 1572–1574. Ruiz-Trillo,I., Paps,J., Loukota,M., Ribera,C., Jondelius,U., Baguna,J. and Riutort,M. (2002) A phylogenetic analysis of myosin heavy chain type II sequences corroborates the Acoela and Nemertodermatida are basal bilaterians. Proc. Natl. Acad. Sci. USA, 99, 11246–11251. Swofford,D.L. (2003) PAUP*: Phylogenetic Analysis Using Parsimony (*and Other Methods). (Version 4.0b10). Sinauer Associates, Sunderland, MA. Tatusov,R.L., Natale,D.A., Garkavtsev,I.V., Tatusova,T.A., Shankavaram,U.T., Rao,B.S., Kiryletin,B., Galperin,M.Y., Federova,N.D. and Koonin,E.V. (2001) The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res., 29, 22–28. Wägele,J., Lockhart,P. and Misof,B. (1999) The Ecdysozoa: artifact or monophylum? J. Zool. Syst. Evol. Res., 37, 211–223. Wang,D.Y.C., Kumar,S. and Hedges,B. (1999) Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. R. Soc. Lond., B 266, 163–171. Wolf,Y.I., Rogozin,I.B., Grishin,N.V. and Koonin,E.V. (2002) Genome trees and the tree of life. Trends, Genet., 18, 472–479. Wolf,Y.I., Rogozin,I.B. and Koonin,E.V. (2004) Coelomata and not Ecdysozoa: evidence from genome-wide phylogenetic analysis. Genome Res., 14, 29–36. Yu,J., Hu,S., Wang,J., Wong,G.K., Li,S., Liu,B., Deng,Y., Dai,L., Zhou,Y., Zhang,X. et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science, 296, 79–92. i121
© Copyright 2026 Paperzz