Phylogenomics and the number of characters required for obtaining

Vol. 20 Suppl. 1 2004, pages i116–i121
DOI: 10.1093/bioinformatics/bth902
BIOINFORMATICS
Phylogenomics and the number of characters
required for obtaining an accurate phylogeny of
eukaryote model species
Hernán Dopazo, Javier Santoyo and Joaquín Dopazo∗
Bioinformatics Unit. Biotechnology Programme, Centro Nacional de Investigaciones
Oncológicas (CNIO), Melchor Fernández Almagro, 3, 28029, Madrid, Spain
Received on January 15, 2004; accepted on March 1, 2004
ABSTRACT
Motivation: Through the most extensive phylogenomic analysis carried out to date, complete genomes of 11 eukaryotic
species have been examined in order to find the homologous
of more than 25 000 amino acid sequences. These sequences
correspond to the exons of more than 3000 genes and were
used as presence/absence characters to test one of the most
controversial hypotheses concerning animal evolution, namely
the Ecdysozoa hypothesis. Distance, maximum parsimony
and Bayesian methods of phylogenetic reconstruction were
used to test the hypothesis.
Results: The reliability of the ecdysozoa, grouping arthropods
and nematodes in a single clade was unequivocally rejected
in all the consensus trees. The Coelomata clade, grouping
arthropods and chordates, was supported by the highest statistical confidence in all the reconstructions. The study of the
dependence of the genomes’ tree accuracy on the number of
exons used, demonstrated that an unexpectedly larger number
of characters are necessary to obtain robust phylogenies. Previous studies supporting ecdysozoa, could not guarantee an
accurate phylogeny because the number of characters used
was clearly below the minimum required.
Contact: [email protected]
1
INTRODUCTION
Recently completed sequences of several eukaryotic genomes
provide an enormous amount of genetic data to address
questions surrounding functional genomics and evolutionary
biology. A major area of interest focuses on phylogenetics.
Molecular systematics based on gene or protein sequences do
not need to rely on the sampling of a few sequences, or even
a few dozen or hundreds of phylogenetic markers. The availability of different complete genomes permits the analysis of
genetic data on a whole genome scale. Phylogenomic studies
based on the analysis of complete genomes of eukaryotic species may result in special interest for solving controversies
in evolution. One of the major controversies surrounding
∗ To
whom correspondence should be addressed.
i116
metazoan evolution is the status of the Ecdysozoa clade grouping moulting animals, such as Drosophila melanogaster and
Caenorhabditis elegans (Aguinaldo et al., 1997).
During the last eight years, two alternative phylogenetic
hypotheses concerning metazoan phylogeny have been under
discussion (Adoutte et al., 1999). The traditional picture,
although there has never been complete agreement on phylogeny and classification, is based on the gradual increase in
the complexity of animal body plans. Thus, the evolutionary tree that could be found in most influential systematics
textbooks (Hyman, 1940) displays the basal position of the
simplest animals (poriferans: sponges), the posterior radiation of diploblastic animals (coelenterates: sea anemones,
corals, hydras) and the posterior origin of the triploblastic
bilateral animals (acoelomates: platyhelminthes; pseudocoelomates: nematodes; protostomates: arthropods, annelids,
molluscs; lophophorates: brachiopods; and deuterostomates:
echinoderms, hemichordates, chordates). Given this scenario,
animal model organisms like the worm, C.elegans belong to
pseudocoelomates; the flies, D.melanogaster and Anopheles
gambiae belongs to protostomates; and the chordates, Ciona
intestinalis, Fugu rubripes, Mus musculus and Homo sapiens
to deuterostomes. Indeed, deuterostomes (chordates) and
protostomates (arthropods) cluster in a single monophyletic
group: coelomata, and pseudocoelomates (nematodes) is the
sister taxa. Consequently, under this traditional evolutionary
picture flies are genetically more closely related to humans
than to worms (Fig. 1A).
Alternatively, a more recent analysis based on small
subunit ribosomal RNA (18S rRNA) sequences has challenged this traditional evolutionary scenario (Aguinaldo et al.,
1997). Thus, coelenterates, acoelomates, pseudocoelomates
and lophophorates seem to be artificial systematic groups.
Coelomate still remains though containing different descendants. One of its branches leads to deuterostomates and the
other to protostomates. This last group contains two new
clades, the Lophotrochozoa and the Ecdysozoa. Although
the branching order basically remains unsolved, ecdysozoa
groups nematodes and arthropods in a single clade. Under this
Bioinformatics 20(Suppl. 1) © Oxford University Press 2004; all rights reserved.
Accurate phylogeny of eukaryote model species
2 SYSTEMS AND METHODS
2.1 Datasets
Fig. 1. Alternative phylogenetic hypothesis concerning the
11 eukaryote model species studied. Traditional phylogenetic hypothesis considers the worm C.elegans as sister taxa of arthropods
(A). Alternatively, the ecdysozoa hypothesis cluster worms and arthropods in a single moulting clade (B). Systematic conflicts involve
the ancestral–descendant relationships of species derived from node
5. Nodes: 1, mammals; 2, vertebrates; 3, chordates; 4, arthropods;
5, coelomates (A tree), ecdysozoa (B tree); 6, bilateral animals; 7,
opisthokonts; 8, plantae; and 9, superior eukarya. By default, node
9 have 100% bootstrap support in all the phylogenetic analysis.
evolutionary scenario, flies are genetically closer to worms
than to humans (Fig. 1B).
Subsequent molecular and morphological studies have been
carried out though the controversy remains unsolved. While
the use of different single gene sequences support the ecdysozoa hypothesis (Aguinaldo et al., 1997; de Rosa et al., 1999;
Mallatt and Winchell, 2002; Manuel et al., 2001; Peterson
and Eernisse, 2001; Ruiz-Trillo et al., 2002), the analysis
of dozens to hundreds of aligned sequences support the
coelomata clade (Blair et al., 2002; Wang et al., 1999; Wolf
et al., 2004).
The acceptance of the new animal phylogeny with the
Ecdysozoa hypothesis imposes a new scheme to understand the Cambrian explosion (Balavoine and Adoutte, 1998;
Morris, 2000), the origin of metazoan body plans (de Rosa
et al., 1999; Carrol et al., 2001) and consequently sets a
new phylogenetic framework for the advancement in comparative genomics. Here, we present strong phylogenomic
evidence against the ecdysozoa hypothesis based on the
presence/absence of homologous sequences corresponding to
the exons of genes of 7 human chromosomes found in 11 genomes of eukaryotic model organisms. To our knowledge, this
is the most extensive phylogenomic analysis carried out to
date in terms of the number of characters involved.
In order to avoid problems derived from the inaccuracy of
some of the most recently released proteomes, we have
favoured the direct analysis of genomic sequences to check the
presence/absence of the genes in the data set. Since exons are
scattered across long genomic distances, and direct homology search of the protein in the genomes is difficult with
methods, such as BLAST (Altschul et al., 1997), we chose to
search for individual exons instead of whole proteins. The
total number of exons used was 25 676 corresponding to
3042 annotated human genes. Complete genome sequences
of 11 eukaryotic species, including Plasmodium falciparum
(Gardner et al., 2002), Arabidopsis thaliana (Arabidopsis
Genome Initiative, 2000), Oryza sativa (Yu et al., 2002),
Saccharomyces cerevisiae (Goffeau et al., 1996), C.elegans
(C.elegans Sequencing Consortium, 1998), A.gambiae (Holt
et al., 2002), D.melanogaster (Adams et al., 2000),
C.intestinalis (Dehal et al., 2002), F.rubripes (Aparicio et al.,
2002), M.musculus (Mouse Genome Sequencing Consortium, 2002) and H.sapiens (International Human Genome
Sequencing Consortium, 2001) were downloaded (Table 1)
and formatted to run local BLAST (Altschul et al., 1997).
Amino acid sequences corresponding to all the exon’s genes
in a randomly chosen sample of human chromosomes (comprised of Y, 13, 14, 18, 20, 21 and 22) were obtained from
the Ensembl database project (Hubbard et al., 2002). We used
the tblastn engine (Altschul et al., 1997) that searches a query
amino acid sequence on the six translation frames of the target
sequence (the complete genomes). Exon sequences of genes of
seven human chromosomes were used for a similarity search
on complete genomes sequences of the mentioned eukaryote
species. Exons shorter than 22 amino acids in length were
discarded from the analysis in order to avoid false positives
due to the small size of the query sequence. Each best hit
was filtered by means of the combination of three different
values (v1 , v2 , v3 ). Being v1 , a blast E-value threshold; v2 , a
proportion of aligned sequence length; and v3 , a chromosome
context reference: the minimum number of exons of the gene
matching on the same chromosome. When v3 = 1, all exons
pass through the filter; if v3 = 2, at least two exons of the same
gene must match on the same chromosome to be included on
the data set.
A matrix of presence/absence of putative homologous exons
was built-up using the following filtering scheme: v1 ≤
1e − 05, v2 ≥ 75%, v3 = 2. The use of other two alternative filtering schemes (v1 ≤ 1e − 03, v2 ≥ 50%, v3 = 1
and v1 ≤ 1e − 03, v2 ≥ 75%, v3 = 2) did not change the
results (data not shown). The resulting matrix has a total
number of 25 676 characters (exons), corresponding to 3042
annotated human genes. These sequences were derived from
a genomic sample of 570 Mb (which covers ∼20% of the
human genome). These characters are uniformly distributed
i117
H.Dopazo et al.
Table 1. Sources of the genomes used in the study
Species
Version/date
Location
H.sapiens
M.musculus
F.rubripes
C.intestinalis
D.melanogaster
A.gambiae
C.elegans
S.cerevisiae
A.thaliana
O.sativa
P.falciparum
31
16/10/03
9.1.1
1.0
3.0
9.1a.1
19/12/02
28/01/03
28/01/03
27/05/02
4.0
http://genome.cse.ucsc.edu/goldenPath/14nov2002/bigZips/
http://genome.cse.ucsc.edu/goldenPath/mmFeb2002/bigZips/
ftp.ensembl.org/pub/current_fugu/data/fasta/dna/Fugu_rubripes.latestgp.fa
ftp.jgi-psf.org/pub/JGI_data/Ciona/v1.0/ciona.fasta.gz
ftp.fruitfly.org/pub/download/compressed/
ftp.ensembl.org/pub/current_mosquito/data/golden_path/
ftp.sanger.ac.uk/pub/C.elegans_sequences/CHROMOSOMES/CURRENT_RELEASE
http://genome-ftp.stanford.edu/pub/yeast/data_download/sequence/genomic_sequence/chromosomes/fasta
ftp.ncbi.nih.gov/genomes/A_thaliana/
http://btn.genomics.org.cn/rice/download.php?pre=contig_sequence&name=all_sequence
http://www.plasmodb.org/restricted/data/P_falciparum/WG/nuc/
across the chromosomes of most of the eukaryotic species
analysed and, consequently can be considered a homogeneous sample of their genomes. From a theoretical point of
view, this feature has demonstrated to be critical in supporting bootstrap values (Felsenstein, 1985). Indeed a previous
genome-wide study has demonstrated that sampling characters scattered along the whole genomes gives better results
rather than sampling co-lineal blocks when the whole-genome
phylogeny is assayed (Cummings et al., 1995).
A reduced matrix was obtained by randomly sampling
19 000 characters without replacement from whole matrix.
This matrix was necessary to run MrBayes (Ronquist and
Huelsenbeck, 2003) program, because of their limitations in
the input data size.
2.2
Phylogenetic methods
Maximum parsimony (MP) trees were constructed for all the
datasets using the PAUP∗ (Swofford, 2003) default option
of the branch and bound search. Unweighted parsimony
was used for all the analysis. Cladograms were evaluated using 5000 and 10 000 bootstrap replicates, obtained
using the R package (Ihaka and Gentleman, 1996). Distance
trees were estimated by neighbour-joining and least-square
methods using PHYLIP (Felsenstein, 2002). Maximum parsimony and distance trees were summarized using the 50%
majority-consensus rule.
Bayesian analysis computing the posterior probability of
trees was carried out in MrBayes (Ronquist and Huelsenbeck,
2003). Thus, 10 000 phylogenetic trees were sampled for
every 100 generation by using the Markov chain Monte Carlo
(MCMC) with 4 chains. The first 1000 trees in the sample
were removed to avoid including trees sampled before convergence of the Markov chains. Posterior probabilities of trees
were averaged on 9000 samples under an unordered standard
binary model. Branch lengths were saved and represented on
the 50% majority-consensus tree.
i118
A total of 100 random matrices of 8 different sizes ranging
from 100 to 20 000 characters were constructed without any
replacement from the original matrix. These matrices indeed
were 100 times randomized using replacement in order to get
at least 10 000 cladograms to compare topological distances
individually with the whole-genome tree. Tree distances and
MP trees were calculated using treedist and pars algorithms
of PHYLIP (Felsenstein, 2002), respectively.
3 RESULTS
3.1 Genomes’ phylogeny strongly support the
coelomata tree
Distance, MP and Bayesian analysis conclusively support the
coelomata tree. Figure 2 shows the phylogenetic reconstruction of the genomes tree and the statistical confidence of nodes
using the distinct phylogenetic methods. All the nodes are supported with the highest statistical confidence independent of
the phylogenetic technique.
Those supporting the ecdysozoa clade hypothesis have
argued in favour of the long branch attraction effect to explain
the position of C.elegans as sister taxa of coelomates. The
branch length of C.elegans, however, was recurrently shorter
than those of arthropods or the basal chordate C.intestinalis
(data not shown). The basal position of C.elegans in all the
genome trees does not derive from a topological effect caused
by a highly differentiated nematode.
3.2
Effect of the number of characters on genome
tree accuracy
In order to assess the relationship between the number of characters and the probability of obtaining the whole-genome tree,
random samples of different sizes derived from the homology
matrix were drawn. Figure 3 shows the dependence of the
mean tree distance to the whole-genome tree on the number
of characters used in its reconstruction. Confident topologies
Accurate phylogeny of eukaryote model species
compatible with ecdysozoa in datasets, including <5000 characters. Although we exhaustively searched for the ecdysozoa
group on trees derived from matrices built from 1000 to 5000
characters, this clade has never been obtained as a topological solution from the randomly sampled smaller datasets.
This result demonstrates that a large number of characters
are required to ensure with sufficient accuracy that the tree
is identical to the whole-genome tree. Moreover, the figure
points out the strong phylogenetic signal of the matrix. Indeed,
we could still recover the coelomata tree with sufficient
accuracy even removing more than 5000 characters.
4
Fig. 2. Phylogenetic reconstruction of the genome trees using the
homology matrices. Bootstrap proportions were obtained from the
50% majority-rule consensus tree using 1000 replicates for distance
methods (NJ, LS) and 10 000 using MP. Posterior probabilities were
averaged on 9000 samples in the Bayesian analysis (BI).
Fig. 3. Dependence of the average distance to the true tree on the
number of sampled characters used in the reconstruction. Each point
represents the mean tree distance of at least 10 000 cladograms to
the highest supported topology of the genome trees found. Bars correspond to standard error deviation. Insert shows the mean standard
error deviation versus characters. Samples below 5000 correspond
to 100, 500, 1000 and 2500 characters.
with mean tree distance units below one can be obtained when
more than 10 000 characters are used.
Since the ecdysozoa topology differs from the coelomata
by two tree distance units, we could expect to find topologies
DISCUSSION
It is widely accepted that the phylogenetic reconstruction lies
on the premise that sampled data are representative of the
whole genome from which they are drawn. Sequences are
generally chosen on the basis of factors such as the functional
characteristics of a region, the historical use in systematic
studies and other considerations generally not related to its
ability to reconstruct whole-genome relationships (Cummings
et al., 1995).
The monophyly of the ecdysozoa group was actually proposed on the basis of the analysis of different single gene
sequences (Aguinaldo et al., 1997; de Rosa et al., 1999;
Mallatt and Winchell, 2002; McHugh, 1997; Manuel et al.,
2001; Ruiz-Trillo et al., 2002; Peterson and Eernisse, 2001).
Indeed, the total number of characters favouring the hypothesis ranges from only one rare genomic event (the triplication of the β-thymosin scanned on uncompleted genomes) to
∼4000 sites when using SSU+LSU rRNA sequences (Mallatt
and Winchell, 2002). Considering this last exhaustive study
favouring the hypothesis, only the 18S rRNA gene (∼1500
sites) showed phylogenetic signal to support the clade with
81% of bootstrap confidence (Mallatt and Winchell, 2002).
This result is comparable with the original study of the same
gene using ∼1100 characters giving at best 77% of confidence
(Aguinaldo et al., 1997).
The posterior re-examination of the original data set
(Aguinaldo et al., 1997) demonstrated important weakness
in the phylogenetic analysis, for instance, the strict support
for the clade with only two binary characters (Wägele et al.,
1999). Moreover, the same study criticized the reliability
of the original morphological synapomorphies, for instance,
the moulting processes are not homologous, while arthropods shed their cuticles made up of chitin, the nematodes’
ecdysis compromise collagen proteins. Apart from these criticisms, the accepted reliability of 18S rRNA as a phylogenetic
marker has been questioned. Some authors emphasized the
extreme nucleotide composition bias among taxa (Abouheif
et al., 1998; Hasegawa and Hashimoto, 1993), the differences
in substitution rates causing long branch attraction (Abouheif
et al., 1998; Carmean and Crespi, 1995), and the absence of
enough informative positions for a robust reconstruction of the
i119
H.Dopazo et al.
metazoan phylogeny (Phylippe et al., 1994). Indeed, although
secondary structure-based alignments can improve uncertainties of the rRNA positional homology, the multiple alignment
problem of these sequences cannot easily be solved without
ambiguities (Notredame et al., 1997). DNA evolution models
considering base paired sites, as independent characters are
unsuitable for modelling RNA coding genes, consequently,
corrections to compensatory substitutions occurring in different parts of the molecule must be taken into account (Dixon
and Hillis, 1993; Jow et al., 2002).
The methodology used in this paper circumvents several difficulties generally associated with phylogenetics and
genome trees methods (Wolf et al., 2002). For instance, as
we search for homologous to exon sequences on whole genome sequences, problems associated with the incompleteness
of the derived proteomes, or in protein homologous databases (Tatusov et al., 2001), normally used in other genome
tree reconstructions, are avoided. The same occurs with the
multiple alignment problems leading to positional homology
errors and the selection of the evolutionary model to correct
for multiple hits. Our approach does not depend on these procedures. The implicit evolutionary assumption used in the
analysis is the equal probability of exon gain or loss on the
branches. Since data are based on the presence or absence of
exons, long branch attraction effects associated with different rates of sequence evolution among species are avoided.
Since paralogous sequences of the human genome were not
specifically filtered, we choose the worst scenario to test the
coelomata hypothesis considering that all the human paralogous sequences [5% of the human genome, see Eichler
(2001)] correspond to coding sequences. If this was the case,
approximately 1000 paralogous exons would be included in
the datasets. We have shown that indeed by removing 5000
characters we could still recover the coelomata tree with sufficient accuracy. Thus considering the worst scenario of human
duplications affecting the dataset, our results are robust to
the paralogous error inclusion. In the approach presented
here, there is a potential problem of bias that could favour
Coelomata group. The data used are exons found in different species as homologous to human exons. Nevertheless,
Ecdysozoa grouping would be favoured by the absence of such
characters. Whether this bias overrate the benefits to avoid
long branch attraction effect is a matter for further analysis.
We have presented the results based on the largest set of
sequences published to date that support the coelomata clade
with 100% of statistical confidence. Previous phylogenetic
studies in which extensive regions of eukaryotic genomes
were sampled drew similar conclusions. In particular, the analysis of several dozens (Hausdorf, 2000; Wang et al., 1999)
to more than 100 concatenated nuclear proteins (Blair et al.,
2002; Wolf et al., 2004) support the chordate plus arthropod
clade. A recent paper emphasizes the benefits of selecting a
high number of concatenated characters in order to resolve
conflictive systematic hypothesis supported by single gene
i120
methods (Rokas et al., 2003). They pointed out that bootstrap values having 100% confidence on five internal branches
has never been previously shown in systematic studies. We
have demonstrated that using the most stringent condition of
homology, all the phylogenetic reconstruction methods yield
100% of support on all the eight internal branches of the
coelomata tree.
Comparative biology lies on the accuracy of its phylogenetic
hypothesis. We have shown how phylogenetic reconstruction based on the whole genome sequences has the potential
to solve one of the most controversial hypothesis in animal
evolution: the reliability of the ecdysozoa clade. If sampling
a huge number of characters seems to be the way of providing
the highest statistical confidence on conflicting evolutionary
hypothesis, the accurate selection of the organisms for future
genome projects must be systematically and phylogenetically
considered.
ACKNOWLEDGEMENTS
We are indebted to all the members of the Bioinformatics
unit (CNIO). Special thanks to A. Cucchi for useful comments on the manuscript and Amanda Wren for the revision
of the English. This work was partially supported by grant
PI020919 from the FIS. H.D. is supported by a fellowship
from Fundación Carolina.
REFERENCES
Abouheif,E., Zardoya,R. and Meyer,A. (1998) Limitations of metazoan 18S rRNA sequence data: implications for reconstructing a
phylogeny of the animal kingdom and inferring the reality of the
cambrian explosion. J. Mol. Evol., 47, 394–405.
Adams,M.D., Celniker,S.E., Holt,R.A., Evans,C.A., Gocayne,J.D.,
Amanatides,P.G., Scherer,S.E., Li,P.W., Hoskins,R.A., Galle,R.F.
et al. (2000) The genome sequence of Drosophila melanogaster.
Science, 287, 2185–2195.
Adoutte,A., Balavoine,G., Lartillot,N. and de Rosa,R. (1999) Animal
evolution: the end of the intermediate taxa? Trends Genet., 15,
104–108.
Aguinaldo,A.M., Turbeville,J.M., Linford,L.S., Rivera,M.C.,
Garey,J.R., Raff,R.A., and Lake,J.A. (1997) Evidence for a clade
of nematodes, arthropods and other moulting animals. Nature,
387, 489–493.
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,
Miller,W., and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Res., 25, 3389–3402.
Aparicio,S., Chapman,J., Stupka,E., Putnam,N., Chia,J.M.,
Dehal,P., Christoffels,A., Rash,S., Hoon,S., Smit,A., et al. (2002)
Whole-genome shotgun assembly and analysis of the genome of
Fugu rubripes. Science, 297, 1301–1310.
Arabidopsis Genome Initiative (2000) Analysis of the genome
sequence of the flowering plant Arabidopsis thaliana. Nature, 408,
796–815.
Balavoine,G. and Adoutte,A. (1998) One or three cambrian radiations? Science, 280, 397–398.
Accurate phylogeny of eukaryote model species
Blair,J.E., Ikeo,K., Gojobori,T. and Hedges,B. (2002) The evolutionary position of nematodes. BMC Evol. Biol., 2, 7.
C. elegans Sequencing Consortium. (1998) Genome sequence of the
nematode C.elegans. a platform for investigating biology. Science,
282, 2012–2018.
Carmean,D. and Crespi,B.J. (1995) Do long branches attract flies?
Nature, 373, 666.
Carrol,S.B., Grenier,J.K. and Weatherbee,S.D. (2001) From DNA
to Diversity. Molecular Genetics and the Evolution of Animal
Design. Blackwell Science, MA.
Cummings,M.P., Otto,S. and Wakeley,J. (1995) Sampling properties
of DNA sequences data in phylogenetic analysis. Mol. Biol. Evol.,
12, 814–822.
de Rosa,R., Grenier,J.K., Andreeva,T., Cook,C.E., Adoutte,A.,
Akam,M., Carroll,S.B. and Balavoine,G. (1999) Hox genes in brachiopods and priapulids and protostome evolution. Nature, 399,
772–776.
Dehal,P., Satou,Y., Campbell,R.K., Chapman,J., Degnan,B., De
Tomaso,A., Davidson,B., Di Gregorio,A., Gelpke,M., Goodstein,D.M. et al. (2002) The draft genome of Ciona intestinalis:
insights into chordate and vertebrate origins. Science, 298,
2157–2167.
Dixon,M. and Hillis,D. (1993) Ribosomal RNA secondary structure: compensatory mutations and implications for phylogenetic
analysis. Mol. Biol. Evol., 10, 256–267.
Eichler,E.E. (2001) Recent duplication, domain accretion and the
dynamic mutation of the human genome. Trends Genet., 17,
661–669.
Gardner,M.J., Hall,N., Fung,E., White,O., Berriman,M.,
Hyman,R.W., Carlton,J.M., Pain,A., Nelson,K.E., Bowman,S.
et al. (2002) Genome sequence of the human malaria parasite
Plasmodium falciparum. Nature, 419, 498–511.
Felsenstein,J. (1985) Confidence limits on phylogenies: an approach
using bootstrap. Evolution, 39, 783–791.
Felsenstein,J. (2002) PHYLIP: Phylogeny Inference Package
(Version 3.6a3). Distributed by the author. Department of
Genome Sciences, University of Washington, Seattle, WA.
Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B.,
Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M.
et al. (1996) Life with 6000 genes. Science, 274, 563–567.
Hasegawa,M. and Hashimoto,T. (1993) Ribosomal RNA trees
misleading? Nature, 361, 23.
Holt,R.A.,
Subramanian,G.M.,
Halpern,A.,
Sutton,G.G.,
Charlab,R.,
Nusskern,D.R.,
Wincker,P.,
Clark,A.G.,
Ribeiro,J.M., Wides,R. et al. (2002) The genome sequence of the
malaria mosquito Anopheles gambiae. Science, 298, 129–149.
Hyman,L.H. (1940) The Invertebrates. Protozoa through
Ctenophora, Vol. 1. McGraw-Hill, NY.
Hausdorf,B. (2000) Early evolution of the bilateralia. Syst. Biol.,
49, 130–142.
Ihaka,R. and Gentleman,R. (1996) R: a language for data analysis
and graphics. J. Comput. Graph. Stat., 5, 299–314.
International Human Genome Sequencing Consortium (2001) Initial
sequencing analysis of the human genome. Nature, 409, 860–921.
Hubbard,T., Barker,D., Birney,E., Cameron,G., Chen,Y., Clark,L.,
Cox,T., Cuff,J., Curwen,V., Down,T. et al. (2002) The Ensembl
genome database project. Nucleic Acids Res., 30, 38–41.
Jow,H., Hudelot,C., Rattray,M. and Higgs,P.G. (2002) Bayesian
phylogenetics using an RNA substitution model applied
to early mammalian evolution. Mol. Biol. Evol., 19,
1591–1601.
Mallatt,J. and Winchell,C.J. (2002) Testing the new animal phylogeny: first use of large-subunit and small-subunit rRNA gene
sequences to classify the protostomes. Mol. Biol. Evol., 19,
289–301.
Manuel,M., Kruse,M., Müller,W. and Le Parco,Y. (2000) The comparison of beta-thymosin homologues among metazoa supports
an arthropod–nematode clade. J. Mol. Evol., 51, 378–381.
McHugh,D. (1997) Molecular evidence that echiurans and pogonophorans are derived annelids. Proc. Natl Acad. Sci. USA, 94,
8006–8009.
Morris,C. (2000) The Cambrian “explosion”: slow-fuse or
megatonnage. Proc. Natl Acad. Sci. USA, 97, 4426–4429.
Mouse Genome Sequencing Consortium (2002) Initial sequencing
and comparative analysis of the mouse genome. Nature, 420,
520–562.
Notredame,C., O’Brien,E.A. and Higgins,D.G. (1997) RAGA:
RNA sequence alignment by genetic algorithm. Nucleic Acids
Res., 25, 4570–4580.
Peterson,K.J. and Eernisse,D.J. (2001) Animal phylogeny and the
ancestry of bilaterians: inferences from morphology and 18S
rDNA gene sequences. Evol. Dev., 3, 170–205.
Phylippe,H., Chenuill,A. and Adoutte,A. (1994) Can the cambrian explosion be inferred through molecular phylogeny?
Development (Suppl.) 15–25.
Rokas,A., Williams,B., King,N. and Carroll,S.B. (2003) Genomescale approaches to resolving incongruence in molecular
phylogenies. Nature, 425, 798–804.
Ronquist,F. and Huelsenbeck,J. (2003) MrBayes 3: Bayesian
phylogenetic inference under mixed models. Bioinformatics, 19,
1572–1574.
Ruiz-Trillo,I., Paps,J., Loukota,M., Ribera,C., Jondelius,U.,
Baguna,J. and Riutort,M. (2002) A phylogenetic analysis of
myosin heavy chain type II sequences corroborates the Acoela
and Nemertodermatida are basal bilaterians. Proc. Natl. Acad.
Sci. USA, 99, 11246–11251.
Swofford,D.L. (2003) PAUP*: Phylogenetic Analysis Using
Parsimony (*and Other Methods). (Version 4.0b10). Sinauer
Associates, Sunderland, MA.
Tatusov,R.L., Natale,D.A., Garkavtsev,I.V., Tatusova,T.A.,
Shankavaram,U.T., Rao,B.S., Kiryletin,B., Galperin,M.Y.,
Federova,N.D. and Koonin,E.V. (2001) The COG database: new
developments in phylogenetic classification of proteins from
complete genomes. Nucleic Acids Res., 29, 22–28.
Wägele,J., Lockhart,P. and Misof,B. (1999) The Ecdysozoa: artifact
or monophylum? J. Zool. Syst. Evol. Res., 37, 211–223.
Wang,D.Y.C., Kumar,S. and Hedges,B. (1999) Divergence time
estimates for the early history of animal phyla and the origin
of plants, animals and fungi. Proc. R. Soc. Lond., B 266,
163–171.
Wolf,Y.I., Rogozin,I.B., Grishin,N.V. and Koonin,E.V. (2002)
Genome trees and the tree of life. Trends, Genet., 18, 472–479.
Wolf,Y.I., Rogozin,I.B. and Koonin,E.V. (2004) Coelomata and not
Ecdysozoa: evidence from genome-wide phylogenetic analysis.
Genome Res., 14, 29–36.
Yu,J., Hu,S., Wang,J., Wong,G.K., Li,S., Liu,B., Deng,Y., Dai,L.,
Zhou,Y., Zhang,X. et al. (2002) A draft sequence of the rice
genome (Oryza sativa L. ssp. indica). Science, 296, 79–92.
i121