Supertree Bootstrapping Methods for Assessing Phylogenetic

Syst. Biol. 55(3):426-440, 2006
Copyright © Society of Systematic Biologists
ISSN: 1063-5157 print / 1076-836X online
DO1:10.1080/10635150500541722
Supertree Bootstrapping Methods for Assessing Phylogenetic Variation among Genes
in Genome-Scale Data Sets
J. GORDON BURLEIGH, AMY C. DRISKELL, AND MICHAEL J. SANDERSON
Section of Evolution and Ecology, University of California, Davis, CA 95616, USA;
E-mail: [email protected] (J.G.B.)
Abstract.—Nonparamtric bootstrapping methods may be useful for assessing confidence in a supertree inference. We examined the performance of two supertree bootstrapping methods on four published data sets that each include sequence
data from more than 100 genes. In "input tree bootstrapping," input gene trees are sampled with replacement and then
combined in replicate supertree analyses; in "stratified bootstrapping," trees from each gene's separate (conventional) bootstrap tree set are sampled randomly with replacement and then combined. Generally, support values from both supertree
bootstrap methods were similar or slightly lower than corresponding bootstrap values from a total evidence, or supermatrix, analysis. Yet, supertree bootstrap support also exceeded supermatrix bootstrap support for a number of clades. There
was little overall difference in support scores between the input tree and stratified bootstrapping methods. Results from
supertree bootstrapping methods, when compared to results from corresponding supermatrix bootstrapping, may provide
insights into patterns of variation among genes in genome-scale data sets. [Nonparametric bootstrapping; phylogenetics;
supermatrix; supertree; supertree bootstrapping.]
Large data sets derived from whole genomes or from
sequence databases are becoming more commonplace
in phylogenetic studies. Numerous phylogenetic analyses have used data sets that include sequences from
more than 100 loci (e.g., Daubin etal., 2001; Bapteste
et al., 2002; Blair et al., 2002, 2005; Lee, 2002; Lerat et al.,
2003; Rokas etal., 2003; Driskell etal., 2004; Philippe
et al., 2004, 2005; Wolf et al., 2004; Dopazo and Dopazo,
2005; Philip et al., 2005). Perhaps the greatest challenge
in phylogenetic analysis of data sets this large is heterogeneity among loci. Questions regarding the treatment of heterogeneous loci are not new (see Bull et al.,
1993; de Queiroz et al., 1995; Huelsenbeck et al., 1996;
Cunningham, 1997), but these questions are especially
relevant when analyzing genome-scale data sets that
often exhibit extensive gene-specific phylogenetic variation (e.g., Rokas et al., 2003; Driskell et. al., 2004). Interestingly, incongruence among genes appears rampant
in genome-scale data sets whether the total evidence
bootstrap results are uniformly high (e.g., Rokas et al.,
2003) or reveal mixed levels of support (e.g., Bapteste
et al., 2002; Driskell et al., 2004). The presence of such
variation among genes may greatly affect or even mislead results of a total evidence phylogenetic analysis. For example, total evidence phylogenetic inference
may be particularly influenced by loci with longer sequences or faster rates of evolution (Seo etal., 2005).
Thus, when analyzing genome-scale data sets it is critical
to understand not only the variation among characters
but also the patterns of variation among genes—and
to assess how this variation may affect phylogenetic
inference.
Supertree methods (Bininda-Emonds, 2004) may be
useful in genome-scale phylogenetic analyses. Supertree
methods combine input trees with partially overlapping sets of taxa to make comprehensive phylogenetic
hypotheses incorporating all taxa present in the input. They are increasingly popular for combining data
from disparate sources and building large phylogenies (e.g., Sanderson et al., 1998; Bininda-Emonds et al.,
2002; Bininda-Emonds, 2004). Recently, supertree methods also have been used to infer phylogenies from large,
multigene data sets (Daubin et al., 2001; Cotton and Page,
2002; Philip et al., 2005). Whereas a total evidence, or supermatrix, approach, which concatenates sequences from
all genes, uses nucleotide or amino acid sites as the basic unit of data, supertree approaches treat each input
tree as the basic unit of data. If each input tree is built
from sequences of a single gene in a genome-scale data
set, supertree methods may be more sensitive to variation in the phylogenetic inference among genes than
supermatrix methods. Similarly, supertree methods also
may minimize the effects in phylogenetic inference of
anomalous loci, such as loci with histories of horizontal transfer (Escobar-Paramo et al., 2004). Furthermore,
in genome-scale data sets, gene sequences may be missing due to gene gains or losses across taxa (e.g., Daubin,
2001) or simply because they have not been sampled (e.g.,
Driskell et al., 2004; Philippe et al., 2004; Yan et al., 2005).
Because supertree methods explicitly build phylogenetic
hypotheses from data sets without complete taxonomic
overlap, genome-scale phylogenetics is a natural application of supertree approaches.
This study explores the utility of two methods of
supertree bootstrapping for assessing clade support in
supertrees and the sampling variance associated with
individual genes in genome-scale data sets. Though numerous supertree methods have been developed (see
Bininda-Emonds, 2004), methods for generating support
values for supertree inferences have received less attention (but, see Purvis, 1995; Cotton and Page, 2002;
Bininda-Emonds, 2003; Creevey etal., 2004; Ronquist
etal., 2004; Philip etal., 2005). We used four published genome-scale data sets to examine two previously proposed nonparametric bootstrapping methods
for assessing clade support in supertrees. We also
compare the differences in clade support measures
from supertree bootstrapping with those from conventional supermatrix bootstrapping in these same data
sets.
426
2006
BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING
427
unrooted MP tree from each gene. The supertrees were
later rooted using the outgroup taxa specified in the original supermatrix studies (see Rokas et al., 2003; Driskell
We analyzed four published data sets, each containing et al., 2004; Philippe et al., 2005). If a gene tree had multisequences from more than 100 genes and varying widely ple optimal MP trees, the first MP tree in the tree file was
in the amount of missing data. Two of the supermatri- used as the input tree. We also ran the analyses using
ces have extreme levels of missing data (Driskell et al., a strict consensus of all MP trees, and the results were
2004). Each taxon in these two supermatrices has amino very similar (not shown). Input trees were transformed
acid sequences from at least 10 genes, and each gene in into their binary matrix representation (e.g., Farris et al.,
the supermatrix has sequences from at least four taxa 1970; Baum, 1992; Ragan, 1992), called the MRP matrix,
(Driskell et al., 2004). The metazoan supermatrix is com- using the r8s program (Sanderson, 2003).
posed of sequences from 1131 genes for 70 taxa. Each
Two variations of supertree bootstrapping were pertaxon has sequences from an average of 95 of the 1131
formed,
which we term input tree bootstrapping and stratgenes. Therefore, approximately 92% of the total gene seified
bootstrapping.
Input tree bootstrapping samples with
quences are missing from the supermatrix (Driskell et al.,
replacement
from
among
the original MP input trees to
2004). The green plant supermatrix contains sequences
create
new
replicate
data
sets
containing the same numfrom 254 genes for 69 green plant taxa. Each taxon has
ber
of
input
trees
as
in
the
original
data set (e.g., Daubin
sequences from an average of 40 genes, and thus, approx2001;
Creevey
and
Mclnerney,
2004; Creevey et al.,
et
al.,
imately 84% of the gene sequences are missing (Driskell
2004;
Philip
et
al.,
2005).
This
version
of input tree bootet al., 2004). The bilaterian supermatrix of Philippe et al.
strapping
is
analogous
to
the
conventional
method of
(2005) is missing a smaller proportion of the data (~35%)
and is comprised of amino acid sequences from 49 taxo- bootstrapping used in phylogenetics (Felsenstein, 1985).
nomic units and 146 genes. Many of the taxon labels in However, the sampled characters are input trees, not sinthe bilaterian supermatrix as published by Philippe et al. gle columns in a character matrix, and the resulting repli(2005) represent chimeric taxa, in which sequences from cate matrices likely will not contain all input trees. The
a larger clade have been combined into a single taxo- nonparametric bootstrap assumes that the resampled
nomic unit. In some cases we used taxon labels that differ characters, in this case input gene trees, are independent
from those of Philippe et al. (2005). Generally, the new and identically distributed (Felsenstein, 1985). Whereas
taxon names represent higher taxonomic units that in- this assumption seems at least as valid for different genes
corporate all taxa that compose each chimeric sequence. as for neighboring nucleotides, it is more problematic if
However, the sequences were not modified. A table that input trees are built from shared data sets. For the green
translates between the taxon labels used by Philippe plant, metazoan, and bilaterian data sets, 100 replicates of
etal. (2005) and those used in this study is available input tree bootstrapping were performed, each consistat http://systematicbiology.org. The fourth data set, the ing of four random addition sequence replicates with a
yeast supermatrix of Rokas et al. (2003), contains DNA se- maximum of 6 hours of TBR branch swapping and saving
quences from 106 genes in eight yeast taxa with no miss- up to 10,000 trees. We chose this search heuristic because
ing gene sequences. All data sets used in this study are it was computationally feasible and identical to that
used in the original MP supermatrix bootstrapping of
available at http://systematicbiology.org.
the green plant and metazoan supermatrices by Driskell
et al. (2004). We performed an identical MP supermatrix
Gene Tree Construction
bootstrap analysis on the bilaterian data set of Philippe
Phylogenetic trees were obtained using maximum par- et al. (2005). Thus, the comparison of supertree and susimony (MP) for each gene in each supermatrix, and permatrix bootstrap scores should not be affected by difthese gene trees were used as input trees for supertree ferent heuristics. Longer tree searches may have yielded
analyses. All phylogenetic searches were implemented higher bootstrap scores (e.g., DeBry and Olmstead, 2000;
in PAUP* (Swofford, 2003). The metazoan, green plant, Mort et al., 2000; Sanderson and Wojciechowski, 2000);
and bilaterian MP gene trees were constructed using a however, they would likely not affect the overall comparheuristic search strategy: each search used TBR branch ison of supertree and supermatrix bootstrap scores. For
swapping on four random addition sequence replicates the yeast data set, 100 replicates of input tree bootstrapand saved a maximum of 2500 trees per replicate. Branch ping were performed, each using a branch and bound
and bound MP searches (Hendy and Penny, 1982) were MP search. An MP supermatrix bootstrap analysis also
used to identify MP gene topologies from the yeast data was performed using PAUP* with branch and bound
searches on 100 replicates.
set.
Stratified bootstrapping was previously proposed as a
method to incorporate uncertainty within individual inSupertree Construction and Bootstrapping
put trees into a supertree analysis (Cotton and Page, 2002;
All supertree analyses used the matrix representa- Page, 2004). First, each gene comprising the original sution with parsimony (MRP) supertree method (Baum, permatrix is bootstrapped. Confidence in each gene tree
1992; Ragan, 1992; Baum and Ragan, 2004), the most also was assessed based on 100 nonparametric bootstrap
widely used supertree method (Bininda-Emonds, 2004). replicates (Felsenstein, 1985). For each gene sequence in
For each data set, the input trees consisted of a single the metazoan, green plant, and bilaterian supermatrices,
METHODS
Data Sets
428
SYSTEMATIC BIOLOGY
each bootstrap replicate used a maximum of 1 hour of
TBR branch swapping from a simple addition sequence
starting tree and saved up to 1000 trees. For the yeast supermatrix, 100 bootstrap MP searches were completed
using branch and bound searches. For each of the stratified bootstrap replicates, a tree from a single bootstrap
replicate for each gene is selected randomly and used as
an input tree in a subsequent supertree search. Thus, in
every replicate of the stratified bootstrap, each gene is
represented with a tree selected from one of its bootstrap
replicates. We performed 100 replicates of stratified bootstrapping on all four data sets, using the same MRP tree
heuristic search strategies as used in the input tree bootstrap analysis described above. We note that a similar
approach to stratified bootstrapping would be to weigh
individual data sets based on their bootstrap support
(e.g., Ronquist, 1996; Bininda-Emonds and Sanderson,
2001) and resample these weighted data sets. Although
these approaches both account for clade support in input
trees, the stratified bootstrapping approach may provide
a more complete picture of the tree support because it incorporates all clades that receive any support even if they
are not represented in a majority rule bootstrap tree. Alternatively, bootstrapping the weighted trees would be
more feasible when the original data sets are unavailable.
RESULTS
Supertree Bootstrap
In the metazoan, green plant, and bilaterian data sets,
support values from both methods of supertree bootstrapping are often low. Although in each analysis some
clades have high support, the majority of clades are
poorly supported. Overall, the scores from input tree
bootstrapping are similar to those from stratified bootstrapping in these three data sets. In the yeast data set,
the supertree bootstrap support from both methods is at
or near 100% for all clades in the optimal MP tree.
In the metazoan data set, Mammalia and Amniota
have at least 91% support for both supertree bootstrapping methods (Fig. la, b). Vertebrata and Metazoa have
83% and 84% support, respectively, in input tree bootstrap and 97% and 98% support with stratified bootstrapping. Support for Tetrapoda is never higher than
57% (Fig. la, b). The primates and numerous lower level
clades are supported with near 100% bootstrap values
in the supertree bootstraps, but many other clades show
very low supertree bootstrap support (Fig. la, b).
In the green plant data set, the land plants, seed plants,
and flowering plants all have at least 92% support from
supertree bootstrapping (Fig. 2a, b). Vascular plants have
82% and 51% support from input tree bootstrapping and
stratified bootstrapping, respectively (Fig. 2a, b). Many
angiosperm clades, including monocots and eudicots,
have very low support from supertree bootstrapping
(Fig. 2a, b).
Though many clades representing major taxonomic
groups are strongly supported in the bilaterian supertree bootstrap analyses, the relationships among
these large clades generally have low bootstrap values
VOL. 55
(Fig. 3a, b). For example, Deuterostomia, Nematoda,
Platyhelminthes, and Arthropoda are supported by at
least 86% of the replicates in the input tree bootstrap
(Fig. 3a) and 96% in stratified bootstrap (Fig. 3b). The
fungi have 92% and 90% values for input tree bootstrapping and stratified bootstrapping, respectively, and
other fungi clades all have at least 89% bootstrap scores
(Fig. 3a, b).
Nearly all bootstrap values from both supertree bootstrapping methods are 100% for all clades in the MP yeast
topology (Fig. 4). The only exception is the single clade
of Saccharomyces kudriavzevii, S. mikatae, S. paradoxus, and
S. cerevisiae, which has a stratified bootstrap value of 96%
(Fig. 4).
Comparing Supertree and Supermatrix Bootstrap Scores
We compared bootstrap support from both supertree
bootstrapping methods to supermatrix bootstrapping for
all clades that received minimally 5% support from at
least one of the bootstrapping methods being compared.
The bootstrap values from the supermatrix analyses generally exceed those from the supertree analyses in the
green plant data set, though there also are a number of
clades in which the supertree bootstrap values exceed the
corresponding supermatrix bootstrap values (Fig. 5a, b).
There is less difference in support values resulting from
the supermatrix and both methods of supertree bootstrapping for the metazoan, bilaterian, and yeast data
sets (Figs. 4a, b).
Supermatrix versus stratified bootstrap scores.—On aver-
age, the supermatrix bootstrap support exceeded the input tree bootstrap support by 5.5% (median = 4.5%), 1.3%
(median = 0.7%), and 0.6% (median = 0%) in the green
plant, metazoan, and bilaterian data sets, respectively.
For the green plant data, the supermatrix bootstrap support exceeded the input tree bootstrap support in 151
clades, and the input tree bootstrap values were higher in
97 clades. For the metazoan data, supermatrix bootstrap
values were higher in 104 clades and input tree bootstrap
support was higher in 84 clades. In the bilaterian data supermatrix, support was higher in 28 clades and input tree
bootstrap support was higher in 29 clades. The supermatrix bootstrap scores exceeded input tree bootstrap scores
by a maximum of 80.7%, 78.2%, and 56.0% in the metazoan, green plant, and bilaterian data sets, respectively,
and the input tree bootstrap scores exceeded the supermatrix bootstrap scores by a maximum of 59.5%, 51.5%,
and 45.5% in the three data sets.
Supermatrix versus stratified bootstrap.—On average, the
supermatrix bootstrap support exceeded the stratified
bootstrap support by 5.2% (median = 3.4%) and 0.8%
(median = 0%) in the green plant and metazoan data
sets, respectively. For the green plant data, supermatrix
bootstrap support was higher in 142 clades and stratified
bootstrap support was higher in 90 clades. In metazoans,
supermatrix support was higher in 78 clades and stratified bootstrap support was higher in 86 clades. However,
on average in the bilaterian data set the stratified bootstrap values exceed the supermatrix bootstrap values by
2006
BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING
0.1% (median = 0%). In 35 clades from the bilaterian
data set, the stratified bootstrap support is higher than
the supermatrix support, and the supermatrix bootstrap
support is higher in 28 clades. Supermatrix bootstrap
scores exceeded stratified bootstrap scores by a maximum of 55.0%, 61.1%, and 40.0% in the metazoan, green
95
plant, and bilaterian data sets, respectively, and the stratified bootstrap scores exceeded the supermatrix bootstrap
scores by a maximum of 36.8%, 47.5%, and 25.5% in the
three data sets.
Input tree versus stratified bootstrap.—On the whole, the
input tree bootstrap scores are slightly lower than the
22]
8fi.
41
92
65r
iflQr—
79 I
43
76
75
Mammalia
62
92 I
98
Amniota
91
Tetrapoda
57
85
64
Vertebrata
83
90
39
Metazoa
84
80
92
429
9606 Homo sapiens
9598 Pan troglodytes
9595 Gorilla gorilla gorilla
9600 Pongo pygmaeus
9601 Pongo pygmaeus abelii
9580 Hylobates lar
Primates
9534 Cercopithecus aethiops
9544 Macaca mulatta
954 1 Macaca fasciculahs
9545 Macaca nemesthna
9557 Papio hamadryas
9531 Cercocebus torquatus atys
9986 Oryctolagus cuniculus Lagomorpha
9685 Felis catus
9615 Canis familiaris
Carnivora
9711 Halichoerus grypus
9720 Phoca vitulina
9796 Equus caballus
9793 Equus aa'nus
Perissodactyla
9807 Ceratotherium simum
9809 Rhinoceros unicornis
9913 Bos taurus
9940 Ovis aries
9925 Capra hircus
9860 Cervus elaphus
Cetartiodactyla
9833 Hippopotamus amphibius
9770 Balaenoptera physalus
9771 Balaenoptera musculus
9823 Sus scrofa
9361 Dasypus novemcinctus Xenarthra
29137 Dugong dugon
Sirenia
10116 Rattus norvegicus
10090 Mus musculus
10029 Cricetulus griseus
Rodentia
10036 Mesocricetus auratus
10030 Cricetulus longicaudatus
10047 Meriones unguiculatus
10141 Cavia porcellus
9319 Macropus robustus
Marsupialia
9267 Didelphis virgiana
9258 Omithorhynchus anatinus Monotremata
9031 Gallus gallus
Aves
93934 Cotumix japonica
8801 Struthio camel us
8355 Xenopus laevis
| Amphibia
8400 Rana catesbiana
7955 Danio rerio
7962 Cyprinus carpio
7957 Carassius auratus
7980 Crossostoma lacustre
8049 Gadus morhua
Actinopterygii
8022 Oncorhynchus mykiss
8030 Salmo salar
31033 Takifugu mbripes
7998 Ictalurus punctatus
7797 Squalus acanthias
Elasmobranchii
7830 Scyliorhinus canicula
7897 Latimeria chalumnae
Coelacanth
49895 Polypterus ornatipinnis
Actinopterygii
7245 Drosophila yakuba
7227 Drosophila melanogaster
Diptera
7165 Anopheles gambiae
7166 Anoph. quadrimaculatus
7108 Spodoptera frugiperda
Lepidoptera
6239 Caenorhabditis elegans
Nematoda
4896 Schizosacch. pombe
4932 Saccharomyces cerevisiae
Fungi
5476 Candida albicans
5141 Neurospora crassa
44689 Dictyostelium discoideum Slime Mold
FIGURE 1. Supertree bootstrap consensus trees from the metazoan data set (Driskell et al., 2004). The taxon labels include the Genbank taxon
ID numbers and the taxonomic name associated with that number. Bootstrap percentages are above each branch, (a) Results from input tree
bootstrapping; (b) Results from stratified bootstrapping. (Continued)
430
SYSTEMATIC BIOLOGY
VOL. 55
9606 Homo sapiens
9598 Pan troglodytes
9595 Gorilla gorilla gorilla
9600 Pongo pygmaeus
9601 Pongo pygmaeus abelii
9580 Hylobates lar
Primates
9534 Cercopithecus aethiops
9544 Macaca mulatta
9541 Macaca fascicularis
9545 Macaca nemestrina
9557 Papio hamadryas
9531 Cercocebus torquatus atys
9685 Felis catus
9615 Canis familiaris
Carnivora
9711 Halichoerus grypus
9720 Phoca vitulina
9913 Bos taurus
9940 Ovis aries
9925 Capra hircus
9860 Cervus elaphus
Cetartiodactyla
9833 Hippopotamus amphibius
9770 Balaenoptera physalus
9771 Balaenoptera musculus
9823 Sus scrofa
9796 Equus caballus
9793 Equus asinus
Perissodactyla
9807 Ceratotherium simum
9809 Rhinoceros unicornis
9361 Dasypus novemcinctus Xenarthra
Sirenia
29137 Dugong dugon
9986 Oryctolagus cuniculus Lagomorpha
10116 Rattus norvegicus
10090 Mus musculus
10047 Meriones unguiculatus
10029 Cricetulus griseus
Rodentia
10036 Mesocricetus auratus
10030 Cricetulus longicaudatus
10141 Cavia porcellus
9319 Macropus robustus
I Marsupial ia
9267 Didelphis virgiana
9258 Omithorhynchus anatinus Monotremata
9031 Gallusgallus
Aves
93934 Cotumix japonica
8801 Struthio camel us
8355 Xenopus laevis
Amphibia
8400 Rana catesbiana
7955 Danio rerio
7962 Cyprinus carpio
7957 Carassius auratus
7980 Crossostoma lacustre
Actinopterygii
8049 Gadus morhua
8022 Oncorhynchus mykiss
8030 Salmo salar
7998 Ictalurus punctatus
31033 Takifugu rubripes
7797 Squalus acanthias
Elasmobranchii
7830 Scyliorhinus canicula
7897 Latimeria chalumnae
Coelacanth
49895 Polypterus ornatipinnis Actinopterygii
7245 Drosophila yakuba
7227 Drosophila melanogaster Diptera
7165 Anopheles gambiae
7166 Anoph. quadrimaculatus
7108 Spodoptera frugiperda
Lepidoptera
6239 Caenorhabditis elegans Nematoda
4896 Schizosacch. pombe
J
_94i—- 4932 Saccharomyces cerevisiae
Fungi
1711—^— 5476 Candida albicans
5141 Neurospora crassa
44689 Dictyostelium discoideum Slime Mold
Mammalia
91
Amniota
99
Tetrapoda
40
Vertebrata
97
99
25
Metazoa
98i
99
100
FIGURE 1.
stratified bootstrap scores in the metazoan and green
plant data sets (Fig. 5c). On average among all observed
clades, the stratified bootstrap score exceeded the input tree bootstrap score by 0.6% (median = 0.8%) for
the metazoan data set and by 0.4% (median = 1.1%)
for the green plant data set. In the metazoan data set,
the stratified bootstrap support exceeded the input tree
bootstrap support in 109 clades, whereas the input tree
(Continued)
bootstrap support was higher in 87 clades. In the green
plant data set, the stratified bootstrap support exceeded
the input tree bootstrap score in 128 clades, while the input tree bootstrap support was higher in 113 clades. The
difference between supertree bootstrap methods was
even smaller in the bilaterian data set. On average, the
input tree bootstrap support exceeded that of stratified bootstrapping by only 0.1% (median = 0%). In
2006
BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING
the bilaterian data set, the support from stratified bootstrapping exceeded that of input tree bootstrapping in
28 clades, and support from input tree bootstrapping
support was higher in 35 clades. Again, in each data
set there were several cases in which clade support was
Eudicots
Flowering Plants
98
Vascular Plants
82
Land Plants
921
431
very different among supertree bootstrapping methods.
Input tree bootstrap support exceeded stratified bootstrap support by a maximum of 51.2%, 49.2%, and 25.5%
in the metazoan, green plant, and bilaterian data sets, respectively, and stratified bootstrapping exceeded input
4097 Nicotiana tabacum
33113 Atropa belladonna
4081 Lycopersicon esculentum
Asterids
4113 Solarium tuberosum
4120 Ipomoea batatas
4177 Epifagus virgin iana
4151 Antirrhinum majus
3888 Pisum sativum
3879 Medicago sativa
34305 Lotus corn. var. japonicus Rosids
3847 Glycine max
3702 Arabidopsis thaliana
3708 Brassica napus
3562 Spinacia oleracea
Caryophyllales
85636 Oen. elata subsp. hookeri Rosids
3544 Mesembry. crystallinum
I
3555 Beta vulgaris subsp. vulgaris paryophyllales
I
3621 Rheum x cultorum
130722 Gunnera chilensis
13413 Cercidiphyllum japonicum
3696 Populus deltoides
4407 Trochodendron araliodes
13569 Hydrastis canadensis
212734 C. floridus var. glaucus
Laurales
3429 Calycanthus floridus
3415 Uriodendron tulipifena
Magnoliales
54733 Magnolia stellata
3419 Drimys winteri
Canellale s
4692 LJIium superbum
Monocots
13099 Illicium parviflorum
Austrobaileyales
50507 Schisandra chinensis
13351 Austrobaileya scandens
130721 Ascarina lucida
Chloranthales
13007 Chloranthus japonicus
22303 Lactoris femandeziana
Piperales
28498 Asarum canadense
13260 Saururus cernuus
4428 Ceratophyllum demersum
35874 Dioscorea bulbifera
Monocots
4465 Acorus calamus
85269 Spathiphyllum wallisii
13333 Amborella trichopoda
4426 Cabomba Carolinian a
Nymphaeales
4419 Nymphaea odorata
15008 Sagittaria latifolia
4565 Thticum aestivum
4550 Secale cereale
4513 Hordeum vulgare
Monocots
112509 H. vulgare subsp. vulgare
4530 Oryza sativa
39947 0. sativa japonica cult. grp.
4577 Zea mays
3350 Pinus thunbergii
Conifers
88728 Pinus koraiensis
3329 Picea abies
33152 Ephedra sinica
Gnetales
3382 Gnetum gnemon
Cycad
42329 Zamia furfuracea
Ginkgo
3311 Ginkgo bilobia
3240 Psilotum nudum
Ferns
13818 Adiantum capillus veneris
48387 Anthoceros formosae Hornworts
3197 Marchantia polymorpha Liverworts
3180 Spirogyra maxima
96477 Chaetosphaeridium globosum
3055 Chlamydomonas reinhardtii
Green Algae
3077 Chlorella vulgaris
31312 Nephroselmis olivacea
41882 Mesostigma viride
FIGURE 2. Supertree bootstrap consensus trees from the green plant data set (Driskell et al., 2004). The taxon labels include the Genbank
taxon ID numbers and the taxonomic name associated with that number. Bootstrap percentages are above each branch, (a) Results from input
tree bootstrapping; (b) results from stratified bootstrapping. (Continued)
432
SYSTEMATIC BIOLOGY
Eudicots
VOL. 55
4097 Nicotiana tabacum
33113 Atropa belladonna
4081 Lycopersicon esculentum
4113 Solanum tuberosum
Asterids
4120 Ipomoea batatas
4177 Epifagus virginiana
4151 Antirrhinum majus
Caryophyllales
3621 Rheum x cultorum
3888 Pisum sativum
3879 Medicago sativa
34305 Lotus corn. var. japonicus
3847 Glycine max
Rosids
3702 Arabidopsis thaliana
3708 Brassica napus
85636 Oen. elata subsp. hookeri
3562 Spinacia oteracea
aryophyllales
3544 Mesembry. crystallinum
3555 Beta vulgaris subsp. vulgaris
3696 Populus deltoides
13413 Cercidiphyllum japonicum
130722 Gunnera chilensis
4407 Trochodendron araliodes
13569 Hydrastis canadensis
212734 C. floridus var. glaucus Laurales
3429 Calycanthus floridus
3415 LJriodendron tulipifera
Magnoliales
54733 Magnolia stellata
3419 Drimys winteri
Canellales
130721 Ascarina lucida
Chloranthales
13007 Chloranthus japonicus
4428 Ceratophyllum demersum
22303 Lactoris fernandeziana
Piperales
13260 Saurvrus cernuus
28498 Asarum canadense
4565 Triticum aestivum
4550 Secale cere ale
4513 Hordeum vulgare
112509 H. vulgare subsp. vulgare
4577 lea mays
4530 Oryza sativa
Monocots
39947 0. sativayapon/ca-cult.-grp.
75008 Sagittaria latifolia
35874 Dioscorea bulbifera
85269 Spathiphyllum wallisii
4465 Acorus calamus
4692 Lilium superbum
13099 Illicium parviflorum
Austrobaileyales
50507 Schisandra chinensis
13351 Austrobaileya scandens
13333 Amborella thchopoda
4426 Cabomba caroliniana
Nymphaeales
4419 Nymphaea odorata
3350 Pinus thunbergii
Conifers
88728 Pinus koraiensis
3329 Picea abies
33152 Ephedra sinica
ne tales
3382 Gnetum gnemon
42329 Zamia furfuracea
Cycad
3311 Ginkgo bilobia
Ginkgo
48387 Anthoceros formosae
Hornworts
3197 Marchantia polymorpha Liverworts
3240 Psilotum nudum
13818 Adiantum capillus veneris Ferns
96477 Chaetosphaeridium globosum
3180 Spirogyra maxima
3055 Chlamydomonas reinhardtii
Green Algae
3077 Chlorella vulgaris
31312 Nephroselmis olivacea
41882 Mesostigma viride
i l
Flowering Plants
100
Vascular Plants
51
Land Plants
96
FIGURE 2. (Continued)
tree bootstrapping scores by a maximum of 37.2%, 48.6%, ods to assess confidence in supertree topologies (but,
and 40.0% in the three data sets.
see Bininda-Emonds, 2003; Creevey et al., 2004; Ronquist
et al., 2004; Philip et al., 2005). Yet, in many cases it is
important to understand not only the optimal supertree
DISCUSSION
topology but also its confidence limits. For example, suDespite the growing interest in supertrees, often su- pertrees are useful for studies in comparative biology,
pertree studies do not explicitly discuss or use meth- and bootstrapping supertrees allows one to incorporate
2006
433
BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING
phylogenetic uncertainty in these comparative analyses.
This study demonstrates implementations of two simple
supertree bootstrapping methods. The results provide a
comparison of supertree and supermatrix phylogenetic
methods and are the first to compare support values from
supertree and supermatrix methods.
There has been much interest in comparing the
performance of supertree and supermatrix techniques
(Bininda-Emonds and Sanderson, 2001; Kennedy and
Page, 2002; Gatesy et al., 2004; Hughes and Vogler, 2004).
Although supertree methods sometimes perform as well
or nearly as well as at recovering the true tree compared to total evidence methods in simulation (BinindaEmonds and Sanderson, 2001), in empirical data sets
supertree methods generally find a greater number of
equally optimal topologies and unambiguously resolve
fewer phylogenetic relationships than corresponding
total evidence approaches (Kennedy and Page, 2002;
Gatesy et al., 2004; Hughes and Vogler, 2004). This result may be due to MRP matrices containing far fewer
1
45
a
1
49
l
I
1001
1100 |
100
100
4ft
100
I
.„
L
42
j
100
100
1
100 |
82
24
100
1001
100
Diptera
100N
I 100
88
87
Insecta
100
45
|
83
62
U 10
99
.
97
TOO
90
100
'
100
Ascomycota
inn
100
inn
89 (
100
Basidiomycota
92
100
100 I
\
I"*
Actinopterygii
Mammalia
Urochordata
Deuterostomia
Echinodermata
Cephalochordata
Ancylostoma
Caenorhabditis b
Caenorhabditis e
Pristionchus
Brugia
Nematoda
Ascaris
Strongyloides
Meloidogyne
Heteroderidae
Trichocephalida
Hypsibius
Echinococcus
Schistosoma j
Schistosoma m Platyhelminthes
Fasdola
Dugesia
Anopheles
Drosophila
Glossina
Bombyx
Apis
Arthropoda
Coleoptera
Siphonaptera
Hemiptera
Crustacea
Chelicerata
Annelida
Mollusca
Hydra
Mnemiopsis
Monosiga b
Choanoflagellata
Monosiga o
Candida
Saccharomyces
Eurotiomyces
Gibberella
Neurospora
Magnaporthe
Schizosaccharomyces Fungi
Cryptococcus
Homobasidiomycetes
Ustilago
Glomales
Chytridiomycota
FIGURE 3. Supertree bootstrap consensus trees from the bilaterian data set (Philippe et al., 2005). Bootstrap percentages are above each branch,
(a) Results from input tree bootstrapping; (b) results from stratified bootstrapping. (Continued)
434
VOL. 55
SYSTEMATIC BIOLOGY
48
96
l
44
42
99
Diptera
' —
1100 1
31
88
Insecta
100
78
67j
87
97
100
92
1100 |
100
94
100
100
1—
100
'—
100
44
67
49
100
96
100 1
100
inn
100
.
93
100
98
As corny cota
1
1
100
100
100
1100 j
100
Basid omycota
90
100
FIGURE 3.
characters than the corresponding total evidence supermatrices. Furthermore, conventional MRP analyses
contain no information about the number of characters
supporting phylogenetic hypotheses represented by input trees, though this may be incorporated into a supertree analysis for example by weighting characters
in an MRP matrix in proportion to support for a clade
(Ronquist, 1996; Bininda-Emonds and Sanderson, 2001).
If the supertree analyses do not resolve clades as well
as the supermatrix analyses, one also might predict that
the bootstrap values generally would be lower in su-
^ " ^
99 1
Actinopterygii
Mammalia
Urochordata
Deuterostomia
Echinodermata
Cephalochordata
Annelida
Mollusca
Anopheles
Drosophila
Glossina
Bombyx
Apis
Arthropoda
Coleoptera
Siphonaptera
Hemiptera
Crustacea
Chelicerata
Ancylostoma
Caenorhabditis b
Caenorhabditis e
Pristionchus
Brugia
Nematoda
Ascaris
Meloidogyne
Heteroderidae
Strongyloides
Trichocephalida
Hypsibius
Echinococcus
Schistosomaj
Schistosoma m Platyhelminthes
Fasdola
Dugesia
Hydra
Mnemiopsis
Monosiga b
Choanoflagellata
Monosiga o
Candida
Saccharomyces
Eurotiomycetes
Gibberella
Neurospora
Magnaporthe
Schizosaccharomyces Fungi
Cryptococcus
Homobasidiomycetes
Ustilago
Glomales
Chytridiomycota
(Continued)
pertree analyses. Indeed, in the majority of clades, support from input tree or stratified bootstrapping is lower
than support from supermatrix analyses in the metazoan and green plant data sets, but overall the supertree
and supermatrix support is similar in the bilaterian and
yeast data sets (Fig. 5a, b). Yet, this characterization
of the relationship of supertree and supermatrix bootstrap support is incomplete. There are numerous cases in
which the supertree bootstrap scores for clades are higher
than supermatrix bootstrap scores (Fig. 5a, b). Therefore,
it is worthwhile to examine specific reasons for large
2006
435
BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING
S. cerevisiae
100/100
104/2
S. paradoxus
100/100
85/21
100/96
S. mikatae
69/37
100/100
106/0
100/100
72/34
S. kudriavzevii
S. bayanus
S. castellii
S. kluyveri
Candida albicans
FIGURE 4. Supertree bootstrap consensus tree from the yeast data set (Rokas et al., 2003). The supertree bootstrap percentages are above each
branch. The input tree bootstrap score is on the left separated by a slash from the stratified bootstrap score on the right. Below each branch are
the number of genes that support/conflict with the quartet implied by each branch (as in Driskell et al., 2004).
differences in support scores from supertree and supermatrix analyses.
Both supertree bootstrapping methods implicitly impose a different weighting scheme on the data than supermatrix analyses. The supertree bootstrapping on the
genome-scale data sets treats all genes, no matter how
large or how variable, as single characters (e.g., Doyle,
1992). Thus, in this case, supertree bootstrapping downweights data from genes with more informative sites and
up-weights genes with fewer informative sites. In this respect, the supertree bootstrapping may resemble a supermatrix analysis that corrects the optimality score of each
gene based on its length (Seo et al., 2005). Though the
overall bootstrap values from supertree bootstrapping
may be lower than those from supermatrix bootstrapping, this trend may not persist or may be diminished
when the proportion of genes that support a phylogenetic hypothesis differs from the proportion of sites that
support the same hypothesis. In these cases, the sampling
variance among sites differs from the variance among
genes. We note that in many cases down-weighting
longer genes and up-weighting shorter genes may not
benefit a phylogenetic analysis. However, it still may
be informative to compare the effects from the different
436
VOL. 55
SYSTEMATIC BIOLOGY
100
o Green Plant
0) 90
• Metazoan
J3
<* 80 A Bilaterian
0
10
20
30
40
50
60
70
80
90
100
Supermatrix Bootstrap Value
o Green Plant
• Metazoan
A Bilaterian
01
a
2
O
o
CQ
•o
01
2
0
10
20
30
40
50
60
70
80
90
100
Supermatrix Bootstrap Value
FIGURE 5. Relationships between bootstrap scores on clades. Each point on the graph represents a single clade. Closed squares represent
clades from the metazoan data set, open circles represent clades from the green plant data set, and triangles represent clades from the bilaterian data
set. The line represents equal values for both bootstrapping methods. Comparison of (a) input tree bootstrapping and supermatrix bootstrapping;
(b) stratified bootstrapping and supermatrix bootstrapping; and (c) input tree bootstrapping and II. These comparisons only include clades that
have at least 5% bootstrap support from at least one of the bootstrapping methods. (Continued)
2006
437
BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING
100
90
0)
^Mfl.\
o Green Plant
• Metazoan
A °
o
80 A Bilaterian
o
•
Boot!rtrai
o
•
70
°
60
•
o
0
50
A
0
°i
o
•u 40
tifi(
o
- AS
*>
• „0
• ,' S .
,' ••
So
•
•
i
of
0 •
A
•
A
O A
•
•
•
" o
m°
o
y.
A
•
° A
o
—
•
30
O
2 20
"
^
*
,
aA
A
• •
10]
o
AA
BmoL"ooJta°0A
0
o
•
0
10
20
o
A
o >
30
40
°
50
60
70
80
90 100
Input Tree Bootstrap Value
FIGURE 5. (Continued)
approaches to utilizing data from different genes in a
phylogenetic inference. Furthermore, supertree analyses
need not weight trees equally; for example, the input
trees could be weighted based on length of genes or number of variable characters.
Such situations can be demonstrated with an imaginary data set of 100 genes of equal length in which each
gene alone is free of homoplasy but different genes support two competing clades, Clade A and Clade B. In the
first example, 50 genes support Clade A over Clade B
by a difference in tree length of 3 steps contributed by
each gene and 50 genes support Clade B over Clade A by
30 steps each. With three or more uncontradicted characters, bootstrap support for the clade will be at least 95%
for each gene (Felsenstein, 1985). In this case, both input tree and stratified supertree bootstrap scores would
support both trees equally because an equal number
of genes support each hypothesis. However, the supermatrix bootstrap would strongly support Clade B over
Clade A, because 10 times as many characters support
Clade B over Clade A.
Conversely, suppose 99 genes support Clade A over
Clade B by 3 steps and 1 gene supports Clade B over
Clade A by 297 steps. Here the input tree and stratified supertree bootstrap scores would overwhelmingly
support Clade A over Clade B because many more gene
trees support Clade A over Clade B. However, because
an equal number of characters character steps support
Clade A and Clade B (297), the supermatrix bootstrap
would support both trees equally. In both examples, the
supertree bootstrap more accurately represents the variation among genes within the data set, and the supermatrix bootstrap represents the variation among characters
in the data sets. Cases in which supertree and supermatrix bootstrap scores greatly diverge could indicate
clades that are influenced by a few anomalous loci or
clades in which there are strongly competing hypotheses
among genes. We suggest it may be informative to use supertree and supermatrix bootstrapping as complementary methods for assessing different levels of variance in
large, multilocus data sets.
One unexpected result from supertree bootstrapping
is that even though there appears to be much incongruence among the genes that compose the yeast data set
(Rokas et al., 2003), the supertree bootstrap values are
100% for nearly all clades (Fig. 4). To examine this result
further, we determined the number of the yeast gene
trees that support and do not support the quartet implied by each inner branch of the tree (see Driskell et al.,
2004). The branch with the most apparent conflict has
69 genes supporting the quartet and 37 genes failing to
support it (Fig. 4). Although this may appear to be a large
number of conflicting genes, a clade that is supported
by 69 characters with only 37 conflicting characters will
have a high parsimony bootstrap score. Thus, the 100%
supertree bootstrap scores are consistent with the observed level of incongruence. Yet this demonstrates that
the supertree bootstrap percentage may differ greatly
438
SYSTEMATIC BIOLOGY
from the actual percentage of genes that conflict with
a specific node.
Though input tree bootstrapping and stratified bootstrapping examine different aspects of variation among
gene trees, they provide similar results in the four data
sets from this study (Fig. 5c). Input tree bootstrapping
represents the sampling variance among genes and can
be interpreted as reflecting the robustness of supertree inference to the choice of loci. On the other hand, stratified
bootstrapping measures how uncertainty in the phylogenetic inference for the input genes affects supertree
inference. In this respect, stratified bootstrapping incorporates the strength of the phylogenetic hypotheses
from each gene into supertree analysis (Page, 2004). Furthermore, by incorporating information about secondary
phylogenetic signals for each gene, stratified bootstrapping may help reveal or enhance hidden support (e.g.,
Barrett et al, 1995; Gatesy and Baker, 2005). Although
there is no apparent a priori reason to expect that values
from the two supertree bootstrapping methods will be
similar, they might be similar in cases when the variation in the phylogenetic signal among genes is a manifestation of uncertainty in the phylogenetic signal within
genes. For example, if the apparent variation in the phylogenetic signal among genes is due to sampling error within genes, then the two supertree bootstrapping
values should be similar. If the genes strongly support
highly divergent topologies, then we expect low values
from input tree bootstrapping and higher values for stratified bootstrapping. Conversely, if the genes all weakly
support the same topology, we expect higher values
for input tree bootstrapping than stratified bootstrapping. One obvious additional supertree bootstrapping
method would be to combine both supertree bootstrapping methods, resampling genes and then resampling
bootstrap trees from the resampled genes. This combined
supertree bootstrap would be conceptually similar to the
two-tiered supermatrix bootstrapping proposed by Seo
et al. (2005) in which one resamples genes and then resamples characters within the resampled genes. However, it may be informative to compare the two supertree
bootstrapping methods individually as described here to
examine differences in sampling variance among characters and among genes.
Supertree methods have been criticized for a variety
of reasons (e.g., Purvis, 1995; Novacek, 2001; Springer
and de Jong, 2001; Wilkinson et al., 2001, 2005; Gatesy
et al., 2002,2004; Pisani and Wilkinson, 2002; Gatesy and
Springer, 2004), and it is worth examining the implications of some of these criticisms on supertree bootstrapping in genome-scale data sets. First, the MRP method
has been particularly criticized, perhaps because it is
the most commonly used supertree method (e.g., Purvis,
1995; Pisani and Wilkinson, 2002; Gatesy and Springer,
2004). Supertree bootstrapping methods do not solve the
problems associated with MRP, but input tree bootstrapping can be applied to any modifications of MRP or any
other supertree method as can stratified bootstrapping
if the source data is amenable to bootstrapping. Thus,
supertree bootstrapping need not be limited by biases
VOL. 55
associated with MRP. Second, supertree studies have
been criticized for using nonindependent input trees,
or input trees constructed from overlapping data sets
(Springer and de Jong, 2001; Gatesy et al., 2002; Gatesy
and Springer, 2004). This criticism is particularly leveled
at input tree sampling from previous supertree studies
and is not necessarily an inherent limitation of supertree
methods (Bininda-Emonds et al., 2004). The supertree
methods proposed here utilize gene trees from genomescale data sets and have input trees built from nonoverlapping sequence data sets. The input trees in supertree
bootstrapping of genome-scale data sets may be considered independent (excepting the perennial caveat regarding recombinational distance between loci). A third
criticism of supertree methods is that they do not utilize
the underlying data from the gene trees, and therefore
they fail to account for the strength of support within
genes. In total-evidence analyses that analyze the primary phylogenetic data, a strong phylogenetic signal
that is not observed in analyses of individual data sets
may emerge from analyses of the combined data sets
(e.g., Barrett et al., 1991; Gatesy et al., 1999; Gatesy and
Baker, 2005). Yet, stratified bootstrapping incorporates
some measure of the strength of support in the underlying data into the supertree analysis by sampling from the
bootstrap distributions of the input trees (Page, 2004).
Supertree bootstrapping has several limitations. First,
like any bootstrapping method, supertree bootstrapping
is computationally intensive. Stratified bootstrapping requires time not only to bootstrap both the underlying
data for the input trees as well as time for the actual supertree bootstrapping replicates. Thus, supertree bootstrapping may not be a fast alternative to supermatrix
bootstrapping. Also, all bootstrap methods may not accurately estimate variance when data sets are small.
Though many supertree analyses include hundreds of
input trees (e.g., Daubin et al., 2001; Liu et al., 2001), the
supertree bootstrapping scores may not be reliable when
there are few input trees. Furthermore, because input
trees from genome-scale data sets likely contain different sets of taxa (e.g., Driskell et al., 2004), it is possible
that some replicates in input tree bootstrapping will not
contain all taxa represented in the total set of input trees.
In addition, some replicates of input tree bootstrapping
may not contain the necessary taxonomic overlap to construct a resolved supertree. These sampling problems
will be most common if some taxa are present in only
one or a few input trees. If all taxa are represented in
numerous input trees, as is the case in each of the four
data sets from this study, then there likely will be few if
any sampling problems associated with input tree bootstrapping. All replicates of input tree bootstrapping in
this study contained all taxa. If some replicates do not
include all taxa, then the representation of the bootstrap
trees becomes a supertree rather than a consensus problem, and one cannot summarize the set of bootstrap trees
with a majority rule consensus.
Supertree and supermatrix data sets may contain
much missing data, and the conventional nonparametric bootstrapping does not explicitly account for this
2006
439
BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING
missing data. Efron (1994) describes several alternate
bootstrap procedures for assessing confidence in the face
of missing data. Two variants are particularly relevant
here. In the first, a protocol must exist for estimating the
missing values in the data matrix. This might be undertaken by fitting the missing cells to a probability distribution based on the cells with data present. Then the actual
data matrix is resampled and each sample has its missing data estimated based on this protocol, followed by
construction of the tree for each of these matrices. In another variant, a model for the concealment mechanism,
which is the process by which data "go missing," must
be postulated. In supertree analyses this might have to
do with the distribution of taxon sampling among trees.
In this bootstrap method, an estimate of the data matrix
is constructed such that no data are missing (perhaps using the fitting method of the first variant), and then this
matrix is resampled, each time subsequently subjected
to the concealment process, reestimation of the full data
matrix, and, finally, tree construction. In each variant of
bootstrapping with missing data, variation in trees could
be summarized in the conventional manner by majority
rule consensus. Note that the structure of the supertree
method used here (cells in an MRP matrix) rules out
many methods that Efron (1994) proposed for estimating missing data or modeling concealment. However, in
principle, either of these missing data bootstrapping procedures could be incorporated into supertree bootstrapping. This study examines only bootstrapping methods
for assessing support in supermatrices and supertrees.
However, there are numerous other methods, including
jackknifing (e.g., Farris et al., 1996), Bremer support or
decay indices (Bremer, 1988), partition Bremer support
(Baker and DeSalle, 1997), and ILD tests (Farris et al.,
1994), for assessing support and incongruence in large
data sets that may be affected differently by missing
data.
The supertree bootstrapping methods described in
this article are relatively easy to implement with existing software, and they provide useful information
in the context of supertree phylogenetic inference. The
increasing prevalence of genome-scale data sets necessitates methods for understanding the patterns of conflicting phylogenetic variation within the genome, and
supertree bootstrapping can be useful for assessing variation among gene trees that may be masked by total evidence bootstrapping. Our study used a relatively simple
MP analysis on each of the input gene sets, but genespecific heterogeneity also could be addressed in a supertree bootstrap by an explicit model-based approach,
such as by assigning different models or substitution
parameters for each gene. In the future it will be interesting to compare these supertree approaches to new
total-evidence methods for incorporating heterogeneity
among genes and assessing gene-specific variation from
genome-scale data sets. Finally, although this article emphasizes the utility of supertree bootstrapping in phylogenetic analyses of genome-scale data sets, the methods
may be extended for estimating uncertainty in other supertree analyses.
A Perl script implementing input tree bootstrapping
is available at http://ginger.ucdavis.edu. It also is implemented in Clann (Creevey and Mclnerney, 2004)
and Treeboot by Brian Moore. All data from this study
are available at both http://systematicbiology.org and
http://ginger.ucdavis.edu.
ACKNOWLEDGEMENTS
We thank Antonis Rokas for providing the yeast supermatrix and
Herv£ Philippe for providing the bilaterian supermatrix. Olaf BinindaEmonds, John Gatesy, Brian Moore, and Rod Page provided helpful
comments on this manuscript. This work was funded by NSF grants
0431154 and 03346963.
REFERENCES
Baker, R. H., and R. DeSalle. 1997. Multiple sources of character information and the phylogeny of Hawaiian Drosophilids. Syst. Biol.
46:654-673.
Bapteste, E., H. Brinkmann, J. A. Lee, D. V. Moore, C. W. Sensen, P.
Gordon, L. Durufl6, T. Gaasterland, P. Lopez, M. Miiller, and H.
Philippe. 2002. The analysis of 100 genes supports the grouping
of three highly divergent amoebae: Dictyostelium, Entamoeba, and
Mastigamoeba. Proc. Natl. Acad. Sci. USA 99:1414-1419.
Barrett, M., M. J. Donoghue, and E. Sober. 1991. Against consensus.
Syst. Zool. 40:486-493.
Baum, B. R. 1992. Combining trees as a way of combining data sets for
phylogenetic inference, and the desirability of combining gene trees.
Taxon 42:637-640.
Baum, B. R., and M. A. Ragan. 2004. The MRP method. Pages 17-34 in
Phylogenetic supertrees: Combining information to reveal the tree
of life (O. R. P. Bininda-Emonds, ed.) Kluwer Academic, Dordrecht,
the Netherlands.
Bininda-Emonds, O. R. P. 2003. Novel versus unsupported clades: Assessing the qualitative support for clades in MRP supertrees. Syst.
Biol. 52:839-848.
Bininda-Emonds, O. R. P. 2004. The evolution of supertrees. Trends
Ecol. Evol. 19:315-322.
Bininda-Emonds, O. R. P., J. L. Gittleman, and M. A. Steel. 2002. The
(super)tree of life: Procedures, problems, and prospects. Annu. Rev.
Ecol. Syst. 33:265-289.
Bininda-Emonds, O. R. P., K. E. Jones, S. A. Price, M. Cardillo, R.
Grenyer, and A. Purvis. 2004. Garbage in, garbage out: Data issues
in supertree construction. Pages 267-280 in Phylogenetic supertrees:
Combining information to reveal the tree of life (O. R. P. BinindaEmonds, ed.) Kluwer Academic, Dordrecht, the Netherlands.
Bininda-Emonds, O. R. P., and M. J. Sanderson. 2001. Assessment of
the accuracy of matrix representation with parsimony supertree construction. Syst. Biol. 50:565-579.
Blair, J. E., K. Ikeo, T. Gojobori, and S. B. Hedges. 2002. The evolutionary
position of nematodes. BMC Evol. Biol. 2:7.
Blair, J. E., P. Shah, and S. B. Hedges. 2005. Evolutionary sequence
analysis of complete eukaryote genomes. BMC Bionformatics 6:53.
Bremer, K. 1988. The limits of amino-scid sequence data in angiosperm
phylogenetic reconstruction. Evolution 42:795-803.
Bull, J. J., J. P. Huelsenbeck, C. W. Huelsenbeck, D. L. Swofford, and
P. J. Waddell. 1993. Partitioning and combining data in phylogenetic
analysis. Syst. Biol. 42:384-397.
Cotton, J. A., and R. D. M. Page. 2002. Going nuclear: Vertebrate phylogeny and gene family evolution reconciled. P. Roy. Soc. Lond. B
Bio. 269:1555-1561.
Creevey, C. J., D. A. Fitzpatrick, G. K. Philip, R. J. Kinsella, M. J.
O'Connell, M. M. Pentony, S. A. Travers, M. Wilkinson, and J. O.
Mclnerney. 2004. Does a tree-like phylogeny only exist at the tips in
the prokaryotes? P. Roy. Soc. Lond. B Bio. 271:2551-2558.
Creevey, C. J., and J. O. Mclnerney. 2004. Clann: Investigating phylogenetic information through supertree analyses. Bioinformatics 21:390—
392.
Cunningham, C. W. 1997. Is congruence between data partitions a
reliable predictor of phylogenetic accuracy? Empirically testing an
440
SYSTEMATIC BIOLOGY
VOL. 55
iterative procedure for choosing among phylogenetic methods. Syst. Mort, M. E., P. S. Soltis, D. E. Soltis, and M. L. Mabry. 2000. Comparison
of three methods for estimating internal support on phylogenetic
Biol. 46:464-478.
trees. Syst. Biol. 49:160-171.
Daubin, V., M. Gouy, and G. Perri£re. 2001. Bacterial molecular phyNovacek, M. J. 2001. Mammalian phylogeny: Genes and supertrees.
logeny using supertree approach. Genome Informatics 12:155-164.
Current Biology 11:R573-R575.
DeBry, R. Wv and R. G. Olmstead. 2000. A simulation study of reduced
tree-search effort in bootstrap resampling analysis. Syst. Biol. 49:171- Page, R. D. M. 2004. Taxonomy, supertrees, and the tree of life. Pages
247-266 in Phylogenetic supertrees: Combining information to reveal
179.
the tree of life (O. R. P. Bininda-Emonds, ed.) Kluwer Academic,
de Queiroz, A., M. J. Donoghue, and J. Kim. 1995. Separate versus
Dordrecht, the Netherlands.
combined analysis of phylogenetic evidence. Annu. Rev. Ecol. Syst.
Philip, G. K., C. J. Creevey, and J. O. Mclnerney. 2005. The Opisthokonta
26:657-681.
and Ecdysozoa may not be clades: Stronger support for the grouping
Dopazo, H., and J. Dopazo. 2005. Genome-scale evidence of the
of plant and animal than for animal and fungi and stronger support
nematode-arthropod clade. Genome Biol. 6:R41.
for the Coelomata than Ecdysozoa. Mol. Biol. Evol. 22:1175-1184.
Doyle, J. J. 1992. Gene trees and species trees: Molecular systematics as
Philippe, H., N. Lartillot, and H. Brinkman. 2005. Multigene analysis
one-character taxonomy. Syst. Bot. 17:144-163.
of bilaterians corroborate the monophyly of Ecdysozoa, LophotroDriskell, A. C., C. An6, J. G. Burleigh, M. M. McMahon, B. C. O'Meara,
chozoa, and Protostomia. Mol. Biol. Evol. 22:1246-1253.
and M. J. Sanderson. 2004. Prospects for building the tree of life from
Philippe, H., E. A. Snell, E. Bapteste, P. Lopez, P. W. H. Holland, and D.
large sequence databases. Science 306:1172-1174.
Casane. 2004. Phylogenomics of eukaryotes: Impact of missing data
Efron, B. 1994. Missing data, imputation, and the bootstrap. J. Am. Stat.
on large alignments. Mol. Biol. Evol. 21:1740-1752.
Assoc. 89:463-475.
Escobar-Paramo, A. Sabbagh, P. Darlu, O. Pradillon, C. Vaury, E. Pisani, D., and M. Wilkinson. 2002. MRP, taxonomic congruence and
Denamur, and G. Lecointre. 2004. Decreasing the effects of horizontal
total evidence. Syst. Biol. 51:151-155.
gene transfer on bacterial phylogeny: The Escherichia coli case study. Purvis, A. 1995. A modification to Baum and Ragan's method for comMol. Phylogent. Evol. 30:243-250.
bining phylogenetic trees. Syst. Biol. 44:251-255.
Farris, J. S., V. A. Albert, M. Kallersjo, D. Lipscomb, and A. G. Ragan, M. A. 1992. Phylogenetic inference based on matrix represenKluge. 1996. Parsimony jackknifing outperforms neighbor-joining.
tation of trees. Mol. Phylogenet. Evol. 1:53-58.
Cladistics 12:99-124.
Rokas, A., B. L. Williams, N. King, and S. B. Carroll. 2003. Genome-scale
Farris, J. S., M. Kallersjo, A. G. Kluge, and C. Bult. 1994. Testing signifapproaches to resolving incongruence in molecular phylogenies. Naicance of incongruence. Cladistics 10:315-319.
ture 425:798-804.
Farris, J. S., A. G. Kluge, and M. J. Eckhardt. 1970. A numerical approach Ronquist, F. 1996. Matrix representation of trees, redundancy, and
to phylogenetic systematics. Syst. Zool. 19:172-191.
weighting. Syst. Biol. 45:247-253.
Felsenstein, J. 1985. Confidence limits on phylogenies: An approach Ronquist, F., J. Juelsenbeck, and T. Britton. 2004. Bayesian supertrees.
using the bootstrap. Evolution 39:783-791.
Pages 193-224 in Phylogenetic supertrees: Combining information
Gatesy, J., and R. H. Baker. 2005. Hidden likelihood support in genomic
to reveal the tree of life (O. R. P. Bininda-Emonds, ed.) Kluwer Acadata: Can forty-five wrongs make a right? Syst. Biol. 54:483^492.
demic, Dordrecht, the Netherlands.
Gatesy, J., R. H. Baker, and C. Hayashi. 2004. Inconsistencies in argu- Sanderson, M. J. 2003. r8s: Inferring absolute rates of molecular evoments for the supertree approach: Supermatrices versus supertrees
lution and divergence times in the absence of a molecular clock.
of Crocodylia. Syst. Biol. 53:342-355.
Bioinformatics 19:301-302.
Gatesy, J., C. Matthee, R. DeSalle, and C. Hayashi. 2002. Resolution of Sanderson, M. J., A. Purvis, and C. Henze. 1998. Phylogenetic sua supertree/supermatrix paradox. Syst. Biol. 51:652-664.
pertrees: Assembling the trees of life. Trends in Ecol. Evol. 13:105-109.
Gatesy, J., P. O'Grady, and R. H. Baker. 1999. Corroboration among Sanderson, M. J., and M. F. Wojciechowski. 2000. Improved bootstrap
data sets in simultaneous analysis: Hidden support for phylogenetic
confidence limits in large-scale phylogenies, with an example from
relationships among higher level artiodactyls taxa. Cladistics 15:271Neo-Astragalus (Leguminosae). Syst. Biol. 49:671-685.
313.
Seo, T.-K., H. Kishino, and J. L. Thorne. 2005. Incorporating geneGatesy, J., and M. S. Springer. 2004. A critque of matrix represpecific variation when inferring and evaluating optimal evolutionsentation with parsimony supertrees. Pages 369-388 in Phylogeary tree topologies from multilocus sequence data. Proc. Natl. Acad.
netic supertrees: Combining information to reveal the tree of life
Sci. USA 102:4436-4441.
(O. R. P. Bininda-Emonds, ed.) Kluwer Academic, Dordrecht, the Springer, M. S., and W. W. de Jong. 2001. Which mammalian supertree
Netherlands.
to bark up? Science 291:1709-1711.
Hendy, M. D., and D. Penny. 1982. Branch and bound algorithms for Swofford, D. L., 2003. PAUP*: Phylogenetic analysis using parsimony
determining minimal evolutionary trees. Math. Biosci. 59:277-290.
(*and other methods), version 4.0bl0. Sinauer Associates, SunderHuelsenbeck, J. P., J. J. Bull, and C. W. Cunningham. 1996. Combining
land, Massachusetts.
data in phylogenetic analyses. Trends Ecol. Evol. 11:152-158.
Wilkinson, M., J. A. Cotton, C. Creevey, O. Eulenstein, S. R. Harris, F.-J.
Hughes, J., and A. P. Vogler. 2004. The phylogeny of acorn weevils
Lapointe, C. Levasseur, J. O. Mclnerney, D. Pisani, and J. L. Thorley.
(genus Curculio) from mitochondrial and nuclear DNA sequences:
2005. The shape of supertrees to come: Tree shape related properties
The problem of incomplete data. Mol. Phylogenet. Evol. 32:601of fourteen supertree methods. Syst. Biol. 54:419-431.
615.
Wilkinson, M., J. L. Thorley, D. T. J. Littlewood, and R. Bray. 2001.
Kennedy, M., and R. D. M. Page. 2002. Seabird supertrees: Combining
Towards a phylogenetic supertree of Platyhelminthes? Pages 292partial estimates of procellariiform phylogeny. Auk 119:88-108.
301 in Interrelationships of the Platyhelminthes (D. Littlewood and
Lee, Y, R. Sultana, G. Pertea, J. Cho, S. Karamycheva, J. Tsai, B. Parvizi, F. R. Bray, eds.) Chapman Hall, London.
Cheung, V. Antonescu, J. White, I. Holt, F. Liang, and J. Quackenbush. Wolf, Y. I., I. B. Rogozin, and E. V. Koonin. 2004. Coelomata and
2002. Cross-referencing eukaryotic genomes: TIGR orthologous gene
not ecdysozoa: Evidence from genome-wide phylogenetic analysis.
alignments (TOGA). Genome Res. 12:493-502.
Genome Res. 14:29-36.
Lerat, E., V. Daubin, and N. A. Moran. 2003. From gene trees to organ- Yan, C, J. G. Burleigh, and O. Eulenstein. 2005. Identifying optimal
ismal phylogeny in prokaryotes: The case of the y-proteobacteria.
incomplete phylogenetic data sets from sequence databases. Mol.
PLoS 1:101-109.
Phylogent. Evol. 35:528-535.
Liu, F.-G. R., M. M. Miyamoto, N. P. Freire, P. Q. Ong, M. R. Tennant,
T. S. Young, and K. F. Gugel. 2001. Molecular and morphological First submitted 11 July 2005; reviews returned 30 August 2005;
supertrees for eutherian (placental) mammals. Science 291:1786final acceptance 4 November 2005
1789.
Associate Editor: Olaf Bininda-Emonds