Taxon-Rich Phylogenomic Analyses Resolve the

Syst. Biol. 64(3):406–415, 2015
© The Author(s) 2014. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved.
For Permissions, please email: [email protected]
DOI:10.1093/sysbio/syu126
Advance Access publication December 23, 2014
Taxon-Rich Phylogenomic Analyses Resolve the Eukaryotic Tree of Life and
Reveal the Power of Subsampling by Sites
LAURA A. KATZ1,2∗ AND JESSICA R. GRANT1
1 Department
of Biological Sciences, Smith College, Northampton, MA 01063, USA and 2 Program in Organismic and Evolutionary Biology,
UMass-Amherst, Amherst MA 01003, USA
∗ Correspondence to be sent to: Department of Biological Sciences, Smith College, Northampton, MA 01063, USA; E-mail: [email protected]
Received 9 March 2014; reviews returned 17 July 2014; accepted 15 December 2014
Associate Editor: Thomas Buckley
Abstract.—Most eukaryotic lineages are microbial, and many have only recently been sampled for phylogenetic studies or
remain in the “dark area” of the tree of life where there are no molecular data. To assess relationships among eukaryotic
lineages, we perform a taxon-rich phylogenomic analysis including 232 eukaryotes selected to maximize taxonomic diversity
and up to 1554 genes chosen as vertically inherited based on their broad distribution among eukaryotes. We also include
sequences from 486 bacteria and 84 archaea to assess the impact of endosymbiotic gene transfer (EGT) from plastids and to
detect contamination. Overall, our analyses are consistent with other less taxon-rich estimates of the eukaryotic tree of life,
and we recover strong support for five major clades: Amoebozoa, Excavata (without the genus Malawimonas), Opisthokonta,
Archaeplastida, and SAR (Stramenopila, Alveolata, and Rhizaria). Our analyses also highlight the existence of “orphan”
lineages, lineages that lack robust placement in the eukaryotic tree of life, and indicate the possibility of as yet undiscovered
diversity. In analyses including bacteria and archaea, we find that approximately 10% of the 1554 genes, which we choose
because they are found in four or five of the five major eukaryotic clades and hence may be more likely to be inherited
vertically, appear to have been acquired from cyanobacteria through EGT in photosynthetic lineages. Removing these EGT
genes places the green algae as sister to the glaucophytes instead of the red algae, suggesting that unknowingly including
genes of plastid origin, and combining them with genes of nuclear origin, may mislead phylogenetic estimates. Finally,
the large size of our data set allows comparative analyses of subsets of data; alignments built from randomly sampled
sites provide greater support, particularly for deep relationships, than do equivalent-sized data sets built from randomly
sampled genes. [Endosymbiotic gene transfer; eukaryotic phylogeny; microbial diversity; phylogenetics; sampling strategy;
tree of life.]
The power of phylogenomics for resolving
relationships is well-documented for groups shallower
than eukaryotes (Delsuc et al. 2005; Philippe et al. 2005;
Giribet and Edgecombe 2012; Kumar et al. 2012), and
has been applied previously to questions of eukaryotic
phylogeny (Hampl et al. 2009; Burki et al. 2010; Burki
et al. 2012; Torruella et al. 2012; Sierra et al. 2013)
and eukaryotic origins (Cotton and McInerney 2010;
Thiergart et al. 2012; Williams et al. 2012). However, given
the constraints on large-scale phylogenomic analyses
and data availability, the bulk of these studies focused
on relatively small numbers of lineages, often in the
range of 50–70 species. Yet, the past few years have seen
an explosion in genome scale data (e.g., whole genomes
and transcriptomes) and high-powered computer
resources allowing for more data rich analyses.
Reconstructing eukaryotic phylogeny is particularly
challenging given the approximately 1.8 billion year
age of this clade (Knoll et al. 2006; Parfrey et al.
2011), the tremendous diversity among eukaryotes
(Adl et al. 2012), and the impact of both lateral and
endosymbiotic gene transfer (LGT and EGT; Andersson
2005; Keeling and Palmer 2008; Archibald 2009; CavalierSmith 2010; Hug et al. 2010; Nowack and Melkonian 2010;
Szklarczyk and Huynen 2010). Eukaryotes are marked
by tremendous heterogeneity in rates and patterns of
molecular evolution, including the rapid evolution of
proteins in many lineages (Dacks et al. 2002; Zufall
et al. 2006; Yoon et al. 2008; Cuomo et al. 2012).
Despite these difficulties, five major clades of eukaryotes
have emerged: Amoebozoa, Excavata, Opisthokonta,
Archaeplastida, and SAR (stramenopile, alveolate, and
rhizaria; e.g., Adl et al. 2012; Katz et al. 2012).
The evolutionary history of eukaryotic genomes has
also been impacted by the transfer of genes from
endosymbionts (i.e., EGT; Martin et al. 1993; 2002).
Eukaryotes acquired plastids by endosymbiosis of a
cyanobacterium or through secondary endosymbiosis in
which a heterotrophic eukaryote engulfed a red or green
alga (Delwiche 1999; Cavalier-Smith 2000; Archibald
2009). The major clade Archaeplastida includes green
algae, red algae, and glaucophytes, and is argued to
have diverged following a single primary acquisition
of chloroplasts (Adl et al. 2005). The remaining
lineages of photosynthetic eukaryotes (e.g., diatoms,
brown algae, euglenoids, cryptomonads, haptophytes,
and dinoflagellates) are argued to have resulted from
secondary, or in the case of some dinoflagellates, tertiary,
or perhaps even quaternary endosymbiosis (Delwiche
1999; Archibald 2009).
Given the importance of taxon sampling for
phylogenetic reconstruction (Heath et al. 2008), we
aimed to capture as broad a diversity of eukaryotes
as possible. For example, in studies of eukaryotes
(Parfrey et al. 2010), Metazoa (Dunn et al. 2008; Hejnol
et al. 2009), and vertebrates (Wiens 2005; Wiens and
Tiu 2012), inclusion of taxa in key positions has been
critical in stabilizing phylogenies even if these taxa
introduced extensive missing data. With well-sampled
phylogenomic data, it is also possible to test the idea that
robust phylogenies are better obtained from analyses
of randomly chosen sites when compared with data
406
2015
KATZ AND GRANT—TAXON-RICH PHYLOGENOMICS OF EUKARYOTES
sets of the same size constructed from unlinked genes
(Cummings et al. 1995, 1999; Cummings and Meyer
2005).
Here we combine a taxon and gene rich approach, with
up to 802 species (232 eukaryotes, 486 bacteria, and 84
archaea) and up to 1554 genes (∼500,000 characters), to
assess the evolutionary relationships among eukaryotes
and the position of eukaryotes on the tree of life.
Inclusion of bacterial lineages also enabled us to identify
eukaryotic genes that fall sister to cyanobacteria and are
hence likely the result of EGTs following acquisition of
plastids. The large size of our data set allows us to assess
the stability of clades using a variety of subsampling
approaches.
MATERIALS AND METHODS
Taxon and Gene Selection
Given our goal of creating a taxon rich phylogenomic
study, we collected data from diverse eukaryotes
containing at least 100 proteins available on public
databases, and curated these sequences in a custombuilt pipeline as described in Grant and Katz (2014). To
this end, we began with the taxa included in OrthoMCL
(Li et al. 2003; Chen et al. 2006), a database of over
124,000 orthologous groups from whole genomes of 98
eukaryotes, 44 bacteria, and 16 archaea. We then added
sequence data from all eukaryotic taxa in GenBank with
at least 100 proteins either in the protein database or as
EST projects and data from transcriptome projects from
407
our laboratory (Chilodonella uncinata, Subulatomonas
tetraspora, Corallomyxa tenera) or from the Marine
Microbial Eukaryote Transcriptome Sequencing project
(http://marinemicroeukaryotes.org/; Trichosphaerium
sp. ATCC40318, Filamoeba nolandi ATCC50430, Mayorella
sp., Stereomyxa ramosa, Pessonella sp. PRA-29, Rhodomonas
lens, Tetraselmis chuii, Mesodinium pulex, Strombidinopsis
sp., and Tiarina fusus).
To maintain some taxonomic evenness, we
subsampled from groups that are well represented
in GenBank (i.e., plants, animals, fungi, and some
parasitic microbial lineages such as Plasmodium and
Trypanosoma). We also removed taxa from the analyses
if the data appeared to be highly contaminated;
for example, the EST data from the rhizarian plant
parasites Plasmodiophora brassicae and Spongospora
subterranea included many sequences that fell within
Archaeplastida in preliminary analyses. Other taxa
(e.g., Welwitschia mirabilis and Ginkgo biloba) were
removed because they had few data that remained in
our matrices after data refinement was complete. In
the final analyses, we also removed the microsporidian
fungi as they consistently fell on very long branches and
their position was unstable across analyses.
To capture putatively vertically inherited genes, we
identified 1554 orthologous groups from OrthoMCL
(Chen et al. 2006) that were present in at least four
of five major eukaryotic clades (i.e., Opisthokonta,
Archaeplastida, Amoebozoa, Excavata and SAR)
as represented in OrthoMCL (http://orthomcl.
org/orthomcl/; Fig. 1 and Supplementary Table S2;
FIGURE 1.
Depiction of output from phylogenomic pipeline reveals heterogeneity in data availability for the 1554 genes sampled. Each
column represents a gene and the warmth (darkness) of the color represents the number of sequences captured for each gene and clade. The
genes are ranked by evenness as measured by representation across both major clades and subclades, and the most evenly sampled 150 genes
are marked by the bar at the bottom of figure. The last row indicates genes identified as impacted by EGT as orthologs in photosynthetic
eukaryotes nest among cyanobacteria. Am = Amoebozoa: ac, Acanthamoebidae; ar, Archamoebae; da, Dactylopodida; di, Dictyosteliida; fi,
Filamoeba; is, incertae sedis; my, Mycetozoa; va, Vannellidae; EE = “everything else”: ap, Apusomonadidae; br, Breviatea; cr, Cryptophyta; ha,
Haptophyta; ka, Katablepharidophyta; Ex = Excavata; eu, Euglenozoa; fo, Fornicata; he, Heterolobosea; ja, Jakobida; ma, Malawimonadidae; ox,
Oxymonadida; pa, Parabasalia; Op = Opisthokonta; ch, Choanoflagellida; fu, Fungi; ic, Ichthyosporea; me, Metazoa; Pl = Archaeplastida/Plantae:
gl, Glaucocystophyceae; gr, Viridiplantae; rh, Rhodophyta; Sr = SAR: ap, Apicomplexa; ch, Chromerida; ci, Ciliophora; di, Dinophyceae; pe,
Perkinsida; rh, Rhizaria; st, Stramenopiles.
408
SYSTEMATIC BIOLOGY
available from http://dx.doi.org/10.5061/dryad.db78g).
We generated alignments using a pipeline that is
described in detail in Grant and Katz (2014). In brief,
genes are identified first by BLAST, using data from
OrthoMCL as query and then pairwise comparisons
of sequences are used to remove too similar (>98%
identical; e.g., allelic variation) or too divergent (< 70%
identical; e.g., non-orthologous) data. Alignments are
then refined using Guidance (Penn et al. 2010) and
RaxML (Stamatakis 2006) is used to build single gene
trees. The resulting data set had 802 taxa (232 eukaryotes,
486 bacteria, and 84 archaea) and up to 493,050 amino
acids. The well-sampled marker SSU-rDNA gene
was also concatenated to some gene matrices. To
maximize the power of our tree reconstructions, we put
considerable effort into analysis of the 150 genes that had
the most evenly distributed taxa across the eukaryotic
clades (e.g., Excavata, Amoebozoa, Alveolata, Rhizaria,
Stramenopile, Archaeplastida, and Opisthokonta;
indicated in Fig. 1).
Phylogenetic Analyses
For all analyses, the best fitting model for each gene
was determined using ProtTest (Darriba et al. 2011). The
model LG was determined to be the best fitting model
for all but 8 of the 1554 genes. Of the eight, six had
WAG as the best fit, one had JTT, and one had BLOSUM,
but of these genes all had LG as the second-best fit
with a deltaBIC < 10. Because estimates from ProtTest
were very similar for +G and +G+I, and because of the
computational cost of partitioning these data, we ran all
amino acid matrices under the PROTGAMMALG model
using default parameters as implemented in RAxML.
Where SSU-rDNA nucleotide data were included, we
partitioned the data to run with GTRGAMMA for
nucleotide data and PROTGAMMALG for amino acids.
Phylogenies were reconstructed in one of several
ways: 1) for smaller and more tractable data sets we used
RAxML (Stamatakis et al. 2005; Stamatakis 2006; Berger
et al. 2011) with rapid bootstrapping of 100 replicates
(Stamatakis et al. 2008) followed by full maximum
likelihood reconstruction; 2) for larger data sets, and for
subsampling, we used ExaML (Stamatakis and Aberer
2013), a faster implementation of the RAxML algorithm
that provides the topology of the most likely tree without
bootstrap support (BS); 3) we also ran an analysis in
PhyloBayes (PhyloBayesMPI 1.4f; Lartillot et al. 2009) as
implemented at CIPRES (Miller et al. 2010) using the
GTR CAT model, to assess the stability of our results to
changing models. Although the PhyloBayes run did not
converge after approximately 52 h on a 192 core computer
(up to 2750 cycles), likely due to instability among closely
related taxa, deeper nodes were consistent with our other
analyses.
Analyses are named in part based on their inclusion
of either the 150 most evenly sampled genes or the full
set of 1554 genes (“150” vs. “1554”), whether they were
run without genes impacted by (“EGT” vs “-EGT” and
varying taxon inclusion to focus only on eukaryotes
VOL. 64
(“E”), or to include also Bacteria and Archaea (“A, B, E”;
Table 1). We explored the impact of missing data using
a range of levels (40–70%) and found that topologies
were generally stable to varying levels of missing data
(i.e., supported clades retained across analyses). Based
on these preliminary analyses, we choose an arbitrary
cut-off of more than 50% missing data (per column) for
removal in all analyses reported here. As detailed in
the legend for Figures 1 and 2; we also used a standard
naming system with the first two letters referring to
major clade (e.g., Op for Opisthokonta), the second two
letters referring to subclade (e.g., me for Metazoa), and
the next four letters referring to species (e.g., hsap for
Homo sapiens).
Comparing Sampling by Sites versus Genes
Data sets of the size generated by our pipeline are
too large to be analyzed by most phylogenetic software
programs. This allowed us to ask whether researchers
analyzing phylogenomic data are best off subsampling
by genes or by randomly chosen sites. Subsampling by
genes at random might enable capture of diverse genes
with varying levels of functional constraint, whereas
subsampling by sites may better fit the assumption of
site independence inherent in most phylogenetic models.
We subsampled the full data set of 1554 genes and
493,050 amino acids (again masking columns with ≥50%
missing data), choosing random genes or random sites
to make new matrices of varying lengths from 1000
to 45,000 characters. For each sampling strategy, we
made 20 alignments of lengths 1000, 5000, 10,000, 15,000,
20,000, 25,000, 30,000, 35,000, 40,000, and 45,000. These
were analyzed with EXaML and the proportion of the
20 topologies that supported monophyly of a particular
clade is reported.
We assessed the impact of fast evolving sites in our
analyses using TIGER (Cummins and McInerney 2011)
with default parameters. TIGER infers the evolutionary
rate of homologous sites, and groups sites with similar
rates into bins. We removed sites from the two and
three highest bins (i.e. fastest 14,800 and 26,017 sites,
respectively) from the alignments “150” and “150-EGT”
with eukaryotes only, respectively, and analyzed the
resulting matrices in RAxML as described above.
Endosymbiotic Gene Transfer
To assess EGTs and possible contamination, we also
included whole, genome data from 486 bacteria and
84 archaea, limiting ourselves to two species per genus
for well-sampled genera. Prior to concatenation, we
removed possible contaminating sequences identified as
single eukaryotes (i.e., without eukaryotic sisters) within
bacterial and/or archaeal clades. To search for genes
with putative signal of EGT, we use python scripts that
employed the tree-walking methods in p4 (Foster 2004)
and identified genes with eukaryotic sequences sister to
or nested within cyanobacteria. We identified 122 genes
2015
409
KATZ AND GRANT—TAXON-RICH PHYLOGENOMICS OF EUKARYOTES
TABLE 1.
Monophyly of major clades is consistent across most analyses, though with varying levels of support
Analysis
Domains
Number of taxa
Number of characters
Software
150
E
232
36,346
RAx
150-EGT
E
232
35,302
RAx
150
E
232
36,346
EXa
150-EGT
E
232
35,302
EXa
1554
E
232
64,860
EXa
1554-EGT
E
232
54,939
EXa
150
A, B, E
800
36,346
EXa
150-EGT
A, B, E
800
35,302
EXa
1554
A, B, E
800
64,860
EXa
1554-EGT
A, B, E
800
54,939
EXa
150
E
235
34,991
PB
Amoebozoa
Excavata
Opisthokonts
Plantae
SAR
85
81
100
82
66
93
85
100
40
70
o
o
x
x
x
x
x
x
x
x
o
o
x
x
x
o
o
x
x
x
o
o
x
x
x
o
o
x
x
x
o
o
x
o
x
o
o
x
x
x
1
o
1
0.78
0.91
Alveolates
Ap, Di, Ch, Pe
Discoba
Euglenozoa
Green + Red
Green + Glauc
Hacrobia
Rhizaria
(Sp + Ct) + Me
100
100
98
100
67
o
57
86
99
100
100
92
100
o
75
45
89
99
x
x
x
x
x
o
x
x
x
x
x
x
x
o
x
x
x
x
x
x
x
x
x
o
x
x
x
x
x
x
x
o
x
x
x
x
x
x
x
x
x
o
x
x
x
x
x
x
x
o
x
x
x
x
x
x
x
x
o
o
x
x
x
x
x
x
x
o
x
x
x
x
0.73
0.91
0.95
0.95
0.79
o
0.69
o
o
Notes: Analyses are named first based on the number of genes using in initial data set (i.e., “150” for 150 most evenly sampled genes; 1554 for
full data set), and then by noting whether genes impacted by EGT are removed (i.e., -EGT). Analyses vary based on the domains included and
the resulting number of taxa. x = monophyletic; o = nonmonophyletic; for RAxML (RAx), numbers represent BS for monophyletic clades; for
PhyloBayes (PB), numbers represent posterior probabilities of monophyletic clades. The large size of the data sets constrained the approaches
that could be used. For example, ExaML (EXa) was used for larger data sets even though it appears to produce topologies that are consistently
different in some areas when compared with RAxML (RAx); for example, neither Amoebozoa or Excavata are recovered consistently in EXaML
analyses as the long branched Entamoeba tend to fall in Excavata (online supplementary Data). The PB analyses were done excluding SSU-rDNA
and including the microsporidians, which were removed from other analyses as these taxa are long branched and unstable across analyses.
Also, in the PhyloBayes analyses, Picobiliphyte sp. MS584-22 is sister to Rhizaria whereas in ML analyses, this taxon falls in Orphan clade
1 (Fig. 2) along with the cryptophytes, haptophytes, Roombia truncata, and Telonema subtile. Abbreviations are: Ap, Di, Ch, Pe = apicomplexa,
dinoflagellates, chromerids, and perkinsida and (Sp + Ct) + Me = sponges and ctenophores outside of the rest of the Metazoa.
from the 1554 (12 of which were from the 150 most even
genes) that are putative examples of EGT. We removed
these genes from some analyses to assess the impact of
EGT (e.g., 150-EGT and 1554-EGT).
Placement of Eukaryotes on the Tree of Life
To address the placement of eukaryotes on the tree
of cellular life while minimizing the impact of lateral
gene transfer (LGT), we concatenated the 207 genes from
the 1554 gene set that included bacteria and/or archaea,
and that yielded monophyletic eukaryotes. We removed
taxa with data in fewer than 20 genes and columns with
more than 50% missing data. The resulting alignment
had 442 taxa and 17,220 characters. This alignment was
analyzed under the PROTGAMMALG model of RAxML
with rapid bootstrapping of 100 replicates, followed by
full maximum likelihood reconstruction.
RESULTS AND DISCUSSION
Phylogenomic Data Set Generated for this Study
Our pipeline yielded an alignment of 1554 genes
from 232 eukaryotes, and up to 486 bacteria and 84
archaea (Figs. 1 and 2; supplementary Tables S1 and S2).
Among the eukaryotes, the focus of this study, available
data are not evenly distributed among lineages, with
overrepresentation of genes from macroscopic clades
(e.g., animals (Op_me), fungi (Op_fu), green algae,
including land plants (Pl_gr)) and microbial clades with
many whole-genome sequences (e.g., Apicomplexa such
as Plasmodium (Sr_ap), Euglenozoa like Trypanosoma and
Leishmania (Ex_eu)). Other clades are underrepresented
in our data set, either because there are few molecular
data for members (e.g., Amoebozoa:Dactylopodida
(Am_da) and Amoebozoa:Filamoeba (Am_fi)) and/or
because these clades are relatively species poor (e.g.,
Excavata:Oxymonadida (Ex_ox) and SAR:Perkinsida
(Sr_pe)).
Analyses of 150 Most Evenly Sampled Genes Yield Robust
Estimate of the Eukaryotic Tree of Life
We put the bulk of our effort into analyses of the
150 genes that were most evenly sampled across major
clades, representing a data set of 36,346 characters
(Fig. 1 and Table 1 analysis “150”). The resulting
topology supports five major clades of eukaryotes
(Fig. 2)—Amoebozoa, Excavata (without the genus
Malawimonas), Opisthokonta, Archaeplastida, and
SAR—and is consistent with a subset of previous
analyses that used fewer taxa (Hampl et al. 2009) or
fewer genes (Yoon et al. 2008; Parfrey et al. 2010). The
placement of Malawimonas has also been unstable in
other analyses (Simpson et al. 2006; Hampl et al. 2009;
410
SYSTEMATIC BIOLOGY
VOL. 64
FIGURE 2. Most likely tree from 150 most even genes, including 232 eukaryotes and 36,346 characters, reveals five major clades of eukaryotes
and several “orphan” lineages (lineages lacking clear sister taxa at this time). The tree is drawn rooted between Opisthokonta and the remaining
eukaryotic clades, though we recognize that the root of the eukaryotic tree of life is under debate. Subclades indicated in bold are monophyletic
and include all subclades except the diatoms as the single stramenopile taxon, Vaucheria litorea, interrupts the monophyly of diatoms in this
analysis. Notations on branches indicate full BS (asterisk) or ≥80% BS (filled circles). This analysis is entitled “150” in Table 1, and abbreviations
are as in Figure 1.
2015
KATZ AND GRANT—TAXON-RICH PHYLOGENOMICS OF EUKARYOTES
Derelle and Lang 2011; Zhao et al. 2012), suggesting that
this genus may not belong to the major clade Excavata
but may instead represent an orphan lineage (i.e., one
without clear position given current taxon sampling).
In a RAxML analysis, BS for these major clades varied
from 100% for Opisthokonta, to more than 80% for
Amoebozoa, Archaeplastida, and Excavata (without
Malawimonas), to 66% for SAR (Fig. 2 and Table 1 analysis
“150”).
Nested relationships within major eukaryotic clades
are generally concordant with previous phylogenies,
with monophyly indicated by bold text in Figure 2.
Monophyletic clades include animals, fungi, green algae,
embryophytes, red algae, brown algae, and apicomplexa
(Table 1; online Supplementary Data). The one exception
is that the diatoms are not monophyletic as the single
representative of the golden algae, Vaucheria litorea, falls
on a long branch within the diatoms (Fig. 2). Analyses
with varying taxon inclusion and software (i.e., RaxML,
ExaML, and PhyloBayes) generate similar topologies
(Table 1). Removing rapidly evolving sites from these
alignments using TIGER (Cummins and McInerney
2011) yields consistent topology with lower support for
most clades. We depict the topology rooted between
Opisthokonta and the remaining eukaryotes (consistent
with some previous studies (Stechmann and CavalierSmith 2002; Derelle and Lang 2011; Katz et al. 2012),
though we recognize the debate around the placement
of the root of the eukaryotic tree of life.
As further evidence of the robustness of analyses
of the 150 most even genes, several hypotheses of
relationships within major clades are supported. For
example, we recover the monophyly of ascomycetes
and basidiomycetes within the fungi (Hibbett et al.
2007) and the Discoba within Excavata, which contains
Heterolobosea, Euglenozoa, and Jakobida (Fig. 2 and
Supplementary Table S3; Hampl et al. 2009). We also
assessed the controversial topology for early diverging
animal lineages and found that our analyses yield a
topology for Metazoa with a clade containing a sponge
(Oscarella) and ctenophore (Mnemiopsis) as sister to
all remaining animals (Fig. 2 and Table 1), a finding
consistent with some recent phylogenomic analyses
(Dunn et al. 2008; Hejnol et al. 2009).
Our taxon rich analyses also highlight the presence of
several clades of “orphan” lineages, lineages that lack
clear sister taxa and/or are unstable in position across
our trees. The first of these clades, “Orphan clade 1”
(Fig. 2), contains several photosynthetic lineages such
as the small (∼6m) picobiliphytes, the katablepharid
taxon Roombia truncata, and the incertae sedis taxon
Telonema subtile. These lineages fall sister to the
haptophytes and cryptophytes, two lineages previously
considered to be part of the now falsified taxon
“Chromalveolata” (Parfrey et al. 2010). The position of
these taxa is controversial, in part because some have
been placed in the “Hacrobia”, which in turn is a taxon
with changing definition (Okamoto et al. 2009; Burki
et al. 2012). A second group of orphans, “Orphan clade 2”
(Fig. 2), includes the controversial genus Malawimonas,
411
that has been argued to belong to Excavata, though
it is often found outside of this major clade (Simpson
et al. 2006; Hampl et al. 2009; Derelle and Lang 2011;
Zhao et al. 2012). In many of our analyses, Malawimonas
falls sister to Collodictyon triciliatum (Zhao et al. 2012),
though the placement of these two taxa is unstable (see
online Supplementary Data). The final orphan clade
includes two taxa previously identified as a potential
novel clade of eukaryotes, S. tetraspora and Breviata
anathema (Walker et al. 2006; Katz et al. 2011; Grant
et al. 2012), and Thecamonas (Amastigomonas) trahens, a
member of the Apusozoa (Cavalier-Smith 1997; CavalierSmith and Chao 2003). The position of these lineages
varies by analyses (online supplementary Data).
Analyses of 1554 Eukaryotic Genes
We also assessed relationships using the full data set
of 1554 genes, though the large size here (232 eukaryotes
× 493,050 characters) made analyses challenging. For
example, full RaxML analyses with bootstraps were not
possible even with the high-powered computing that
we accessed through CIPRES (Miller et al. 2010) and
University Florida Research Computing Center. Instead,
we used ExaML, despite differences in reconstructions
based on these analyses as compared to RaxML;
ExaML generally fails to recover monophyly of Excavata
(even without Malawimonas) or Amoebozoa because
Archamoeba go within Excavata, sister to a clade
containing Fornicata and parabasalids (Table 1 analysis
“1554”, online Supplementary Data). Given the growing
body of evidence supporting these clades (e.g., Hampl
et al. 2009; Parfrey et al. 2010; Adl et al. 2012), we suspect
that lack of monophyly of Amoebozoa and Excavata
(with or without the controversial Malawimonas) may
be due to differences in implementation of ML analysis
in ExaML when compared with RaxML. Beyond
differences that might be attributed to software package,
the 1554 analyses are generally concordant with analyses
of 150 most evenly sampled genes in that we recover
Opisthokonta, Archaeplastida, SAR, and many of the
nested clades (Table 1).
Analyses Including Bacterial and/or Archaeal Orthologs
Inclusion of bacterial and archaeal orthologs enabled
us to look at the impact of genes of endosymbiotic
origin and the placement of eukaryotes on the tree of
cellular life. We identified 122 genes where homologs
in photosynthetic eukaryotes appear to be descended
from the cyanobacterial ancestor of plastids (see dashes
at bottom of Fig. 1, and Supplementary Table S2). These
genes were identified by the placement of photosynthetic
eukaryotes sister to at least 1 of the 10 cyanobacteria
included in our pipeline. Removal of 12 potential EGT
genes from the 150 analyses had little impact on the
structure of the tree except for in a few places (compare
“150 euks” to “150-EGT” in Table 1). Notably, the support
for the monophyly of Archaeplastida dropped from 82%
412
SYSTEMATIC BIOLOGY
VOL. 64
BS to 40% BS and within Archaeplastida green algae
are sister to glaucophytes (75% BS) instead of sister to
red algae (67% BS). Removal of 122 potential EGT genes
from the 1554 had the same topological effect, though
the size of this data set constrained us to use ExaML
(Table 1). These data indicate that inclusion of just a few
genes impacted by EGT can have a substantial impact
on interpretation of evolutionary relationships among
photosynthetic taxa.
To evaluate the placement of eukaryotes on the tree of
life and to explore root of eukaryotes, we concatenated
genes that met two criteria: they included bacteria
and/or archaea and generated single gene trees with
monophyletic eukaryotes. We restricted ourselves to
the trees with monophyletic eukaryotes as a means of
lessening the impact of both EGT and LGT on our
analyses. Phylogenetic reconstructions of the resulting
207 genes revealed that eukaryotes emerge as a lineage
of archaea, sister to species in the archaeal clade
Thaumarchaeota (Fig. 3). This result is consistent with
multiple analyses that challenge the validity of the three
domain hypotheses (e.g., Lake et al. 1984; Tourasse and
Gouy 1999; Brown et al. 2001; Williams et al. 2013),
including analyses that find eukaryotes emerging from
among the “TACK” (Thaumarchaeota, Aigarchaeota,
Crenarchaeota, and Korarchaeota) clade of archaea
(Williams et al. 2012). Within eukaryotes, the root
appears between Fornicata (Excavata such as Giardia
lamblia), Archamoeba, and the remaining eukaryotes.
However, this root may be due to systematic error such
as: 1) neither Amoebozoa nor Excavata are monophyletic
in these analyses; and 2) ‘long branch’ lineages such as
Trichomonas vaginalis and Entamoeba spp. fall toward root
of tree (Fig. 3).
More Genes or More Sites?
Subsampling phylogenomic data for sites chosen
at random yields more robust phylogenies than
subsampling by gene (Fig. 4). The large size of our data
set enabled us to assess clade stability by subsampling
characters to build matrices of varying sizes. We took
two approaches here, assessing the consistency of nodes
from 20 data sets of a given size created using a
subsampling of either genes or randomly chosen sites
from the supermatrix. In nearly all cases, nodes were
more stable in analyses of randomly chosen sites than in
matrices of the same size that consisted of gene segments
(Fig. 4). For example, 10,000 characters were sufficient
to yield monophyletic Opisthonkonta in 19 of 20 trees
when subsampling by site but only half of the trees
generated with same-sized subsamples of genes (Fig. 4
and Supplementary Table S4). Similarly, the monophyly
of SAR is more stable in subsamples by site; in data
sets of 30,000 characters, SAR is recovered in 19 of 20
trees sampled by site and only 12 of 20 trees by gene
(Supplementary Table S4).
Our findings at a phylogenomic scale are consistent
with analyses of smaller data sets. For example, the same
FIGURE 3.
Phylogenetic analyses of 207 genes that included
bacteria/archaea and had a monophyletic clade of eukaryotes place
eukaryotes as a lineage of archaea. Topology generated in RaxML based
on analysis of 17,220 sites. Only monophyletic groups are labeled, with
the notable exception of the paraphyletic Archaea. Bootstrap values
are shown for the monophyly of eukaryotes (100%) and the sister
relationship between archaea and Thaumarchaeota (79%). A full tree
with species names and all bootstrap values can be found as a Newick
string in the online Supplementary Data.
inference on the greater power of random sites when
compared with gene segments was found analyzing loci
within fully linked mitochondrial genomes (Cummings
et al. 1995, 1999). One explanation is that the nonindependence of sites within a gene can lead to
2015
KATZ AND GRANT—TAXON-RICH PHYLOGENOMICS OF EUKARYOTES
413
tree of life, including identifying major clades plus
several clusters of “orphan” lineages (lineages without
clear sister taxa). Inclusion of bacterial and archaeal
sequences also enabled us to investigate the impact
of lateral gene transfer from plastids to host nuclei
(EGT) on estimates of evolutionary relationships among
photosynthetic eukaryotes. Finally, we demonstrate that
random sampling sites provides better estimates of
phylogeny when compared with data sets of the same
size made from gene segments.
SUPPLEMENTARY MATERIAL
Supplementary material related to this manuscript
has been deposited on Treebase (eukaryotes only
concatenated alignments and trees, due to size
constraints; TB2:S16260) and Dryad (supplementary
data tables, plus all concatenated alignments used for
analyses and their associated trees; http://dx.doi.org/
10.5061/dryad.db78g).
FUNDING
This work was support by the National Science
Foundation AVAToL grant [DEB-1208741 to L.A.K.].
ACKNOWLEDGMENTS
FIGURE 4.
Subsampling by sites chosen at random yields
more robust phylogenetic estimates than subsampling in gene pieces,
based on assessment of monophyly of clades in 20 trees produced for
each sampling strategy and data set size. Analyses are divided into
major clades (top panel), nested hypotheses (mid panel), and groups
with morphological synapomorphies (lower panel). Warmer (darker)
colors represent greater numbers of trees supporting monophyly, and
numerical values are in Supplementary Table S4. Clades recovered
significantly more often when sampling by site than by gene are
marked with asterisk (P < 0.05 as assessed by a two-tailed t-test
implemented in R (http://www.r-project.org/)). Abbreviations are:
Ap, Di, Ch, Pe = apicomplexa, dinoflagellates, chromerids, and
perkinsida and (Sp + Ct) + Me = sponges and ctenophores together,
outside of the rest of the metazoan.
inaccurate phylogenetic estimates (Cummings et al. 1995,
1999), which contributed to the suggestion that sampling
multiple, unlinked genes from various parts of the
genome will help overcome this effect (Cummings and
Meyer 2005). However, our sampling of genes chosen at
random with respect to their position in genomes did not
perform as well as sampling random sites from within
these unlinked genes (Fig. 4). Intriguingly, the impact
of the sampling effect was greatest on deeper nodes
(estimated at ∼1 billion years or more), suggesting the
need to rethink sampling approaches for studies of early
evolutionary events.
SYNTHESIS
Using a data set of up to 802 species and 1554 genes, we
are able to construct a robust estimate for the eukaryotic
The authors are grateful to Gordon Burleigh
(University of Florida), T. Martin Embley (Newcastle
University), Tom Williams (Newcastle University)
for help with computer resources and conversations
about the work presented here, and to Jean-David
Grattepanche (Smith College) for advice on statistics.
They also thank to Mark Miller and Wayne Pfeiffer at
CIPRES and Charles Taylor and Oleksandr Moskalenko
at the University of Florida for support in using
computer clusters. The authors also thank William F.
Martin (University of Düsseldorf) and two anonymous
reviewers for comments on earlier versions of this
manuscript. The Gordon and Betty Moore Foundation
provided sequence data and assembly through the
Marine Microbial Eukaryote Transcriptome project
(http://marinemicroeukaryotes.org/; last accessed
September 1, 2015).
REFERENCES
Adl S.M., Simpson A.G.B., Farmer M.A., Andersen R.A., Anderson
O.R., Barta J.R., Bowser S.S., Brugerolle G., Fensome R.A., Fredericq
S., James T.Y., Karpov S., Kugrens P., Krug J., Lane C.E., Lewis
L.A., Lodge J., Lynn D.H., Mann D.G., McCourt R.M., Mendoza
L., Moestrup O., Mozley-Standridge S.E., Nerad T.A., Shearer C.A.,
Smirnov A.V., Spiegel F.W., Taylor M. 2005. The new higher level
classification of eukaryotes with emphasis on the taxonomy of
protists. J. Eukaryot. Microbiol. 52:399–451.
Adl S.M., Simpson A.G.B., Lane C.E., Lukes J., Bass D., Bowser
S.S., Brown M.W., Burki F., Dunthorn M., Hampl V., Heiss A.,
Hoppenrath M., Lara E., le Gall L., Lynn D.H., McManus H.,
Mitchell E.A.D., Mozley-Stanridge S.E., Parfrey L.W., Pawlowski J.,
Rueckert S., Shadwick L., Schoch C.L., Smirnov A., Spiegel F.W.
414
SYSTEMATIC BIOLOGY
2012. The revised classification of eukaryotes. J. Eukaryot. Microbiol.
59:429–493.
Andersson J.O. 2005. Lateral gene transfer in eukaryotes. Cell. Mol.
Life Sci. 62:1182–1197.
Archibald J.M. 2009. The puzzle of plastid evolution. Curr. Biol. 19:
R81–R88.
Berger S.A., Krompass D., Stamatakis A. 2011. Performance, accuracy,
and web server for evolutionary placement of short sequence reads
under maximum likelihood. Syst. Biol. 60:291–302.
Brown J.R., Douady C.J., Italia M.J., Marshall W.E., Stanhope M.J. 2001.
Universal trees based on large combined protein sequence data sets.
Nat. Genet. 28:281–285.
Burki F., Kudryavtsev A., Matz M.V., Aglyamova G.V., Bulman S.,
Fiers M., Keeling P.J., Pawlowski J. 2010. Evolution of Rhizaria: new
insights from phylogenomic analysis of uncultivated protists. BMC
Evol. Biol. 10:18.
Burki F., Okamoto N., Pombert J.F., Keeling P.J. 2012. The evolutionary
history of haptophytes and cryptophytes: phylogenomic evidence
for separate origins. Proc. Roy. Soc. B Biol. Sci. 279:2246–2254.
Cavalier-Smith T. 1997. Amoeboflagellates and mitochondrial
cristae in eukaryote evolution: megasystematics of the new
protozoan subkingdoms Eozoa and Neozoa. Arch. Protistenkd. 147:
237–258.
Cavalier-Smith T. 2000. Membrane heredity and early chloroplast
evolution. Trends Plant Sci. 5:173–182.
Cavalier-Smith T. 2010. Origin of the cell nucleus, mitosis and sex: roles
of intracellular coevolution. Biol. Direct 5:7.
Cavalier-Smith T., Chao E.E.Y. 2003. Phylogeny of choanozoa,
apusozoa, and other protozoa and early eukaryote megaevolution.
J. Mol. Evol. 56:540–563.
Chen F., Mackey A.J., Stoeckert C.J., Roos D.S. 2006. OrthoMCLDB: querying a comprehensive multi-species collection of ortholog
groups. Nucleic Acids Res. 34:D363–D368.
Cotton J.A., McInerney J.O. 2010. Eukaryotic genes of archaebacterial
origin are more important than the more numerous eubacterial
genes, irrespective of function. Proc. Natl Acad. Sci. U. S. A.
107:17252–17255.
Cummings M.P., Meyer A. 2005. Magic bullets and golden rules: data
sampling in molecular phylogenetics. Zoology 108:329–336.
Cummings M.P., Otto S.P., Wakeley J. 1995. Sampling properties of
DNA sequence data in phylogenetic analysis. Mol. Biol. Evol. 12:
814–822.
Cummings M.P., Otto S.P., Wakeley J. 1999. Genes and other samples
of DNA sequence data for phylogenetic inference. Biol. Bull. 196:
345–347.
Cummins C.A., McInerney J.O. 2011. A method for inferring the
rate of evolution of homologous characters that can potentially
improve phylogenetic inference, resolve deep divergence and
correct systematic biases. Syst. Biol. 60:833–844.
Cuomo C.A., Desjardins C.A., Bakowski M.A., Goldberg J., Ma A.T.,
Becnel J.J., Didier E.S., Fan L., Heiman D.I., Levin J.Z., Young S.,
Zeng Q., Troemel E.R. 2012. Microsporidian genome analysis reveals
evolutionary strategies for obligate intracellular growth. Genome
Res. 22:2478–2488.
Dacks J.B., Marinets A., Doolittle W.F., Cavalier-Smith T., Logsdon J.M.
2002. Analyses of RNA polymerase II genes from free-living protists:
phylogeny, long branch attraction, and the eukaryotic big bang. Mol.
Biol. Evol. 19:830–840.
Darriba D., Taboada G.L., Doallo R., Posada D. 2011. ProtTest 3: fast
selection of best-fit models of protein evolution. Bioinformatics
27:1164–1165.
Delsuc F., Brinkmann H., Philippe H. 2005. Phylogenomics and the
reconstruction of the tree of life. Nat. Rev. Genet. 6:361–375.
Delwiche C.F. 1999. Tracing the tread of plastid diversity through the
tapestry of life. Am. Nat. 154:S164–S177.
Derelle R., Lang F.B. 2011. Rooting the eukaryotic tree with
mitochondrial and bacterial proteins. Mol. Biol. Evol. 29:1277–1289.
Dunn C.W., Hejnol A., Matus D.Q., Pang K., Browne W.E., Smith
S.A., Seaver E., Rouse G.W., Obst M., Edgecombe G.D., Sorensen
M.V., Haddock S.H.D., Schmidt-Rhaesa A., Okusu A., Kristensen
R.M., Wheeler W.C., Martindale M.Q., Giribet G. 2008. Broad
phylogenomic sampling improves resolution of the animal tree of
life. Nature 452:745–U745.
VOL. 64
Foster P.G. 2004. Modeling compositional heterogeneity. Syst. Biol.
53:485–495.
Giribet G., Edgecombe G.D. 2012. Reevaluating the arthropod tree of
life. Annu. Rev. Entomol. 57:167–186.
Grant J.R., Lahr D.J.G., Rey F.E., Burleigh J.G., Gordon J.I.,
Knight R., Molestina R.E., Katz L.A. 2012. Gene discovery
from a pilot study of the transcriptomes from three diverse
microbial eukaryotes: Corallomyxa tenera, Chilodonella uncinata, and
Subulatomonas tetraspora. Protist Genomics. 1:3–18.
Grant J.R., Katz L.A. 2014. Building a phylogenomic pipeline for the
eukaryotic tree of life—addressing deep phylogenies with genomescale data. PLoS Curr. 6.
Hampl V., Hug L., Leigh J.W., Dacks J.B., Lang B.F., Simpson
A.G.B., Roger A.J. 2009. Phylogenomic analyses support the
monophyly of Excavata and resolve relationships among eukaryotic
“supergroups”. Proc. Natl. Acad. Sci. U. S. A. 106:3859–3864.
Heath T.A., Hedtke S.M., Hillis D.M. 2008. Taxon sampling and the
accuracy of phylogenetic analyses. J. Syst. Evol. 46:239–257.
Hejnol A., Obst M., Stamatakis A., Ott M., Rouse G.W., Edgecombe
G.D., Martinez P., Baguna J., Bailly X., Jondelius U., Wiens M., Muller
W.E.G., Seaver E., Wheeler W.C., Martindale M.Q., Giribet G., Dunn
C.W. 2009. Assessing the root of bilaterian animals with scalable
phylogenomic methods. Proc. Roy. Soc. B Biol. Sci. 276:4261–4270.
Hibbett D.S., Binder M., Bischoff J.F., Blackwell M., Cannon P.F.,
Eriksson O.E., Huhndorf S., James T., Kirk P.M., Lücking R.,
Thorsten Lumbsch H., Lutzoni F., Matheny P.B., McLaughlin D.J.,
Powell M.J., Redhead S., Schoch C.L., Spatafora J.W., Stalpers J.A.,
Vilgalys R., Aime M.C., Aptroot A., Bauer R., Begerow D., Benny
G.L., Castlebury L.A., Crous P.W., Dai Y.-C., Gams W., Geiser D.M.,
Griffith G.W., Gueidan C., Hawksworth D.L., Hestmark G., Hosaka
K., Humber R.A., Hyde K.D., Ironside J.E., Kõljalg U., Kurtzman
C.P., Larsson K.-H., Lichtwardt R., Longcore J., Miadlikowska J.,
Miller A., Moncalvo J.-M., Mozley-Standridge S., Oberwinkler F.,
Parmasto E., Reeb V., Rogers J.D., Roux C., Ryvarden L., Sampaio
J.P., Schüssler A., Sugiyama J., Thorn R.G., Tibell L., Untereiner W.A.,
Walker C., Wang Z., Weir A., Weiss M., White M.M., Winka K., Yao
Y.-J., Zhang N. 2007. A higher-level phylogenetic classification of the
Fungi. Mycol. Res. 111:509–547.
Hug L.A., Stechmann A., Roger A.J. 2010. Phylogenetic distributions
and histories of proteins involved in anaerobic pyruvate metabolism
in eukaryotes. Mol. Biol. Evol. 27:311–324.
Katz L.A. 2012. Origin and diversification of eukaryotes. Annu. Rev.
Microbiol. 66:411–427.
Katz L.A., Grant J.R., Parfrey L.W., Burleigh J.G. 2012. Turning the
crown upside down: gene tree parsimony roots the eukaryotic tree
of life. Syst. Biol. 61:653–660.
Katz L.A., Grant J.R., Parfrey L.W., Gant A., O’Kelly C.J., Anderson
O.R., Molestina R.E., Nerad T. 2011. Subulatomonas tetraspora nov.
gen. nov. sp. is a member of a previously unrecognized major clade
of eukaryotes. Protist 162:762–773.
Keeling P.J., Palmer J.D. 2008. Horizontal gene transfer in eukaryotic
evolution. Nat. Rev. Genet. 9:605–618.
Knoll A.H., Javaux E.J., Hewitt D., Cohen P. 2006. Eukaryotic organisms
in Proterozoic oceans. Philos. Trans. R. Soc. B Biol. Sci. 361:1023–1038.
Kumar S., Filipski A.J., Battistuzzi F.U., Pond S.L.K., Tamura K. 2012.
Statistics and truth in phylogenomics. Mol. Biol. Evol. 29:457–472.
Lake J.A., Henderson E., Oakes M., Clark M.W. 1984. Eocytes: a new
ribosome structure indicates a kingdom with a close relationship to
eukaryotes. Proc. Natl. Acad. Sci. U. S. A. 81:3786–3790.
Lartillot N., Lepage T., Blanquart S. 2009. PhyloBayes 3: a Bayesian
software package for phylogenetic reconstruction and molecular
dating. Bioinformatics 25:2286–2288.
Li L., Stoeckert C.J. Jr., Roos D.S. 2003. OrthoMCL: identification of
ortholog groups for eukaryotic genomes. Genome Res. 13:2178–2189.
Martin W., Brinkmann, H., Savonna, C., Cerff R. 1993. Evidence
for a chimeric nature of nuclear genomes: eubacterial origin
of eukaryotic glyceraldehyde-3-phosphate dehydrogenase genes.
Proc. Natl. Acad. Sci. U. S. A. 90:8692–8698.
Martin W., Rujan T., Richly E., Hansen A., Cornelsen S., Lins T., Leister
D., Stoebe B., Hasegawa M., Penny D. 2002. Evolutionary analysis
of Arabidopsis, cyanobacterial, and chloroplast genomes reveals
plastid phylogeny and thousands of cyanobacterial genes in the
nucleus. Proc. Natl Acad. Sci. U. S. A. 99:12246–12251.
2015
KATZ AND GRANT—TAXON-RICH PHYLOGENOMICS OF EUKARYOTES
Miller M.A., Pfeiffer W., Schwartz T. 2010. Creating the CIPRES science
gateway for inference of large phylogenetic trees. Proceedings of
the Gateway Computing Environments Workshop (GCE). New
Orleans, LA. p. 1–8.
Nowack E.C.M., Melkonian M. 2010. Endosymbiotic associations
within protists. Philos. Trans. R. Soc. B Biol. Sci. 365:699–712.
Okamoto N., Chantangsi C., Horak A., Leander B.S., Keeling P.J. 2009.
Molecular phylogeny and description of the novel katablepharid
Roombia truncata gen. et sp nov., and establishment of the Hacrobia
taxon nov. PLoS One 4:e7080.
Parfrey L.W., Grant J., Tekle Y.I., Lasek-Nesselquist E., Morrison
H.G., Sogin M.L., Patterson D.J., Katz L.A. 2010. Broadly sampled
multigene analyses yield a well-resolved eukaryotic tree of life. Syst.
Biol. 59:518–533.
Parfrey L.W., Lahr D.J.G., Knoll A.H., Katz L.A. 2011. Estimating the
timing of early eukaryotic diversification with multigene molecular
clocks. Proc. Natl Acad. Sci. U. S. A. 108:13624–13629.
Penn O., Privman E., Ashkenazy H., Landan G., Graur D., Pupko T.
2010. GUIDANCE: a web server for assessing alignment confidence
scores. Nucleic Acids Res. 38:W23–W28.
Philippe H., Delsuc F., Brinkmann H., Lartillot N. 2005. Phylogenomics.
Annu. Rev. Ecol. Evol. Sci. 36:541–562.
Sierra R., Matz M.V., Aglyamova G., Pillet L.Ø., Decelle J., Not F.,
de Vargas C., Pawlowski J. 2013. Deep relationships of Rhizaria
revealed by phylogenomics: a farewell to Haeckel’s Radiolaria. Mol.
Phylogenet. Evol. 67:53–59.
Simpson A.G.B., Inagaki Y., Roger A.J. 2006. Comprehensive multigene
phylogenies of excavate protists reveal the evolutionary positions of
“primitive” eukaryotes. Mol. Biol. Evol. 23:615–625.
Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based
phylogenetic analyses with thousands of taxa and mixed models.
Bioinformatics 22:2688–2690.
Stamatakis A., Aberer A.J. 2013. Novel parallelization schemes for
large-scale likelihood-based phylogenetic inference. Proceedings
of the 2013 IEEE 27th International Symposium on Parallel and
Distributed Processing, IEEE Computer Society. p. 1195–1204.
Stamatakis A., Hoover P., Rougemont J. 2008. A rapid bootstrap
algorithm for the RAxML web-servers. Syst. Biol. 57:758–771.
Stamatakis A., Ott M., Ludwig T. 2005. RAxML-OMP: an efficient
program for phylogenetic inference on SMPs. Parallel Comput.
Technol. 3606:288–302.
415
Stechmann A., Cavalier-Smith T. 2002. Rooting the eukaryote tree by
using a derived gene fusion. Science 297:89–91.
Szklarczyk R., Huynen M.A. 2010. Mosaic origin of the mitochondrial
proteome. Proteomics 10:4012–4024.
Thiergart T., Landan G., Schenk M., Dagan T., Martin W.F. 2012. An
evolutionary network of genes present in the eukaryote common
ancestor polls genomes on eukaryotic and mitochondrial origin.
Genome Biol. Evol. 4:466–485.
Torruella G., Derelle R., Paps J., Lang B.F., Roger A.J., ShalchianTabrizi K., Ruiz-Trillo I. 2012. Phylogenetic relationships within
the Opisthokonta based on phylogenomic analyses of conserved
single-copy protein domains. Mol. Biol. Evol. 29:531–544.
Tourasse N.J., Gouy M. 1999. Accounting for evolutionary rate variation
among sequence sites consistently changes universal phylogenies
deduced from rRNA and protein-coding genes. Mol. Phylogenet.
Evol. 13:159–168.
Walker G., Dacks J.B., Embley T.M. 2006. Ultrastructural description
of Breviata anathema, n. Gen., n. Sp., the organism previously
studied as “Mastigamoeba invertens”. J. Eukaryot. Microbiol. 53:
65–78.
Wiens J.J. 2005. Can incomplete taxa rescue phylogenetic analyses from
long-branch attraction? Syst. Biol. 54:731–742.
Wiens J.J., Tiu J. 2012. Highly incomplete taxa can rescue phylogenetic
analyses from the negative impacts of limited taxon sampling. PLoS
One 7:e42925
Williams T.A., Foster P.G., Cox C.J., Embley T.M. 2013. An archaeal
origin of eukaryotes supports only two primary domains of life.
Nature 504:231–236.
Williams T.A., Foster P.G., Nye T.M., Cox C.J., Embley T.M. 2012.
A congruent phylogenomic signal places eukaryotes within the
Archaea. Proc. Roy. Soc. B Biol. Sci. 279:4870–4879.
Yoon H.S., Grant J., Tekle Y.I., Wu M., Chaon B.C., Cole J.C.,
Logsdon J.M., Patterson D.J., Bhattacharya D., Katz L.A. 2008.
Broadly sampled multigene trees of eukaryotes. BMC Evol. Biol.
8:14.
Zhao S., Burki F., Brate J., Keeling P.J., Klaveness D., Shalchian-Tabrizi
K. 2012. Collodictyon—an ancient lineage in the tree of eukaryotes.
Mol. Biol. Evol. 29:1557–1568.
Zufall R.A., McGrath C., Muse S.V., Katz L.A. 2006. Genome
architecture drives protein evolution in ciliates. Mol. Biol. Evol.
23:1681–1687.