PDF - Oxford Academic - Oxford University Press

Integrating Markov Clustering and Molecular Phylogenetics to Reconstruct the
Cyanobacterial Species Tree from Conserved Protein Families
Wesley D. Swingley,* Robert E. Blankenship, and Jason Raymondà
*Institute of Low Temperature Science, Hokkaido University, Sapporo, Japan; Departments of Biology and Chemistry, Washington
University, St Louis, MO; and àSchool of Natural Sciences, University of California, Merced
Attempts to classify living organisms by their physical characteristics are as old as biology itself. The advent of protein
and DNA sequencing—most notably the use of 16S ribosomal RNA—defined a new level of classification that now
forms our basic understanding of the history of life on earth. High-throughput sequencing currently provides DNA
sequences at an unprecedented rate, not only providing a wealth of information but also posing considerable analytical
challenges. Here we present comparative genomics–based methods useful for automating evolutionary analysis between
any number of species. As a practical example, we applied our method to the well-studied cyanobacterial lineage. The 24
cyanobacterial genomes compared here occupy a wide variety of environmental niches and play major roles in global
carbon and nitrogen cycles. By integrating phylogenetic data inferred for upward of 1,000 protein-coding genes common
to all or most cyanobacteria, we have reconstructed an evolutionary history of the phylum, establishing a framework for
resolving key issues regarding the evolution of their metabolic and phenotypic diversity. Greater resolution on individual
branches can be attained by telescoping inward to the larger set of conserved proteins between fewer taxa. The
construction of all individual protein phylogenies allows for quantitative tree scoring, providing insight into the
evolutionary history of each protein family as well as probing the limits of phylogenetic resolution. The tools
incorporated here are fast, computationally tractable, and easily extendable to other phyla and provide a scaleable
framework for contrasting and integrating the information present in thousands of protein-coding genes within related
genomes.
Introduction
Although the 16S ribosomal RNA (rRNA) paradigm
continues to provide a strong framework for understanding
evolution, it represents only one small piece of an organism’s history. The exponentially increasing number of genome sequencing projects is pushing our understanding of
diversity well beyond the limitations of the single-gene
proxy. Integrating the enormous wealth of genetic information—hundreds to tens of thousands of genes per genome—stands as one of the central challenges to biology
in the 21st century. Ultimately, an evolutionary tree will
be available for every (nonnovel) gene from every sequenced genome, providing a temporal and cross-species
blueprint of how Darwinian evolution has brought these
genes together into an organism able to thrive in its particular niche.
The goal of phylogenomics has recently been the subject of a number of novel and provocative approaches
(Eisen 1998; Lerat et al. 2003; Rivera and Lake 2004;
Delsuc et al. 2005; Snel et al. 2005). Although insightful,
their results are often quite controversial; for example, some
strongly support the canonical tree of life as deduced by
16S rRNA analysis, whereas others suggest striking rearrangements to this orthodoxy (Wolf et al. 2002; Charlebois
et al. 2003; Doolittle 2005; Ciccarelli et al. 2006). Perhaps
the best developed and most rigorously tested of these
methods, molecular phylogeny, have been difficult to implement due primarily to computational challenges of constructing gene trees with very large data sets. Furthermore,
single-gene phylogenies are complex by default, often reflecting nonvertical evolution due to horizontal gene transfer, gene duplication (paralogy), and loss (Gogarten and
Key words: genomics, cyanobacteria, evolution, Markov clustering,
phylogenomics.
E-mail: [email protected].
Mol. Biol. Evol. 25(4):643–654. 2008
doi:10.1093/molbev/msn034
Advance Access publication February 22, 2008
Ó The Author 2008. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: [email protected]
Townsend 2005). Deep phylogenies are especially prone
to poor resolution due to sequence divergence. In particular,
although 1 set of homologous genes or proteins may be
quite useful in resolving species or genus-level relationships, it might be quite poor at resolving phylum-level relationships due to poor conservation or short sequence
length.
In this work, we take a new approach integrating clustering and sequence analysis toward resolving an integrated
phylogeny spanning multiple taxonomic levels within a single phylum. Using all genomes available from a single
phylum, our approach combines the rigorous (maximum
likelihood) analysis of large numbers of orthologs, as well
as of concatenated sets of up to several hundred proteins
representing a large fraction of some genomes, and of consensus phylogenies based on single-protein trees. The ultimate goal is to determine, given the known role of
horizontal gene transfer particularly in prokaryote evolution
as well as the difficulty in resolving deep phylogenies,
whether a plurality phylogenetic signal exists that is both
consistent with, and potentially explanatory toward, systematic and taxonomic information about a group of organisms.
This phylum first approach is well suited to the .103
ongoing genome projects, for several reasons. First, most
phyla appear to be robustly defined based both on molecular methods, especially 16S, and on traditional systematics. Organisms within a phylum typically share unique
phenotypic traits that are variable enough to be both interesting and informative of the evolutionary process. Second,
by focusing first on resolving the distribution and phylogeny of single proteins, it is possible to select for subsequent
analysis those that are potentially most useful in resolving
relationships at different taxonomic levels. For example,
many proteins are not common to all organisms within
a clade and would be excluded from analyses of completely
conserved, or ‘‘core,’’ proteins, whereas they might be useful for determining relationships between subsets of organisms. Additionally, depending on factors such as length and
644 Swingley et al.
degree of conservation, some proteins give well-resolved
trees for only some taxonomic levels. Ribosomal proteins
often share 100% amino acid identity—and are thereby
phylogenetically uninformative—between members of
the same genus or species.
Working with a single phylum (as opposed to, say, all
3 domains of life) also prevents data sets from becoming
computationally intractable, especially when employing
maximum likelihood–based approaches. This methodology
can also be naturally extended into different taxonomic levels. Whereas some subset of proteins may be useful for resolving relationships within phyla, when needed, additional
proteins can be incorporated for reconstructing family-,
class-, or genus-level relationships by selecting only those
proteins conserved at these taxonomic levels. Understanding which proteins are adequate at resolving different taxonomic levels enables selection of proteins that are useful in
determining relationships between phyla—an ultimate goal
(and persistent shortcoming) in reconstructions of the tree
of life.
As an introductory example, we focus on the phylum
cyanobacteria, which is notable for sequencing projects
covering a wide swath of their enormous diversity as well
as for their evolutionary importance and time constraints on
their early evolution. The most ancient diagnostic markers
for any organism come in the way of chemical biomarkers
argued to have been left by cyanobacterial ancestors some
2.7 billion years ago, and the global-scale effects resulting
from the oxygen produced during cyanobacterial photosynthesis are seen in rocks ;2.43 billion years old and younger
(Summons et al. 1999; Farquhar et al. 2000; Knoll 2003;
Kopp et al. 2005). Ongoing and completed sequencing
projects include cyanobacteria from marine and freshwater
environments, thermophiles, nitrogen fixers, and symbionts. In addition to illustrating the robust evolutionary
resolution acquired using our method, we also seek to build
a growing phylogenetic framework upon which the evolution of this phenotypically diverse group of organisms is
based.
The long history of cyanobacterial systematics has
been confounded by morphology-based botanical classifications as well as difficulties in resolving closely related
species using 16S rRNA (Rippka et al. 1979; Fox et al.
1992; Castenholz 2001; Casamatta et al. 2005). Individual
genes and proteins conserved across all organisms or specifically in all cyanobacteria have been used to build phylogenies (Woese 1987; Giovannoni et al. 1988; Honda et al.
1999; Hess et al. 2001; Seo and Yokota 2003; Henson et al.
2004). Some subsets of cyanobacteria have also been compared extensively, particularly within the (genomically)
well-sampled Prochlorophyte clade (Hess 2004; Dufresne
et al. 2005). However, only a few studies thus far have assembled cyanobacterial phylogenies based on a larger set
of proteins conserved across all cyanobacteria. Martin
et al. (2002) examined several thousand genes from 3
then-available cyanobacteria to determine the evolutionary
history of nuclear genes from Arabidopsis thaliana, establishing the widescale impact that imported cyanobacterial genes
have had on the evolution of photosynthetic eukaryotes,
as well as plausible gene complements of chloroplast/
cyanobacterial ancestors. A Blast-based comparison of the
genomes of 8 cyanobacterial genomes by (Martin et al.
2003) revealed 181 signature genes that do not have homologs in other organisms, roughly 3/4 of which had no ascribable function yet are clearly important in some aspect of
cyanobacterial lifestyle. Sanchez-Baracaldo et al. (2005)
more recently developed a method based on multigene concatenation combined with morphological character analysis
to construct and map traits onto a cyanobacterial species tree.
Additionally, a cyanobacterial phylogeny based on 31 proteins conserved across the entire tree of life was constructed
as part of a large-scale tree construction (Ciccarelli et al.
2006), but this study used only 8 cyanobacterial taxa and
the ribosomal proteins used for tree construction did not resolve terminal branches. A cluster of orthologous groups
(COG)–based analysis was used to determine the distribution of proteins in 15 complete cyanobacterial genomes, with
a particular focus on understanding the origin of photosynthesis (Mulkidjanian et al. 2006). However, the analysis did
not undertake phylogenetic analysis, either of individual protein families or in an attempt to resolve the evolution of the
phylum as a whole. Zhaxybayeva et al. (2006) have conducted the most extensive sampling of the phylum to date,
reconstructing histories of 1,128 protein-coding genes from
11 cyanobacterial genomes in order to reconstruct a plurality
tree based on quartet analysis (Zhaxybayeva et al. 2006).
In addition to constructing maximum likelihood trees
for a large number of orthologs from completed cyanobacterial genomes, we assembled concatenated alignments as
a further test of phylogenetic robustness. Importantly, variations in the concatenated alignment used resulted in 2 distinct but very highly supported phylogenies, suggesting that
even large, statistically well-supported concatenations can
converge on very different trees. To further test phylogenetic robustness, we used a tree consensus method to build
a single tree that best captures all single-protein phylogenies. Recent work (Gadagkar et al. 2005) has compared
the effectiveness of concatenated versus consensus methods for phylogenetic inference in the face of incongruent
signals (e.g., due to horizontal gene transfer, poor resolution, invalid model assumptions, or use of the same model
for all data sets). They found that concatenated phylogenies
outperform consensus phylogenies, though importantly
both methods can converge on incorrect trees when systematic biases are present in individual trees—for example,
when the evolutionary model used is a poor match to the
data. However, our consensus tree agrees exactly with
one of the trees inferred from concatenated alignments,
compares the results of multiple evolutionary models,
and also is compatible with modern cyanobacterial classification schemes that integrate both systematic and molecular information.
To further test, and potentially increase, resolution of
individual nodes on our concatenated/consensus genome
tree, we used a telescoping method whereby protein families that are conserved among a smaller number of very
closely related taxa can be taken into account. This proved
useful particularly in resolving relationships between the
very closely related marine Synechococcus and Prochlorococcus clades, which were clarified with exceptional
support by analyzing conserved protein families between
just these 2 groups. In cases, such as these, the inverse
Reconstructing the Cyanobacterial Species Tree 645
Table 1
Genomes Analyzed in This Study
Organism
Gloeobacter violaceus
Thermosynechococcus elongatus
Anabaena variabilis
Nostoc punctiforme
Nostoc sp. PCC7120
Synechococcus elongatus PCC7942
S. elongatus PCC6301
Synechocystis sp. PCC6803
Synechococcus sp. OS A
Synechococcus sp. OS B’
Acaryochloris marina
Prochlorococcus marinus CCMP1375
P. marinus MED4
P. marinus MIT9312
P. marinus MIT9313
P. marinus NATL2A
Synechococcus sp. CC9605
Synechococcus sp. CC9902
Synechococcus sp. RS9917
Synechococcus sp. WH5701
Synechococcus sp. WH7805
Synechococcus sp. WH8102
Trichodesmium erythraeum
Crocosphaera watsonii
Clostridium acetobutylicum
Rhodopseudomonas palustris CGA009
a
b
Description
National Center for Biotechnology
Information Accession/in Progress
Terrestrial, lacks photosystem components
55C optimal growth (hot springs)
Heterocyst-forming diazotroph
Heterocyst-forming diazotroph
Heterocyst-forming diazotroph
Freshwater
Freshwater
Freshwater
Hot spring, diazotroph
Hot spring, diazotroph
Chlorophyll d-containing symbiont
Marine, unicellular
Marine, unicellular
Marine, unicellular
Marine, unicellular
Marine, unicellular
Marine, unicellular
Marine, unicellular
Marine, unicellular
Marine, unicellular
Marine, unicellular
Marine, unicellular
Marine, filamentous diazotroph
Marine, unicellular diazotroph
Gram positive (outgroup)
Proteobacterium (outgroup)
NC 005125
NC 004113
NC 007413
IP; NZ AAAY00000000a
NC 003272
NC 007604
NC 006576
NC 000911
NC 007775
NC 007776
NC 009925b
NC 005042
NC 005072
NC 007577
NC 005071
NC 007335
NC 007516
NC 007513
IP:NZ AANP00000000b
IP;NZ AANO00000000b
IP;NZ AAOK00000000b
NC 005070
NC 008312
IP; NZ AADV00000000a
NC 003030
NC 005296
JGI microbial sequencing portal: http://genome.jgi-psf.org/mic_home.html.
J. Craig Venter Institute/University of Warwick collaborative sequencing project (preliminary sequence available via GenBank).
relationship between the number of conserved protein families and the number of taxa tends to yield a uniform total
number of phylogenetically informative characters.
Finally, using these methods to model cyanobacterial
speciation provides a framework for understanding and explaining the distribution of cyanobacterial protein families.
Generating a robust ‘‘background’’ tree is crucial for framing key evolutionary events, such as the origin and evolution of capabilities such as pigment biosynthesis, carbon
and nitrogen fixation, and provides insight into fundamental
evolutionary mechanisms such as niche adaptation, genome
reduction, and horizontal gene transfer. This approach can
be similarly extended to other phyla to provide a highresolution framework, based on the totality of evolutionary
information from many protein families, which can be
linked together to assemble the tree of life.
Methods
All data are publicly available in the way of completed
or nearly complete genome sequences (table 1). The pipeline of methods used is diagrammed in figure 2. BlastP
comparisons (104 cutoff, BLOSUM62, standard settings
for word size, gap opening/extension, and filtering) were
made between all protein sequences from the genomes
of 24 cyanobacteria and 2 non-cyanobacterial outgroups
(see table 1), representing all complete plus diverse set
of nearly complete cyanobacteria, and outgroups from
well-sampled bacterial phyla (proteobacteria and Gram
positive bacteria). To generate first-pass protein families,
Markov clustering (Enright et al. 2002) was performed iteratively on a matrix generated from Blast e values. To optimize clustering results, inflation parameters ranging from
1.2 to 20.0 were used, with resultant protein family/cluster
size distributions given in supplementary table S2 (Supplementary Material online). An inflation parameter of 2.8
yielded the highest number of protein families with single
orthologs from all or most (.21) of the 24 cyanobacterial
genomes (445 total), as well as families with no more than 2
paralogs (178 total). Note that the smallest cyanobacterial
genome (Prochlorococcus sp. MED4) analyzed contains
1,809 proteins; this level of filtering captures nearly 34%
of that genome for further phylogenetic analysis. Despite
the large number of clusters involved, the Markov clustering method is quite fast (,10 min on a 32-bit/2 GHz AMD
desktop PC) and has been argued to have advantages over,
for example, COG-based protein family assignment (Harlow
et al. 2004). Ultimately, the end goal of these and other
clustering methods is identical—to assemble proteins from
complete genomes into groups of evolutionarily related
orthologs—and no matter what heuristic is used, curation
is a necessary part of the process. Following clustering, all
families were then multiply aligned using ClustalW (Gonnet
protein weight matrix, default gap opening/extension penalties) on an MPI-enabled 18 processor AMD Athlon cluster
(all pre- and postcurated protein family alignments, as well as
scripts for using cluster results for translating complete genomes into protein families, are freely available on request).
646 Swingley et al.
FIG. 1.—Consensus cyanobacterial phylogeny based on maximum likelihood trees for each of 438 orthologous protein families. The numbers at
each bifurcation indicate the total number of trees where that exact bifurcation/branching order is observed; for example, all 438 trees have
Synechococcus sp. A and B’ as closest neighbors and also cleanly distinguish marine Synechococcus and Prochlorococcus from all other cyanobacteria.
Note that as this is a consensus tree, topology is meaningful but distances are not (as opposed to fig. 3, where tree distances are still meaningful).
Using the multiple alignments and corresponding
Neighbor-Joining trees generated by ClustalW as a guide,
protein families were then manually checked for poor alignments and/or long-branch lengths, with poorly aligned sequences and/or poorly assembled protein families either
corrected or removed. Most frequently, these differences
involved inclusion of a paralog in a protein family, which
can be easily detected based on the number of homologs per
organism or, often, the presence of long branches in the
phylogeny. As depicted in table 2, these curated protein
families were then parsed using various filters, for example,
selecting protein families present in all or most cyanobacteria, any imaginable subset of organisms, or by selecting
protein families that all share a common function or annotation. The full-protein family spreadsheet is available as
supplementary table S2 (Supplementary Material online).
In addition to the distance-based trees generated during multiple alignment, phylogenies based on singleprotein families were generated for every aligned protein
family using 2 different maximum likelihood methods.
The first approach used PHYLIP’s ProML package with
the following parameters: JTT probability model, one cat-
egory of sites with constant rate, and with randomized input
order (Felsenstein 1989). Additionally, a second, quartetbased maximum likelihood approach was used with the
parallelized version of iqpnni, here using the Whelen and
Goldman substitution model and estimating a gamma parameter with 4 rate categories (Minh et al. 2005). PHYLIP’s
CONSENSE package was used to generate extended majority rule consensus phylogenies for each separate set of
phylogenies (distance and both ML runs; iqpnni run results
shown in fig. 2).
Concatenated multiple alignments were generated by
end-to-end attachment of individual protein families, using
gaps as placeholders for species missing a particular ortholog.
As an additional test of robustness, variable/uninformative
positions were filtered out of these concatenated alignments
using progressively more stringent Shannon information entropy cutoffs (SIE 1.0–3.0) and filtering out positions with
.50% gaps. The resulting concatenated alignments from
all 26 genomes ranged from 28,281 (SIE 1.0) to 230,415
(full/unfiltered concatenation) aligned amino acid positions
and contained up to 300,000 aligned positions in the case
of the Prochlorococcus/Synechococcus-conserved protein
Reconstructing the Cyanobacterial Species Tree 647
Table 2
Parsing Protein Families Based on Different Criteria*
*see supplementary table S2 (Supplementary Material online) for full spreadsheet.
families (fig. 4). PHYLIP ProML and Neighbor-Joining
phylogenies were then constructed for each of these filtered
concatenated alignments to determine the effect of removing gaps and progressively more variable sites from alignments (see e.g., discussion of difference in support
for the Prochorococcus/Synechococcus clade in the main
text).
Our final goal was to test the effect of correcting for
site heterogeneity in concatenated alignments by incorporating a gamma parameter, rather than strictly filtering out
variable regions of alignments. The size and associated
memory requirements of inferring gamma corrected phylogenies for these concatenated alignments required they
be analyzed using MrBayes (Huelsenbeck and Ronquist
2001). MrBayes was run using the VT evolutionary model,
incorporating a gamma parameter sampled from 4 rate categories, with the substitution model analyzed over 20,000–
30,000 generations in 4 separate runs and a 1,000 generation
burn-in. Because of the large data set and number of free parameters in the model, MrBayes required a 64bit dual CPU
system with 8 GB RAM. Although this limited the number of
generations and discrete chains, in all cases topological convergence to the consensus phylogeny was achieved within
5,000–8,000 generations and was maintained throughout
all the remaining runs. In addition, this topology was also
observed in phylogenetic inference using the NeighborJoining algorithm as implemented in MEGA v3.0, using multiple models and incorporating a gamma parameter (Kumar
et al. 1994), and (topologically) agreed with the trees
obtained using PHYLIP’s ProML on entropy-filtered concatenated alignments, as discussed in the main text. All concatenated alignments and phylogenies are available upon request.
Tree comparisons used PHYLIP’s consense, using the
extended majority rule method and both the symmetric
(Robinson–Foulds) and branch score distance metrics.
Comparisons also included 50 trees comprised of the same
cyanobacterial taxa arranged in randomized topologies. As
illustrated in figure 5, core- and pan-genome numbers are
determined for a specific rooted phylogeny by 1) counting
the number of protein families conserved within all descendents of a particular node in the tree (core) and 2)
counting the total number of protein families present in
the descendents of a particular node in the tree (pan).
Results and Discussion
The 24 genomes analyzed here represent all cyanobacteria with either complete or very nearly complete sequencing projects and encompass nearly 94,000 protein-coding
genes. Homology-based Markov clustering resulted in
7,378 families of proteins present in more than 1 cyanobacterium (an additional 12,955 protein families were found
only in a single cyanobacterium). Many of these families
include multiple, often closely related paralogs. For example, the D1 and D2 proteins of the photosystem II reaction
center complex are members of the same family, and ABC
transporter and serine/threonine kinase paralogs are quite
extensive even in the smallest cyanobacterial genomes.
To avoid problems associated with inclusion of paralogs
in phylogenies, initial analysis focused on families with
few or no paralogs present in most or all cyanobacteria,
which includes housekeeping proteins common to most
organisms as well as cyanobacterial-specific proteins
that have been important during their evolution and early
diversification.
Following the initial clustering, 613 protein families fit
the criterion of being absent in not more than 2 cyanobacteria and having not more than 2 paralogs in total for all
organisms. Alignments and Neighbor-Joining phylogenies
for all families were manually checked, and poorly aligned
proteins (as well as those with disproportionately longbranch lengths; for details, see Methods) were removed
from alignments or else the family was removed from
the analysis. A total of 583 protein families remained after
this manual curation. Here we focus on a substantial number of relatively easily obtained families of orthologs, selected by a fast clustering approach that minimizes the
648 Swingley et al.
number of paralogs while maximizing the total number of
genomes represented in a given protein family (see supplementary table S1, Supplementary Material online).
Phylogenies for each of the 583 families were constructed using 2 different implementations of the maximum
likelihood method (PHYLIP and quartet-based iqpnni; see
Methods). A total of 438 of these families—those comprised strictly of orthologs—were then used to generate
a consensus phylogeny that portrays the bifurcations that
occur most frequently across all trees (fig. 2). For example,
both the marine Synechococcus/Prochlorococcus (11 organisms) and the Synechococcus sp. A and B’ clusters
are conserved in every tree generated, and the Nostocales
clade is observed in 421 of 438 trees. Importantly, only minority support is observed for several nodes on the tree, especially among the cyanobacteria often argued as among
the earliest branching (Gloeobacter)—which may indeed
reflect asymmetric rates of evolution—as well as for some
members of the Prochlorococcus lineages, which recent
studies suggest may result from horizontal gene transfer
(Beiko et al. 2005). The ability to detect this phylogenetic
incoherence is a crucial step in being able to segregate both
protein families and organisms that are responsible. An attractive, iterative approach would take these into account by
fine-tuning parameters of ascribed evolutionary models or
progressively removing ‘‘difficult’’ protein families from
tree-building methods that rely on combined data sets.
This consensus phylogeny gives a straightforward
method for finding putative horizontal gene transfer events
and indicates that gene transfer ‘‘across’’ the tree, that is,
between Prochlorococcus/marine Synechococcus and cyanophytes, is very rare among this particular subset of proteins. Note that as these proteins are common to almost
all cyanobacteria, a very specific type of horizontal gene
transfer—orthologous gene replacement—must occur,
whereby a newly transferred gene displaces a functional
wild-type gene. Importantly, though recent evidence indeed
supports an important role for horizontal gene transfer
among cyanobacteria (Zhaxybayeva et al. 2006), simulations suggest that these phylogenetic signals are not selfreinforcing and, even when corrections are not made for
variations in evolutionary rate or composition, convergence
to the true tree is frequently observed (Gadagkar et al.
2005). Indeed, Zhaxybayeva et al. (2006) obtained a plurality tree based on quartet reconstruction with which the consensus and concatenated trees presented here are consistent.
In addition to individual and consensus phylogenies,
all alignments without paralogs were concatenated into
a single large alignment containing 230,415 positions encompassing 26 organisms. Smaller alignments were generated from this full alignment using a Shannon information
entropy–based filter (Reche and Reinherz 2003) to remove
phylogenetically uninformative (too variable or too conserved) sites from the alignment. Shannon entropy can
be calculated for each position in an alignment and provides
a more robust method for parsing informative positions
from alignment than simply culling positions that fall below
a given percentage identity or similarity. For example, a position in a protein sequence alignment might have 1 amino
acid in half of the sequences and a different amino acid in
the other half. If a percentage-based cutoff were used, this
position would contain the same informative value as one
where half the positions were 1 amino acid and the other
half were all different amino acids. However, the Shannon
entropy score of these 2 examples is quite different and,
furthermore, is conceptually similar to maximum likelihood
calculations. Phylogenies for all concatenated alignments
were generated as discussed in the methods and showed
overall agreement with one another, with one notable exception—differing levels of filtering (Shannon entropy cutoff values ranging from 1 to 4, where 0 is an invariant site
and 4.322 is a site where all 20 amino acids are equally represented) resulted in 2 distinct trees differing by monophyly
of the Prochlorococcus/Synechococcus clades. One of the
trees—shown in figure 3—was converged upon from multiple MrBayes runs using the full/unfiltered data set. This
tree is characterized by separate/monophyletic Prochlorales
(the order containing Prochlorococcus species) and marine
Synechococcus clades, with Synechococcus sp. strain WH
5701 basal to both groups, a topology supported in previous
single-gene trees (Rocap et al. 2002; Scanlan 2003). Notably, this tree was in almost exact agreement with the consensus phylogeny generated from 438 trees (with the
exception of the poorly supported Acaryochloris marina/
Thermosynechococcus elongatus clade, resolved as 2 distinct lineages in the concatenated tree).
Although the observed convergence to a single tree
from 2 different approaches lends support to this as the true
tree, the fact that a different tree was inferred from some
filtered concatenated alignments underscores the importance of using multiple methods of analysis to infer phylogenies. Shannon entropy presents a metric for pruning highly
variable (less phylogenetically informative) positions from
long alignments, making phylogenetic analysis more tractable. However, care must be taken that evolutionary models are compared each time a data set is filtered as it is
feasible that the best model can change as positions are
pruned from an alignment. Even character-rich data sets
can be prone to error, in particular when they contain multiple phylogenetic signals or include highly divergent or
deeply branching organisms (Mossel and Steel 2006).
As is evident in figures 3 and 4, order Prochlorales
shows anomalously long-branch lengths, evident both in individual as well as concatenated phylogenies, that may account for the alternative topology seen in some filtered
concatenated phylogenies (this alternate topology is illustrated by the dashed line in fig. 4). However, one of the trees
is converged to in both concatenated and consensus phylogenies, lending support to this as the true tree.
As a further test, we demonstrate one of the advantages
of our approach by incorporating additional information
from protein families excluded from the initial analysis because they were not present in most or all cyanobacteria.
Specifically, 1,108 protein families are found in all Prochlorococcus and marine Synechococcus species (including
WH 5701). A total of 848 of these families have no paralogs
within either of these clades, and so individual and consensus/concatenated phylogenies can be generated for this Prochlorococcus/Synechococcus-specific subset of families.
As shown in figure 4, phylogeny based on 848 concatenated
protein families (287,466 aligned positions in 11 Prochlorococcus/Synechococcus genomes) supports a branching
Reconstructing the Cyanobacterial Species Tree 649
FIG. 2—Bayesian maximum likelihood tree for the full-concatenated data set, based on 230,415 aligned positions in 26 genomes. Note the strong
agreement with the consensus tree from figure 2, as well as the presence of non-cyanobacterial outgroups that support Gloeobacter violaceus as an earlybranching cyanobacterium. The scale bar indicates the number of substitutions per site. Shown at each bifurcation are the predicted core-genome (upper
number) and pan-genome (lower number) sizes of an ancestor at that point. The core-genome represents the intersection of all protein families in all progeny
of an ancestor, whereas the pan-genome represents the union of all protein families in those progeny (the 2 numbers converge at the tips of the tree).
order in agreement with both the consensus and fully concatenated data sets. Moreover, the resulting phylogeny also
retains the relatively long-branch lengths characteristic of
several members of the prochlorales clade, suggesting that
an accelerated substitution rate across many proteins has
accompanied genome reduction. Prochlorococcus genome
analyses have observed this long-branch effect, which is
likely due to loss of several DNA repair capabilities during
genome reduction (Dufresne et al. 2005).
The single phylogeny converged upon by multiple
methods used herein also provides a framework for understanding the distribution of protein families at each ancestral node on the tree (Martin et al. 2002; Eisen and Fraser
2003; Lerat et al. 2003). As shown in figure 3, the common
ancestor of all cyanobacteria is inferred to have had a conserved core of 361 protein families as these are present in
the full set of 26 genomes analyzed. A total of 675 proteins
(within which the 361 are nested) are common to all 24
cyanobacterial genomes analyzed, though as mentioned,
many of these families contain paralogs and so were excluded from this analysis. These families represent
a widely conserved core of housekeeping proteins common not only across known cyanobacterial diversity but
also present to some extent in non-cyanobacterial genomes. Furthermore, the total diversity of modern cyanobacterial protein families—the union of all protein families
in all progeny of an ancestor—is inferred to be just over
20,000 proteins for the cyanobacterial common ancestor
and 25,292 when including the non-cyanobacterial outgroups. This is referred to as the cyanobacterial pangenome (which must be emphasized never actually existed
but simply captures the extent of protein family variability
across the phylum), illustrated along with the core-genome
concept in figure 5. These pan- and core-genome numbers
provide upper and lower bounds on protein family distributions at each node in a given phylogeny and are not
parsimony-based estimates of the true genetic content of
ancient organisms.
650 Swingley et al.
FIG. 4—Using phylogeny and the distribution of protein families in
different genomes to infer ancestral characteristics. As illustrated in the
diagram, each bifurcation represents an ancestor whose core-genome
contains the protein families found in every one of its descendents (the
intersection of descendent genomes), whereas the pan-genome contains
all proteins families found in all descendents (the union of descendent
genomes).
FIG. 3—Maximum likelihood phylogeny of 848 concatenated
orthologous protein families (;290,000 aligned amino acid positions)
common to 11 Prochlorococcus and marine Synechococcus genomes. By
incorporating a larger number of protein families shared in a smaller
number of closely related organisms, we find strong support for 1 of 2
topologies found in 26 genome trees, effectively improving the resolution
of the consensus tree. The alternative topological position of Synechococcus sp. WH5701, observed in some filtered concatenated trees as
discussed in the text, is illustrated by the dashed line.
The core-genome at the base of the cyanobacterial
phylum encompasses most of the major proteins of the
photosynthetic apparatus, suggesting that oxygenic photosynthesis evolved prior to or early in the cyanobacterial radiation. This is in stark contrast with the ability to fix
nitrogen, which is found paraphyletically throughout the
cyanobacterial tree (illustrated in fig. 6a—N2-fixing lineages denoted by ‘‘þ’’). The nodes where nitrogen fixation
is inferred—that is, whose descendent lineages all fix
nitrogen—occur at multiple points across the tree (gray
squares on fig. 6a) so that gene loss, horizontal gene transfer, or some combination of these processes must be
invoked to explain the distribution of nitrogen fixation.
The strength of having both combined and individual phylogenies comes from the capability to contrast the background tree of cyanobacterial speciation (figs. 2 and 3)
with the evolutionary tree for nitrogenase. For example,
based on the species tree, one plausible scenario is that nitrogenase was acquired on independent occasions within
cyanobacterial lineages (e.g., through horizontal gene transfers would be required at the gray þ’s in fig. 6a), followed
by largely vertical evolution to result in the observed distribution in the phylum. Alternatively (and arguably less
parsimoniously), one could posit that the ancestor of all cyanobacteria had the capability to fix nitrogen but that the
nitrogenase evolutionary history has since been dominated
by gene loss. This scenario begins with nitrogen fixation in
the hypothetical pan-genome and is followed by multiple
independent losses, shown as x’s on figure 6a.
By examining phylogenies of individual protein families, for example, that of the NifD (nitrogen fixation catalytic subunit) protein family shown in figure 6b, we can
explore whether one of these scenarios is indeed more parsimonious than the other or if some combination of the 2 is
more likely. The NifD tree (fig. 6b) shows some congruence
with the cyanobacterial species tree (fig. 6a) but provides an
important example of the complex history of protein families, often overlooked or not accurately captured in species
trees. As well as supporting numerous gene losses, the NifD
tree shows evidence for several gene duplications and plausible horizontal gene transfer, as suggested by the position
of Trichodesmium erythraeum, comprising the earliest cyanobacterial branch among NifD proteins (though note poor
bootstrap support makes it difficult to resolve this from the
Synechococcus sp. A/B’ divergence). At face value, this indeed suggests a combination of vertical evolution and gene
loss accounts for the distribution of nitrogen fixation in cyanobacteria, with evidence for horizontal gene transfer as
well as duplication in several lineages.
As with this truth-is-in-between example, the cyanobacterial ancestor would have had a genome content somewhere between the core- and pan-ancestral extremes, with
functions and capabilities that, as demonstrated above, can
be understood through examining of individual phylogenies. In a broader sense, the range established by ancestral
core- and pan-genomes gives insight into the relative importance of genome reduction versus the evolution or
acquisition of new genes and helps constrain the appearance of phenotypes specific to individual organisms or
clades. This approach is extended to several other pathways
of key importance to cyanobacterial evolution, such as carbon fixation and pigment biosynthesis, in Swingley et al.
(2007). As shown in figure 7, the increasing size of the
core-genome between any 2 organisms shows strong inverse correlation with their phylogenetic distance, whereas
the pan-genome size shows only weak correlation. This
Reconstructing the Cyanobacterial Species Tree 651
FIG. 5—(a) Possible scenarios for the distribution of nitrogen fixation in cyanobacteria, contrasting convergence versus gene loss as suggested by
the protein family composition of core- and pan-ancestral genomes. The nitrogen fixation pathway is found in 7 genomes (black þ next to species
name). Pan-ancestral genome data posit nitrogen fixation arose before the cyanobacterial common ancestor and many of its descendents (black dots) but
were subsequently lost in many lineages (black x’s). Core composition of ancestral genomes suggests that the ability to fix nitrogen appeared 3
independent times (gray þ’s; gray boxes indicate ancestral nodes where N2-fixation was present). (b) The phylogenetic tree from protein family
1574—the catalytic molybdenum–iron subunit of the nitrogenase complex (see e.g., table 2). Though many lineages are missing, the species present
have a similar phylogeny as observed in the species tree, suggesting largely vertical evolution with multiple gene losses. The exception is the distinct
position of Trichodesmium erythraeum—not closely related to the Nostocales as in 5a, suggesting that horizontal gene transfer may have been
important early in the evolution of cyanobacterial nitrogen fixation.
results mainly because of the presence of novel/orphan
genes that distinguish even closely related genomes, such
as the 2 Synechococcus elongatus strains with 2,219 shared
protein families.
As mentioned above, the major elements of the cyanobacterial species tree find strong support in other analyses, coming both from systematics and molecular analyses.
This includes: monophyly of heterocystous diazotrophs
with the nonheterocystous diazotroph Trichodesmium erythreum as an outgroup (Sanchez-Baracaldo et al. 2005);
the sister relationship and monophyly of marine Synechococcus and prochlorales (with Synechococcus sp. WH5701
basally branching) (Scanlan 2003) and a more deeply
branching group of freshwater Synechococcus (PCC6301
and 7942) (Giovannoni et al. 1988; Honda et al. 1999);
the cluster of Synechocystis sp. PCC6803 and Crocosphaera watsonii (Sanchez-Baracaldo et al. 2005); and evidence for Gloeobacter violaceus as an early-branching
cyanobacterium (Nelissen et al. 1995), though intriguingly
2 thermophilic, N2-fixing Synechococcus strains also branch
very deeply (Ferris et al. 1996). Note that this approach, like
any, is subject to biases in ongoing sequencing projects
and is therefore missing several important cyanobacterial
taxonomic groups; however, it establishes a framework
652 Swingley et al.
FIG. 6.—Change in core- or pan-genome size at increasing
evolutionary distances for the cyanobacterial tree in figure 3. The black
dots (left axis) indicate the core-genome size versus evolutionary distance
between all pairwise combinations of the 26 genomes analyzed. The black
line shows a single exponential fit (r2 5 0.802). The gray dots (right axis)
give the same information for the pan-genome size.
for incorporating further genomic data as well as expanding
individual protein families with sequence data from public
databases. This also provides a straightforward approach
with which to target sequencing strategies toward organisms that will most improve phylogenetic resolution.
The phylogenies presented here integrate a large
amount of genomic data from all completed, as well as
a few nearly complete, cyanobacterial genomes. The fact
that concatenated and consensus phylogenies from as many
as 583 proteins converge on nearly identical topologies that
agree with earlier systematic and molecular approaches
suggests that this tree represents an accurate, though averaged, history of cyanobacterial speciation. Moreover, phylogenies from individual protein families are retained and
can be selected and contrasted based on overall resolution,
taxonomic distribution, degree of orthology versus paralogy, or various functional or pathway-associated criteria
(e.g., table 2). Though attempting to resolve organismal
evolution as a single phylogenetic tree invariably ignores
the rich histories of single genes, here we have emphasized
how organismal history can be understood at one level by
integrating the information present in diverse genes and on
additional levels by contrasting that integrated tree with individual phylogenies.
This telescoping approach to phylogenetic reconstruction—incorporating data from protein sequences at multiple taxonomic levels of conservation—can be used to refine
evolutionary trees at different levels of phylogenetic resolution. Furthermore, inference of robust phylogenies stands
as a primary technique by which horizontal gene transfer
can be detected (and then be subtracted from consensus data
sets). As genome data continue to fill out the branches of the
tree of life, this approach will become increasingly useful as
it provides a way to incorporate, compare, and contrast entire genomes’ worth of sequence data, without ignoring information from individual genes or proteins.
Accession Numbers
Accession numbers for genomes used in this study are
given in table 1.
FIG. 7.—Diagram of the steps involved in going from complete genomes to phylogenetic analysis, as detailed in the Methods.
Reconstructing the Cyanobacterial Species Tree 653
Supplementary Material
Supplementary figure S1 and tables S1 and S2 are
available at Molecular Biology and Evolution online
(http://www.mbe.oxfordjournals.org/).
Acknowledgments
The authors wish to thank Jeff Touchman and the
DNA sequencing team at the Translational Genomics Institute for making available sequence data for Acaryochloris
marina. The authors also acknowledge very helpful discussions and suggestions from Carrine Blank and Elbert
Branscomb. The A. marina genome project is funded by
grant 0412824 from the National Science Foundation
Microbial Genome Sequencing Program (http://genomes.
tgen.org/). R.B. acknowledges additional support from
grant NNG04GK59G from the Exobiology Program at
the National Aeronautics and Space Administration. J.R.
acknowledges support through a Lawrence Postdoctoral
Fellowship at Lawrence Livermore National Laboratory.
Literature Cited
Beiko RG, Harlow TJ, Ragan MA. 2005. Highways of gene
sharing in prokaryotes. Proc Natl Acad Sci USA.
102:14332–14337.
Casamatta DA, Johansen JR, Vis ML, Broadwater ST. 2005.
Molecular and morphological characterization of ten polar
and near-polar strains within the Oscillatoriales (cyanobacteria). J Phycol. 41:421–438.
Castenholz RW. 2001. Phylum BX. Cyanobacteria. Oxygenic
photosynthetic bacteria. In: Boone DR, Castenholz RW,
editors. Bergey’s manual of systematic bacteriology. Volume
1: the Archaea and deeply branching and phototrophic
Bacteria. New York: Springer-Verlag. p. 413–439.
Charlebois RL, Beiko RG, Ragan MA. 2003. Microbial
phylogenomics: branching out. Nature. 421:217.
Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B,
Bork P. 2006. Toward automatic reconstruction of a highly
resolved tree of life. Science. 311:1283–1287.
Delsuc F, Brinkmann H, Philippe H. 2005. Phylogenomics and the
reconstruction of the tree of life. Nat Rev Genet. 6:361–375.
Doolittle RF. 2005. Evolutionary aspects of whole-genome
biology. Curr Opin Struct Biol. 15:248–253.
Dufresne A, Garczarek L, Partensky F. 2005. Accelerated
evolution associated with genome reduction in a free-living
prokaryote. Genome Biol. 6:R14.
Eisen JA. 1998. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis.
Genome Res. 8:163–167.
Eisen JA, Fraser CM. 2003. Phylogenomics: intersection of
evolution and genomics. Science. 300:1706–1707.
Enright AJ, Van Dongen S, Ouzounis CA. 2002. An efficient
algorithm for large-scale detection of protein families. Nucleic
Acids Res. 30:1575–1584.
Farquhar J, Bao H, Thiemens M. 2000. Atmospheric influence of
earth’s earliest sulfur cycle. Science. 289:756–759.
Felsenstein J. 1989. PHYLIP—Phylogeny inference package
(Version 3.2). Cladistics. 5:164–166.
Ferris MJ, Ruff-Roberts AL, Kopczynski ED, Bateson MM,
Ward DM. 1996. Enrichment culture and microscopy conceal
diverse thermophilic Synechococcus populations in a single
hot spring microbial mat habitat. Appl Environ Microbiol.
62:1045–1050.
Fox GE, Wisotzkey JD, Jurtshuk P Jr. 1992. How close is close:
16S rRNA sequence identity may not be sufficient to
guarantee species identity. Int J Syst Bacteriol. 42:166–170.
Gadagkar SR, Rosenberg MS, Kumar S. 2005. Inferring species
phylogenies from multiple genes: concatenated sequence tree
versus consensus gene tree. J Exp Zoolog B Mol Dev Evol.
304:64–74.
Giovannoni SJ, Turner S, Olsen GJ, Barns S, Lane DJ, Pace NR.
1988. Evolutionary relationships among cyanobacteria and
green chloroplasts. J Bacteriol. 170:3584–3592.
Gogarten JP, Townsend JP. 2005. Horizontal gene transfer, genome
innovation and evolution. Nat Rev Microbiol. 3:679–687.
Harlow TJ, Gogarten JP, Ragan MA. 2004. A hybrid clustering
approach to recognition of protein families in 114 microbial
genomes. BMC Bioinformatics. 5:45.
Henson BJ, Hesselbrock SM, Watson LE, Barnum SR. 2004.
Molecular phylogeny of the heterocystous cyanobacteria
(subsections IV and V) based on nifD. Int J Syst Evol
Microbiol. 54:493–497.
Hess WR. 2004. Genome analysis of marine photosynthetic microbes
and their global role. Curr Opin Biotechnol. 15:191–198.
Hess WR, Rocap G, Ting CS, Larimer F, Stilwagen S,
Lamerdin J, Chisholm SW. 2001. The photosynthetic
apparatus of Prochlorococcus: insights through comparative
genomics. Photosynth Res. 70:53–71.
Honda D, Yokota A, Sugiyama J. 1999. Detection of seven major
evolutionary lineages in cyanobacteria based on the 16S
rRNA gene sequence analysis with new sequences of five
marine Synechococcus strains. J Mol Evol. 48:723–739.
Huelsenbeck JP, Ronquist F. 2001. MrBayes: Bayesian inference
of phylogenetic trees. Bioinformatics. 17:754–755.
Knoll AH. 2003. The geological consequences of evolution.
Geobiology. 3–14.
Kopp RE, Kirschvink JL, Hilburn IA, Nash CZ. 2005. The
paleoproterozoic snowball earth: a climate disaster triggered
by the evolution of oxygenic photosynthesis. Proc Natl Acad
Sci USA. 102:11131–11136.
Kumar S, Tamura K, Nei M. 1994. MEGA: molecular
evolutionary genetics analysis software for microcomputers.
Comput Appl Biosci. 10:189–191.
Lerat E, Daubin V, Moran NA. 2003. From gene trees to
organismal phylogeny in prokaryotes: the case of the gammaProteobacteria. PLoS Biol. 1:E19.
Martin KA, Siefert JL, Yerrapragada S, Lu Y, McNeill TZ,
Moreno PA, Weinstock GM, Widger WR, Fox GE. 2003.
Cyanobacterial signature genes. Photosynth Res. 75:211–221.
Martin W, Rujan T, Richly E, Hansen A, Cornelsen S, Lins T,
Leister D, Stoebe B, Hasegawa M, Penny D. 2002.
Evolutionary analysis of Arabidopsis, cyanobacterial, and
chloroplast genomes reveals plastid phylogeny and thousands
of cyanobacterial genes in the nucleus. Proc Natl Acad Sci
USA. 99:12246–12251.
Minh BQ, Vinh le S, von Haeseler A, Schmidt HA. 2005.
pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies. Bioinformatics. 21:3794–3796.
Mossel E, Steel M. 2006. How much can evolved characters tell
us about the tree that generated them? In: Gascuel O, editor.
Mathematics of evolution and phylogeny. Oxford: Oxford
University Press. p. 384–412.
Mulkidjanian AY, Koonin EV, Makarova KS, et al. (12 co-authors).
2006. The cyanobacterial genome core and the origin of
photosynthesis. Proc Natl Acad Sci USA. 103:13126–13131.
Nelissen B, Van de Peer Y, Wilmotte A, De Wachter R. 1995.
An early origin of plastids within the cyanobacterial divergence is suggested by evolutionary trees based on
complete 16S rRNA sequences. Mol Biol Evol.
12:1166–1173.
654 Swingley et al.
Reche PA, Reinherz EL. 2003. Sequence variability analysis of
human class I and class II MHC molecules: functional and
structural correlates of amino acid polymorphisms. J Mol
Biol. 331:623–641.
Rippka R, Deruelles J, Waterbury JB, Herdman M, Stanier RY.
1979. Generic assignments, strain histories and properties of
pure cultures of cyanobacteria. J Gen Microbiol. 111:1–61.
Rivera MC, Lake JA. 2004. The ring of life provides evidence
for a genome fusion origin of eukaryotes. Nature. 431:
152–155.
Rocap G, Distel DL, Waterbury JB, Chisholm SW. 2002.
Resolution of Prochlorococcus and Synechococcus ecotypes
by using 16S-23S ribosomal DNA internal transcribed spacer
sequences. Appl Environ Microbiol. 68:1180–1191.
Sanchez-Baracaldo P, Hayes PK, Blank CE. 2005. Morphological and habitat evolution in the cyanobacteria using
a compartmentalization approach. Geobiology. 3:145–165.
Scanlan DJ. 2003. Physiological diversity and niche adaptation in
marine Synechococcus. Adv Microb Physiol. 47:1–64.
Seo PS, Yokota A. 2003. The phylogenetic relationships of
cyanobacteria inferred from 16S rRNA, gyrB, rpoC1
and rpoD1 gene sequences. J Gen Appl Microbiol. 49:
191–203.
Snel B, Huynen MA, Dutilh BE. 2005. Genome trees and the
nature of genome evolution. Annu Rev Microbiol. 59:191–209.
Summons RE, Jahnke LL, Hope JM, Logan GA. 1999.
2-Methylhopanoids as biomarkers for cyanobacterial oxygenic photosynthesis. Nature. 400:554–557.
Swingley WD, Blankenship RE, Raymond J. 2007. Insights into
cyanobacterial evolution from comparative genomics. In:
Herrero A, Flores E, editors. Genomics and molecular biology
of cyanobacteria. Norwich (UK): Horizon Scientific Press.
p. 22–43.
Woese CR. 1987. Bacterial evolution. Microbiol Rev.
51:221–271.
Wolf YI, Rogozin IB, Grishin NV, Koonin EV. 2002. Genome
trees and the tree of life. Trends Genet. 18:472–479.
Zhaxybayeva O, Gogarten JP, Charlebois RL, Doolittle WF,
Papke RT. 2006. Phylogenetic analyses of cyanobacterial
genomes: quantification of horizontal gene transfer events.
Genome Res. 16:1099–1108.
Takashi Gojobori, Associate Editor
Accepted December 26, 2007