Integrating Markov Clustering and Molecular Phylogenetics to Reconstruct the Cyanobacterial Species Tree from Conserved Protein Families Wesley D. Swingley,* Robert E. Blankenship, and Jason Raymondà *Institute of Low Temperature Science, Hokkaido University, Sapporo, Japan; Departments of Biology and Chemistry, Washington University, St Louis, MO; and àSchool of Natural Sciences, University of California, Merced Attempts to classify living organisms by their physical characteristics are as old as biology itself. The advent of protein and DNA sequencing—most notably the use of 16S ribosomal RNA—defined a new level of classification that now forms our basic understanding of the history of life on earth. High-throughput sequencing currently provides DNA sequences at an unprecedented rate, not only providing a wealth of information but also posing considerable analytical challenges. Here we present comparative genomics–based methods useful for automating evolutionary analysis between any number of species. As a practical example, we applied our method to the well-studied cyanobacterial lineage. The 24 cyanobacterial genomes compared here occupy a wide variety of environmental niches and play major roles in global carbon and nitrogen cycles. By integrating phylogenetic data inferred for upward of 1,000 protein-coding genes common to all or most cyanobacteria, we have reconstructed an evolutionary history of the phylum, establishing a framework for resolving key issues regarding the evolution of their metabolic and phenotypic diversity. Greater resolution on individual branches can be attained by telescoping inward to the larger set of conserved proteins between fewer taxa. The construction of all individual protein phylogenies allows for quantitative tree scoring, providing insight into the evolutionary history of each protein family as well as probing the limits of phylogenetic resolution. The tools incorporated here are fast, computationally tractable, and easily extendable to other phyla and provide a scaleable framework for contrasting and integrating the information present in thousands of protein-coding genes within related genomes. Introduction Although the 16S ribosomal RNA (rRNA) paradigm continues to provide a strong framework for understanding evolution, it represents only one small piece of an organism’s history. The exponentially increasing number of genome sequencing projects is pushing our understanding of diversity well beyond the limitations of the single-gene proxy. Integrating the enormous wealth of genetic information—hundreds to tens of thousands of genes per genome—stands as one of the central challenges to biology in the 21st century. Ultimately, an evolutionary tree will be available for every (nonnovel) gene from every sequenced genome, providing a temporal and cross-species blueprint of how Darwinian evolution has brought these genes together into an organism able to thrive in its particular niche. The goal of phylogenomics has recently been the subject of a number of novel and provocative approaches (Eisen 1998; Lerat et al. 2003; Rivera and Lake 2004; Delsuc et al. 2005; Snel et al. 2005). Although insightful, their results are often quite controversial; for example, some strongly support the canonical tree of life as deduced by 16S rRNA analysis, whereas others suggest striking rearrangements to this orthodoxy (Wolf et al. 2002; Charlebois et al. 2003; Doolittle 2005; Ciccarelli et al. 2006). Perhaps the best developed and most rigorously tested of these methods, molecular phylogeny, have been difficult to implement due primarily to computational challenges of constructing gene trees with very large data sets. Furthermore, single-gene phylogenies are complex by default, often reflecting nonvertical evolution due to horizontal gene transfer, gene duplication (paralogy), and loss (Gogarten and Key words: genomics, cyanobacteria, evolution, Markov clustering, phylogenomics. E-mail: [email protected]. Mol. Biol. Evol. 25(4):643–654. 2008 doi:10.1093/molbev/msn034 Advance Access publication February 22, 2008 Ó The Author 2008. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] Townsend 2005). Deep phylogenies are especially prone to poor resolution due to sequence divergence. In particular, although 1 set of homologous genes or proteins may be quite useful in resolving species or genus-level relationships, it might be quite poor at resolving phylum-level relationships due to poor conservation or short sequence length. In this work, we take a new approach integrating clustering and sequence analysis toward resolving an integrated phylogeny spanning multiple taxonomic levels within a single phylum. Using all genomes available from a single phylum, our approach combines the rigorous (maximum likelihood) analysis of large numbers of orthologs, as well as of concatenated sets of up to several hundred proteins representing a large fraction of some genomes, and of consensus phylogenies based on single-protein trees. The ultimate goal is to determine, given the known role of horizontal gene transfer particularly in prokaryote evolution as well as the difficulty in resolving deep phylogenies, whether a plurality phylogenetic signal exists that is both consistent with, and potentially explanatory toward, systematic and taxonomic information about a group of organisms. This phylum first approach is well suited to the .103 ongoing genome projects, for several reasons. First, most phyla appear to be robustly defined based both on molecular methods, especially 16S, and on traditional systematics. Organisms within a phylum typically share unique phenotypic traits that are variable enough to be both interesting and informative of the evolutionary process. Second, by focusing first on resolving the distribution and phylogeny of single proteins, it is possible to select for subsequent analysis those that are potentially most useful in resolving relationships at different taxonomic levels. For example, many proteins are not common to all organisms within a clade and would be excluded from analyses of completely conserved, or ‘‘core,’’ proteins, whereas they might be useful for determining relationships between subsets of organisms. Additionally, depending on factors such as length and 644 Swingley et al. degree of conservation, some proteins give well-resolved trees for only some taxonomic levels. Ribosomal proteins often share 100% amino acid identity—and are thereby phylogenetically uninformative—between members of the same genus or species. Working with a single phylum (as opposed to, say, all 3 domains of life) also prevents data sets from becoming computationally intractable, especially when employing maximum likelihood–based approaches. This methodology can also be naturally extended into different taxonomic levels. Whereas some subset of proteins may be useful for resolving relationships within phyla, when needed, additional proteins can be incorporated for reconstructing family-, class-, or genus-level relationships by selecting only those proteins conserved at these taxonomic levels. Understanding which proteins are adequate at resolving different taxonomic levels enables selection of proteins that are useful in determining relationships between phyla—an ultimate goal (and persistent shortcoming) in reconstructions of the tree of life. As an introductory example, we focus on the phylum cyanobacteria, which is notable for sequencing projects covering a wide swath of their enormous diversity as well as for their evolutionary importance and time constraints on their early evolution. The most ancient diagnostic markers for any organism come in the way of chemical biomarkers argued to have been left by cyanobacterial ancestors some 2.7 billion years ago, and the global-scale effects resulting from the oxygen produced during cyanobacterial photosynthesis are seen in rocks ;2.43 billion years old and younger (Summons et al. 1999; Farquhar et al. 2000; Knoll 2003; Kopp et al. 2005). Ongoing and completed sequencing projects include cyanobacteria from marine and freshwater environments, thermophiles, nitrogen fixers, and symbionts. In addition to illustrating the robust evolutionary resolution acquired using our method, we also seek to build a growing phylogenetic framework upon which the evolution of this phenotypically diverse group of organisms is based. The long history of cyanobacterial systematics has been confounded by morphology-based botanical classifications as well as difficulties in resolving closely related species using 16S rRNA (Rippka et al. 1979; Fox et al. 1992; Castenholz 2001; Casamatta et al. 2005). Individual genes and proteins conserved across all organisms or specifically in all cyanobacteria have been used to build phylogenies (Woese 1987; Giovannoni et al. 1988; Honda et al. 1999; Hess et al. 2001; Seo and Yokota 2003; Henson et al. 2004). Some subsets of cyanobacteria have also been compared extensively, particularly within the (genomically) well-sampled Prochlorophyte clade (Hess 2004; Dufresne et al. 2005). However, only a few studies thus far have assembled cyanobacterial phylogenies based on a larger set of proteins conserved across all cyanobacteria. Martin et al. (2002) examined several thousand genes from 3 then-available cyanobacteria to determine the evolutionary history of nuclear genes from Arabidopsis thaliana, establishing the widescale impact that imported cyanobacterial genes have had on the evolution of photosynthetic eukaryotes, as well as plausible gene complements of chloroplast/ cyanobacterial ancestors. A Blast-based comparison of the genomes of 8 cyanobacterial genomes by (Martin et al. 2003) revealed 181 signature genes that do not have homologs in other organisms, roughly 3/4 of which had no ascribable function yet are clearly important in some aspect of cyanobacterial lifestyle. Sanchez-Baracaldo et al. (2005) more recently developed a method based on multigene concatenation combined with morphological character analysis to construct and map traits onto a cyanobacterial species tree. Additionally, a cyanobacterial phylogeny based on 31 proteins conserved across the entire tree of life was constructed as part of a large-scale tree construction (Ciccarelli et al. 2006), but this study used only 8 cyanobacterial taxa and the ribosomal proteins used for tree construction did not resolve terminal branches. A cluster of orthologous groups (COG)–based analysis was used to determine the distribution of proteins in 15 complete cyanobacterial genomes, with a particular focus on understanding the origin of photosynthesis (Mulkidjanian et al. 2006). However, the analysis did not undertake phylogenetic analysis, either of individual protein families or in an attempt to resolve the evolution of the phylum as a whole. Zhaxybayeva et al. (2006) have conducted the most extensive sampling of the phylum to date, reconstructing histories of 1,128 protein-coding genes from 11 cyanobacterial genomes in order to reconstruct a plurality tree based on quartet analysis (Zhaxybayeva et al. 2006). In addition to constructing maximum likelihood trees for a large number of orthologs from completed cyanobacterial genomes, we assembled concatenated alignments as a further test of phylogenetic robustness. Importantly, variations in the concatenated alignment used resulted in 2 distinct but very highly supported phylogenies, suggesting that even large, statistically well-supported concatenations can converge on very different trees. To further test phylogenetic robustness, we used a tree consensus method to build a single tree that best captures all single-protein phylogenies. Recent work (Gadagkar et al. 2005) has compared the effectiveness of concatenated versus consensus methods for phylogenetic inference in the face of incongruent signals (e.g., due to horizontal gene transfer, poor resolution, invalid model assumptions, or use of the same model for all data sets). They found that concatenated phylogenies outperform consensus phylogenies, though importantly both methods can converge on incorrect trees when systematic biases are present in individual trees—for example, when the evolutionary model used is a poor match to the data. However, our consensus tree agrees exactly with one of the trees inferred from concatenated alignments, compares the results of multiple evolutionary models, and also is compatible with modern cyanobacterial classification schemes that integrate both systematic and molecular information. To further test, and potentially increase, resolution of individual nodes on our concatenated/consensus genome tree, we used a telescoping method whereby protein families that are conserved among a smaller number of very closely related taxa can be taken into account. This proved useful particularly in resolving relationships between the very closely related marine Synechococcus and Prochlorococcus clades, which were clarified with exceptional support by analyzing conserved protein families between just these 2 groups. In cases, such as these, the inverse Reconstructing the Cyanobacterial Species Tree 645 Table 1 Genomes Analyzed in This Study Organism Gloeobacter violaceus Thermosynechococcus elongatus Anabaena variabilis Nostoc punctiforme Nostoc sp. PCC7120 Synechococcus elongatus PCC7942 S. elongatus PCC6301 Synechocystis sp. PCC6803 Synechococcus sp. OS A Synechococcus sp. OS B’ Acaryochloris marina Prochlorococcus marinus CCMP1375 P. marinus MED4 P. marinus MIT9312 P. marinus MIT9313 P. marinus NATL2A Synechococcus sp. CC9605 Synechococcus sp. CC9902 Synechococcus sp. RS9917 Synechococcus sp. WH5701 Synechococcus sp. WH7805 Synechococcus sp. WH8102 Trichodesmium erythraeum Crocosphaera watsonii Clostridium acetobutylicum Rhodopseudomonas palustris CGA009 a b Description National Center for Biotechnology Information Accession/in Progress Terrestrial, lacks photosystem components 55C optimal growth (hot springs) Heterocyst-forming diazotroph Heterocyst-forming diazotroph Heterocyst-forming diazotroph Freshwater Freshwater Freshwater Hot spring, diazotroph Hot spring, diazotroph Chlorophyll d-containing symbiont Marine, unicellular Marine, unicellular Marine, unicellular Marine, unicellular Marine, unicellular Marine, unicellular Marine, unicellular Marine, unicellular Marine, unicellular Marine, unicellular Marine, unicellular Marine, filamentous diazotroph Marine, unicellular diazotroph Gram positive (outgroup) Proteobacterium (outgroup) NC 005125 NC 004113 NC 007413 IP; NZ AAAY00000000a NC 003272 NC 007604 NC 006576 NC 000911 NC 007775 NC 007776 NC 009925b NC 005042 NC 005072 NC 007577 NC 005071 NC 007335 NC 007516 NC 007513 IP:NZ AANP00000000b IP;NZ AANO00000000b IP;NZ AAOK00000000b NC 005070 NC 008312 IP; NZ AADV00000000a NC 003030 NC 005296 JGI microbial sequencing portal: http://genome.jgi-psf.org/mic_home.html. J. Craig Venter Institute/University of Warwick collaborative sequencing project (preliminary sequence available via GenBank). relationship between the number of conserved protein families and the number of taxa tends to yield a uniform total number of phylogenetically informative characters. Finally, using these methods to model cyanobacterial speciation provides a framework for understanding and explaining the distribution of cyanobacterial protein families. Generating a robust ‘‘background’’ tree is crucial for framing key evolutionary events, such as the origin and evolution of capabilities such as pigment biosynthesis, carbon and nitrogen fixation, and provides insight into fundamental evolutionary mechanisms such as niche adaptation, genome reduction, and horizontal gene transfer. This approach can be similarly extended to other phyla to provide a highresolution framework, based on the totality of evolutionary information from many protein families, which can be linked together to assemble the tree of life. Methods All data are publicly available in the way of completed or nearly complete genome sequences (table 1). The pipeline of methods used is diagrammed in figure 2. BlastP comparisons (104 cutoff, BLOSUM62, standard settings for word size, gap opening/extension, and filtering) were made between all protein sequences from the genomes of 24 cyanobacteria and 2 non-cyanobacterial outgroups (see table 1), representing all complete plus diverse set of nearly complete cyanobacteria, and outgroups from well-sampled bacterial phyla (proteobacteria and Gram positive bacteria). To generate first-pass protein families, Markov clustering (Enright et al. 2002) was performed iteratively on a matrix generated from Blast e values. To optimize clustering results, inflation parameters ranging from 1.2 to 20.0 were used, with resultant protein family/cluster size distributions given in supplementary table S2 (Supplementary Material online). An inflation parameter of 2.8 yielded the highest number of protein families with single orthologs from all or most (.21) of the 24 cyanobacterial genomes (445 total), as well as families with no more than 2 paralogs (178 total). Note that the smallest cyanobacterial genome (Prochlorococcus sp. MED4) analyzed contains 1,809 proteins; this level of filtering captures nearly 34% of that genome for further phylogenetic analysis. Despite the large number of clusters involved, the Markov clustering method is quite fast (,10 min on a 32-bit/2 GHz AMD desktop PC) and has been argued to have advantages over, for example, COG-based protein family assignment (Harlow et al. 2004). Ultimately, the end goal of these and other clustering methods is identical—to assemble proteins from complete genomes into groups of evolutionarily related orthologs—and no matter what heuristic is used, curation is a necessary part of the process. Following clustering, all families were then multiply aligned using ClustalW (Gonnet protein weight matrix, default gap opening/extension penalties) on an MPI-enabled 18 processor AMD Athlon cluster (all pre- and postcurated protein family alignments, as well as scripts for using cluster results for translating complete genomes into protein families, are freely available on request). 646 Swingley et al. FIG. 1.—Consensus cyanobacterial phylogeny based on maximum likelihood trees for each of 438 orthologous protein families. The numbers at each bifurcation indicate the total number of trees where that exact bifurcation/branching order is observed; for example, all 438 trees have Synechococcus sp. A and B’ as closest neighbors and also cleanly distinguish marine Synechococcus and Prochlorococcus from all other cyanobacteria. Note that as this is a consensus tree, topology is meaningful but distances are not (as opposed to fig. 3, where tree distances are still meaningful). Using the multiple alignments and corresponding Neighbor-Joining trees generated by ClustalW as a guide, protein families were then manually checked for poor alignments and/or long-branch lengths, with poorly aligned sequences and/or poorly assembled protein families either corrected or removed. Most frequently, these differences involved inclusion of a paralog in a protein family, which can be easily detected based on the number of homologs per organism or, often, the presence of long branches in the phylogeny. As depicted in table 2, these curated protein families were then parsed using various filters, for example, selecting protein families present in all or most cyanobacteria, any imaginable subset of organisms, or by selecting protein families that all share a common function or annotation. The full-protein family spreadsheet is available as supplementary table S2 (Supplementary Material online). In addition to the distance-based trees generated during multiple alignment, phylogenies based on singleprotein families were generated for every aligned protein family using 2 different maximum likelihood methods. The first approach used PHYLIP’s ProML package with the following parameters: JTT probability model, one cat- egory of sites with constant rate, and with randomized input order (Felsenstein 1989). Additionally, a second, quartetbased maximum likelihood approach was used with the parallelized version of iqpnni, here using the Whelen and Goldman substitution model and estimating a gamma parameter with 4 rate categories (Minh et al. 2005). PHYLIP’s CONSENSE package was used to generate extended majority rule consensus phylogenies for each separate set of phylogenies (distance and both ML runs; iqpnni run results shown in fig. 2). Concatenated multiple alignments were generated by end-to-end attachment of individual protein families, using gaps as placeholders for species missing a particular ortholog. As an additional test of robustness, variable/uninformative positions were filtered out of these concatenated alignments using progressively more stringent Shannon information entropy cutoffs (SIE 1.0–3.0) and filtering out positions with .50% gaps. The resulting concatenated alignments from all 26 genomes ranged from 28,281 (SIE 1.0) to 230,415 (full/unfiltered concatenation) aligned amino acid positions and contained up to 300,000 aligned positions in the case of the Prochlorococcus/Synechococcus-conserved protein Reconstructing the Cyanobacterial Species Tree 647 Table 2 Parsing Protein Families Based on Different Criteria* *see supplementary table S2 (Supplementary Material online) for full spreadsheet. families (fig. 4). PHYLIP ProML and Neighbor-Joining phylogenies were then constructed for each of these filtered concatenated alignments to determine the effect of removing gaps and progressively more variable sites from alignments (see e.g., discussion of difference in support for the Prochorococcus/Synechococcus clade in the main text). Our final goal was to test the effect of correcting for site heterogeneity in concatenated alignments by incorporating a gamma parameter, rather than strictly filtering out variable regions of alignments. The size and associated memory requirements of inferring gamma corrected phylogenies for these concatenated alignments required they be analyzed using MrBayes (Huelsenbeck and Ronquist 2001). MrBayes was run using the VT evolutionary model, incorporating a gamma parameter sampled from 4 rate categories, with the substitution model analyzed over 20,000– 30,000 generations in 4 separate runs and a 1,000 generation burn-in. Because of the large data set and number of free parameters in the model, MrBayes required a 64bit dual CPU system with 8 GB RAM. Although this limited the number of generations and discrete chains, in all cases topological convergence to the consensus phylogeny was achieved within 5,000–8,000 generations and was maintained throughout all the remaining runs. In addition, this topology was also observed in phylogenetic inference using the NeighborJoining algorithm as implemented in MEGA v3.0, using multiple models and incorporating a gamma parameter (Kumar et al. 1994), and (topologically) agreed with the trees obtained using PHYLIP’s ProML on entropy-filtered concatenated alignments, as discussed in the main text. All concatenated alignments and phylogenies are available upon request. Tree comparisons used PHYLIP’s consense, using the extended majority rule method and both the symmetric (Robinson–Foulds) and branch score distance metrics. Comparisons also included 50 trees comprised of the same cyanobacterial taxa arranged in randomized topologies. As illustrated in figure 5, core- and pan-genome numbers are determined for a specific rooted phylogeny by 1) counting the number of protein families conserved within all descendents of a particular node in the tree (core) and 2) counting the total number of protein families present in the descendents of a particular node in the tree (pan). Results and Discussion The 24 genomes analyzed here represent all cyanobacteria with either complete or very nearly complete sequencing projects and encompass nearly 94,000 protein-coding genes. Homology-based Markov clustering resulted in 7,378 families of proteins present in more than 1 cyanobacterium (an additional 12,955 protein families were found only in a single cyanobacterium). Many of these families include multiple, often closely related paralogs. For example, the D1 and D2 proteins of the photosystem II reaction center complex are members of the same family, and ABC transporter and serine/threonine kinase paralogs are quite extensive even in the smallest cyanobacterial genomes. To avoid problems associated with inclusion of paralogs in phylogenies, initial analysis focused on families with few or no paralogs present in most or all cyanobacteria, which includes housekeeping proteins common to most organisms as well as cyanobacterial-specific proteins that have been important during their evolution and early diversification. Following the initial clustering, 613 protein families fit the criterion of being absent in not more than 2 cyanobacteria and having not more than 2 paralogs in total for all organisms. Alignments and Neighbor-Joining phylogenies for all families were manually checked, and poorly aligned proteins (as well as those with disproportionately longbranch lengths; for details, see Methods) were removed from alignments or else the family was removed from the analysis. A total of 583 protein families remained after this manual curation. Here we focus on a substantial number of relatively easily obtained families of orthologs, selected by a fast clustering approach that minimizes the 648 Swingley et al. number of paralogs while maximizing the total number of genomes represented in a given protein family (see supplementary table S1, Supplementary Material online). Phylogenies for each of the 583 families were constructed using 2 different implementations of the maximum likelihood method (PHYLIP and quartet-based iqpnni; see Methods). A total of 438 of these families—those comprised strictly of orthologs—were then used to generate a consensus phylogeny that portrays the bifurcations that occur most frequently across all trees (fig. 2). For example, both the marine Synechococcus/Prochlorococcus (11 organisms) and the Synechococcus sp. A and B’ clusters are conserved in every tree generated, and the Nostocales clade is observed in 421 of 438 trees. Importantly, only minority support is observed for several nodes on the tree, especially among the cyanobacteria often argued as among the earliest branching (Gloeobacter)—which may indeed reflect asymmetric rates of evolution—as well as for some members of the Prochlorococcus lineages, which recent studies suggest may result from horizontal gene transfer (Beiko et al. 2005). The ability to detect this phylogenetic incoherence is a crucial step in being able to segregate both protein families and organisms that are responsible. An attractive, iterative approach would take these into account by fine-tuning parameters of ascribed evolutionary models or progressively removing ‘‘difficult’’ protein families from tree-building methods that rely on combined data sets. This consensus phylogeny gives a straightforward method for finding putative horizontal gene transfer events and indicates that gene transfer ‘‘across’’ the tree, that is, between Prochlorococcus/marine Synechococcus and cyanophytes, is very rare among this particular subset of proteins. Note that as these proteins are common to almost all cyanobacteria, a very specific type of horizontal gene transfer—orthologous gene replacement—must occur, whereby a newly transferred gene displaces a functional wild-type gene. Importantly, though recent evidence indeed supports an important role for horizontal gene transfer among cyanobacteria (Zhaxybayeva et al. 2006), simulations suggest that these phylogenetic signals are not selfreinforcing and, even when corrections are not made for variations in evolutionary rate or composition, convergence to the true tree is frequently observed (Gadagkar et al. 2005). Indeed, Zhaxybayeva et al. (2006) obtained a plurality tree based on quartet reconstruction with which the consensus and concatenated trees presented here are consistent. In addition to individual and consensus phylogenies, all alignments without paralogs were concatenated into a single large alignment containing 230,415 positions encompassing 26 organisms. Smaller alignments were generated from this full alignment using a Shannon information entropy–based filter (Reche and Reinherz 2003) to remove phylogenetically uninformative (too variable or too conserved) sites from the alignment. Shannon entropy can be calculated for each position in an alignment and provides a more robust method for parsing informative positions from alignment than simply culling positions that fall below a given percentage identity or similarity. For example, a position in a protein sequence alignment might have 1 amino acid in half of the sequences and a different amino acid in the other half. If a percentage-based cutoff were used, this position would contain the same informative value as one where half the positions were 1 amino acid and the other half were all different amino acids. However, the Shannon entropy score of these 2 examples is quite different and, furthermore, is conceptually similar to maximum likelihood calculations. Phylogenies for all concatenated alignments were generated as discussed in the methods and showed overall agreement with one another, with one notable exception—differing levels of filtering (Shannon entropy cutoff values ranging from 1 to 4, where 0 is an invariant site and 4.322 is a site where all 20 amino acids are equally represented) resulted in 2 distinct trees differing by monophyly of the Prochlorococcus/Synechococcus clades. One of the trees—shown in figure 3—was converged upon from multiple MrBayes runs using the full/unfiltered data set. This tree is characterized by separate/monophyletic Prochlorales (the order containing Prochlorococcus species) and marine Synechococcus clades, with Synechococcus sp. strain WH 5701 basal to both groups, a topology supported in previous single-gene trees (Rocap et al. 2002; Scanlan 2003). Notably, this tree was in almost exact agreement with the consensus phylogeny generated from 438 trees (with the exception of the poorly supported Acaryochloris marina/ Thermosynechococcus elongatus clade, resolved as 2 distinct lineages in the concatenated tree). Although the observed convergence to a single tree from 2 different approaches lends support to this as the true tree, the fact that a different tree was inferred from some filtered concatenated alignments underscores the importance of using multiple methods of analysis to infer phylogenies. Shannon entropy presents a metric for pruning highly variable (less phylogenetically informative) positions from long alignments, making phylogenetic analysis more tractable. However, care must be taken that evolutionary models are compared each time a data set is filtered as it is feasible that the best model can change as positions are pruned from an alignment. Even character-rich data sets can be prone to error, in particular when they contain multiple phylogenetic signals or include highly divergent or deeply branching organisms (Mossel and Steel 2006). As is evident in figures 3 and 4, order Prochlorales shows anomalously long-branch lengths, evident both in individual as well as concatenated phylogenies, that may account for the alternative topology seen in some filtered concatenated phylogenies (this alternate topology is illustrated by the dashed line in fig. 4). However, one of the trees is converged to in both concatenated and consensus phylogenies, lending support to this as the true tree. As a further test, we demonstrate one of the advantages of our approach by incorporating additional information from protein families excluded from the initial analysis because they were not present in most or all cyanobacteria. Specifically, 1,108 protein families are found in all Prochlorococcus and marine Synechococcus species (including WH 5701). A total of 848 of these families have no paralogs within either of these clades, and so individual and consensus/concatenated phylogenies can be generated for this Prochlorococcus/Synechococcus-specific subset of families. As shown in figure 4, phylogeny based on 848 concatenated protein families (287,466 aligned positions in 11 Prochlorococcus/Synechococcus genomes) supports a branching Reconstructing the Cyanobacterial Species Tree 649 FIG. 2—Bayesian maximum likelihood tree for the full-concatenated data set, based on 230,415 aligned positions in 26 genomes. Note the strong agreement with the consensus tree from figure 2, as well as the presence of non-cyanobacterial outgroups that support Gloeobacter violaceus as an earlybranching cyanobacterium. The scale bar indicates the number of substitutions per site. Shown at each bifurcation are the predicted core-genome (upper number) and pan-genome (lower number) sizes of an ancestor at that point. The core-genome represents the intersection of all protein families in all progeny of an ancestor, whereas the pan-genome represents the union of all protein families in those progeny (the 2 numbers converge at the tips of the tree). order in agreement with both the consensus and fully concatenated data sets. Moreover, the resulting phylogeny also retains the relatively long-branch lengths characteristic of several members of the prochlorales clade, suggesting that an accelerated substitution rate across many proteins has accompanied genome reduction. Prochlorococcus genome analyses have observed this long-branch effect, which is likely due to loss of several DNA repair capabilities during genome reduction (Dufresne et al. 2005). The single phylogeny converged upon by multiple methods used herein also provides a framework for understanding the distribution of protein families at each ancestral node on the tree (Martin et al. 2002; Eisen and Fraser 2003; Lerat et al. 2003). As shown in figure 3, the common ancestor of all cyanobacteria is inferred to have had a conserved core of 361 protein families as these are present in the full set of 26 genomes analyzed. A total of 675 proteins (within which the 361 are nested) are common to all 24 cyanobacterial genomes analyzed, though as mentioned, many of these families contain paralogs and so were excluded from this analysis. These families represent a widely conserved core of housekeeping proteins common not only across known cyanobacterial diversity but also present to some extent in non-cyanobacterial genomes. Furthermore, the total diversity of modern cyanobacterial protein families—the union of all protein families in all progeny of an ancestor—is inferred to be just over 20,000 proteins for the cyanobacterial common ancestor and 25,292 when including the non-cyanobacterial outgroups. This is referred to as the cyanobacterial pangenome (which must be emphasized never actually existed but simply captures the extent of protein family variability across the phylum), illustrated along with the core-genome concept in figure 5. These pan- and core-genome numbers provide upper and lower bounds on protein family distributions at each node in a given phylogeny and are not parsimony-based estimates of the true genetic content of ancient organisms. 650 Swingley et al. FIG. 4—Using phylogeny and the distribution of protein families in different genomes to infer ancestral characteristics. As illustrated in the diagram, each bifurcation represents an ancestor whose core-genome contains the protein families found in every one of its descendents (the intersection of descendent genomes), whereas the pan-genome contains all proteins families found in all descendents (the union of descendent genomes). FIG. 3—Maximum likelihood phylogeny of 848 concatenated orthologous protein families (;290,000 aligned amino acid positions) common to 11 Prochlorococcus and marine Synechococcus genomes. By incorporating a larger number of protein families shared in a smaller number of closely related organisms, we find strong support for 1 of 2 topologies found in 26 genome trees, effectively improving the resolution of the consensus tree. The alternative topological position of Synechococcus sp. WH5701, observed in some filtered concatenated trees as discussed in the text, is illustrated by the dashed line. The core-genome at the base of the cyanobacterial phylum encompasses most of the major proteins of the photosynthetic apparatus, suggesting that oxygenic photosynthesis evolved prior to or early in the cyanobacterial radiation. This is in stark contrast with the ability to fix nitrogen, which is found paraphyletically throughout the cyanobacterial tree (illustrated in fig. 6a—N2-fixing lineages denoted by ‘‘þ’’). The nodes where nitrogen fixation is inferred—that is, whose descendent lineages all fix nitrogen—occur at multiple points across the tree (gray squares on fig. 6a) so that gene loss, horizontal gene transfer, or some combination of these processes must be invoked to explain the distribution of nitrogen fixation. The strength of having both combined and individual phylogenies comes from the capability to contrast the background tree of cyanobacterial speciation (figs. 2 and 3) with the evolutionary tree for nitrogenase. For example, based on the species tree, one plausible scenario is that nitrogenase was acquired on independent occasions within cyanobacterial lineages (e.g., through horizontal gene transfers would be required at the gray þ’s in fig. 6a), followed by largely vertical evolution to result in the observed distribution in the phylum. Alternatively (and arguably less parsimoniously), one could posit that the ancestor of all cyanobacteria had the capability to fix nitrogen but that the nitrogenase evolutionary history has since been dominated by gene loss. This scenario begins with nitrogen fixation in the hypothetical pan-genome and is followed by multiple independent losses, shown as x’s on figure 6a. By examining phylogenies of individual protein families, for example, that of the NifD (nitrogen fixation catalytic subunit) protein family shown in figure 6b, we can explore whether one of these scenarios is indeed more parsimonious than the other or if some combination of the 2 is more likely. The NifD tree (fig. 6b) shows some congruence with the cyanobacterial species tree (fig. 6a) but provides an important example of the complex history of protein families, often overlooked or not accurately captured in species trees. As well as supporting numerous gene losses, the NifD tree shows evidence for several gene duplications and plausible horizontal gene transfer, as suggested by the position of Trichodesmium erythraeum, comprising the earliest cyanobacterial branch among NifD proteins (though note poor bootstrap support makes it difficult to resolve this from the Synechococcus sp. A/B’ divergence). At face value, this indeed suggests a combination of vertical evolution and gene loss accounts for the distribution of nitrogen fixation in cyanobacteria, with evidence for horizontal gene transfer as well as duplication in several lineages. As with this truth-is-in-between example, the cyanobacterial ancestor would have had a genome content somewhere between the core- and pan-ancestral extremes, with functions and capabilities that, as demonstrated above, can be understood through examining of individual phylogenies. In a broader sense, the range established by ancestral core- and pan-genomes gives insight into the relative importance of genome reduction versus the evolution or acquisition of new genes and helps constrain the appearance of phenotypes specific to individual organisms or clades. This approach is extended to several other pathways of key importance to cyanobacterial evolution, such as carbon fixation and pigment biosynthesis, in Swingley et al. (2007). As shown in figure 7, the increasing size of the core-genome between any 2 organisms shows strong inverse correlation with their phylogenetic distance, whereas the pan-genome size shows only weak correlation. This Reconstructing the Cyanobacterial Species Tree 651 FIG. 5—(a) Possible scenarios for the distribution of nitrogen fixation in cyanobacteria, contrasting convergence versus gene loss as suggested by the protein family composition of core- and pan-ancestral genomes. The nitrogen fixation pathway is found in 7 genomes (black þ next to species name). Pan-ancestral genome data posit nitrogen fixation arose before the cyanobacterial common ancestor and many of its descendents (black dots) but were subsequently lost in many lineages (black x’s). Core composition of ancestral genomes suggests that the ability to fix nitrogen appeared 3 independent times (gray þ’s; gray boxes indicate ancestral nodes where N2-fixation was present). (b) The phylogenetic tree from protein family 1574—the catalytic molybdenum–iron subunit of the nitrogenase complex (see e.g., table 2). Though many lineages are missing, the species present have a similar phylogeny as observed in the species tree, suggesting largely vertical evolution with multiple gene losses. The exception is the distinct position of Trichodesmium erythraeum—not closely related to the Nostocales as in 5a, suggesting that horizontal gene transfer may have been important early in the evolution of cyanobacterial nitrogen fixation. results mainly because of the presence of novel/orphan genes that distinguish even closely related genomes, such as the 2 Synechococcus elongatus strains with 2,219 shared protein families. As mentioned above, the major elements of the cyanobacterial species tree find strong support in other analyses, coming both from systematics and molecular analyses. This includes: monophyly of heterocystous diazotrophs with the nonheterocystous diazotroph Trichodesmium erythreum as an outgroup (Sanchez-Baracaldo et al. 2005); the sister relationship and monophyly of marine Synechococcus and prochlorales (with Synechococcus sp. WH5701 basally branching) (Scanlan 2003) and a more deeply branching group of freshwater Synechococcus (PCC6301 and 7942) (Giovannoni et al. 1988; Honda et al. 1999); the cluster of Synechocystis sp. PCC6803 and Crocosphaera watsonii (Sanchez-Baracaldo et al. 2005); and evidence for Gloeobacter violaceus as an early-branching cyanobacterium (Nelissen et al. 1995), though intriguingly 2 thermophilic, N2-fixing Synechococcus strains also branch very deeply (Ferris et al. 1996). Note that this approach, like any, is subject to biases in ongoing sequencing projects and is therefore missing several important cyanobacterial taxonomic groups; however, it establishes a framework 652 Swingley et al. FIG. 6.—Change in core- or pan-genome size at increasing evolutionary distances for the cyanobacterial tree in figure 3. The black dots (left axis) indicate the core-genome size versus evolutionary distance between all pairwise combinations of the 26 genomes analyzed. The black line shows a single exponential fit (r2 5 0.802). The gray dots (right axis) give the same information for the pan-genome size. for incorporating further genomic data as well as expanding individual protein families with sequence data from public databases. This also provides a straightforward approach with which to target sequencing strategies toward organisms that will most improve phylogenetic resolution. The phylogenies presented here integrate a large amount of genomic data from all completed, as well as a few nearly complete, cyanobacterial genomes. The fact that concatenated and consensus phylogenies from as many as 583 proteins converge on nearly identical topologies that agree with earlier systematic and molecular approaches suggests that this tree represents an accurate, though averaged, history of cyanobacterial speciation. Moreover, phylogenies from individual protein families are retained and can be selected and contrasted based on overall resolution, taxonomic distribution, degree of orthology versus paralogy, or various functional or pathway-associated criteria (e.g., table 2). Though attempting to resolve organismal evolution as a single phylogenetic tree invariably ignores the rich histories of single genes, here we have emphasized how organismal history can be understood at one level by integrating the information present in diverse genes and on additional levels by contrasting that integrated tree with individual phylogenies. This telescoping approach to phylogenetic reconstruction—incorporating data from protein sequences at multiple taxonomic levels of conservation—can be used to refine evolutionary trees at different levels of phylogenetic resolution. Furthermore, inference of robust phylogenies stands as a primary technique by which horizontal gene transfer can be detected (and then be subtracted from consensus data sets). As genome data continue to fill out the branches of the tree of life, this approach will become increasingly useful as it provides a way to incorporate, compare, and contrast entire genomes’ worth of sequence data, without ignoring information from individual genes or proteins. Accession Numbers Accession numbers for genomes used in this study are given in table 1. FIG. 7.—Diagram of the steps involved in going from complete genomes to phylogenetic analysis, as detailed in the Methods. Reconstructing the Cyanobacterial Species Tree 653 Supplementary Material Supplementary figure S1 and tables S1 and S2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/). Acknowledgments The authors wish to thank Jeff Touchman and the DNA sequencing team at the Translational Genomics Institute for making available sequence data for Acaryochloris marina. The authors also acknowledge very helpful discussions and suggestions from Carrine Blank and Elbert Branscomb. The A. marina genome project is funded by grant 0412824 from the National Science Foundation Microbial Genome Sequencing Program (http://genomes. tgen.org/). R.B. acknowledges additional support from grant NNG04GK59G from the Exobiology Program at the National Aeronautics and Space Administration. J.R. acknowledges support through a Lawrence Postdoctoral Fellowship at Lawrence Livermore National Laboratory. Literature Cited Beiko RG, Harlow TJ, Ragan MA. 2005. Highways of gene sharing in prokaryotes. Proc Natl Acad Sci USA. 102:14332–14337. Casamatta DA, Johansen JR, Vis ML, Broadwater ST. 2005. Molecular and morphological characterization of ten polar and near-polar strains within the Oscillatoriales (cyanobacteria). J Phycol. 41:421–438. Castenholz RW. 2001. Phylum BX. Cyanobacteria. Oxygenic photosynthetic bacteria. In: Boone DR, Castenholz RW, editors. Bergey’s manual of systematic bacteriology. Volume 1: the Archaea and deeply branching and phototrophic Bacteria. New York: Springer-Verlag. p. 413–439. Charlebois RL, Beiko RG, Ragan MA. 2003. Microbial phylogenomics: branching out. Nature. 421:217. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P. 2006. Toward automatic reconstruction of a highly resolved tree of life. Science. 311:1283–1287. Delsuc F, Brinkmann H, Philippe H. 2005. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 6:361–375. Doolittle RF. 2005. Evolutionary aspects of whole-genome biology. Curr Opin Struct Biol. 15:248–253. Dufresne A, Garczarek L, Partensky F. 2005. Accelerated evolution associated with genome reduction in a free-living prokaryote. Genome Biol. 6:R14. Eisen JA. 1998. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 8:163–167. Eisen JA, Fraser CM. 2003. Phylogenomics: intersection of evolution and genomics. Science. 300:1706–1707. Enright AJ, Van Dongen S, Ouzounis CA. 2002. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30:1575–1584. Farquhar J, Bao H, Thiemens M. 2000. Atmospheric influence of earth’s earliest sulfur cycle. Science. 289:756–759. Felsenstein J. 1989. PHYLIP—Phylogeny inference package (Version 3.2). Cladistics. 5:164–166. Ferris MJ, Ruff-Roberts AL, Kopczynski ED, Bateson MM, Ward DM. 1996. Enrichment culture and microscopy conceal diverse thermophilic Synechococcus populations in a single hot spring microbial mat habitat. Appl Environ Microbiol. 62:1045–1050. Fox GE, Wisotzkey JD, Jurtshuk P Jr. 1992. How close is close: 16S rRNA sequence identity may not be sufficient to guarantee species identity. Int J Syst Bacteriol. 42:166–170. Gadagkar SR, Rosenberg MS, Kumar S. 2005. Inferring species phylogenies from multiple genes: concatenated sequence tree versus consensus gene tree. J Exp Zoolog B Mol Dev Evol. 304:64–74. Giovannoni SJ, Turner S, Olsen GJ, Barns S, Lane DJ, Pace NR. 1988. Evolutionary relationships among cyanobacteria and green chloroplasts. J Bacteriol. 170:3584–3592. Gogarten JP, Townsend JP. 2005. Horizontal gene transfer, genome innovation and evolution. Nat Rev Microbiol. 3:679–687. Harlow TJ, Gogarten JP, Ragan MA. 2004. A hybrid clustering approach to recognition of protein families in 114 microbial genomes. BMC Bioinformatics. 5:45. Henson BJ, Hesselbrock SM, Watson LE, Barnum SR. 2004. Molecular phylogeny of the heterocystous cyanobacteria (subsections IV and V) based on nifD. Int J Syst Evol Microbiol. 54:493–497. Hess WR. 2004. Genome analysis of marine photosynthetic microbes and their global role. Curr Opin Biotechnol. 15:191–198. Hess WR, Rocap G, Ting CS, Larimer F, Stilwagen S, Lamerdin J, Chisholm SW. 2001. The photosynthetic apparatus of Prochlorococcus: insights through comparative genomics. Photosynth Res. 70:53–71. Honda D, Yokota A, Sugiyama J. 1999. Detection of seven major evolutionary lineages in cyanobacteria based on the 16S rRNA gene sequence analysis with new sequences of five marine Synechococcus strains. J Mol Evol. 48:723–739. Huelsenbeck JP, Ronquist F. 2001. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics. 17:754–755. Knoll AH. 2003. The geological consequences of evolution. Geobiology. 3–14. Kopp RE, Kirschvink JL, Hilburn IA, Nash CZ. 2005. The paleoproterozoic snowball earth: a climate disaster triggered by the evolution of oxygenic photosynthesis. Proc Natl Acad Sci USA. 102:11131–11136. Kumar S, Tamura K, Nei M. 1994. MEGA: molecular evolutionary genetics analysis software for microcomputers. Comput Appl Biosci. 10:189–191. Lerat E, Daubin V, Moran NA. 2003. From gene trees to organismal phylogeny in prokaryotes: the case of the gammaProteobacteria. PLoS Biol. 1:E19. Martin KA, Siefert JL, Yerrapragada S, Lu Y, McNeill TZ, Moreno PA, Weinstock GM, Widger WR, Fox GE. 2003. Cyanobacterial signature genes. Photosynth Res. 75:211–221. Martin W, Rujan T, Richly E, Hansen A, Cornelsen S, Lins T, Leister D, Stoebe B, Hasegawa M, Penny D. 2002. Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proc Natl Acad Sci USA. 99:12246–12251. Minh BQ, Vinh le S, von Haeseler A, Schmidt HA. 2005. pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies. Bioinformatics. 21:3794–3796. Mossel E, Steel M. 2006. How much can evolved characters tell us about the tree that generated them? In: Gascuel O, editor. Mathematics of evolution and phylogeny. Oxford: Oxford University Press. p. 384–412. Mulkidjanian AY, Koonin EV, Makarova KS, et al. (12 co-authors). 2006. The cyanobacterial genome core and the origin of photosynthesis. Proc Natl Acad Sci USA. 103:13126–13131. Nelissen B, Van de Peer Y, Wilmotte A, De Wachter R. 1995. An early origin of plastids within the cyanobacterial divergence is suggested by evolutionary trees based on complete 16S rRNA sequences. Mol Biol Evol. 12:1166–1173. 654 Swingley et al. Reche PA, Reinherz EL. 2003. Sequence variability analysis of human class I and class II MHC molecules: functional and structural correlates of amino acid polymorphisms. J Mol Biol. 331:623–641. Rippka R, Deruelles J, Waterbury JB, Herdman M, Stanier RY. 1979. Generic assignments, strain histories and properties of pure cultures of cyanobacteria. J Gen Microbiol. 111:1–61. Rivera MC, Lake JA. 2004. The ring of life provides evidence for a genome fusion origin of eukaryotes. Nature. 431: 152–155. Rocap G, Distel DL, Waterbury JB, Chisholm SW. 2002. Resolution of Prochlorococcus and Synechococcus ecotypes by using 16S-23S ribosomal DNA internal transcribed spacer sequences. Appl Environ Microbiol. 68:1180–1191. Sanchez-Baracaldo P, Hayes PK, Blank CE. 2005. Morphological and habitat evolution in the cyanobacteria using a compartmentalization approach. Geobiology. 3:145–165. Scanlan DJ. 2003. Physiological diversity and niche adaptation in marine Synechococcus. Adv Microb Physiol. 47:1–64. Seo PS, Yokota A. 2003. The phylogenetic relationships of cyanobacteria inferred from 16S rRNA, gyrB, rpoC1 and rpoD1 gene sequences. J Gen Appl Microbiol. 49: 191–203. Snel B, Huynen MA, Dutilh BE. 2005. Genome trees and the nature of genome evolution. Annu Rev Microbiol. 59:191–209. Summons RE, Jahnke LL, Hope JM, Logan GA. 1999. 2-Methylhopanoids as biomarkers for cyanobacterial oxygenic photosynthesis. Nature. 400:554–557. Swingley WD, Blankenship RE, Raymond J. 2007. Insights into cyanobacterial evolution from comparative genomics. In: Herrero A, Flores E, editors. Genomics and molecular biology of cyanobacteria. Norwich (UK): Horizon Scientific Press. p. 22–43. Woese CR. 1987. Bacterial evolution. Microbiol Rev. 51:221–271. Wolf YI, Rogozin IB, Grishin NV, Koonin EV. 2002. Genome trees and the tree of life. Trends Genet. 18:472–479. Zhaxybayeva O, Gogarten JP, Charlebois RL, Doolittle WF, Papke RT. 2006. Phylogenetic analyses of cyanobacterial genomes: quantification of horizontal gene transfer events. Genome Res. 16:1099–1108. Takashi Gojobori, Associate Editor Accepted December 26, 2007
© Copyright 2026 Paperzz