International Journal of Systematic and Evolutionary Microbiology (2012), 62, 716–721 DOI 10.1099/ijs.0.038075-0 Introducing EzTaxon-e: a prokaryotic 16S rRNA gene sequence database with phylotypes that represent uncultured species Ok-Sun Kim,134 Yong-Joon Cho,13 Kihyun Lee,1 Seok-Hwan Yoon,1 Mincheol Kim,1 Hyunsoo Na,1 Sang-Cheol Park,2 Yoon Seong Jeon,2 Jae-Hak Lee,2 Hana Yi,1 Sungho Won3 and Jongsik Chun1,2 Correspondence 1 Jongsik Chun 2 School of Biological Sciences, Seoul National University, Seoul, Republic of Korea Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea [email protected] 3 Department of Statistics, Chung-Ang University, Seoul, Republic of Korea Despite recent advances in commercially optimized identification systems, bacterial identification remains a challenging task in many routine microbiological laboratories, especially in situations where taxonomically novel isolates are involved. The 16S rRNA gene has been used extensively for this task when coupled with a well-curated database, such as EzTaxon, containing sequences of type strains of prokaryotic species with validly published names. Although the EzTaxon database has been widely used for routine identification of prokaryotic isolates, sequences from uncultured prokaryotes have not been considered. Here, the next generation database, named EzTaxon-e, is formally introduced. This new database covers not only species within the formal nomenclatural system but also phylotypes that may represent species in nature. In addition to an identification function based on Basic Local Alignment Search Tool (BLAST) searches and pairwise global sequence alignments, a new objective method of assessing the degree of completeness in sequencing is proposed. All sequences that are held in the EzTaxon-e database have been subjected to phylogenetic analysis and this has resulted in a complete hierarchical classification system. It is concluded that the EzTaxon-e database provides a useful taxonomic backbone for the identification of cultured and uncultured prokaryotes and offers a valuable means of communication among microbiologists who routinely encounter taxonomically novel isolates. The database and its analytical functions can be found at http://eztaxon-e.ezbiocloud.net/. INTRODUCTION Despite recent advances in molecular biology and in the development of commercially available phenotype-based identification kits, identifying bacterial strains remains a difficult task for many routine microbiological laboratories. Among the several thousand genes within a bacterial genome, the 16S rRNA gene has served as the primary key for phylogeny-based identification when compared against well-curated 16S rRNA gene sequence databases (RossellóMora & Amann, 2001; Tindall et al., 2010). In a previous study, we introduced a manually curated database, called EzTaxon (http://www.eztaxon.org/), which contains 16S rRNA gene sequences of the type strains of species with validly published names (Chun et al., 2007). This database has been widely used for bacterial identification worldwide, and more than 1 000 000 identification jobs have been performed by microbiologists from more than 50 countries. Here, we formally introduce the next generation database of EzTaxon, named EzTaxon-e. This database covers not only prokaryotic species with validly published names but also species with non-validly published names and Candidatus taxa (Murray & Stackebrandt, 1995), as well as representatives of yet uncultured phylotypes, using new analytical functions. In addition, all species and phylotypes have been assigned to a complete hierarchical system (from species to phylum) by using comprehensive phylogenetic analyses based on 16S rRNA gene sequences. METHODS 3These authors contributed equally to this work. 4Present address: Division of Polar Biological Sciences, Korea Polar Research Institute, Incheon 406-840, Republic of Korea. Abbreviation: 716 BLAST, basic local alignment search tool. Collection of 16S rRNA gene sequences. In addition to the 16S rRNA gene sequences that were previously maintained at the EzTaxon database (http://www.eztaxon.org/), type or representative 16S rRNA gene sequences were collected from Candidatus taxa and species with Downloaded from www.microbiologyresearch.org by 038075 G 2012 IUMS IP: 88.99.165.207 On: Fri, 16 Jun 2017 09:25:51 Printed in Great Britain EzTaxon-e: a new 16S rRNA sequence database non-validly published names from public domain databases, including PubMed and GenBank. Moreover, the 16S rRNA gene sequences that can be found on assembled contigs of genome and metagenome sequencing projects were extracted using the rRNASelector program (http://sw.ezbiocloud.net/rrnaselector) (Lee et al., 2011). In principle, only a single best sequence was selected for each species/phylotype according to the following priority rules: sequences from genome projects, long sequences (¢1300 bp) and sequences with few ambiguous bases. When multiple sequences were available for the same type strain, sequences were aligned and inspected with a secondary structure model and were subjected to phylogenetic analyses using the jPHYDIT program (Jeon et al., 2005). If there was no evidence for selecting only a single sequence, multiple sequences were included in the database. To expand the scope of the new database to the domain of uncultured prokaryotes, the ‘phylotype’ concept that has been used for many years was applied. Here, a phylotype is defined as a so far uncultured putative species that exists in nature and is represented by a single 16S rRNA gene sequence that serves as a centroid. The 16S rRNA gene sequences of these reference phylotypes were manually selected from a pool of 16S rRNA gene sequences available in GenBank and 97 % gene sequence similarity was considered as a species boundary (Stackebrandt & Goebel, 1994; Tindall et al., 2010). Similarities were determined using the algorithm of Myers & Miller (1988) as previously described (Chun et al., 2007). Sequences representing different phylotypes were checked for PCRdriven chimera formation using the algorithm summarized in Fig. 1. The identification results of two non-overlapping 500 bp-long fragments [(a) and (b) in Fig. 1] were manually compared for chimera formation. The two fragments were then subjected to independent identification processes using a chimera-free database that only contained sequences from cultured strains. If the two fragments showed different hits with .95 % sequence similarity and the two hits appeared to belong to different orders (taxonomic rank), the query sequence was regarded as a chimera. In the case of bacteria, sequences were trimmed when required, to include only positions between 8 and 1507 [Escherichia coli numbering system (Brosius et al., 1981)]. Taxonomic assignment of sequences. The collected 16S rRNA gene sequences were manually aligned on the basis of a known secondary structure model using the jPHYDIT program (Jeon et al., 2005). Numerous phylogenetic trees were generated for different combinations of subsets of 16S rRNA gene sequences corresponding to various taxonomic ranks. Phylogenetic analyses were implemented using the neighbour-joining method (Saitou & Nei, 1987), with the evolutionary model of Jukes & Cantor (1969), in the MEGA program (Tamura et al., 2007) and using the maximum-likelihood method (Felsenstein, 1981) in the RAxML program (Stamatakis, 2006). The topologies of the resulting trees were evaluated using bootstrap analysis (Felsenstein, 1985), based on 100–1000 replicates depending on the size of the dataset. In addition to the recognized taxonomic names corresponding to species within class ranks (the phylum category is not covered by the Rules of the Bacteriological Code), tentative names were given by adding a suffix to the GenBank accession numbers (see Table 1 for examples). Names without formal standing in nomenclature were always labelled using a suffix with an underscore. An attempt was made to conserve phylotype names that are already widely used in the fields of microbial ecology and clinical microbiology, such as SAR11 (Field et al., 1997) and TM7 (Marcy et al., 2007). Species that had previously been misclassified on the basis of 16S rRNA gene phylogenies were correctly placed within the new hierarchical system. For example, Clostridium cellulosi (He et al., 1991; GenBank accession no. L09177) was placed under the genus Fig. 1. The chimera-detection algorithm used in this study. All the results were also checked manually. http://ijs.sgmjournals.org Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 16 Jun 2017 09:25:51 717 O.-S. Kim and others Table 1. Examples of tentative names for phylotypes Typical suffixes are _s (for species), _g (genus), _f (family), _o (order), _c (class) and _p (phylum). GenBank accession no. Phylum Class Order Family Genus AF234118 CP000084 AF234118_p Proteobacteria AF234118_c Alphaproteobacteria AF234118_o SAR11 AF234118_f SAR11-1_f AF234118_g Pelagibacter ABBV01000356 L09177 TM7 Firmicutes TM7_c Clostridia TM7_o Clostridiales TM7_f Ruminococcaceae TM7_g Ethanoligenens Ethanoligenens in accordance with the phylogenetic analysis based on 16S rRNA gene sequences (Table 1). Ordination analysis of sequences included in the EzTaxon-e database. Complete sequence similarity matrices were generated from multiple sequence alignments (.1200 bp) and were converted to Jukes and Cantor distance matrices (Jukes & Cantor, 1969). Principal coordinate analyses were performed as previously described (Legendre, 1998). To reduce the computational cost of obtaining complete series of eigenvalues and eigenvectors, the power method algorithm was employed for approximating eigenvalues (Golub & van Loan, 1996). Three-dimensional ordination plots were generated from the resultant principal components using the R package (http:// www.r-project.org) with the rgl library. Functionalities of the website. An individual 16S rRNA gene sequence (query sequence) is identified using a combination of two sequence similarity search engines (BLAST and MEGABLAST) (Altschul et al., 1997), followed by rigorous pairwise global sequence alignment (Myers & Miller, 1988) as previously described (Chun et al., 2007). The ‘CompGen’ function was designed to provide a calculation of pairwise sequence similarity values for the query sequence against all sequences belonging to a user-defined genus. Species AF234118_s ‘Candidatus Pelagibacter ubique’ TM7_s Clostridium cellulosi ‘Completeness’ was designed as an objective measurement of the degree of completeness of a 16S rRNA gene sequence with respect to full-length sequences. A completeness value is defined as: Completeness (%)5 L6100 C where L is the length of a query sequence and C is the length of the most similar sequence that is regarded as complete. Here, we define ‘complete sequence’ as the sequences containing regions between PCR primers 27F and 1492R for Bacteria (Lane, 1991) and A25F and U1492R for Archaea (Dojka et al., 1998), respectively, as these primers were the most widely used. The most similar sequence in the database of complete sequences was identified by a BLASTN search. The completeness values of all sequences contained in the EzTaxon-e database were calculated prior to the release of the database. The overall workflow of the identification and completeness calculation is illustrated in Fig. 2. Operating system and programming languages. All databases and computer programs were generated using MySQL, JSP and JAVA under the Linux operating system. Availability. The EzTaxon-e server is accessible over the internet at http://eztaxon-e.ezbiocloud.net/. Fig. 2. Overall workflow of the identification and completeness calculation implemented in the EzTaxon-e database. 718 Downloaded from www.microbiologyresearch.org by International Journal of Systematic and Evolutionary Microbiology 62 IP: 88.99.165.207 On: Fri, 16 Jun 2017 09:25:51 EzTaxon-e: a new 16S rRNA sequence database RESULTS AND DISCUSSION In this study, the aim was to generate a new database with extensive manual curation for facilitating the accurate identification of new isolates in various settings such as clinical and environmental microbiological laboratories. Through manual curation with the aid of various in-house computer programs, as of 1 September 2011, the EzTaxon-e database contained 34 607 representative prokaryotic 16S rRNA gene sequences (Table 2). All sequences were manually checked for chimera formation and sequencing quality and were aligned with a known secondary structure model. All sequences included in the database were taxonomically assigned to generate a complete hierarchical system (from species to phylum) on the basis of the criteria of 16S rRNA gene sequence similarity and phylogenetic trees. Using underscore-based suffixes, over 24 000 tentative names were created to build a traceable nomenclature system, resulting in 87 phyla, 303 classes, 1027 orders, 3735 families, 12 457 genera and 34 244 species. We suggest that the term ‘completeness’ can be used as an objective measure that depicts the degree of completeness of sequencing with respect to full-length sequences. The positions for defining ‘full-length’ sequences were set by widely used forward and reverse primer sets [primers 27F and 1492R for Bacteria (Lane, 1991) and A25F and U1492R for Archaea (Dojka et al., 1998)], but not the actual complete sequence of the 16S rRNA molecule. The numbers of fulllength sequences serving as reference templates in this study were 18 102 for Bacteria (mean length of 1449 bp) and 550 for Archaea (mean length of 1408 bp). The mean completeness values of sequences held in the EzTaxon-e database were 95.6 % and 88.6 % for the Bacteria and Archaea, respectively. It is suggested that 16S rRNA gene sequences with completeness values of .95 % should be used for formal taxonomic descriptions, as incomplete/partial sequences will hamper various taxonomic analyses, including BLAST searches and phylogenetic tree analyses. The overall genetic relationships among the 16S rRNA gene sequences held in the database were explored using a threedimensional ordination method based on principal coordinate analysis (Fig. 3). In general, sequences belonging to major phyla of the Bacteria and Archaea formed clusters supporting our taxonomic assignment, which was largely dependent on phylogenetic analyses (Fig. 3a). Due to the development of cost-effective genome sequencing methods (MacLean et al., 2009), the availability of genome sequences of type strains is growing rapidly. At present, out of 9787 validly published prokaryotic names, the type strains of 1076 species have been examined using whole genome sequencing, indicating that only 11 % of species with validly published names have been examined at the genome level so far. The distribution of genome-based 16S rRNA gene sequences among all sequences (Fig. 3b) indicates that there is a need for further genome sequencing projects, focusing on representatives of different phyla, to elucidate prokaryotic genomic diversity (Wu et al., 2009). For the taxa for which cultured strains or enriched communities are unavailable, single cell-based genomics may be a solution (Kalisky & Quake, 2011), although the application of this technique at the high-throughput level is yet to be demonstrated. The distribution of sequences from cultured strains showed a strong bias that demonstrated the difficulties in cultivating bacterial strains from certain phyla (Fig. 3c). In addition to species with validly published names, representative sequences were selected that may serve as points of reference for as yet uncultured phylotypes. The cut-off similarity value that was employed for defining phylotypes was 97 %, which is the most widely accepted value for defining prokaryotic species (Tindall et al., 2010). Over 24 000 phylotypes were newly defined, uniquely named and taxonomically assigned within our complete hierarchical system. The tentative names proposed in this study are always labelled with an underscore followed by specific suffixes (e.g. X_s for phylotypes, X_g for tentative genus, X_f for tentative family and so on, where X is the GenBank accession number or an arbitrary name; see Table 1 for examples). This system could be useful for comparing 16S rRNA gene sequences obtained from metagenomic libraries in which a substantial portion of the sequences may not match recognized species with validly published names. The phylotype names proposed here will be maintained in the same manner as that used for handling a validly published prokaryotic name. For instance, if a phylotype (named ABC_s in EzTaxon-e) is later isolated and described as ‘Newgenus newname’, our database will record this change and related information will be traceable for future Table 2. General statistics of the EzTaxon-e database Type of sequence Species with a validly published name Species with non-validly published name Candidatus taxa Phylotypes Genome shotgun assembly No. of sequences (Bacteria/Archaea) 9787 (9412/375)* 212 (199/13) 222 (215/7) 24 386 (22 376/2010) 1076 (984/92) *398 sequences were from names that were ‘in press’ with the International Journal of Systematic and Evolutionary Microbiology (http://ijs.sgmjournals.org/). http://ijs.sgmjournals.org Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 16 Jun 2017 09:25:51 719 O.-S. Kim and others Fig. 3. Three-dimensional ordination diagrams showing distributions of (a) major Bacterial phyla, (b) genome-based 16S rRNA gene sequences and (c) sequences from cultured and uncultured strains. reference. This function should provide a valuable means of communication among microbiologists who have routinely come across taxonomically novel isolates. Finally, we would like to emphasize that the names of phylotypes and the hierarchical system presented here have no standing in formal prokaryotic nomenclature and are only intended to aid comparative analyses when handling mainly uncultured prokaryotes. The database and the analytical functions can be found at http://eztaxon-e.ezbiocloud.net/. ACKNOWLEDGEMENTS This study was supported by the Interdisciplinary Research Program of Seoul National University. Brosius, J., Dull, T. J., Sleeter, D. D. & Noller, H. F. (1981). Gene organization and primary structure of a ribosomal RNA operon from Escherichia coli. J Mol Biol 148, 107–127. Chun, J., Lee, J. H., Jung, Y., Kim, M., Kim, S., Kim, B. K. & Lim, Y. W. (2007). EzTaxon: a web-based tool for the identification of prokaryotes based on 16S ribosomal RNA gene sequences. Int J Syst Evol Microbiol 57, 2259–2261. Dojka, M. A., Hugenholtz, P., Haack, S. K. & Pace, N. R. (1998). Microbial diversity in a hydrocarbon- and chlorinated-solvent-contaminated aquifer undergoing intrinsic bioremediation. Appl Environ Microbiol 64, 3869–3877. Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17, 368–376. Felsenstein, J. (1985). Confidence limits on phylogenies with a molecular clock. Syst Zool 34, 152–161. REFERENCES Field, K. G., Gordon, D., Wright, T., Rappé, M., Urback, E., Vergin, K. & Giovannoni, S. J. (1997). Diversity and depth-specific distribution of Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new genera- Golub, G. H. & van Loan, C. F. (1996). Matrix Computations, 3rd edn. tion of protein database search programs. Nucleic Acids Res 25, 3389–3402. Baltimore, London: The Johns Hopkins University Press. 720 SAR11 cluster rRNA genes from marine planktonic bacteria. Appl Environ Microbiol 63, 63–70. Downloaded from www.microbiologyresearch.org by International Journal of Systematic and Evolutionary Microbiology 62 IP: 88.99.165.207 On: Fri, 16 Jun 2017 09:25:51 EzTaxon-e: a new 16S rRNA sequence database He, Y. L., Ding, Y. F. & Long, Y. Q. (1991). Two cellulolytic Clostridium Murray, R. G. & Stackebrandt, E. (1995). Taxonomic note: imple- species: Clostridium cellulosi sp. nov. and Clostridium cellulofermentans sp. nov. Int J Syst Bacteriol 41, 306–309. mentation of the provisional status Candidatus for incompletely described procaryotes. Int J Syst Bacteriol 45, 186–187. Jeon, Y. S., Chung, H., Park, S., Hur, I., Lee, J. H. & Chun, J. (2005). Myers, E. W. & Miller, W. (1988). Optimal alignments in linear space. jPHYDIT: a JAVA-based integrated environment for molecular phylogeny of ribosomal RNA sequences. Bioinformatics 21, 3171–3173. Rosselló-Mora, R. & Amann, R. (2001). The species concept for Jukes, T. H. & Cantor, C. R. (1969). Evolution of protein molecules. In prokaryotes. FEMS Microbiol Rev 25, 39–67. Mammalian Protein Metabolism, pp. 21–132. Edited by H. N. Munro. New York: Academic Press. Kalisky, T. & Quake, S. R. (2011). Single-cell genomics. Nat Methods Saitou, N. & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4, 406– 425. 8, 311–314. Stackebrandt, E. & Goebel, B. M. (1994). A place for DNA- Lane, D. J. (1991). 16S/23S rRNA sequencing. In Nucleic Acid DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. Int J Syst Bacteriol 44, 846–849. Techniques in Bacterial Systematics, pp. 115–175. Edited by E. Stackebrandt & M. Goodfellow. Chichester: Wiley. Lee, J. H., Yi, H. & Chun, J. (2011). rRNASelector: a computer program for selecting ribosomal RNA encoding sequences from metagenomic and metatranscriptomic shotgun libraries. J Microbiol 49, 689–691. Legendre, P. L. L. (1998). Numerical Ecology, 2nd edn. Amsterdam: Elsevier. MacLean, D., Jones, J. D. & Studholme, D. J. (2009). Application of ‘next-generation’ sequencing technologies to microbial genetics. Nat Rev Microbiol 7, 287–296. Marcy, Y., Ouverney, C., Bik, E. M., Lösekann, T., Ivanova, N., Martin, H. G., Szeto, E., Platt, D., Hugenholtz, P. & other authors (2007). Dissecting biological ‘‘dark matter’’ with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth. Proc Natl Acad Sci U S A 104, 11889–11894. http://ijs.sgmjournals.org Comput Appl Biosci 4, 11–17. Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690. Tamura, K., Dudley, J., Nei, M. & Kumar, S. (2007). MEGA4: Molecular evolutionary genetics analysis (MEGA) software version 4.0. Mol Biol Evol 24, 1596–1599. Tindall, B. J., Rosselló-Móra, R., Busse, H. J., Ludwig, W. & Kämpfer, P. (2010). Notes on the characterization of prokaryote strains for taxonomic purposes. Int J Syst Evol Microbiol 60, 249– 266. Wu, D., Hugenholtz, P., Mavromatis, K., Pukall, R., Dalin, E., Ivanova, N. N., Kunin, V., Goodwin, L., Wu, M. & other authors (2009). A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462, 1056–1060. Downloaded from www.microbiologyresearch.org by IP: 88.99.165.207 On: Fri, 16 Jun 2017 09:25:51 721
© Copyright 2026 Paperzz