Introducing EzTaxon-e: a prokaryotic 16S rRNA gene sequence

International Journal of Systematic and Evolutionary Microbiology (2012), 62, 716–721
DOI 10.1099/ijs.0.038075-0
Introducing EzTaxon-e: a prokaryotic 16S rRNA
gene sequence database with phylotypes that
represent uncultured species
Ok-Sun Kim,134 Yong-Joon Cho,13 Kihyun Lee,1 Seok-Hwan Yoon,1
Mincheol Kim,1 Hyunsoo Na,1 Sang-Cheol Park,2 Yoon Seong Jeon,2
Jae-Hak Lee,2 Hana Yi,1 Sungho Won3 and Jongsik Chun1,2
Correspondence
1
Jongsik Chun
2
School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
[email protected]
3
Department of Statistics, Chung-Ang University, Seoul, Republic of Korea
Despite recent advances in commercially optimized identification systems, bacterial identification
remains a challenging task in many routine microbiological laboratories, especially in situations
where taxonomically novel isolates are involved. The 16S rRNA gene has been used extensively
for this task when coupled with a well-curated database, such as EzTaxon, containing sequences
of type strains of prokaryotic species with validly published names. Although the EzTaxon
database has been widely used for routine identification of prokaryotic isolates, sequences from
uncultured prokaryotes have not been considered. Here, the next generation database, named
EzTaxon-e, is formally introduced. This new database covers not only species within the formal
nomenclatural system but also phylotypes that may represent species in nature. In addition to an
identification function based on Basic Local Alignment Search Tool (BLAST) searches and pairwise
global sequence alignments, a new objective method of assessing the degree of completeness in
sequencing is proposed. All sequences that are held in the EzTaxon-e database have been
subjected to phylogenetic analysis and this has resulted in a complete hierarchical classification
system. It is concluded that the EzTaxon-e database provides a useful taxonomic backbone for the
identification of cultured and uncultured prokaryotes and offers a valuable means of
communication among microbiologists who routinely encounter taxonomically novel isolates. The
database and its analytical functions can be found at http://eztaxon-e.ezbiocloud.net/.
INTRODUCTION
Despite recent advances in molecular biology and in the
development of commercially available phenotype-based
identification kits, identifying bacterial strains remains a
difficult task for many routine microbiological laboratories. Among the several thousand genes within a bacterial
genome, the 16S rRNA gene has served as the primary key
for phylogeny-based identification when compared against
well-curated 16S rRNA gene sequence databases (RossellóMora & Amann, 2001; Tindall et al., 2010). In a previous
study, we introduced a manually curated database, called
EzTaxon (http://www.eztaxon.org/), which contains 16S
rRNA gene sequences of the type strains of species with
validly published names (Chun et al., 2007). This database
has been widely used for bacterial identification worldwide,
and more than 1 000 000 identification jobs have been
performed by microbiologists from more than 50 countries. Here, we formally introduce the next generation
database of EzTaxon, named EzTaxon-e. This database
covers not only prokaryotic species with validly published
names but also species with non-validly published names
and Candidatus taxa (Murray & Stackebrandt, 1995), as
well as representatives of yet uncultured phylotypes, using
new analytical functions. In addition, all species and phylotypes have been assigned to a complete hierarchical
system (from species to phylum) by using comprehensive
phylogenetic analyses based on 16S rRNA gene sequences.
METHODS
3These authors contributed equally to this work.
4Present address: Division of Polar Biological Sciences, Korea Polar
Research Institute, Incheon 406-840, Republic of Korea.
Abbreviation:
716
BLAST,
basic local alignment search tool.
Collection of 16S rRNA gene sequences. In addition to the 16S
rRNA gene sequences that were previously maintained at the EzTaxon
database (http://www.eztaxon.org/), type or representative 16S rRNA
gene sequences were collected from Candidatus taxa and species with
Downloaded from www.microbiologyresearch.org by
038075 G 2012 IUMS
IP: 88.99.165.207
On: Fri, 16 Jun 2017 09:25:51
Printed in Great Britain
EzTaxon-e: a new 16S rRNA sequence database
non-validly published names from public domain databases,
including PubMed and GenBank. Moreover, the 16S rRNA gene
sequences that can be found on assembled contigs of genome
and metagenome sequencing projects were extracted using the
rRNASelector program (http://sw.ezbiocloud.net/rrnaselector) (Lee
et al., 2011). In principle, only a single best sequence was selected for
each species/phylotype according to the following priority rules:
sequences from genome projects, long sequences (¢1300 bp) and
sequences with few ambiguous bases. When multiple sequences
were available for the same type strain, sequences were aligned and
inspected with a secondary structure model and were subjected to
phylogenetic analyses using the jPHYDIT program (Jeon et al., 2005). If
there was no evidence for selecting only a single sequence, multiple
sequences were included in the database.
To expand the scope of the new database to the domain of uncultured
prokaryotes, the ‘phylotype’ concept that has been used for many
years was applied. Here, a phylotype is defined as a so far uncultured
putative species that exists in nature and is represented by a single 16S
rRNA gene sequence that serves as a centroid. The 16S rRNA gene
sequences of these reference phylotypes were manually selected from
a pool of 16S rRNA gene sequences available in GenBank and 97 %
gene sequence similarity was considered as a species boundary
(Stackebrandt & Goebel, 1994; Tindall et al., 2010). Similarities were
determined using the algorithm of Myers & Miller (1988) as
previously described (Chun et al., 2007).
Sequences representing different phylotypes were checked for PCRdriven chimera formation using the algorithm summarized in Fig. 1.
The identification results of two non-overlapping 500 bp-long
fragments [(a) and (b) in Fig. 1] were manually compared for
chimera formation. The two fragments were then subjected to
independent identification processes using a chimera-free database
that only contained sequences from cultured strains. If the two
fragments showed different hits with .95 % sequence similarity and
the two hits appeared to belong to different orders (taxonomic rank),
the query sequence was regarded as a chimera.
In the case of bacteria, sequences were trimmed when required, to
include only positions between 8 and 1507 [Escherichia coli numbering
system (Brosius et al., 1981)].
Taxonomic assignment of sequences. The collected 16S rRNA
gene sequences were manually aligned on the basis of a known
secondary structure model using the jPHYDIT program (Jeon et al.,
2005). Numerous phylogenetic trees were generated for different
combinations of subsets of 16S rRNA gene sequences corresponding
to various taxonomic ranks. Phylogenetic analyses were implemented
using the neighbour-joining method (Saitou & Nei, 1987), with the
evolutionary model of Jukes & Cantor (1969), in the MEGA program
(Tamura et al., 2007) and using the maximum-likelihood method
(Felsenstein, 1981) in the RAxML program (Stamatakis, 2006). The
topologies of the resulting trees were evaluated using bootstrap
analysis (Felsenstein, 1985), based on 100–1000 replicates depending
on the size of the dataset. In addition to the recognized taxonomic
names corresponding to species within class ranks (the phylum
category is not covered by the Rules of the Bacteriological Code),
tentative names were given by adding a suffix to the GenBank
accession numbers (see Table 1 for examples). Names without formal
standing in nomenclature were always labelled using a suffix with an
underscore. An attempt was made to conserve phylotype names that
are already widely used in the fields of microbial ecology and clinical
microbiology, such as SAR11 (Field et al., 1997) and TM7 (Marcy
et al., 2007). Species that had previously been misclassified on the
basis of 16S rRNA gene phylogenies were correctly placed within the
new hierarchical system. For example, Clostridium cellulosi (He et al.,
1991; GenBank accession no. L09177) was placed under the genus
Fig. 1. The chimera-detection algorithm used
in this study. All the results were also checked
manually.
http://ijs.sgmjournals.org
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Fri, 16 Jun 2017 09:25:51
717
O.-S. Kim and others
Table 1. Examples of tentative names for phylotypes
Typical suffixes are _s (for species), _g (genus), _f (family), _o (order), _c (class) and _p (phylum).
GenBank
accession no.
Phylum
Class
Order
Family
Genus
AF234118
CP000084
AF234118_p
Proteobacteria
AF234118_c
Alphaproteobacteria
AF234118_o
SAR11
AF234118_f
SAR11-1_f
AF234118_g
Pelagibacter
ABBV01000356
L09177
TM7
Firmicutes
TM7_c
Clostridia
TM7_o
Clostridiales
TM7_f
Ruminococcaceae
TM7_g
Ethanoligenens
Ethanoligenens in accordance with the phylogenetic analysis based on
16S rRNA gene sequences (Table 1).
Ordination analysis of sequences included in the EzTaxon-e
database. Complete sequence similarity matrices were generated
from multiple sequence alignments (.1200 bp) and were converted
to Jukes and Cantor distance matrices (Jukes & Cantor, 1969).
Principal coordinate analyses were performed as previously described
(Legendre, 1998). To reduce the computational cost of obtaining
complete series of eigenvalues and eigenvectors, the power method
algorithm was employed for approximating eigenvalues (Golub & van
Loan, 1996). Three-dimensional ordination plots were generated
from the resultant principal components using the R package (http://
www.r-project.org) with the rgl library.
Functionalities of the website. An individual 16S rRNA gene
sequence (query sequence) is identified using a combination of two
sequence similarity search engines (BLAST and MEGABLAST) (Altschul
et al., 1997), followed by rigorous pairwise global sequence alignment
(Myers & Miller, 1988) as previously described (Chun et al., 2007).
The ‘CompGen’ function was designed to provide a calculation of
pairwise sequence similarity values for the query sequence against all
sequences belonging to a user-defined genus.
Species
AF234118_s
‘Candidatus
Pelagibacter
ubique’
TM7_s
Clostridium cellulosi
‘Completeness’ was designed as an objective measurement of the
degree of completeness of a 16S rRNA gene sequence with respect to
full-length sequences. A completeness value is defined as:
Completeness (%)5
L6100
C
where L is the length of a query sequence and C is the length of the
most similar sequence that is regarded as complete. Here, we define
‘complete sequence’ as the sequences containing regions between PCR
primers 27F and 1492R for Bacteria (Lane, 1991) and A25F and
U1492R for Archaea (Dojka et al., 1998), respectively, as these primers
were the most widely used. The most similar sequence in the database
of complete sequences was identified by a BLASTN search. The
completeness values of all sequences contained in the EzTaxon-e
database were calculated prior to the release of the database.
The overall workflow of the identification and completeness
calculation is illustrated in Fig. 2.
Operating system and programming languages. All databases
and computer programs were generated using MySQL, JSP and JAVA
under the Linux operating system.
Availability. The EzTaxon-e server is accessible over the internet at
http://eztaxon-e.ezbiocloud.net/.
Fig. 2. Overall workflow of the identification and completeness calculation implemented in the EzTaxon-e database.
718
Downloaded from www.microbiologyresearch.org by
International Journal of Systematic and Evolutionary Microbiology 62
IP: 88.99.165.207
On: Fri, 16 Jun 2017 09:25:51
EzTaxon-e: a new 16S rRNA sequence database
RESULTS AND DISCUSSION
In this study, the aim was to generate a new database with
extensive manual curation for facilitating the accurate
identification of new isolates in various settings such as
clinical and environmental microbiological laboratories.
Through manual curation with the aid of various in-house
computer programs, as of 1 September 2011, the EzTaxon-e
database contained 34 607 representative prokaryotic 16S
rRNA gene sequences (Table 2). All sequences were manually
checked for chimera formation and sequencing quality and
were aligned with a known secondary structure model.
All sequences included in the database were taxonomically
assigned to generate a complete hierarchical system (from
species to phylum) on the basis of the criteria of 16S rRNA
gene sequence similarity and phylogenetic trees. Using
underscore-based suffixes, over 24 000 tentative names were
created to build a traceable nomenclature system, resulting
in 87 phyla, 303 classes, 1027 orders, 3735 families, 12 457
genera and 34 244 species.
We suggest that the term ‘completeness’ can be used as an
objective measure that depicts the degree of completeness
of sequencing with respect to full-length sequences. The
positions for defining ‘full-length’ sequences were set by
widely used forward and reverse primer sets [primers 27F and
1492R for Bacteria (Lane, 1991) and A25F and U1492R for
Archaea (Dojka et al., 1998)], but not the actual complete
sequence of the 16S rRNA molecule. The numbers of fulllength sequences serving as reference templates in this study
were 18 102 for Bacteria (mean length of 1449 bp) and 550
for Archaea (mean length of 1408 bp). The mean completeness values of sequences held in the EzTaxon-e database
were 95.6 % and 88.6 % for the Bacteria and Archaea, respectively. It is suggested that 16S rRNA gene sequences with
completeness values of .95 % should be used for formal
taxonomic descriptions, as incomplete/partial sequences will
hamper various taxonomic analyses, including BLAST searches
and phylogenetic tree analyses.
The overall genetic relationships among the 16S rRNA gene
sequences held in the database were explored using a threedimensional ordination method based on principal coordinate analysis (Fig. 3). In general, sequences belonging to
major phyla of the Bacteria and Archaea formed clusters
supporting our taxonomic assignment, which was largely
dependent on phylogenetic analyses (Fig. 3a). Due to the
development of cost-effective genome sequencing methods
(MacLean et al., 2009), the availability of genome sequences
of type strains is growing rapidly. At present, out of 9787
validly published prokaryotic names, the type strains of 1076
species have been examined using whole genome sequencing,
indicating that only 11 % of species with validly published
names have been examined at the genome level so far. The
distribution of genome-based 16S rRNA gene sequences
among all sequences (Fig. 3b) indicates that there is a
need for further genome sequencing projects, focusing on
representatives of different phyla, to elucidate prokaryotic
genomic diversity (Wu et al., 2009). For the taxa for which
cultured strains or enriched communities are unavailable,
single cell-based genomics may be a solution (Kalisky &
Quake, 2011), although the application of this technique at
the high-throughput level is yet to be demonstrated. The
distribution of sequences from cultured strains showed a
strong bias that demonstrated the difficulties in cultivating
bacterial strains from certain phyla (Fig. 3c).
In addition to species with validly published names,
representative sequences were selected that may serve as
points of reference for as yet uncultured phylotypes. The
cut-off similarity value that was employed for defining
phylotypes was 97 %, which is the most widely accepted
value for defining prokaryotic species (Tindall et al., 2010).
Over 24 000 phylotypes were newly defined, uniquely named
and taxonomically assigned within our complete hierarchical system. The tentative names proposed in this study are
always labelled with an underscore followed by specific
suffixes (e.g. X_s for phylotypes, X_g for tentative genus, X_f
for tentative family and so on, where X is the GenBank
accession number or an arbitrary name; see Table 1 for
examples). This system could be useful for comparing 16S
rRNA gene sequences obtained from metagenomic libraries
in which a substantial portion of the sequences may not
match recognized species with validly published names. The
phylotype names proposed here will be maintained in the
same manner as that used for handling a validly published
prokaryotic name. For instance, if a phylotype (named
ABC_s in EzTaxon-e) is later isolated and described as
‘Newgenus newname’, our database will record this change
and related information will be traceable for future
Table 2. General statistics of the EzTaxon-e database
Type of sequence
Species with a validly published name
Species with non-validly published name
Candidatus taxa
Phylotypes
Genome shotgun assembly
No. of sequences (Bacteria/Archaea)
9787 (9412/375)*
212 (199/13)
222 (215/7)
24 386 (22 376/2010)
1076 (984/92)
*398 sequences were from names that were ‘in press’ with the International Journal of Systematic and
Evolutionary Microbiology (http://ijs.sgmjournals.org/).
http://ijs.sgmjournals.org
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Fri, 16 Jun 2017 09:25:51
719
O.-S. Kim and others
Fig. 3. Three-dimensional ordination diagrams
showing distributions of (a) major Bacterial phyla,
(b) genome-based 16S rRNA gene sequences and
(c) sequences from cultured and uncultured strains.
reference. This function should provide a valuable means of
communication among microbiologists who have routinely
come across taxonomically novel isolates. Finally, we would
like to emphasize that the names of phylotypes and the
hierarchical system presented here have no standing in
formal prokaryotic nomenclature and are only intended to
aid comparative analyses when handling mainly uncultured
prokaryotes. The database and the analytical functions can
be found at http://eztaxon-e.ezbiocloud.net/.
ACKNOWLEDGEMENTS
This study was supported by the Interdisciplinary Research Program
of Seoul National University.
Brosius, J., Dull, T. J., Sleeter, D. D. & Noller, H. F. (1981). Gene
organization and primary structure of a ribosomal RNA operon from
Escherichia coli. J Mol Biol 148, 107–127.
Chun, J., Lee, J. H., Jung, Y., Kim, M., Kim, S., Kim, B. K. & Lim,
Y. W. (2007). EzTaxon: a web-based tool for the identification of
prokaryotes based on 16S ribosomal RNA gene sequences. Int J Syst
Evol Microbiol 57, 2259–2261.
Dojka, M. A., Hugenholtz, P., Haack, S. K. & Pace, N. R. (1998).
Microbial diversity in a hydrocarbon- and chlorinated-solvent-contaminated aquifer undergoing intrinsic bioremediation. Appl Environ
Microbiol 64, 3869–3877.
Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a
maximum likelihood approach. J Mol Evol 17, 368–376.
Felsenstein, J. (1985). Confidence limits on phylogenies with a
molecular clock. Syst Zool 34, 152–161.
REFERENCES
Field, K. G., Gordon, D., Wright, T., Rappé, M., Urback, E., Vergin, K. &
Giovannoni, S. J. (1997). Diversity and depth-specific distribution of
Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W.
& Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new genera-
Golub, G. H. & van Loan, C. F. (1996). Matrix Computations, 3rd edn.
tion of protein database search programs. Nucleic Acids Res 25, 3389–3402.
Baltimore, London: The Johns Hopkins University Press.
720
SAR11 cluster rRNA genes from marine planktonic bacteria. Appl
Environ Microbiol 63, 63–70.
Downloaded from www.microbiologyresearch.org by
International Journal of Systematic and Evolutionary Microbiology 62
IP: 88.99.165.207
On: Fri, 16 Jun 2017 09:25:51
EzTaxon-e: a new 16S rRNA sequence database
He, Y. L., Ding, Y. F. & Long, Y. Q. (1991). Two cellulolytic Clostridium
Murray, R. G. & Stackebrandt, E. (1995). Taxonomic note: imple-
species: Clostridium cellulosi sp. nov. and Clostridium cellulofermentans sp. nov. Int J Syst Bacteriol 41, 306–309.
mentation of the provisional status Candidatus for incompletely
described procaryotes. Int J Syst Bacteriol 45, 186–187.
Jeon, Y. S., Chung, H., Park, S., Hur, I., Lee, J. H. & Chun, J. (2005).
Myers, E. W. & Miller, W. (1988). Optimal alignments in linear space.
jPHYDIT: a JAVA-based integrated environment for molecular phylogeny of ribosomal RNA sequences. Bioinformatics 21, 3171–3173.
Rosselló-Mora, R. & Amann, R. (2001). The species concept for
Jukes, T. H. & Cantor, C. R. (1969). Evolution of protein molecules. In
prokaryotes. FEMS Microbiol Rev 25, 39–67.
Mammalian Protein Metabolism, pp. 21–132. Edited by H. N. Munro.
New York: Academic Press.
Kalisky, T. & Quake, S. R. (2011). Single-cell genomics. Nat Methods
Saitou, N. & Nei, M. (1987). The neighbor-joining method: a new
method for reconstructing phylogenetic trees. Mol Biol Evol 4, 406–
425.
8, 311–314.
Stackebrandt, E. & Goebel, B. M. (1994). A place for DNA-
Lane, D. J. (1991). 16S/23S rRNA sequencing. In Nucleic Acid
DNA reassociation and 16S rRNA sequence analysis in the
present species definition in bacteriology. Int J Syst Bacteriol 44,
846–849.
Techniques in Bacterial Systematics, pp. 115–175. Edited by
E. Stackebrandt & M. Goodfellow. Chichester: Wiley.
Lee, J. H., Yi, H. & Chun, J. (2011). rRNASelector: a computer
program for selecting ribosomal RNA encoding sequences from
metagenomic and metatranscriptomic shotgun libraries. J Microbiol
49, 689–691.
Legendre, P. L. L. (1998). Numerical Ecology, 2nd edn. Amsterdam:
Elsevier.
MacLean, D., Jones, J. D. & Studholme, D. J. (2009). Application of
‘next-generation’ sequencing technologies to microbial genetics. Nat
Rev Microbiol 7, 287–296.
Marcy, Y., Ouverney, C., Bik, E. M., Lösekann, T., Ivanova, N., Martin,
H. G., Szeto, E., Platt, D., Hugenholtz, P. & other authors (2007).
Dissecting biological ‘‘dark matter’’ with single-cell genetic analysis of
rare and uncultivated TM7 microbes from the human mouth. Proc
Natl Acad Sci U S A 104, 11889–11894.
http://ijs.sgmjournals.org
Comput Appl Biosci 4, 11–17.
Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based
phylogenetic analyses with thousands of taxa and mixed models.
Bioinformatics 22, 2688–2690.
Tamura, K., Dudley, J., Nei, M. & Kumar, S. (2007). MEGA4: Molecular
evolutionary genetics analysis (MEGA) software version 4.0. Mol Biol
Evol 24, 1596–1599.
Tindall, B. J., Rosselló-Móra, R., Busse, H. J., Ludwig, W. &
Kämpfer, P. (2010). Notes on the characterization of prokaryote
strains for taxonomic purposes. Int J Syst Evol Microbiol 60, 249–
266.
Wu, D., Hugenholtz, P., Mavromatis, K., Pukall, R., Dalin, E., Ivanova,
N. N., Kunin, V., Goodwin, L., Wu, M. & other authors (2009). A
phylogeny-driven genomic encyclopaedia of Bacteria and Archaea.
Nature 462, 1056–1060.
Downloaded from www.microbiologyresearch.org by
IP: 88.99.165.207
On: Fri, 16 Jun 2017 09:25:51
721