Genome sequence of a model prokaryote

R656
Dispatch
Genome sequences: Genome sequence of a model prokaryote
Eugene V. Koonin
Address: National Center for Biotechnology Information, National
Library of Medicine, National Institutes of Health, Bethesda, Maryland
20894, USA.
Current Biology 1997, 7:R656–R659
http://biomednet.com/elecref/09609822007R0656
© Current Biology Ltd ISSN 0960-9822
The complete genome sequence of Escherichia coli is now
out [1]. This was the first of the prokaryotic genomesequencing projects to be initiated [2,3], but not the first to
be completed — in the meantime, complete genome
sequences have become available for Haemophilus influenzae
[4], Mycoplasma genitalium [5], Mycoplasma pneumoniae [6],
Synechocystis sp [7] and Methanococcus jannaschii [8]. The completion of the E. coli genome sequence is still, however, an
unprecedented and momentous event for every molecular
biologist. And this is not just because, with 4,639,221 base
pairs and 4288 reported genes, E. coli’s is the largest and
most complex of the sequenced prokaryotic genomes,
though the relatively large number of genes in the E. coli
genome is a factor in the importance of its sequence.
The unique importance of the E. coli genome, however,
lies in the added value that comes from the last 50 years of
research in molecular biology and biochemistry, for which
E. coli has been the primary model system. This knowledge has been distilled in the comprehensive monograph
“Escherichia coli and Salmonella”, and the difference in size
and quality between the first edition of this monograph in
1987 [9], and the second in 1996 [10], is a testimony to the
continued vigor of E. coli research. The wealth of molecular and genetic data that are available on E. coli, which are
now supported by the complete sequences of all of its
genes, makes this prokaryote an indispensable reference
for future genome research.
Exactly how complete and how useful is all this information on a single prokaryotic species? In order to evaluate
this, we have to address two questions. First, how well is E.
coli itself understood — that is, at the simplest level, how
much is actually known about functions and regulation of
its genes and their products? And second, how much of the
information learnt from E. coli can be transferred to other
species? This second issue depends on how many
homologs of E. coli genes there are in other species’
genomes, and how well the homologs are conserved.
According to the paper [1] reporting the completion of the
E. coli genome sequence, about 40% of the E. coli genes
have no known function. In the genome sequence submitted to GenBank, the proportion of open reading frames
that do not have a specific gene name, which would be
indicative of prior genetic or biochemical identification
[11], is actually even greater — 2465 out of 4288, or 57%.
These numbers alone show that an enormous amount of
work remains to be done before our knowledge of the best
studied model organism is complete even at the simplest
level of knowing the function of each gene.
How much help are the sequence and structure databases
in predicting the functions of uncharacterized E. coli genes?
A comprehensive analysis of the complete set of E. coli
Figure 1
5000
4500
Number of E. coli proteins
The complete Escherichia coli genome sequence is now
known; it should greatly facilitate the analysis of other
genomes, but a lot remains to be learnt about E. coli
itself. About half the genes were previously
uncharacterized, but expanding databases and improving
analysis methods will help predict their functions.
4000
3500
3000
2500
2000
1500
1
2
1000
1
3
500
2
3
0
All gene products
Uncharacterized
gene products
© 1997 Current Biology
The availability of the complete E. coli genome sequence [1] allows the
analysis of conserved sequences. The graph shows the number of
homologs identified for all the E. coli gene products (left), and just those
that are as yet uncharacterized (right). The blue bars indicate the number
of E. coli gene products for which homologs have been identifed in: 1,
any species; 2, distantly related bacteria (outside the Proteobacteria
domain); 3, eukaryotes. The white extensions indicate the total number of
E. coli gene products, in the case of the bars on the left, and the total
number of uncharacterized gene products, in the case of the bars on the
right. The cut-off for significant similarity was a score of 105, computed
using the BLAST2 program [14], which approximately corresponds to
the probability of a chance match of 10–3 when the complete nonredundant protein sequence database at the NCBI is searched with a
sequence of an average size bacterial protein (300 amino acids).
Dispatch
Number of proteins
Figure 2
6500
6000
5500
5000
4500
4000
3500
3000
2500
2000
1500
100
500
0
R657
themselves, render the uncharacterized genes even more
amenable to functional interpretation. Indeed, the
sequences of E. coli gene products of no known function
show nearly the same level of evolutionary conservation as
the entire gene set; most of them contain ‘ancient conserved regions’ shared with homologs from distantly
related bacteria and/or eukaryotes (Figure 1).
MG
HI
Ssp
MJ
Comparison species
SC
© 1997 Current Biology
Homologs of E. coli proteins in other completely sequenced genomes.
The blue bars indicate the number of gene products with significant
similarity — defined as in the Figure 1 legend — to E. coli proteins
identified in the genomes of: MG, Mycoplasma genitalium; HI,
Haemophilus influenzae; Ssp, Synechocystis sp; MJ, Methanococcus
jannaschii; SC, Saccharomyces cerevisiae (the white extensions
indicate the total number of proteins predicted from the complete
sequence of each respective genome).
protein sequences remains to be performed. A pilot study
with 60% of the gene products indicated that, with varying
degrees of precision, a functional prediction could be
made for about two-thirds of the uncharacterized genes
[12]. Recent experience in comparative analysis of complete bacterial and archaeal genomes [13] suggests that
current improvements in database search methods,
combining pairwise similarity analysis with profile search
[14,15], as well as the rapid growth of the databases
In another attempt to evaluate the amount of information
that can be extracted from E. coli protein sequences by
computer methods — and that will require experimental
corroboration — I searched the E. coli protein database for
several widely conserved protein motifs that are considered
diagnostic of specific functions [12,16]. It is striking that,
for most of the conserved motifs, 50% or more of the proteins containing them have never been studied (Table 1).
These observations demonstrate both the impressive level
of our ignorance and the ability of relatively simple but
sensitive computer methods rapidly to generate functional
predictions for a large number of proteins.
The detection of significant sequence similarities for
uncharacterized gene products (Figure 1) and the identification of known conserved motifs (Table 1) together
make the greatest contribution to the prediction of functions for E. coli genes. The repertoire of predictions is,
however, by no means limited to such findings. There is a
lot to be learnt even about those proteins that have been
studied for a long time, and careful computer analysis
sometimes uncovers distinctly non-trivial connections and
suggests unexpected functions in seemingly well understood proteins. A good example is the recent demonstration that bacterial DNA ligases — including the E. coli lig
gene product —contain the ‘BRCT’ domain found in the
hereditary breast cancer gene product BRCA1 and in a
variety of eukaryotic cell-cycle checkpoint proteins [17].
Table 1
Selected conserved motifs in E. coli proteins in their relation to functional characterization.
Number of E. coli proteins
Conserved motif*
Total
Function known,
including biochemical
activity
Functional information
available but biochemical
activity unknown
P-loop (Walker-type ATP-binding motif)
218
112 (51%)
14( 6%)
S-adenosylmethionine-binding motif
35
10 (29%)
5 (14.3%)
Acetyltransferase motif
23
6 (26%)
1 (4%)
Dehalogenase-like hydrolase motif
22
4 (18%)
2 (9%)
α–β hydrolase motif
18
3 (17%)
–
Pyrophosphate-binding (PP) motif
10
5 (50%)
2 (20%)
*The motifs were detected with the MoST program [16] using the previously characterized conserved blocks [12] to initiate the search. Additional
occurrences of the motifs were detected using the PSIBLAST program [15].
R658
Current Biology, Vol 7 No 10
Table 2
Conserved gene families represented in the H. influenzae, but not the E. coli, genome.
Complete genomes*
Gene product function
HI
MG/MP
Ssp
MJ
SC
Ser/Thr protein kinase
HIN1394†
MG109/
MP585
Sll1575
Slr1697
Slr0152
Slr0599
Sll0776
Slr1225
Sll005
MJ0444
MJ1073
MJ1130
At least 120
Ser/Thr protein
kinase genes
Virulence-associated
protein (vapC)‡
HI0947
HI0322
Sll0690
MJ0914
MJ1121
Carboxymucono-lactone
hydrolase subunit
homolog
HI1053
Slr1853
MJ0742
MJ1511
Homoserine
N-acetyltransferase
HI1263
Putative histidine/purine
biosynthesis enzyme
(snzA)
HI1647
MJ0677
YMR096w
YFL059w
YNL333w
HIN1648
MJ1661
YMR095c
YFL060c
YNL334c
Glutamine
amidotransferase
possibly involved in
histidine/purine
biosynthesis (snzB)
YNL277w
*Abbreviations: HI, Haemophilus influenzae; MG, Mycoplasma
genitalium; MP, Mycoplasma pneumoniae; Ssp, Synechocystis sp.;
MJ, Methanococcus jannaschii; SC, Saccharomyces cerevisiae. The
gene names are from the original reports on genome sequences [2–8].
†A gene not annotated in the original H. influenzae genome report [2]
but detected in subsequent analysis and given a name [20]. ‡Also
detected in a variety of other bacteria including pathogenic ones
Shigella flexneri and Mycobacterium tuberculosis.
Another is the finding that the DNA repair protein MutL,
Hsp90 family molecular chaperones, signal-transducing
histidine kinases, and DNA gyrases all contain a conserved ATP-binding domain [18,19].
The value of E. coli as a reference for the functional
annotation of other genomes will clearly depend on the
number of genes in other genomes of interest that have
homologs among the E. coli genes. This is a moving target,
as it depends on the sensitivity of the methods for
sequence comparison and the criteria used to define significant sequence similarity. In order to get a census of the
E. coli homologs for the available complete genomes, I
compared the set of protein sequences from each of them
to the E. coli protein collection, using a relatively conservative significance cut-off (see Figure 1 legend).
The latter example illustrates another hallmark of the
present stage in sequence analysis — that the expansion
of protein families by sensitive sequence comparison
methods leads, more and more often, to a protein of
known three-dimensional structure (in the example
above, DNA gyrase). This makes structural modeling possible, and allows much more confident, and detailed, functional predictions to be made for uncharacterized domains.
In the accompanying correspondence [20], I present
another example of such a finding for a highly conserved,
but functionally uncharacterized, E. coli protein. All in all,
surprising as this may seem after decades of intense
research, it seems fair to say that more remains to be found
out about E. coli gene functions than is already known.
Now that the complete genome sequence is known, the
combination of in-depth computer searches and experimental studies provides researchers with the means to
acquire this much-needed knowledge.
The results of this analysis, shown in Figure 2, are at the
same time encouraging and sobering. E. coli does indeed
seem to be an ideal reference for the characterization of
relatively close bacterial genomes, such as H. influenzae, in
which a good majority of the genes are represented by
homologs in E. coli ([21] and Figure 2). However, the
fraction of genes that have such homologs drops to about
50% in distantly related bacteria, such as the cyanobacterium Synechocystis sp., about 30% in archaea, and less
than 20% in eukaryotes (Figure 2). These numbers are
much greater than those in the original report [1], which
Dispatch
used a deliberately high cut-off, but they still indicate that
functional annotation of distantly related genomes using
the information on E. coli genes may not be straightforward.
In order to get the best out of the E. coli genome, further
development of sensitive comparison methods is critical.
One of the unique benefits of complete genome analysis is
our ability to show, within the limits of the available
methods, which genes are missing in a given genome [22].
This is exemplified by the ‘differential display’ technique,
in which the homologs of E. coli genes are subtracted from
the genome of interest, leaving those genes that may define
unique features of the given species, such as its pathogenicity. For the H. influenzae genome, this approach revealed
119 genes that are missing in E. coli, but for which either
experimental data or informative sequence similarities or
both are available [23]. Only a few of these genes, however,
belong to highly conserved families represented in distantly
related complete genomes, and, remarkably, only one
apparently ancient metabolic pathway present in H. influenzae — the probable purine metabolism shunt catalyzed by
the SnzA and SnzB proteins [24] — is missing in E. coli
(Table 2). These observations put the use of genomes of
model organisms, such as E. coli, in the proper perspective:
they are instrumental in predicting functions of conserved,
house-keeping genes and help focus our attention on those
genes that define species uniqueness.
I have only touched on the most straightforward aspect of
the genome analysis — the phylogenetic conservation of
protein sequences and its use in functional prediction.
Regulatory interactions, which are much better understood in E. coli than in any other species, add a whole new
dimension. The complete sequencing of the E. coli
genome is certainly a brilliant climax to half a century of
research on this bacterium. More importantly, however, it
is the starting point for a new era, which hopefully will be
marked by mutually enriching interactions between computational and experimental approaches. The old favorite
of molecular biology will continue to challenge, fascinate
and reward researchers well into the twenty-first century.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
Acknowledgements
I am grateful to Frederick Blattner, Stephen Altschul, Peer Bork and Martijn
Huynen for providing me with preprints of their papers.
20.
21.
References
1. Blattner FR, Plunkett G III, Bloch CA, Perna NT, Burland V, Riley M,
Collado-Vides J, Glasner JD, Rode CK, Mayhew G, et al.: The
complete genome sequence of Escherichia coli K-12. Science
1997, in press.
2. Daniels DL, Plunkett G 3d, Burland V, Blattner FR: Analysis of the
Escherichia coli genome: DNA sequence of the region from 84.5
to 86.5 minutes. Science 1992, 257: 771-778.
3. Yura T, Mori H, Nagai H, Nagata T, Ishihama A, Fujita N, Isono K,
Mizobuchi K, Nakata A: Systematic sequencing of the Escherichia
coli genome: analysis of the 0-2.4 min region. Nucleic Acids Res
1992, 20: 3305-3308.
4. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF,
Kerlavage AR, Bult CJ, Tomb J-F, Dougherty BA, Merrick JM, et al.:
22.
23.
24.
R659
Whole-genome random sequencing and assembly of
Haemophilus influenzae Rd. Science 1995, 269:496-512.
Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA,
Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM, et al.:
The minimal gene complement of Mycoplasma genitalium.
Science 1995, 270:397-403.
Himmelreich R, Hilbert H, Plagens H, Pirkl E, Li B-C. Herrmann R:
Complete sequence analysis of the genome of the bacterium
Mycoplasma pneumoniae. Nucleic Acids Res 1996, 24:4420-4449.
Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y,
Miyajima N, Hirosawa M, Sugiura M, Sasamoto S, et al.: Sequence
analysis of the genome of the unicellular cyanobacterium
Synechocystis sp. strain PCC6803. II. Sequence determination of
the entire genome and assignment of potential protein-coding
regions. DNA Res 1996, 3:109-136.
Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG,
Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, et al.: Complete
genome sequence of the methanogenic archaeon,
Methanococcus jannaschii. Science 1996, 273:1058-1072.
Neidhardt FC, Ingraham JL, Lin ECC, Low KB, Magasanik B,
Schaechter M, Umbarger HE (Eds): Escherichia coli and Salmonella
typhimurium: Cellular and Molecular Biology. Washington: American
Society for Microbiology; 1987.
Neidhardt FC, Curtiss R III, Ingraham JL, Lin ECC, Low KB, Magasanik
B, Reznikoff WS, Riley M, Schaechter M, Umbarger HE (Eds):
Escherichia coli and Salmonella: Cellular and Molecular Biology
(2nd edn). Washington: American Society for Microbiology; 1996.
Berlyn MKB, Low KB, Rudd KE: Linkage map of Escherichia coli K12, Edition 9. In Escherichia coli and Salmonella: Cellular and
Molecular Biology (2nd edn). Neidhardt FC, Curtiss R III, Ingraham
JL, Lin ECC, Low KB, Magasanik B, Reznikoff WS, Riley M,
Schaechter M, Umbarger HE (Eds): Washington: American Society
for Microbiology; 1996:1715–1902.
Koonin EV, Tatusov RL, Rudd KE: Sequence similarity analysis of
Escherichia coli proteins: functional and evolutionary implications.
Proc Natl Acad Sci USA 1995, 92:11921-11925.
Koonin EV, Mushegian AR, Galperin MY, Walker DR: Comparison of
archaeal and bacterial genomes: computer analysis of protein
sequences predicts novel functions and suggests a chimeric
origin for the archaea. Mol Microbiol 1997, in press.
Altschul SF, Gish W: Local alignment statistics. Meth Enzymol
1996, 266:460-481.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res 1997,
in press.
Tatusov RL, Altschul SF, Koonin EV: Detection of conserved
segments in proteins: iterative scanning of sequence databases
with alignment blocks. Proc Natl Acad Sci USA 1996,
91:12091-12095.
Bork P, Hofmann K, Bucher P, Neuwald AF, Altschul SF, Koonin EV: A
superfamily of conserved domains in DNA damage-responsive
cell cycle checkpoint proteins. FASEB J 1997, 11:68-76.
Bergerat A, de Massy B, Gadelle D, Varoutas PC, Nicolas A, Forterre
P: An atypical topoisomerase II from Archaea with implications
for meiotic recombination. Nature 1997, 386:414- 417.
Mushegian AR, Bassett DE Jr, Boguski MS, Bork P, Koonin EV:
Positionally cloned human disease genes: patterns of
evolutionary conservation and functional motifs. Proc Natl Acad
Sci USA 1997, 94:5831-5836.
Koonin EV: A conserved ancient domain joins the growing
superfamily of 3¢–5¢ exonucleases. Curr Biol 1997, 7:R604-R606.
Tatusov RL, Mushegian AR, Bork P, Brown NP, Borodovsky M, Hayes
WS, Rudd KE, Koonin EV: Metabolism and evolution of
Haemophilus influenzae deduced from a whole genome
comparison to Escherichia coli. Curr Biol 1996, 6:279-291.
Koonin EV, Mushegian AR, Rudd KE: Sequencing and analysis of
bacterial genomes. Curr Biol 1996, 6:404-416.
Huynen MA, Diaz-Lazcock Y, Bork P: Differential genome display.
Trends Genet 1997, in press.
Galperin MY, Koonin EV: Sequence analysis of an exceptionally
conserved operon suggests enzymes for a new link between
histidine and purine biosynthesis. Mol Microbiol 1997,
24:443- 445.