R656 Dispatch Genome sequences: Genome sequence of a model prokaryote Eugene V. Koonin Address: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA. Current Biology 1997, 7:R656–R659 http://biomednet.com/elecref/09609822007R0656 © Current Biology Ltd ISSN 0960-9822 The complete genome sequence of Escherichia coli is now out [1]. This was the first of the prokaryotic genomesequencing projects to be initiated [2,3], but not the first to be completed — in the meantime, complete genome sequences have become available for Haemophilus influenzae [4], Mycoplasma genitalium [5], Mycoplasma pneumoniae [6], Synechocystis sp [7] and Methanococcus jannaschii [8]. The completion of the E. coli genome sequence is still, however, an unprecedented and momentous event for every molecular biologist. And this is not just because, with 4,639,221 base pairs and 4288 reported genes, E. coli’s is the largest and most complex of the sequenced prokaryotic genomes, though the relatively large number of genes in the E. coli genome is a factor in the importance of its sequence. The unique importance of the E. coli genome, however, lies in the added value that comes from the last 50 years of research in molecular biology and biochemistry, for which E. coli has been the primary model system. This knowledge has been distilled in the comprehensive monograph “Escherichia coli and Salmonella”, and the difference in size and quality between the first edition of this monograph in 1987 [9], and the second in 1996 [10], is a testimony to the continued vigor of E. coli research. The wealth of molecular and genetic data that are available on E. coli, which are now supported by the complete sequences of all of its genes, makes this prokaryote an indispensable reference for future genome research. Exactly how complete and how useful is all this information on a single prokaryotic species? In order to evaluate this, we have to address two questions. First, how well is E. coli itself understood — that is, at the simplest level, how much is actually known about functions and regulation of its genes and their products? And second, how much of the information learnt from E. coli can be transferred to other species? This second issue depends on how many homologs of E. coli genes there are in other species’ genomes, and how well the homologs are conserved. According to the paper [1] reporting the completion of the E. coli genome sequence, about 40% of the E. coli genes have no known function. In the genome sequence submitted to GenBank, the proportion of open reading frames that do not have a specific gene name, which would be indicative of prior genetic or biochemical identification [11], is actually even greater — 2465 out of 4288, or 57%. These numbers alone show that an enormous amount of work remains to be done before our knowledge of the best studied model organism is complete even at the simplest level of knowing the function of each gene. How much help are the sequence and structure databases in predicting the functions of uncharacterized E. coli genes? A comprehensive analysis of the complete set of E. coli Figure 1 5000 4500 Number of E. coli proteins The complete Escherichia coli genome sequence is now known; it should greatly facilitate the analysis of other genomes, but a lot remains to be learnt about E. coli itself. About half the genes were previously uncharacterized, but expanding databases and improving analysis methods will help predict their functions. 4000 3500 3000 2500 2000 1500 1 2 1000 1 3 500 2 3 0 All gene products Uncharacterized gene products © 1997 Current Biology The availability of the complete E. coli genome sequence [1] allows the analysis of conserved sequences. The graph shows the number of homologs identified for all the E. coli gene products (left), and just those that are as yet uncharacterized (right). The blue bars indicate the number of E. coli gene products for which homologs have been identifed in: 1, any species; 2, distantly related bacteria (outside the Proteobacteria domain); 3, eukaryotes. The white extensions indicate the total number of E. coli gene products, in the case of the bars on the left, and the total number of uncharacterized gene products, in the case of the bars on the right. The cut-off for significant similarity was a score of 105, computed using the BLAST2 program [14], which approximately corresponds to the probability of a chance match of 10–3 when the complete nonredundant protein sequence database at the NCBI is searched with a sequence of an average size bacterial protein (300 amino acids). Dispatch Number of proteins Figure 2 6500 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 100 500 0 R657 themselves, render the uncharacterized genes even more amenable to functional interpretation. Indeed, the sequences of E. coli gene products of no known function show nearly the same level of evolutionary conservation as the entire gene set; most of them contain ‘ancient conserved regions’ shared with homologs from distantly related bacteria and/or eukaryotes (Figure 1). MG HI Ssp MJ Comparison species SC © 1997 Current Biology Homologs of E. coli proteins in other completely sequenced genomes. The blue bars indicate the number of gene products with significant similarity — defined as in the Figure 1 legend — to E. coli proteins identified in the genomes of: MG, Mycoplasma genitalium; HI, Haemophilus influenzae; Ssp, Synechocystis sp; MJ, Methanococcus jannaschii; SC, Saccharomyces cerevisiae (the white extensions indicate the total number of proteins predicted from the complete sequence of each respective genome). protein sequences remains to be performed. A pilot study with 60% of the gene products indicated that, with varying degrees of precision, a functional prediction could be made for about two-thirds of the uncharacterized genes [12]. Recent experience in comparative analysis of complete bacterial and archaeal genomes [13] suggests that current improvements in database search methods, combining pairwise similarity analysis with profile search [14,15], as well as the rapid growth of the databases In another attempt to evaluate the amount of information that can be extracted from E. coli protein sequences by computer methods — and that will require experimental corroboration — I searched the E. coli protein database for several widely conserved protein motifs that are considered diagnostic of specific functions [12,16]. It is striking that, for most of the conserved motifs, 50% or more of the proteins containing them have never been studied (Table 1). These observations demonstrate both the impressive level of our ignorance and the ability of relatively simple but sensitive computer methods rapidly to generate functional predictions for a large number of proteins. The detection of significant sequence similarities for uncharacterized gene products (Figure 1) and the identification of known conserved motifs (Table 1) together make the greatest contribution to the prediction of functions for E. coli genes. The repertoire of predictions is, however, by no means limited to such findings. There is a lot to be learnt even about those proteins that have been studied for a long time, and careful computer analysis sometimes uncovers distinctly non-trivial connections and suggests unexpected functions in seemingly well understood proteins. A good example is the recent demonstration that bacterial DNA ligases — including the E. coli lig gene product —contain the ‘BRCT’ domain found in the hereditary breast cancer gene product BRCA1 and in a variety of eukaryotic cell-cycle checkpoint proteins [17]. Table 1 Selected conserved motifs in E. coli proteins in their relation to functional characterization. Number of E. coli proteins Conserved motif* Total Function known, including biochemical activity Functional information available but biochemical activity unknown P-loop (Walker-type ATP-binding motif) 218 112 (51%) 14( 6%) S-adenosylmethionine-binding motif 35 10 (29%) 5 (14.3%) Acetyltransferase motif 23 6 (26%) 1 (4%) Dehalogenase-like hydrolase motif 22 4 (18%) 2 (9%) α–β hydrolase motif 18 3 (17%) – Pyrophosphate-binding (PP) motif 10 5 (50%) 2 (20%) *The motifs were detected with the MoST program [16] using the previously characterized conserved blocks [12] to initiate the search. Additional occurrences of the motifs were detected using the PSIBLAST program [15]. R658 Current Biology, Vol 7 No 10 Table 2 Conserved gene families represented in the H. influenzae, but not the E. coli, genome. Complete genomes* Gene product function HI MG/MP Ssp MJ SC Ser/Thr protein kinase HIN1394† MG109/ MP585 Sll1575 Slr1697 Slr0152 Slr0599 Sll0776 Slr1225 Sll005 MJ0444 MJ1073 MJ1130 At least 120 Ser/Thr protein kinase genes Virulence-associated protein (vapC)‡ HI0947 HI0322 Sll0690 MJ0914 MJ1121 Carboxymucono-lactone hydrolase subunit homolog HI1053 Slr1853 MJ0742 MJ1511 Homoserine N-acetyltransferase HI1263 Putative histidine/purine biosynthesis enzyme (snzA) HI1647 MJ0677 YMR096w YFL059w YNL333w HIN1648 MJ1661 YMR095c YFL060c YNL334c Glutamine amidotransferase possibly involved in histidine/purine biosynthesis (snzB) YNL277w *Abbreviations: HI, Haemophilus influenzae; MG, Mycoplasma genitalium; MP, Mycoplasma pneumoniae; Ssp, Synechocystis sp.; MJ, Methanococcus jannaschii; SC, Saccharomyces cerevisiae. The gene names are from the original reports on genome sequences [2–8]. †A gene not annotated in the original H. influenzae genome report [2] but detected in subsequent analysis and given a name [20]. ‡Also detected in a variety of other bacteria including pathogenic ones Shigella flexneri and Mycobacterium tuberculosis. Another is the finding that the DNA repair protein MutL, Hsp90 family molecular chaperones, signal-transducing histidine kinases, and DNA gyrases all contain a conserved ATP-binding domain [18,19]. The value of E. coli as a reference for the functional annotation of other genomes will clearly depend on the number of genes in other genomes of interest that have homologs among the E. coli genes. This is a moving target, as it depends on the sensitivity of the methods for sequence comparison and the criteria used to define significant sequence similarity. In order to get a census of the E. coli homologs for the available complete genomes, I compared the set of protein sequences from each of them to the E. coli protein collection, using a relatively conservative significance cut-off (see Figure 1 legend). The latter example illustrates another hallmark of the present stage in sequence analysis — that the expansion of protein families by sensitive sequence comparison methods leads, more and more often, to a protein of known three-dimensional structure (in the example above, DNA gyrase). This makes structural modeling possible, and allows much more confident, and detailed, functional predictions to be made for uncharacterized domains. In the accompanying correspondence [20], I present another example of such a finding for a highly conserved, but functionally uncharacterized, E. coli protein. All in all, surprising as this may seem after decades of intense research, it seems fair to say that more remains to be found out about E. coli gene functions than is already known. Now that the complete genome sequence is known, the combination of in-depth computer searches and experimental studies provides researchers with the means to acquire this much-needed knowledge. The results of this analysis, shown in Figure 2, are at the same time encouraging and sobering. E. coli does indeed seem to be an ideal reference for the characterization of relatively close bacterial genomes, such as H. influenzae, in which a good majority of the genes are represented by homologs in E. coli ([21] and Figure 2). However, the fraction of genes that have such homologs drops to about 50% in distantly related bacteria, such as the cyanobacterium Synechocystis sp., about 30% in archaea, and less than 20% in eukaryotes (Figure 2). These numbers are much greater than those in the original report [1], which Dispatch used a deliberately high cut-off, but they still indicate that functional annotation of distantly related genomes using the information on E. coli genes may not be straightforward. In order to get the best out of the E. coli genome, further development of sensitive comparison methods is critical. One of the unique benefits of complete genome analysis is our ability to show, within the limits of the available methods, which genes are missing in a given genome [22]. This is exemplified by the ‘differential display’ technique, in which the homologs of E. coli genes are subtracted from the genome of interest, leaving those genes that may define unique features of the given species, such as its pathogenicity. For the H. influenzae genome, this approach revealed 119 genes that are missing in E. coli, but for which either experimental data or informative sequence similarities or both are available [23]. Only a few of these genes, however, belong to highly conserved families represented in distantly related complete genomes, and, remarkably, only one apparently ancient metabolic pathway present in H. influenzae — the probable purine metabolism shunt catalyzed by the SnzA and SnzB proteins [24] — is missing in E. coli (Table 2). These observations put the use of genomes of model organisms, such as E. coli, in the proper perspective: they are instrumental in predicting functions of conserved, house-keeping genes and help focus our attention on those genes that define species uniqueness. I have only touched on the most straightforward aspect of the genome analysis — the phylogenetic conservation of protein sequences and its use in functional prediction. Regulatory interactions, which are much better understood in E. coli than in any other species, add a whole new dimension. The complete sequencing of the E. coli genome is certainly a brilliant climax to half a century of research on this bacterium. More importantly, however, it is the starting point for a new era, which hopefully will be marked by mutually enriching interactions between computational and experimental approaches. The old favorite of molecular biology will continue to challenge, fascinate and reward researchers well into the twenty-first century. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. Acknowledgements I am grateful to Frederick Blattner, Stephen Altschul, Peer Bork and Martijn Huynen for providing me with preprints of their papers. 20. 21. References 1. Blattner FR, Plunkett G III, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew G, et al.: The complete genome sequence of Escherichia coli K-12. Science 1997, in press. 2. Daniels DL, Plunkett G 3d, Burland V, Blattner FR: Analysis of the Escherichia coli genome: DNA sequence of the region from 84.5 to 86.5 minutes. Science 1992, 257: 771-778. 3. Yura T, Mori H, Nagai H, Nagata T, Ishihama A, Fujita N, Isono K, Mizobuchi K, Nakata A: Systematic sequencing of the Escherichia coli genome: analysis of the 0-2.4 min region. Nucleic Acids Res 1992, 20: 3305-3308. 4. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb J-F, Dougherty BA, Merrick JM, et al.: 22. 23. 24. R659 Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995, 269:496-512. Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM, et al.: The minimal gene complement of Mycoplasma genitalium. Science 1995, 270:397-403. Himmelreich R, Hilbert H, Plagens H, Pirkl E, Li B-C. Herrmann R: Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. Nucleic Acids Res 1996, 24:4420-4449. Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, Miyajima N, Hirosawa M, Sugiura M, Sasamoto S, et al.: Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res 1996, 3:109-136. Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, et al.: Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 1996, 273:1058-1072. Neidhardt FC, Ingraham JL, Lin ECC, Low KB, Magasanik B, Schaechter M, Umbarger HE (Eds): Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology. Washington: American Society for Microbiology; 1987. Neidhardt FC, Curtiss R III, Ingraham JL, Lin ECC, Low KB, Magasanik B, Reznikoff WS, Riley M, Schaechter M, Umbarger HE (Eds): Escherichia coli and Salmonella: Cellular and Molecular Biology (2nd edn). Washington: American Society for Microbiology; 1996. Berlyn MKB, Low KB, Rudd KE: Linkage map of Escherichia coli K12, Edition 9. In Escherichia coli and Salmonella: Cellular and Molecular Biology (2nd edn). Neidhardt FC, Curtiss R III, Ingraham JL, Lin ECC, Low KB, Magasanik B, Reznikoff WS, Riley M, Schaechter M, Umbarger HE (Eds): Washington: American Society for Microbiology; 1996:1715–1902. Koonin EV, Tatusov RL, Rudd KE: Sequence similarity analysis of Escherichia coli proteins: functional and evolutionary implications. Proc Natl Acad Sci USA 1995, 92:11921-11925. Koonin EV, Mushegian AR, Galperin MY, Walker DR: Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea. Mol Microbiol 1997, in press. Altschul SF, Gish W: Local alignment statistics. Meth Enzymol 1996, 266:460-481. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, in press. Tatusov RL, Altschul SF, Koonin EV: Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci USA 1996, 91:12091-12095. Bork P, Hofmann K, Bucher P, Neuwald AF, Altschul SF, Koonin EV: A superfamily of conserved domains in DNA damage-responsive cell cycle checkpoint proteins. FASEB J 1997, 11:68-76. Bergerat A, de Massy B, Gadelle D, Varoutas PC, Nicolas A, Forterre P: An atypical topoisomerase II from Archaea with implications for meiotic recombination. Nature 1997, 386:414- 417. Mushegian AR, Bassett DE Jr, Boguski MS, Bork P, Koonin EV: Positionally cloned human disease genes: patterns of evolutionary conservation and functional motifs. Proc Natl Acad Sci USA 1997, 94:5831-5836. Koonin EV: A conserved ancient domain joins the growing superfamily of 3¢–5¢ exonucleases. Curr Biol 1997, 7:R604-R606. Tatusov RL, Mushegian AR, Bork P, Brown NP, Borodovsky M, Hayes WS, Rudd KE, Koonin EV: Metabolism and evolution of Haemophilus influenzae deduced from a whole genome comparison to Escherichia coli. Curr Biol 1996, 6:279-291. Koonin EV, Mushegian AR, Rudd KE: Sequencing and analysis of bacterial genomes. Curr Biol 1996, 6:404-416. Huynen MA, Diaz-Lazcock Y, Bork P: Differential genome display. Trends Genet 1997, in press. Galperin MY, Koonin EV: Sequence analysis of an exceptionally conserved operon suggests enzymes for a new link between histidine and purine biosynthesis. Mol Microbiol 1997, 24:443- 445.
© Copyright 2026 Paperzz