articles DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae John F. Heidelberg*, Jonathan A. Eisen*, William C. Nelson*, Rebecca A. Clayton, Michelle L. Gwinn*, Robert J. Dodson*, Daniel H. Haft*, Erin K. Hickey*, Jeremy D. Peterson*, Lowell Umayam*, Steven R. Gill*, Karen E. Nelson*, Timothy D. Read*, Herve Tettelin*, Delwood Richardson*, Maria D. Ermolaeva*, Jessica Vamathevan*, Steven Bass*, Haiying Qin*, Ioana Dragoi*, Patrick Sellers*, Lisa McDonald*, Teresa Utterback*, Robert D. Fleishmann*, William C. Nierman*, Owen White *, Steven L. Salzberg*, Hamilton O. Smith*², Rita R. Colwell³, John J. Mekalanos§, J. Craig Venter*² & Claire M. Fraser* * The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA ³ Center of Marine Biotechnology, University of Maryland Biotechnology Institute, 701 East Pratt Street, Baltimore, Maryland 21202, USA, and Department of Cell and Molecular Biology, University of Maryland, College Park, Maryland 20742, USA § Harvard Medical School, Department of Microbiology and Molecular Genetics, 200 Longwood Avenue, Boston, Massachusetts 02115, USA ............................................................................................................................................................................................................................................................................ Here we determine the complete genomic sequence of the Gram negative, g-Proteobacterium Vibrio cholerae El Tor N16961 to be 4,033,460 base pairs (bp). The genome consists of two circular chromosomes of 2,961,146 bp and 1,072,314 bp that together encode 3,885 open reading frames. The vast majority of recognizable genes for essential cell functions (such as DNA replication, transcription, translation and cell-wall biosynthesis) and pathogenicity (for example, toxins, surface antigens and adhesins) are located on the large chromosome. In contrast, the small chromosome contains a larger fraction (59%) of hypothetical genes compared with the large chromosome (42%), and also contains many more genes that appear to have origins other than the g-Proteobacteria. The small chromosome also carries a gene capture system (the integron island) and host `addiction' genes that are typically found on plasmids; thus, the small chromosome may have originally been a megaplasmid that was captured by an ancestral Vibrio species. The V. cholerae genomic sequence provides a starting point for understanding how a free-living, environmental organism emerged to become a signi®cant human bacterial pathogen. Vibrio cholerae is the aetiological agent of cholera, a severe diarrhoeal disease that occurs most frequently in epidemic form1. Cholera has been epidemic in southern Asia for at least 1,000 years, but also spread worldwide to cause seven pandemics since 1817 (ref. 1). When untreated, cholera is a disease of extraordinarily rapid onset and potentially high lethality. Although clinical management of cholera has advanced over the past 40 years, cholera remains a serious threat in developing countries where sanitation is poor, health care limited, and drinking water unsafe. Vibrio cholerae as a species includes both pathogenic and nonpathogenic strains that vary in their virulence gene content2. This bacterium contains a wide variety of strains and biotypes, receiving and transferring genes for toxins3, colonization factors4,5, antibiotic resistance6, capsular polysaccharides that provide resistance to chlorine7 and new surface antigens, such as the 0139 lipopolysaccharide and O antigen capsule8,9. The lateral or horizontal transfer of these virulence genes by phage3, pathogenicity islands10,11 and other accessory genetic elements12 provides insights into how bacterial pathogens emerge and evolve to become new strains. Vibrio species represent a signi®cant portion of the culturable heterotrophic bacteria of oceans, coastal waters and estuaries13,14. Environmental studies show that these bacteria strongly in¯uence nutrient cycling in the marine environment. Various species of this genus are also devastating pathogens for ®n®sh, shell®sh and mammals. There is still much to be learned about the aquatic ecology and natural history of V. cholerae including its autochthonous (native) existence in endemic locales during cholera-free, interepidemic periods, which environmental factors, such as climate13,15, aided its re-emergence in Latin America, and which environmental factors are associated with its habitat in cholera endemic regions. For example, V. cholerae, during interepidemic periods, is an inhabitant ² Present address: Celera Genomics, 45 West Gude Drive, Rockville, Maryland 20850, USA NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com of brackish and estuarine waters, and in these environments is associated with zooplankton and other aquatic ¯ora and fauna16. The organism also enters a ``viable but nonculturable''17 state under certain conditions. Roles for these environmental interactions and this dormant physiological state in the emergence and persistence of pathogenic V. cholerae have been proposed14,17. Here we report the determination and analysis of the Vibrio cholerae genome sequence. This analysis represents an important step toward the complete molecular description of how this freeliving environmental organism emerged to become a human pathogen by horizontal gene transfer. Genome analysis The genome of V. cholerae was sequenced by the whole genome random sequencing method18. The genome consists of two circular chromosomes19,20 of 2,961,146 (chromosome 1) and 1,072,314 Table 1 General features of the Vibrio cholerae genome Size (bp) Total number of sequences G+C percentage Total number of ORFs ORF size (bp) Percentage coding Number of rRNA operons (16S-23S-5S) Number of tRNA Number similar to known proteins Number similar to proteins of unknown function* Number of conserved hypothetical proteins² Number of hypothetical proteins³ Number of Rho-independent terminators Chromosome 1 Chromosome 2 2,961,151 36,797 47.7 2,770 952 88.6 8 94 1,614 (58%) 163 (6%) 478 (17%) 515 (19%) 599 1,072,914 14,367 46.9 1,115 918 86.3 0 4 465 (42%) 66 (6%) 165 (15%) 419 (38%) 193 ............................................................................................................................................................................. bp, base pairs. ORFs, open reading frames. * Proteins of unknown function, signi®cant sequence similarity (homology) to a named protein for which there is currently no known function. ² Conserved hypothetical protein, sequence similarity to a translation of another ORF, but there is currently no experimental evidence a protein is expressed. ³ Hypothetical protein, no signi®cant sequence similarity to another protein. © 2000 Macmillan Magazines Ltd 477 articles (chromosome 2) base pairs, with an average G+C content of 46.9% and 47.7%, respectively. There are a total of 3,885 predicted open reading frames (ORFs) and 792 predicted Rho-independent terminators; with 2,770 and 1,115 ORFs and 599 and 193 Rho-independent terminators on the individual chromosomes (Table 1, Figs 1 and 2). Most genes required for growth and viability are located on chromosome 1, although some genes found only on chromosome 2 are also thought to be essential for normal cell function (for example, dsdA, thrS and the genes encoding ribosomal proteins L20 and L35). Additionally, many intermediaries of metabolic pathways are encoded only on chromosome 2 (Fig. 3). The replicative origin in chromosome 1 was identi®ed by similarity to the Vibrio harveyi and Escherichia coli origins, co-localization of genes (dnaA, dnaN, recF and gyrA) often found near the origin in prokaryotic genomes, and GC nucleotide skew (G-C/ G+C) analysis21. Based on these, we designated base-pair 1 in an intergenic region that is located in the putative origin of replication. Only the GC skew analysis was useful in identifying a putative origin on chromosome 2. This genomic sequence of V. cholerae con®rmed the presence of a large integron island (a gene capture system) located on chromosome 2 (125.3 kbp)22,23. The V. cholerae integron island contains all copies of the V. cholerae repeat (VCR) sequence and 216 ORFs (Fig. 1). However, most of these ORFs have no homology to other sequences. Among the recognizable integron island genes are three that encode gene products that may be involved in drug resistance (chloramphenicol acetyltransferase, fosfomycin resistance protein and glutathione transferase), several DNA metabolism enzymes (MutT, transposase, and an integrase), potential virulence genes (haemagglutinin and lipoproteins) and three genes which encode gene products similar to the `host addiction' proteins (higA, higB and doc), which are used by plasmids to select for their maintenance by host cells. V. cholerae biology, notably its ability to inhabit diverse environments. These environments, in turn, may have selected the duplication and divergence of genes useful for specialized functions. Additionally, whereas El Tor strain N16961 carries only a single copy of the cholera toxin prophage, other V. cholerae strains carry several copies of this element24,25, and strains of the classical biotype have a second copy of the prophage that is localized on chromosome 2 (ref. 20). Thus, virulence genes are presumed to be subject to selective pressure, affecting copy number and chromosomal location. Several ORFs with apparently identical functions exist on both chromosomes which were probably acquired by lateral gene transfer. For example, glyA (encoding serine hydroxymethyl transferase) is found once on each chromosome but the phylogenetic analysis suggest the glyA copy on chromosome 1 branches with the a-Proteobacteria, whereas the copy on chromosome 2 branches with the g-Proteobacteria (see Supplementary Information). The chromosome 2 glyA is ¯anked by genes encoding transposases, suggesting that this gene was acquired through a transposition event. Figure 1 Linear representation of the V. cholerae chromosomes. The location of Q the predicted coding regions, colour-coded by biological role, RNA genes, tRNAs, other RNAs, Rho-independent terminators and Vibrio cholerae repeats (VCRs) are indicated. Arrows represent the direction of transcription for each predicted coding region. Numbers next to the tRNAs represent the number of tRNAs at a locus. Numbers next to GES represent the number of membrane-spanning domains predicted by the Goldman, Engleman and Steitz scale calculated by TopPred for that protein. Gene names are available at the TIGR web site (www.tigr.org) and as Supplementary Information. 1,000,000 100,000 900,000 Comparative genomics The two-chromosome structure of V. cholerae allows for comparisons, both between the two chromosomes of this organism and between either of the V. cholerae chromosomes and the chromosomes of other microbial species. There is pronounced asymmetry in the distribution of genes known to be essential for growth and virulence between the two chromosomes. Signi®cantly more genes encoding DNA replication and repair, transcription, translation, cell-wall biosynthesis and a variety of central catabolic and biosynthetic pathways are encoded by chromosome 1. Similarly, most genes known to be essential in bacterial pathogenicity (that is, those encoding the toxin co-regulated pilus, cholera toxin, lipopolysaccharide and the extracellular protein secretion machinery) are also located on chromosome 1. In contrast, chromosome 2 contains a larger fraction (59%) of hypothetical genes and genes of unknown function, compared with chromosome 1 (42%) (Fig. 4). This partitioning of hypothetical proteins on chromosome 2 is highly localized in the integron island (Fig. 2). Chromosome 2 also carries the 3-hydroxy-3-methylglutaryl CoA reductase, a gene apparently acquired from an archaea (Y. Boucher and W. F. Doolittle, personal communication). The majority of the V. cholerae genes were very similar to E. coli genes (1,454 ORFs), but 499 (12.8%) of the V. cholerae ORFs showed highest similarity to other V. cholerae genes, suggesting recent duplications (Figs 5 and 6). Most of the duplicated ORFs encode products involved in regulatory functions (59), chemotaxis (50), transport and binding (42), transposition (18), pathogenicity (13) or unknown functions encoded by conserved hypothetical (62) and hypothetical proteins (113). There are 105 duplications with at least one of each ORF on each chromosome indicating there have been recent crossovers between chromosomes. The extensive duplication of genes involved in scavenging behaviour (chemotaxis and solute transport) suggests the importance of these gene products in 478 1 200,000 800,000 300,000 700,000 400,000 600,000 2,900,000 2,800,000 2,700,000 2,600,000 2,500,000 500,000 1 100,000 200,000 300,000 400,000 500,000 2,400,000 600,000 2,300,000 700,000 2,200,000 800,000 2,100,000 900,000 2,000,000 1,000,000 1,900,000 1,100,000 1,800,000 1,200,000 1,700,000 1,300,000 1,600,000 1,500,000 1,400,000 Figure 2 Circular representation of the V. cholerae genome. The two chromosomes, large and small, are depicted. From the outside inward: the ®rst and second circles show predicted protein-coding regions on the plus and minus strand, by role, according to the colour code in Fig. 1 (unknown and hypothetical proteins are in black). The third circle shows recently duplicated genes on the same chromosome (black) and on different chromosomes (green). The fourth circle shows transposon-related (black), phage-related (blue), VCRs (pink) and pathogenesis genes (red). The ®fth circle shows regions with signi®cant x2 values for trinucleotide composition in a 2,000-bp window. The sixth circle shows percentage G+C in relation to mean G+C for the chromosome.The seventh and eighth circles are tRNAs and rRNAs, respectively. © 2000 Macmillan Magazines Ltd NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com articles Origin and function of the small chromosome of V. cholerae was originally a megaplasmid (Fig. 4). The megaplasmid presumably acquired genes from diverse bacterial species before its capture by the ancestral Vibrio. The relocation of several essential genes from chromosome 1 to the megaplasmid completed the stable capture of this smaller replicon. Apparently this capture of the megaplasmid occurred long enough ago that the trinucleotide composition and percentage G+C content between the two chromosomes has become similar (except for laterally moving elements such as the integron island, bacteriophage genomes, transposons, and so on). The two chromosome structure is found in other Vibrio species19 suggesting that the gene content of the megaplasmid continues to provide Vibrio with an evolutionary advantage, perhaps within the aquatic ecosystem where Vibrio species are frequently the dominant microorganisms14,16. It is unclear why chromosome 2 has not been integrated into chromosome 1. Perhaps chromosome 2 plays an important specialized function that provides the evolutionary selective pressure to Several lines of evidence suggest that chromosome 2 was originally a megaplasmid captured by an ancestral Vibrio species. The phylogenetic analysis of the ParA homologues located near the putative origin of replication of each chromosome shows chromosome 1 ParA tending to group with other chromosomal ParAs, and the ParA from chromosome 2 tending to group with plasmid, phage and megaplasmid ParAs (see Supplementary Information). In general, genes on chromosome 2, with an apparently identical functioning copy on chromosome 1, appear less similar to orthologues present in other g-Proteobacteria species (see Supplementary Information). Also, chromosome 1 contains all the ribosomal RNA operons and at least one copy of all the transfer RNAs (four tRNAs are found on chromosome 2, but there are duplicates on chromosome 1). In addition, chromosome 2 carries the integron region, an element often found on plasmids26. Finally, the bias in the functional gene content is more easily explained, if chromosome 2 uracil NupC NMN, pnuC uraA Family xanthine/ (2,1) uracil family ELECTRON TRANSPORT CHAIN H+ H+ cellobiose* fructose mannitol sucrose? glucose ATP H+ synthase H+ sbp sulphate CHITIN sugar-Pi LACTO SE FRUCTO SE Histidine Degradation Pathway sulphate, cysZ GLYCOLYSIS phosphate, nptA ? GALACTOSE GLYCEROL ** HISTIDINE fatty acids, fadL(2,1) MANNITOL MANNO SE urocanate MALATE arginine PENTOSE PATHWAY Na +/alanine (3) chorismate D-LACTATE ribose ttyrosine Na +/proline,putP PP formate, H 2+CO 2 mal E/F/G/ K galactoside cadaverine ASP ARAGINE >40 flagellar and mglA/B/C oxalate/formate ? METHIONINE L-ASPA RTATE Na +/citrate, citS Na +/dicarboxylate(1,1) formate ? (1,1) benzoate, benE C4- dicarboxylate fumarate Fatty Acid Biosynthes is and Degradation propionate O-succinyl- SERINE cis-aconitate CO 2 bcaa, brnQ potassium, trkA/H ARGININE bypass isocitrate succinate NH 4+ ? GLUTA MINE succinyl-CoA Mg2+, mgt,(2,1) PUTRESCINE 2-oxoglutarate L- cysteine dctP/Q/M, dcuA/B/C tyrosine, tyrP MCPs (23,20) ORNITHINE Glyoxylate homoserine glycine serine,sda C (2) tryptophan,mtr glyoxylate fumarate AzlC family V/W/Y/Z TCA Cycle L-iso-leucine THREONINE BCCT family CheA/B/D?/R/ oxaloacetic citrate acid malate arginine/ ornithine putrescine/ ornithine, potE motor genes ethanol L- leucine ASP ARTATE diaminopimelic acid FLAGELLUM cadaverine/ lysine? propionate L-lysine maltose Na +/glutamate,gltS melanin D-alanine L-alanine leucine valine 2-keto-isovalerate Acetyl-CoA acetyl-P (1,1) proton/peptide family L- tryptophan SERINE Pyruvate L-LACTATE ACETA TE proton/glutamate, gltP L- phenylalanine serine formiminoL-glutamate rbsA/B/C/D artI/M/P/Q PRP P PHOSPH ATE PEP L-glutamate oppA/B/C/D/F amino acids (3,1) RIBOSE OXIDATIV E L-TRYPTOP HAN L-lactate ? oligopeptides GLUCONATE NON - L-ALANINE imidazolepropanoate Pi ptsH PATHWAY GLUCONEOG ENESIS sulphate(2,2) peptides (3,1) ptsI DOUDO ROFF SUCROSE Pho4 family protein potA/B/C/D ENTNER GLUCOSE spermidine/ putrescine Pi IIA NAG trehalose? TREHALOSE N-ACETYLGLUCOSAMINE modA/B/C gluconate ? IIBC GLYCOGEN STARCH pstA/B/C/S molybdenum ompS + Pi e- phosphate (1,1) maltoporin ADP ATP cy sA/P/T /W PTS system Pi sugar family iron (II), feoA/B Na +/H +(4,2) PROLINE L-GLUTAMATE potassium, kup potassium ke fB (2?) ExbB(1,1) porin ? lipolysaccharide/ O-antigen, rfbH/I NadC family iron (III) G3P MsbA? cations AcrB/D/F (2,2) glyc G3P colicins thiamine? B12? btuB/C/D toxins drugs heme glpF glpT tolA/B/Q/R ugpA/B/C/E (4,2) (14,3) ccmA/B/C/D Figure 3 Overview of metabolism and transport in V. cholerae. Pathways for energy production and the metabolism of organic compounds, acids and aldehydes are shown. Transporters are grouped by substrate speci®city: cations (green), anions (red), carbohydrates (yellow), nucleosides, purines and pyrimidines (purple), amino acids/ peptides/amines (dark blue) and other (light blue). Question marks associated with transporters indicate a putative gene, uncertainty in substrate speci®city, or direction of transport. Permeases are represented as ovals; ABC transporters are shown as composite ®gures of ovals, diamonds and circles; porins are represented as three ovals; the largeconductance mechanosensitive channel is shown as a gated cylinder; other cylinders represent outer membrane transporters or receptors; and all other transporters are drawn as rectangles. Export or import of solutes is designated by the direction of the arrow through the transporter. If a precise substrate could not be determined for a transporter, no gene name was assigned and a more general common name re¯ecting the type of substrate being transported was used. Gene location on the two chromosomes, for both NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com E1-E2 family (2,1) vibriobactin receptor viuA hemin zinc vibriobactin hutB/C/D znuA/B/C fepB/C/D/G TonB (1,1) mscL ExbD(1,1) + MCP heme K ? hutA irgA TonB system receptor(1,2) chemotactic signals family (3) transporters and metabolic steps, is indicated by arrow colour: all genes located on the large chromosome (black); all genes located on the small chromosome (blue); all genes needed for the complete pathway on one chromosome, but a duplicate copy of one or more genes on the other chromosome (purple); required genes on both chromosomes (red); complete pathway on both chromosomes (green). (Complete pathways, except for glycerol, are found on the large chromosome.) Gene numbers on the two chromosomes are in parentheses and follow the colour scheme for gene location. Substrates underlined and capitalized can be used as energy sources. PRPP, phosphoribosyl-pyrophosphate; PEP, phosphoenolpyruvate; PTS, phosphoenolpyruvate-dependant phosphotransferase system; ATP, adenosine triphosphate; ADP, adenosine diphosphate; MCP, methylaccepting chaemotaxis protein; NAG, N-acetylglucosamine; G3P, glycerol-3-phosphate; glyc, glycerol; NMN, nicotinamide mononucleotide. Asterisk, because V. cholerae does not use cellobiose, we expect this PTS system to be involved in chitobiose transport. © 2000 Macmillan Magazines Ltd 479 articles suppress integration events when they do occur. For example, if under some environmental condition there is a difference in copy number between the chromosomes, then chromosome 2 may have accumulated genes that are better expressed at higher or lower copy number than genes on chromosome 1. A second possibility is that, in response to environmental cues, one chromosome may partition to daughter cells in the absence of the other chromosome (aberrant segregation). Such single-chromosome-containing cells would be replication-defective but still maintain metabolic activity (`drone' cells), and, therefore, be a potential source of ``viable, but nonculturable (VBNC)'' cells observed to occur in V. cholerae17. Such `drone' cells may also play a role in V. cholerae bio®lms7,27,28 by, for example, producing extracellular chitinase, protease and other degradative enzymes that enhance survival of cells in a bio®lm, retaining two chromosomes without directly competing with these cells for nutrients. Transport and energy metabolism Vibrio cholerae has a diverse natural habitat that includes association with zooplankton in a sessile stage, a planktonic state in the water column, and the capacity to act as a pathogen within the human gastrointestinal tract. It is, therefore, no surprise that this organism maintains a large repertoire of transport proteins with broad substrate speci®city and the corresponding catabolic pathways to enable it to respond ef®ciently to these different and constantly changing ecosystems (Fig. 4). Many of the sugar transporter systems and their corresponding catabolic pathway enzymes are localized on a single chromosome (that is, ribose and lactate transport and degradation enzymes are contained on chromosome 2, whereas the trehalose systems reside on chromosome 1). However, many of the other energy metabolism pathways are split (that is, chitin, glycolysis, and so on) between the chromosomes (Fig. 3). In aquatic environments, chitin often represents a source of both carbon and nitrogen. This energy source is important for V. cholerae as it is associated with zooplankton, which have a chitinous exoskeleton13,15,29. Vibrio cholerae degrades chitin by a pathway that is very similar to that of Vibrio furnissii30. Sequence analysis suggests a phosphenolpyruvate phosphotransferase system (PTS) Interchromosomal regulation Several of the regulatory pathways, both for regulation in response to environmental and pathogenic signals, are divided between the two chromosomes. These included pathways for starvation survival, `quorum sensing' and expression of the entertoxigenic haemolysin, HlyA. During periods of nutrient starvation, V. cholerae, and other Gram-negative bacteria, enter the stationary phase and, later, the viable but nonculturable (VBNC) state14,17. The alternative sigma factor j38 (rpoS) is required for survival of V. cholerae in the environment but not for pathogenicity31, and therefore probably plays an important role in the initiation of the VBNC state. There is one copy of rpoS, located on chromosome 1, near the oriC. The RpoS regulates expression of several other proteins, including catalase, cyclopropane-fatty-acyl-phospholipid synthase and HA/protease, which are found on both chromosomes31. Genes involved in `quorum sensing', or cell-density-dependent regulation, also exist on both chromosomes of V. cholerae. In bioluminescent Vibrio species (notably Vibrio ®scheri and V. harveyi), quorum sensing is used to control light production. Although this strain of V. cholerae lacks the genes for bioluminescence, it does have the genes required for the autoinducer-2 (AI-2) quorumsensing mechanism32 (luxOPQSU) but this pathway is split between the chromosomes with luxOSU on chromosome 1 and luxPQ on chromosome 2. Similarly, another transcriptional regulatory gene, 9 in Relative proportion of most similar oc A o qu open reading frames M M ccu ife x yc y s ob co B rad aeo ac pla ac io lic te sm illu du us Borium a g s s ran e ub s r C Tre reli tub nit tili hl p a e al s am on b rc iu C y e ur u m hl d m g lo am ia a do si y p p rf s H ae dia neuallid eri m Es tra m um op c c on hi he ho iae N ei Vlus rich mat ss ib in ia is Ri er ri flu c ck ia o e ol c n i e H tt me ho za C eli sia nin ler e am co p g a py ba row itid e M et Th Sy lob cte aze is ha n r e no rm ec act py kii ba M A o ho er lo ct e rch Ae tog cy jej ri er th ae ro a st un iu an o py m is i m o g ru a sp th co lob m riti . m e c Py rm cu us pe a ro oa s ja fulg rni co ut n id x Sa Ca Py cc otr nas us cc eno roc us op ch ha rh oc ho hic ii ro ab cu rik um m d s os yc itis a h es e by ii ce leg ssi re an vis s ia e 0.35 10 0.3 8 0.25 7 0.2 6 0.15 5 0.1 4 3 0.05 2 0 1 Fa tty Am in ac nuc o id le a an osi P cid d de ur bi p s pr ho , aine osy C osth B sp nd s, p nth en e io h n y e tra tic sy olip uc rim sis l i g nth id leo idi * nt ro e m t n er up sis e ide es m s o ta s , ed , a f bo Tr an i n c l sp E ary d cofa ism or ne m ar cto t b rg eta rie rs in y m bo rs* , di ng et lism a DN and bol * pr ism A ot m ei e T tab ns Pr ran oli ot sc sm ei n ript * sy io Re n n gu Pr the * la ote sis to in * ry fu fat C Ce nc e* el ll lu en tion la s v O r pr elo th o p er ce e* ca ss H teg es* yp ot orie he s tic al *1 0 De Open reading frames (%) for cellobiose transport, but as V. cholerae does not use cellobiose, it is more likely that this PTS is involved in transport of the structurally similar compound, chitobiose, analogous to the situation proposed for Bourrelia burgdorferi18. The three anions that are transported by ABC transport systems in V. cholerae are molybdenum, phosphate and sulphate. Molybdenum transport genes (modA/B/C) are all located on chromosome 2, and most of the sulphate transport genes are on the large molecule. However, copies of the genes for phosphate transporters are found in both chromosomes. The genes in these two phosphate transport operons are different from each other and do not represent a recent duplication; instead, this suggests that one may be an acquired operon. Figure 4 Percentage of total Vibrio cholerae open reading frames (ORFs) in biological roles compared with other g-Proteobacteria. These were V. cholerae, chromosome 1 (blue); V. cholerae, chromosome 2 (red); Escherichia coli (yellow); Haemophilus in¯uenzae (pale blue). Signi®cant partitioning (P , 0.01) of biological roles between V. cholerae chromosomes is indicated with an asterisk, as determined with a x2 analysis. 1, Hypothetical contains both conserved hypothetical proteins and hypothetical proteins, and is at 1/10 scale compared with other roles. 480 Figure 5 Comparison of the V. cholerae ORFs with those of other completely sequenced genomes. The sequence of all proteins from each completed genome were retrieved from NCBI, TIGR and the Caenorhabditis elegans (wormpep16) databases. All V. cholerae ORFs (large chromosome, blue; small chromosome, red) were searched against all other genomes with FASTA3. The number of V. cholerae ORFs with greatest similarity (E # 10-5) are shown in proportion to the total number of ORFs in that genome. There were no ORFs that were most similar to a Mycoplasma pneumoniae ORF. © 2000 Macmillan Magazines Ltd NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com articles ** ** V.cholerae VC0512 V.cholerae VCA1034 V.cholerae VCA0974 V.cholerae VCA0068 ** V.cholerae VC0825 * V.cholerae VC0282 V.cholerae VCA0906 V.cholerae VCA0979 V.cholerae VCA1056 V.cholerae VC1643 V.cholerae VC2161 V.cholerae VCA0923 ** V.cholerae VC0514 ** V.cholerae VC1868 V.cholerae VCA0773 V.cholerae VC1313 V.cholerae VC1859 V.cholerae VC1413 V.cholerae VCA0268 V.cholerae VCA0658 ** V.cholerae VC1405 V.cholerae VC1298 * V.cholerae VC1248 V.cholerae VCA0864 V.cholerae VCA0176 V.cholerae VCA0220 ** V.cholerae VC1289 V.cholerae A1069 ** V.cholerae VC2439 V.cholerae 1967 V.cholerae A0031 V.cholerae VC1898 V.cholerae VCA0663 V.cholerae VCA0988 V.cholerae VC0216 V.cholerae VC0449 * V.cholerae VCA0008 V.cholerae VC1406 V.cholerae VC1535 V.cholerae VC0840 B.subtilis gi2633766 Synechocystis sp. gi1001299 Synechocystis sp. gi1001300 * Synechocystis sp. gi1652276 * Synechocystis sp. gi1652103 * H.pylori gi2313716 ** H.pylori 99 gi4155097 ** C.jejuni Cj1190c C.jejuni Cj1110c A.fulgidus gi2649560 ** A.fulgidus gi2649548 B.subtilis gi2634254 B.subtilis gi2632630 B.subtilis gi2635607 B.subtilis gi2635608 B.subtilis gi2635609 ** ** ** B.subtilis gi2635610 B.subtilis gi2635882 E.coli gi1788195 E.coli gi2367378 ** E.coli gi1788194 * E.coli gi1787690 V.cholerae VCA1092 V.cholerae VC0098 E.coli gi1789453 H.pylori gi2313186 99 gi4154603 ** H.pylori C.jejuni Cj0144 C.jejuni Cj1564 ** C.jejuni Cj0262c ** C.jejuni Cj1506c H.pylori gi2313163 * ** H.pylori 99 gi4154575 H.pylori gi2313179 ** ** H.pylori 99 gi4154599 C.jejuni Cj0019c C.jejuni Cj0951c C.jejuni Cj0246c B.subtilis gi2633374 T.maritima TM0014 V.cholerae VC1403 V.cholerae VCA1088 T.pallidum gi3322777 T.pallidum gi3322939 T.pallidum gi3322938 ** ** B.burgdorferi gi2688522 T.pallidum gi3322296 B.burgdorferi gi2688521 * ** T.maritima TM0429 T.maritima TM0918 ** T.maritima TM0023 T.maritima TM1428 * T.maritima TM1143 T.maritima TM1146 P.abyssi PAB1308 P.horikoshii gi3256846 ** P.abyssi PAB1336 ** ** P.horikoshii gi3256896 ** P.abyssi PAB2066 P.horikoshii gi3258290 ** * P.abyssi PAB1026 P.horikoshii gi3256884 ** D.radiodurans DRA00354 D.radiodurans DRA0353 ** ** D.radiodurans DRA0352 V.cholerae VC1394 P.abyssi PAB1189 P.horikoshii gi3258414 ** B.burgdorferigi 2688621 M.tuberculosis gi1666149 V.cholerae VC0622 Figure 6 Phylogenetic tree of methyl-accepting chemotactic proteins (MCP) homologues in completed genomes. Homologues of MCP were identi®ed by FASTA3 searches of all available complete genomes. Amino-acid sequences of the proteins were aligned using CLUSTALW, and a neighbour-joining phylogenetic tree was generated from the alignment using the PAUP* program (using a PAM-based distance calculation). Hypervariable regions of the alignment and positions with gaps in many of the sequences were excluded from the analysis. Nodes with signi®cant bootstrap values are indicated: two asterisks, .70%; asterisk, 40±70%. NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com hlyU (ref. 33), is located on chromosome 1, while the gene it regulates, hlyA, is located on chromosome 2. DNA repair Vibrio cholerae has genes encoding several DNA-repair and DNAdamage-response pathways, including nucleotide-excision repair, mismatch-excision repair, base-excision repair, AP endonuclease, alkylation transfer, photoreactivation, DNA ligation, and all the major components of recombination and recombinational repair, including initiation, recombination and resolution34. In addition, homologues of many of the genes involved in the SOS response in E. coli are found. The presence of three photolyase homologues, more than have been found in other bacterial species, probably allows for the ability to photoreactivate the two major forms of ultraviolet-induced DNA damage (cyclobutane pyrimidine dimers and 6-4 photoproducts), and may also allow use of a range of wavelengths of light used for the energy required for photoreactivation. It is also of interest that many of the repair genes are on chromosome 2 (alkA, ada1, ada2, phr3, mutK, sbcCD, dcm, mutT3), indicating that this chromosome is probably required for full DNA repair capability. Pathogenicity Toxins. The genome sequence of V. cholerae El Tor N16961 revealed a single copy of the cholera toxin (CT) genes, ctxAB, located on chromosome 1 within the integrated genome of CTXf, a temperate ®lamentous phage3. The receptor for entry of CTXf into the cell is the toxin-coregulated pilus (TCP)3, and the TCP gene cluster (see below) also resides on chromosome 1. Like the structural genes for CT and TCP, the regulatory gene, toxR, which controls their expression in vivo35, is also located on chromosome 1. On the other side of CTXf prophage is a region encoding an RTX toxin (rtxA), and its activator (rtxC) and transporters (rtxBD)36. A third transporter gene has been identi®ed that is a paralogue of rtxB, and is transcribed in the same direction as rtxBD. Downstream of this gene are two genes encoding a sensor histidine kinase and response regulator. Trinucleotide composition analysis suggests that the RTX region was horizontally acquired along with the sensor histidine kinase/response regulator, suggesting these regulators effect expression of the closely linked RTX transcriptional units. Also present are genes encoding numerous potential toxins, including several haemolysins, proteases and lipases. These include hap, the haemagluttinin protease, a secreted metalloprotease that seems to attack proteins involved in maintaining the integrity of epithelial cell tight junctions37, and hlyA, encoding a secreted haemolysin that displays enterotoxic activity38. In contrast to CTX, RTX and all known intestinal colonization factors, the hap and hlyA genes virulence factors reside on chromosome 2. Vibrio cholerae has been reported to produce shiga-like toxins39; however, the sequence did not reveal genes encoding speci®c homologues of the A or B subunits of shiga toxin. Also not detected were genes encoding homologues of E. coli heat stable toxin (ST), which have been detected in other pathogenic strains of V. cholerae40. Colonization factors. The critical intestinal colonization factor of V. cholerae is the TCP, a type IV pilus5,41. The genome sequence con®rmed that the genes involved in TCP assembly (tcpABCDEFGHIJNQRST) reside on chromosome 1 (ref. 20) as part of a proposed `pathogenicity island' (also referred to as VPI) composed of recently acquired DNA that encodes not only TCP, but also other genes associated with the ToxR regulatory cascade, such as acfABCD, toxT, aldA and tagAB10,11. Trinucleotide composition analysis suggest that this 45.3-kb segment begins at a 20-bp site upstream of aldA, and encompasses a helicase-related protein and a transcriptional activator which both share homology with bacteriophage proteins. At the other end of the segment of atypical trinucleotide composition is a phage family integrase and the © 2000 Macmillan Magazines Ltd 481 articles other copy of the 20-bp site, which is presumably the target for integration of the island onto the chromosome10,11. It has been proposed that the TCP/ACF island corresponds to the genome of a ®lamentous phage that uses TCP pilin as a coat protein4. However, other than the three genes encoding phage-related proteins (that is, the helicase, transcriptional activator and integrase) we could ®nd no other genes on the island that encoded products with signi®cant homology to the conserved gene products of other ®lamentous phages or the structural proteins of non®lamentous phages. The maltose-sensitive haemagglutinin (MSHA) is unique to the El Tor biotype of V. cholerae. Initially characterized as a haemagglutinin, it was later found to be a type IV pilus42,43. The MSHA biogenesis (MshHIJKLMNEGF) and structural (MshBACD) proteins are all clustered on chromosome 1. There are no apparent integrases or transposases that might de®ne this region as a pathogenicity island or suggest an origin for it other than V. cholerae. In support of this conclusion, trinucleotide composition analysis shows that this region has similar composition to the rest of the chromosome, suggesting that if these genes were acquired it was very early in the Vibrio phylogenetic history. Recently, several investigators have reported that MSHA is not required for intestinal colonization, nor does it seem to appreciably affect the ef®ciency of colonization44±46, but instead plays a role in bio®lm formation27,28. Accordingly, this pilus may be important for the environmental ®tness of Vibrio species rather than for pathogenic potential. The pilA region of V. cholerae genome apparently encodes a third type IV pilus, although it has not been visualized47. This gene cluster includes a gene encoding a prepilin peptidase (PilD) that is required for the ef®cient processing of protein complexes with type IV prepilin-like signal sequences including TCP, MSHA and EPS47,48. The EPS system of V. cholerae encodes a type II secretion system involved in extracellular export of CT and other proteins. The EPS system is encoded by chromosome 1 but, like the MSHA genes, trinucleotide analysis suggests that the EPS genes of V. cholerae have not been recently acquired. In contrast, trinucleotide composition analysis suggests that the pilA gene cluster was acquired by horizontal transfer. Thus, analysis of the V. cholerae genome sequence provides some evidence that older gene clusters, like MSHA and EPS, have become dependent on newly acquired genes such as PilD. Conclusions The Vibrio cholerae genome sequence provides a new starting point for the study of this organism's environmental and pathobiological characteristics. It will be interesting to determine the gene expression patterns that are unique to its survival and replication during human infection35 as well as in the environment13,14,16. Additionally, the genomic sequence of V. cholerae should facilitate the study of this model multi-chromosomal prokaryotic organism. Comparative genomics between several species in the genus Vibrio will provide a better understanding of the origin of the new small chromosome and the role that it plays in Vibrio biology. The genome sequence may also provide important clues to understanding the metabolic and regulatory networks that link genes on the two chromosomes. Finally, V. cholerae clearly represents a promising genetic system for studying how several horizontally acquired loci located on separate chromosomes can still ef®ciently interact at the regulatory, cell biology and biochemical levels. M Methods Whole-genome random sequencing procedure. Vibrio cholerae N16961 was grown from a single isolated colony. Cloning, sequencing and assembly were as described for genomes sequenced by TIGR18. One small-insert plasmid library (2±3 kb) was generated by random mechanical shearing of genomic DNA. One large insert library was ligated into l-DASHII/EcoRI vector (Stratagene). In the initial sequence phase, approximately sevenfold sequence coverage was achieved with 49,633 sequences from plasmid clones. Sequences from both ends of 383 l-clones served as a genome scaffold, verifying the orientation, order and integrity of the contigs. The plasmid and l sequences were jointly assembled using TIGR Assembler. Sequence gaps were closed 482 by editing the end sequences and/or primer walking on plasmid clones. Physical gaps were closed by direct sequencing of genomic DNA, or combinatorial polymerase chain reaction (PCR) followed by sequencing the PCR product. The ®nal genome sequence is based on 51,164 sequences. ORF prediction and gene family identi®cation. An initial set of ORFs, likely to encode proteins, was identi®ed with GLIMMER49, and those shorter than 30 codons were eliminated. ORFs that overlapped were visually inspected, and in some cases removed. ORFs were searched against a non-redundant protein database18. Frameshifts and point mutations were detected and corrected where appropriate. Remaining frameshifts and point mutations are considered to be authentic and were annotated as `authentic frameshift' or `authentic point mutation'. ORFs were also analysed with two sets of hidden Markov models (HMMs) constructed for a number of conserved protein families (1,313 from Pfam v3.1 (ref. 50) and 476 from the TIGRFAM) by use of the HMMER package. TopPred was used to identify membrane-spanning domains in proteins. Paralogous gene families were constructed by searching the ORFs against themselves using BLASTX, identifying matches with E # 10-5 over 60% of the query search length, and subsequently clustering these matches into multigene families. Multiple alignments for these protein families were generated with the CLUSTALW program and the alignments scrutinized. Distribution of all 64 trinucleotides (3-mers) for each chromosome was determined, and the 3-mer distribution in 2,000-bp windows that overlapped by half their length (1,000 bp) across the genome was computed. For each window, we computed the x2 statistic on the difference between its 3-mer content and that of the whole chromosome. A large value of this statistic indicates that the 3-mer composition in this window is different from the rest of the chromosome. Probability values for this analysis are based on the assumption that the DNA composition is relatively uniform throughout the genome. Because this assumption may be incorrect, we prefer to interpret high x2 values merely as indicators of regions on the chromosome that appear unusual and demand further scrutiny. Homologues of the genes of interest were identi®ed using the BLASTP and FASTA3 search programs. All homologues were then aligned to each other using the CLUSTALW program with default settings. Phylogenetic trees were generated from the alignments using the neighbour-joining algorithm as implemented by the PAUP* program (with a PAM matrix based distance calculation). Regions of the alignment that were hypervariable or were of low con®dence were excluded from the phylogenetic analysis. All alignments are available upon request. Received 3 April; accepted 18 May 2000. 1. Wachsmuth, K., Olsvik, é., Evins, G. M. & Popovic, T. in Vibrio Cholerae And Cholera: Molecular To Global Perspective (eds Wachsmuth, I. K., Blake, P. A. & Olsvik, é.) 357±370 (ASM Press, Washington DC, 1994). 2. Faruque, S. M., Albert, M. J. & Mekalanos, J. J. Epidemiology, genetics, and ecology of toxigenic Vibrio cholerae. Microbiol. Mol. Biol. Rev. 62, 1301±1314 (1998). 3. Waldor, M. K. & Mekalanos, J. J. Lysogenic conversion by a ®lamentous phage encoding cholera toxin. Science 272, 1910±1914 (1996). 4. Karaolis, D. K., Somara, S., Maneval, D. R. Jr, Johnson, J. A. & Kaper, J. B. A bacteriophage encoding a pathogenicity island, a type-IV pilus and a phage receptor in cholera bacteria. Nature 399, 375±379 (1999). 5. Brown, R. C. & Taylor, R. K. Organization of tcp, acf, and toxT genes within a ToxT-dependent operon. Mol. Microbiol. 16, 425±439 (1995). 6. Hochhut, B. & Waldor, M. K. Site-speci®c integration of the conjugal Vibrio cholerae SXTelement into prfC. Mol. Microbiol. 32, 99±110 (1999). 7. Yildiz, F. H. & Schoolnik, G. K. Vibrio cholerae O1 El Tor: identi®cation of a gene cluster required for the rugose colony type, exopolysaccharide production, chlorine resistance, and bio®lm formation. Proc. Natl Acad. Sci. USA 96, 4028±4033 (1999). 8. Bik, E. M., Bunschoten, A. E., Gouw, R. D. & Mooi, F. R. Genesis of the novel epidemic Vibrio cholerae O139 strain: evidence for horizontal transfer of genes involved in polysaccharide synthesis. EMBO J 14, 209±216 (1995). 9. Waldor, M. K., Colwell, R. & Mekalanos, J. J. The Vibrio cholerae O139 serogroup antigen includes an O-antigen capsule and lipopolysaccharide virulence determinants. Proc. Natl Acad. Sci. USA 91, 11388±11392 (1994). 10. Kovach, M. E., Shaffer, M. D. & Peterson, K. M. A putative integrase gene de®nes the distal end of a large cluster of ToxR-regulated colonization genes in Vibrio cholerae. Microbiology 142, 2165±2174 (1996). 11. Karaolis, D. K. et al. A Vibrio cholerae pathogenicity island associated with epidemic and pandemic strains. Proc. Natl Acad. Sci. USA 95, 3134±3139 (1998). 12. Mekalanos, J. J., Rubin, E. J. & Waldor, M. K. Cholera: molecular basis for emergence and pathogenesis. FEMS Immunol. Med. Microbiol. 18, 241±248 (1997). 13. Colwell, R. R. Global climate and infectious disease: the cholera paradigm. Science 274, 2025±2031 (1996). 14. Colwell, R. R. & Spira, W. M. in Cholera (eds Barua, D. & Greenough, W. B. III) 107±127 (Plenum Medical, New York, 1992). 15. Lobitz, B. et al. Climate and infectious disease: use of remote sensing for detection of Vibrio cholerae by indirect measurement. Proc. Natl Acad. Sci. USA 97, 1438±1443 (2000). 16. Colwell, R. R. & Huq, A. Environmental reservoir of Vibrio cholerae. The causative agent of cholera. Ann. NY Acad. Sci. 740, 44±54 (1994). 17. Roszak, D. B. & Colwell, R. R. Survival strategies of bacteria in the natural environment. Microbiol. Rev. 51, 365±379 (1987). 18. Fraser, C. M. et al. Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature 390, 580±586 (1997). 19. Yamaichi, Y., Iida, T., Park, K. S., Yamamoto, K. & Honda, T. Physical and genetic map of the genome of Vibrio parahaemolyticus: presence of two chromosomes in Vibrio species. Mol. Microbiol. 31, 1513± 1521 (1999). 20. Trucksis, M., Michalski, J., Deng, Y. K. & Kaper, J. B. The Vibrio cholerae genome contains two unique circular chromosomes. Proc. Natl Acad. Sci. USA 95, 14464±14469 (1998). © 2000 Macmillan Magazines Ltd NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com articles 21. Lobry, J. R. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 13, 660±665 (1996). 22. Rowe-Magnus, D. A., Guerout, A. M. & Mazel, D. Super-integrons. Res. Microbiol. 150, 641±651 (1999). 23. Hall, R. M., Brookes, D. E. & Stokes, H. W. Site-speci®c insertion of genes into integrons: role of the 59-base element and determination of the recombination cross-over point. Mol. Microbiol. 5, 1941± 1959 (1991). 24. Davis, B. M., Kimsey, H. H., Chang, W. & Waldor, M. K. The Vibrio cholerae O139 Calcutta bacteriophage CTXphi is infectious and encodes a novel repressor. J. Bacteriol. 181, 6779±6787 (1999). 25. Mekalanos, J. J. Duplication and ampli®cation of toxin genes in Vibrio cholerae. Cell 35, 253±263 (1983). 26. Mazel, D., Dychinco, B., Webb, V. A. & Davies, J. A distinctive class of integron in the Vibrio cholerae genome. Science 280, 605±608 (1998). 27. Watnick, P. I. & Kolter, R. Steps in the development of a Vibrio cholerae El Tor bio®lm. Mol. Microbiol. 34, 586±595 (1999). 28. Watnick, P. I., Fullner, K. J. & Kolter, R. A role for the mannose-sensitive hemagglutinin in bio®lm formation by Vibrio cholerae El Tor. J. Bacteriol. 181, 3606±3609 (1999). 29. Huq, A. et al. Ecological relationships between Vibrio cholerae and planktonic crustacean copepods. Appl. Environ. Microbiol. 45, 275±283 (1983). 30. Bassler, B. L., Yu, C., Lee, Y. C. & Roseman, S. Chitin utilization by marine bacteria. Degradation and catabolism of chitin oligosaccharides by Vibrio furnissii. J. Biol. Chem. 266, 24276±24286 (1991). 31. Yildiz, F. H. & Schoolnik, G. K. Role of rpoS in stress survival and virulence of Vibrio cholerae. J. Bacteriol. 180, 773±784 (1998). 32. Bassler, B. L., Greenberg, E. P. & Stevens, A. M. Cross-species induction of luminescence in the quorum-sensing bacterium Vibrio harveyi. J. Bacteriol. 179, 4043±4045 (1997). 33. Williams, S. G., Attridge, S. R. & Manning, P. A. The transcriptional activator HlyU of Vibrio cholerae: nucleotide sequence and role in virulence gene expression. Mol. Microbiol. 9, 751±760 (1993). 34. Eisen, J. A. & Hanawalt, P. C. A phylogenomic study of DNA repair genes, proteins, and processes. Mutat. Res. 435, 171±213 (1999). 35. Lee, S. H., Hava, D. L., Waldor, M. K. & Camilli, A. Regulation and temporal expression patterns of Vibrio cholerae virulence genes during infection. Cell 99, 625±634 (1999). 36. Lin, W. et al. Identi®cation of a Vibrio cholerae RTX toxin gene cluster that is tightly linked to the cholera toxin prophage. Proc. Natl Acad. Sci. USA 96, 1071±1076 (1999). 37. Wu, Z., Milton, D., Nybom, P., Sjo, A. & Magnusson, K. E. Vibrio cholerae hemagglutinin/protease (HA/protease) causes morphological changes in cultured epithelial cells and perturbs their paracellular barrier function. Microb. Pathog. 21, 111±123 (1996). 38. Alm, R. A., Stroeher, U. H. & Manning, P. A. Extracellular proteins of Vibrio cholerae: nucleotide sequence of the structural gene (hlyA) for the haemolysin of the haemolytic El Tor strain 017 and characterization of the hlyA mutation in the non- haemolytic classical strain 569B. Mol. Microbiol. 2, 481±488 (1988). 39. O'Brien, A. D., Chen, M. E., Holmes, R. K., Kaper, J. & Levine, M. M. Environmental and human NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. isolates of Vibrio cholerae and Vibrio parahaemolyticus produce a Shigella dysenteriae 1 (Shiga)-like cytotoxin. Lancet 1, 77±78 (1984). Ogawa, A., Kato, J., Watanabe, H., Nair, B. G. & Takeda, T. Cloning and nucleotide sequence of a heatstable enterotoxin gene from Vibrio cholerae non-O1 isolated from a patient with traveler's diarrhea. Infect. Immun. 58, 3325±3329 (1990). Manning, P. A. The tcp gene cluster of Vibrio cholerae. Gene 192, 63±70 (1997). Jonson, G., Holmgren, J. & Svennerholm, A. M. Identi®cation of a mannose-binding pilus on Vibrio cholerae El Tor. Microb. Pathog. 11, 433±441 (1991). Jonson, G., Lebens, M. & Holmgren, J. Cloning and sequencing of Vibrio cholerae mannose-sensitive haemagglutinin pilin gene: localization of mshA within a cluster of type 4 pilin genes. Mol. Microbiol. 13, 109±118 (1994). Attridge, S. R., Manning, P. A., Holmgren, J. & Jonson, G. Relative signi®cance of mannose-sensitive hemagglutinin and toxin-coregulated pili in colonization of infant mice by Vibrio cholerae El Tor. Infect. Immun. 64, 3369±3373 (1996). Tacket, C. O. et al. Investigation of the roles of toxin-coregulated pili and mannose- sensitive hemagglutinin pili in the pathogenesis of Vibrio cholerae O139 infection. Infect. Immun. 66, 692±695 (1998). Thelin, K. H. & Taylor, R. K. Toxin-coregulated pilus, but not mannose-sensitive hemagglutinin, is required for colonization by Vibrio cholerae O1 El Tor biotype and O139 strains. Infect. Immun. 64, 2853±2856 (1996). Fullner, K. J. & Mekalanos, J. J. Genetic characterization of a new type IV-A pilus gene cluster found in both classical and El Tor biotypes of Vibrio cholerae. Infect. Immun. 67, 1393±1404 (1999). Marsh, J. W. & Taylor, R. K. Identi®cation of the Vibrio cholerae type 4 prepilin peptidase required for cholera toxin secretion and pilus formation. Mol. Microbiol. 29, 1481±1492 (1998). Salzberg, S. L., Delcher, A. L., Kasif, S. & White, O. Microbial gene identi®cation using interpolated Markov models. Nucleic Acids Res. 26, 544±548 (1998). Bateman, A. et al. Pfam 3.1: 1313 multiple alignments and pro®le HMMs match the majority of proteins. Nucleic Acids Res. 27, 260±262 (1999). Supplementary information is available on Nature's World-Wide Web site (http:// www.nature.com) or as paper copy from the London editorial of®ce of Nature. Acknowledgements This work was supported by the National Institutes of Health, National Institute of Allergy and Infectious Disease. We thank M. Heaney, V. Sapiro, B. Lee, M. Holmes and B. Vincent for database and software support. Correspondence and requests for materials should be addressed to C.M.F. (e-mail: [email protected]). The annotated genome sequence and the gene family alignments are available at hhttp://www.tigr.org/tbd/mdbi. The sequences have been deposited in GenBank with accession number AE003852 (chromosome 1) and AE003853 (chromosome 2). © 2000 Macmillan Magazines Ltd 483 484 © 2000 Macmillan Magazines Ltd 91 1049 5 281 448 2 757 451 954 955 459 572 574 12 8 964 1063 863 772 12 575 5 965 679 468 6 576 470 773 680 300 106 14 2742 969 1067 Amino acid biosynthesis 774 472 681 16 474 1068 971 865 776 578 972 777 1069 10 9 581 778 202 2175 974 2539 481 1071 310 13 208 20 5s rRNA 16s rRNA 687 1073 781 872 976 23s rRNA 9 587 977 588 8 873 10 3 978 22 6 12 590 979 495 981 876 785 691 591 328 212 Repeat region Three tRNAs 1075 980 875 24 117 9 Membrane protein 1077 982 496 1078 983 786 593 12 119 498 14 877 692 213 25 4 2278 26 1700 5 5 984 1080 883 790 1081 506 695 5 985 507 10 884 791 121 338 986 219 885 795 510 12 340 31 2672 886 987 2192 798 609 1085 700 887 799 603 344 988 888 800 511 1086 889 801 512 13 347 128 605 513 989 890 349 222 35 6 2462 2379 1088 891 990 6 2290 1089 892 5 704 607 223 130 2381 1090 991 893 894 354 803 608 38 131 516 37 6 1091 992 705 610 11 1535 40 993 804 612 895 518 6 994 706 614 134 805 5 9 519 996 707 615 230 137 2765 2571 997 897 807 708 2208 231 139 998 898 521 365 808 617 10 44 2766 2472 2572 2684 2092 809 1098 620 5 5 900 1000 709 619 232 140 1001 1099 1002 1100 5 5 235 49 1004 903 372 237 142 6 144 238 374 10 1102 13 1008 813 716 624 524 904 714 51 2691 2480 2391 905 526 375 2775 711 529 11 1104 1010 9 814 717 10 376 906 527 1105 1011 625 379 53 2694 1012 5 815 8 380 2595 6 243 1109 13 908 817 12 1015 909 2213 720 1111 2599 2699 1018 911 820 912 723 10 530 388 10 629 59 12 2485 1019 1112 1458 7 1020 914 9 5 531 1113 2490 2216 1023 917 824 1293 5 64 1115 534 920 637 828 1025 921 535 397 156 638 254 922 1026 728 399 157 65 2702 2021 1027 923 12 1835 1740 2121 639 257 540 405 924 832 2608 6 542 258 1029 833 732 406 10 68 2026 2224 2407 7 2705 160 11 23Sf 2324 2223 2123 2024 1926 1566 10 2126 2609 925 834 6 734 644 543 6 735 409 10 2226 926 835 645 646 736 927 928 836 2710 413 264 72 5 2332 2412 647 1 838 930 5 649 417 267 75 167 1034 931 840 738 650 546 7 2616 2499 6 824 76 651 421 1036 935 741 550 5 170 2618 7 1753 171 11 1037 846 10 2620 552 847 1755 2136 939 6 1040 848 655 428 271 272 173 6 274 432 1041 747 656 11 940 554 431 175 2621 2504 2237 2137 1311 82 941 556 433 2717 2505 2418 276 943 6 436 557 1043 748 657 176 83 2719 11 2622 2718 1855 749 9 11 8 751 1045 849 85 558 439 178 2625 2422 2342 945 2626 2507 2423 752 1046 661 559 441 179 469 946 1047 662 560 442 8 2 561 280 1048 947 2 181 90 2726 2628 5 9 2510 2425 2344 2144 663 754 443 180 9 2725 12 1588 1949 2508 2627 279 88 1760 1859 654 1131 1667 2245 2424 2343 2143 753 12 1587 1492 1406 1666 837 745 566 471 934 385 1315 1033 565 194 289 470 100 1220 2041 836 6 933 653 744 6 2724 6 1947 2723 278 659 440 9 86 1759 1857 99 192 564 1314 1665 1585 2244 2142 2039 2722 277 2624 2242 11 1856 13 1129 1405 1491 1758 6 931 10 384 563 468 6 288 191 1031 1219 835 743 652 562 467 1945 2141 2721 658 944 1 84 2720 2623 1490 1664 1 98 287 1029 5 1128 1217 1313 2421 2038 1944 2506 2420 2341 2240 2037 1942 1663 1403 1216 1028 834 742 651 466 381 13 1583 1584 833 1127 2140 1312 4 286 560 930 741 832 97 190 465 379 650 8 1488 1757 95 285 1025 1026 9 6 1126 2419 11 1 1854 2238 12 10 1582 1024 831 739 559 464 649 1215 1660 12 94 377 378 1402 2340 275 2036 1941 1853 1756 2417 80 2716 929 1310 1486 1581 1852 2339 9 174 1309 1214 1401 13 5 1023 830 737 648 558 463 284 188 1123 1124 1125 829 1659 2035 1940 1851 2236 745 1039 938 9 654 79 172 1400 1308 1213 928 736 647 376 93 557 462 283 187 1122 1021 828 556 461 374 1580 1939 1658 2503 2416 2715 270 425 1038 937 11 744 78 2714 2619 2338 2235 2135 2234 2033 1485 1399 11 1754 1938 460 282 927 92 186 1121 1850 1579 1849 1937 827 734 926 1307 10 1656 423 424 936 653 269 2233 2502 2415 1936 2713 77 2501 2232 2133 8 1120 1212 1483 373 91 459 826 645 1018 925 1397 1848 1655 554 457 644 1306 1577 1482 2032 8 1 732 456 11 185 372 90 281 1118 1119 1935 2337 652 843 740 420 169 549 419 268 168 825 924 1847 1575 2712 10 11 6 2414 2617 2500 2335 2132 6 1211 1117 1846 455 731 1017 184 371 89 553 643 1481 1751 1394 1304 10 1016 923 552 730 2031 2231 2131 13 280 370 13 1934 1845 1750 1573 1210 1115 9 6 88 454 922 369 823 728 551 2334 1933 9 1653 1303 1209 2711 12 2498 737 648 416 74 2130 2413 2230 2333 1033 415 266 165 545 929 837 414 265 73 2614 2497 1749 1478 1392 1015 1114 921 727 9 183 86 87 453 642 366 278 11 1844 9 2615 2129 2030 1931 2229 641 1302 1748 8 365 452 1113 6 85 181 182 550 726 1207 1571 1476 1843 2227 2128 2613 164 1031 6 9 544 263 71 7 16Sf 410 2708 2610 261 262 70 2 2409 2330 RNasePRNA 2127 8 1746 1652 1391 1301 14 11 2 822 1014 920 640 1112 1206 1013 919 725 549 277 84 180 362 450 83 1570 5 6 1929 1475 1745 1568 1928 2028 11 2225 1927 408 1111 1205 1390 1300 8 639 4 276 1012 918 1838 1839 1840 1651 1567 1473 2329 161 821 638 361 179 82 1010 1011 1204 1388 6 724 917 547 1110 2027 2706 275 449 178 1009 8 81 1743 1837 1203 1298 1741 1836 640 641 67 2607 13 256 1028 830 730 10 2406 2323 7 2023 1925 916 637 360 274 1008 820 5 359 1386 1650 1470 1006 723 636 1565 1384 447 545 80 1108 177 356 357 271 79 1107 1202 1297 1106 159 12 1 539 255 66 402 538 22 2703 2606 2494 1 914 1005 11 78 355 634 2222 1834 1563 2022 2120 401 729 829 537 400 2493 819 722 544 76 1469 270 446 1201 1649 2405 2220 2119 10 2604 2404 2322 2118 2492 536 2603 2701 827 12 253 396 155 9 2403 2217 2019 1923 1833 1562 1467 1382 9 175 354 1105 12 75 1296 1004 1200 5 1738 1832 1647 1561 818 913 633 543 353 269 74 721 174 1104 1465 1295 1199 1922 1736 1103 1003 817 4 1002 542 445 73 632 268 720 350 173 6 541 444 267 2116 2491 9 2602 2402 2018 1921 1831 1735 1645 1560 1463 1379 719 631 1001 1102 9 911 815 2115 825 727 635 532 395 250 154 2601 1114 634 152 726 915 823 725 63 249 391 151 2600 2489 2401 2113 2016 2017 1732 1644 1377 1101 349 540 5 72 172 6 266 443 1198 718 1000 814 630 539 5 442 265 348 1828 1920 2320 2700 2111 2215 2400 1827 1460 1376 71 171 441 70 910 999 1291 9 1558 2015 632 633 248 822 724 631 5 6 2488 10 2110 2014 1730 538 717 813 1197 1099 1643 1290 1196 909 715 439 440 69 264 12 346 170 537 629 1557 1826 150 913 630 247 61 5 438 1375 1098 2487 2399 2319 2214 10 1918 10 1457 2013 1641 1555 1374 812 68 263 714 908 1 1195 437 345 262 1289 1194 5 11 998 907 2398 2109 1825 1554 387 246 910 1017 385 148 58 2698 2598 2397 2316 1110 818 719 628 245 57 529 244 2697 1727 1916 5 1097 906 67 535 628 168 811 627 23Se 1824 2012 2108 2484 2212 1915 10 1726 1639 1454 260 344 1553 905 810 1288 13 1552 1373 1190 1096 1638 2396 2596 1 626 434 66 167 534 343 259 904 6 2107 2011 1822 1551 56 382 1108 1013 907 718 627 528 242 147 2695 2696 55 903 997 2395 2312 2211 533 809 9 65 433 342 12 1095 2106 1725 2009 2483 2593 2311 241 2394 2482 240 1821 64 166 258 624 16Se 1287 1550 5 1372 6 1636 1912 2103 2310 146 2693 5 1094 1188 63 532 902 6 257 165 807 531 432 8 256 1723 10 2008 2210 2590 239 17 52 2481 2309 8 2100 2101 6 1911 1820 1722 1549 1286 1371 62 14 623 995 1093 5 10 901 806 530 431 341 254 1635 1186 1092 6 1548 1910 2692 2308 2099 61 164 1285 1370 1721 1819 2007 1909 2307 2774 5 2584 1005 1006 812 712 623 1101 711 371 50 2773 1633 804 994 1634 1284 1185 1547 1369 2097 2479 2690 2305 2096 2006 803 430 340 528 252 622 19 900 1091 710 993 1720 1546 1907 1817 2390 523 2772 2689 2581 2476 2303 2209 2095 2004 1632 1282 1719 1544 1906 1718 10 1451 11 9 1184 801 709 621 60 339 526 7 2 802 429 13 251 524 525 338 162 899 1089 992 1183 800 708 523 1367 369 370 902 811 710 621 522 141 48 5 368 2770 233 2579 2475 2302 2688 2577 901 47 367 6 2301 2389 1815 2003 1905 1717 1631 9 23Sa 428 898 1543 1088 1280 2094 10 2474 2576 2300 2685 11 2002 1904 799 897 1366 620 522 337 250 161 1 426 706 1182 991 1814 1630 1542 1716 2093 5 10 6 896 705 6 5Sb 425 1 521 336 249 1365 2767 2768 366 999 618 2388 45 2001 1811 1715 1541 798 619 424 1087 1181 990 7 519 1279 1364 12 2298 2299 1903 1540 5 1629 1086 1363 2000 12 2387 2297 2470 363 616 1095 806 520 229 9 43 896 361 136 42 2764 2683 2569 2469 2386 2205 2206 2207 228 9 360 1093 1094 358 227 41 2385 2295 2089 1180 16Sa 23Sb 704 618 13 894 247 335 796 423 518 988 617 5 246 1277 1278 1808 1998 1902 1628 1362 1714 1807 795 1 58 334 12 57 703 893 1085 987 616 422 333 245 56 16Sb 1276 1539 11 8 1179 1997 1627 1901 2568 135 2467 8 2203 11 2681 2762 1361 1806 12 2088 1995 1805 2567 2466 2384 2566 8 1450 8 794 892 702 517 332 421 1084 1178 55 244 986 701 615 1 793 1275 13 54 159 1536 1537 1538 2294 2202 1092 225 133 6 8 2565 2465 2383 2293 2087 1994 1083 890 331 243 1177 1360 1274 1713 1900 1803 1712 5 1359 1273 1625 356 517 132 355 39 2761 12 1081 985 9 700 52 158 516 614 330 51 792 420 242 889 1174 1448 7 2201 2464 2382 6 5 1899 2292 2564 2678 2760 2563 2463 2198 2086 1993 1800 1711 1534 1358 1272 1173 1079 2291 1533 1447 1624 791 698 613 157 50 515 984 49 241 888 983 329 48 1271 1172 981 1270 2380 8 790 697 1077 1898 1799 2759 351 514 802 703 8 606 350 36 2677 5 1532 1269 156 514 419 886 612 240 47 1354 1076 1171 2197 2562 129 8 2461 696 788 980 2085 2289 2378 2676 2758 702 2675 2561 2460 2377 2196 2084 1992 1623 1446 418 884 1075 1710 1798 1991 1896 2287 2195 34 1622 1529 1268 1170 8 787 695 5 513 46 154 417 239 979 416 611 1353 1797 2083 8 221 346 604 127 2757 345 11 2674 2560 2459 2286 2194 2082 1990 1621 45 153 1074 881 786 1169 977 1445 1267 1527 11 1894 2458 33 8 220 124 2756 2673 2559 2193 2081 1989 1893 1794 1709 1526 5 1266 1168 880 5 152 238 415 9 44 694 610 512 328 237 151 12 1073 976 1350 1444 2285 693 43 784 236 975 1525 1987 2755 602 1167 1892 2456 32 11 876 2376 2558 341 123 2671 2557 11 1084 601 699 2670 16Sh 2454 12 1986 7 11 150 692 974 1265 1524 1791 2283 2080 2191 1985 1890 1523 1442 12 973 783 1708 1166 1349 1264 1071 1706 8 875 608 414 327 235 42 509 510 326 149 691 234 41 508 325 148 972 690 607 1789 2555 3 8 1889 2282 793 30 1083 792 600 337 29 697 1082 2554 2668 5 2374 2281 2190 1788 324 413 506 1441 1522 1263 1165 7 40 233 781 1070 873 606 1620 1439 1984 508 218 599 28 2553 2453 2280 8 1888 1786 1704 1438 1348 1521 7 1069 780 689 605 505 412 323 147 39 232 322 411 1164 971 870 779 1261 12 2666 23Sh 334 217 688 12 503 410 146 37 231 321 6 409 1163 2188 2552 2279 596 11 882 789 27 120 1347 2077 8 144 SRPRNA 13 604 408 4 778 9 1983 1887 1784 1703 1437 1346 1260 1162 36 230 970 687 1067 7 505 694 504 880 881 788 12 331 595 214 1079 693 501 330 2664 5Sh 2550 2452 2187 2373 2549 12 2075 1701 1520 1259 5 869 969 407 502 320 229 142 777 9 406 1980 1981 2662 16Sg 2451 16 2074 2548 2661 1783 1886 1979 2277 1782 1435 6 1345 1066 5 1161 685 35 140 501 319 228 139 34 776 8 968 1619 1519 1434 1344 1698 1618 11 1697 592 118 2660 13 2276 2073 2450 2371 2185 Rho-ind. terminator 784 325 690 6 493 23 9 2657 2547 2448 23Sg 2184 2274 2370 2183 1433 1258 1160 1064 1517 1518 1978 1781 1885 1977 5 1779 1884 6 1343 1617 1432 1695 1615 1431 1516 1257 1159 775 684 603 500 405 318 227 33 137 683 499 317 225 1063 866 498 965 5 136 32 404 774 1062 682 602 315 224 135 496 403 1158 1061 964 864 773 495 223 5Sc 1 134 1342 2072 1976 1778 1614 1515 1430 211 689 491 322 116 1157 1060 2656 5Sg 1 1074 783 589 688 782 210 318 21 494 402 7 31 222 963 863 1883 1777 2273 2447 2751 490 2369 2272 2182 2070 681 1341 1693 1429 1256 1882 1975 2181 7 1776 2545 2655 2446 115 317 2750 485 486 487 Other RNA 1072 975 871 686 2544 2654 316 114 2653 2368 2271 2180 2069 1974 1880 12 601 772 221 23Sc 1156 1612 1340 1513 1611 1692 10 1775 1610 6 30 132 493 1059 962 1155 1058 1428 1154 400 861 862 771 29 220 680 131 11 492 399 1255 1427 1339 2270 1973 2445 207 586 113 5 313 11 585 483 16 780 584 311 205 112 19 2749 2542 2651 2444 2541 2649 2443 2365 2366 2269 6 1879 1774 2179 1690 1972 5 2068 2178 2267 2268 2067 1773 1689 1878 1512 1609 7 6 1426 1253 1153 1056 961 770 6 600 491 2 860 28 217 678 859 16Sc 130 960 858 1152 959 857 769 677 597 490 1511 1425 1608 6 1338 1252 8 216 398 27 1054 7 9 596 1151 856 958 1971 1772 685 583 870 1607 1509 1424 1250 2748 309 204 111 1053 312 129 676 595 397 215 1150 2176 2647 18 779 582 479 2747 12 2442 11 2364 26 128 768 1688 2266 2066 1970 12 2265 684 308 2538 110 867 580 476 2064 2441 17 2746 307 201 683 475 8 2537 2646 2174 2363 109 2440 2262 2536 200 108 1 Kb 303 2744 2645 2362 2172 2063 1968 1877 9 1508 1423 1337 675 396 957 594 1249 1606 1687 309 489 855 1149 1507 1050 767 674 6 395 213 127 214 23 24 126 593 1875 1771 1686 10 1336 1506 1422 1685 1874 1967 1148 956 1248 854 6 22 673 7 308 592 488 212 125 1049 21 1335 1505 1605 1684 1504 1421 1246 766 853 672 591 394 307 211 1048 1147 954 682 1683 1604 1503 2535 15 107 199 864 301 2743 Conserved hypothetical 1062 963 862 769 678 465 299 198 2741 2644 2534 2171 2439 No database match 962 768 676 677 197 13 2740 2643 2360 2261 12 2062 1966 Protein fate/protein synthesis 1060 767 294 196 463 464 195 104 2532 2642 5 2260 2168 2061 11 1770 1682 5 1603 1419 1334 1873 1245 6 487 590 306 210 20 1420 953 852 764 671 589 12 122 1047 1333 11 486 Other categories 1059 961 860 766 675 573 460 293 194 103 2739 2531 2359 2438 2358 2259 2166 2167 1681 1872 1965 1502 12 1244 1602 7 1418 2060 1769 1964 2058 2059 2258 1601 1242 9 1145 951 305 209 393 851 763 668 208 1332 849 587 1046 950 848 1680 1871 13 667 304 485 392 19 Transport and binding proteins 1058 765 674 193 291 9 12 9 2641 2529 6 1501 1962 1963 2257 2164 2057 2437 2356 2256 7 1679 1869 1768 1330 1417 1600 1870 1329 1599 5 1144 1239 1240 1044 9 762 tmRNA 586 1045 761 949 847 585 483 207 120 Regulatory functions 1057 859 764 960 458 571 673 290 11 102 2528 7 2 2056 1960 1678 1500 5 1238 948 760 665 583 391 11 18 Transcription 457 288 192 101 2738 2638 6 7 2255 2162 2055 1959 2436 2524 2525 2527 2637 1767 9 1416 1598 1868 2355 14 1328 7 1237 1043 303 482 206 119 16 Nucleotides 958 856 763 671 191 9 10 2254 15 2053 1675 2353 2434 2435 2736 1499 1597 1956 5 1327 1236 1141 1042 846 759 5 204 118 Fatty acid and phospholipid metabolism 1056 855 568 455 287 100 2636 2523 2433 2352 9 2161 2052 2253 1955 1867 1766 1498 1674 10 1415 1325 1235 1041 481 582 947 664 480 14 203 15 117 DNA metabolism 854 957 669 762 12 567 454 190 99 8 2734 2522 10 2432 2159 2160 2051 9 1866 11 1596 1497 1414 1140 1040 581 390 302 758 9 19 116 14 Energy metabolism 1055 853 1 14 189 98 7 2733 2520 2350 2252 2635 2157 2431 2349 286 2519 285 956 6 566 284 186 97 2634 2156 2251 2049 1953 1765 1673 1595 1323 1139 944 663 478 202 115 580 943 757 579 845 662 942 756 11 12 477 13 114 201 300 113 578 389 1234 1496 1138 1039 941 661 576 577 299 1 12 Central intermediary metabolism 1054 2348 2250 1865 1413 1594 1321 1495 2048 1137 1232 1038 1136 112 200 476 755 5Sd 111 Cellular processes 1053 5 453 667 4 2047 1952 12 1672 2517 2518 2732 760 565 852 283 96 3 1231 1593 1411 2153 2430 1135 940 5 575 660 298 844 753 939 574 475 5 10 Biosynthesis of cofactors, prosthetic groups, and carriers 1052 851 8 666 758 564 450 183 759 2731 95 2347 2249 2631 2632 2633 2514 2346 2248 843 9 1037 9 1764 1671 2046 2152 2045 1494 12 1229 1592 1864 1410 1951 1863 1763 1670 1591 1320 199 9 Cell envelope 1051 9 449 282 94 952 850 1050 951 756 13 665 182 92 563 447 5 949 663 181 1 2428 2729 2730 2630 2727 2513 2427 2510 2511 2426 2345 1036 8 23Sd 938 752 659 573 842 474 296 297 198 108 5 6 7 1134 841 937 1228 1319 1133 2247 1862 1762 1409 1226 936 2 5 570 571 295 4 749 750 751 658 473 840 568 2628 2629 2425 6 2344 2246 2146 2043 1950 196 16Sd 106 1590 1318 1132 1035 748 293 9 1861 1761 1669 1492 1408 1224 935 839 2145 2245 2042 1860 1760 567 195 7 747 1317 1589 3 105 656 657 1034 1407 12 934 838 746 655 386 472 566 385 194 291 1 2 103 articles NATURE | VOL 406 | 3 AUGUST 2000 | www.nature.com
© Copyright 2025 Paperzz