On surrogate methods for detecting lateral gene transfer

FEMS Microbiology Letters 201 (2001) 187^191
www.fems-microbiology.org
On surrogate methods for detecting lateral gene transfer
Mark A. Ragan *
Institute for Molecular Bioscience, The University of Queensland, Brisbane, Qld 4072, Australia
Received 17 May 2001; accepted 29 May 2001
First published online 25 June 2001
Abstract
Surrogate methods for detecting lateral gene transfer are those that do not require inference of phylogenetic trees. Herein I apply four
such methods to identify open reading frames (ORFs) in the genome of Escherichia coli K12 that may have arisen by lateral gene transfer.
Only two of these methods detect the same ORFs more frequently than expected by chance, whereas several intersections contain many
fewer ORFs than expected. Each of the four methods detects a different non-random set of ORFs. The methods may detect lateral ORFs of
different relative ages; testing this hypothesis will require rigorous inference of trees. ß 2001 Federation of European Microbiological
Societies. Published by Elsevier Science B.V. All rights reserved.
Keywords : Microbial evolution ; Lateral gene transfer; Horizontal gene transfer; Nucleotide composition; Markov model; GC content
1. Introduction
Darwin's explanation of organismal diversity ^ as the
consequence of genealogical descent, with genetic modi¢cation, as on a bifurcating tree ^ stands as a landmark of
biology, indeed of modern scienti¢c enquiry. This paradigm uni¢es diverse observations, and connects them to
underlying genetic processes now understood at the molecular level. However, Darwin's paradigm has been, and
continues to be, violated in certain circumstances. Plastids
and mitochondria originated not by treelike `vertical' descent within the eukaryotic lineage, but rather by lateral
transfer of genetic information from bacteria; and subsequent plastid diversi¢cation has involved further lateral
events [1]. Plasmids spread resistance to antibiotics laterally among bacterial populations [2].
Organelles and plasmids might be considered special
cases, of limited extent in time and space ^ complications,
not threats, to Darwin's paradigm. The same cannot be
said of new analyses, arising in part from the wealth of
newly available prokaryotic genome sequences, that appear to reveal rampant lateral gene transfer (LGT ; also
known as horizontal gene transfer) among microbes [3^
13]. The most convincing among these are based on phylogenetic inference. Well-supported topological disagree-
* Tel. : +61 (7) 3365-1160; Fax: +61 (7) 3365-4388;
E-mail : [email protected]
ment between the tree inferred for one gene family and
that inferred for another can often be parsimoniously explained only by invoking LGT [14^17]. However, it is
di¤cult to extend these studies to all known genes and
genomes. Many gene families are intrinsically of restricted
phyletic distribution; and some genes accept changes so
rapidly that orthologs cannot con¢dently be identi¢ed or
aligned, unavoidably yielding sparse trees with weakly
supported topological features. Large data sets, on the
other hand, pose computational challenges both for inferring trees and for assessing con¢dence intervals of subtrees. Unknown support for subtrees of unknown optimality is not a promising foundation from which to assess the
prevalence of LGT.
Thus there has been considerable interest in developing
methods by which laterally transferred genes can be identi¢ed without the need to infer gene or protein trees. At
least seven such surrogate methods have been put forward
[3^13], some applicable only to regions (e.g. 50-kb windows) of genomic sequence, others to individual open
reading frames (ORFs). Each genomic region or ORF is
assessed to determine how typical it is of other regions or
ORFs in the genome under investigation. Depending on
how the assessment criteria are structured, atypicality may
indicate that the genomic region or ORF has an origin
di¡erent from that of the rest of the genome, i.e. has arisen
by LGT. Here I focus on surrogate methods that can be
applied to individual ORFs.
Lawrence and Ochman [3^5] identi¢ed ORFs in the
0378-1097 / 01 / $20.00 ß 2001 Federation of European Microbiological Societies. Published by Elsevier Science B.V. All rights reserved.
PII: S 0 3 7 8 - 1 0 9 7 ( 0 1 ) 0 0 2 6 2 - 2
FEMSLE 10016 16-7-01
188
M.A. Ragan / FEMS Microbiology Letters 201 (2001) 187^191
genome of Escherichia coli K12 that have nucleotide contents [3] or codon usage patterns [4] atypical of other
E. coli ORFs. They demonstrated that nucleotide composition at each codon position is, at equilibrium, linearly
related to nucleotide composition of the entire genome,
and described a process of `amelioration' [3] during which
introgressed ORFs become progressively more like those
of the host. They identi¢ed a set of E. coli ORFs, some
17.5% of the total, that by these criteria are candidates for
having arisen by LGT. This approach was recently
criticized for high rates of both false positives and false
negatives [18].
Coding regions di¡er from non-coding regions in statistically de¢nable ways [19^21] that can be formally expressed as Markov models. These models can be used to
identify genes in microbial [6,22,23] and other genomes.
The models are parameterized on training sets from speci¢c genomes, and thus tend to be genome-speci¢c. Nonetheless, even when optimized in this way and run to convergence, they usually fail to detect all genes known to be
present in that genome. Hayes and Borodovsky [6] showed
that a model trained on ORFs identi¢ed (by various criteria) as atypical for a given genome could e¤ciently recognize many ORFs not found by a model trained on `typical' ORFs. In the case of E. coli, these additional ORFs
show a functional pro¢le distinct, in part, from that of
typical ORFs. These authors proposed that many ORFs
detected by the atypical model had been laterally transferred into the E. coli genome [6].
A third and conceptually di¡erent approach has been
introduced by Clarke et al. (submitted). These authors
sorted GenBank by species to create a target database
for BLASTP analysis, and identi¢ed ORFs that show patterns of BLAST matches signi¢cantly di¡erent from the
median pattern shown by ORFs in that same genome.
These ORFs were termed `phylogenetically discordant' because if a distance matrix were constituted from the numerical values of the BLASTP analysis and a tree generated, its topology would presumably be anomalous for
ORFs in that genome. Indeed, removal of discordant sequences strengthened bootstrap support for key topological features in whole-genome distance trees calculated
from mean pairwise BLASTP expectation scores. LGT is
the most obvious means by which an ORF could have a
genealogical history di¡erent from that of its host genome.
Finally, LGT may generate unusual patterns of gene
distribution among organisms. Catastrophic gene loss or
lineage-speci¢c rates of sequence change can also generate
unusual distributional patterns, especially if they occur in
multiple lineages (i.e. on multiple occasions). Occasionally,
independent evidence (e.g. from gene co-localization, or
remnants of transposons) may implicate LGT as the cause
of a particular distribution [12]; but in the absence of such
evidence, how are patterns caused by LGT to be distinguished from those explicable within the framework of
vertical transmission ? Ragan and Charlebois (submitted)
sorted GenBank into hierarchical taxa, and implemented a
dual-threshold criterion to distinguish LGT from nonLGT patterns. A BLASTP match with expectation value
better than a rigorous inclusion threshold was taken to
indicate the presence of a putative homolog in the target
taxon, while lack of a BLASTP match with expectation
value better than a more permissive exclusion threshold
implies that homologs are absent. As the inclusion threshold is made more stringent and the exclusion threshold less
stringent, lineage-speci¢c rate e¡ects are progressively ¢ltered out, with only the clearest patterns remaining. Furthermore, the more sparse the phyletic distribution, the
less parsimoniously can the pattern be attributed to multiple catastrophic losses. By this approach, these authors
identi¢ed, for each of 23 bacterial genomes, a set of
ORFs with homologs in only one non-self bacterial phylum. Most of these homologs occur among Firmicutes and
Proteobacteria, which in a resolved bifurcating tree cannot
be sister lineages to all 23 of the bacteria in question. Thus
overall, these sets must be enriched in laterally transferred
ORFs.
Herein I examine whether these four surrogate methods
detect the same ORFs in the genome of E. coli K12 [24].
Any ORF with an atypical base composition, detected by
an atypical Markov model, presenting an atypical pattern
of BLAST matches, and with its only sure homolog in a
non-adjacent lineage, would bear the weight of circumstantial evidence against sharing a common phylogenetic
history with its more typical counterparts in the same genome. Alternatively, should these approaches fail to identify a common set of ORFs, we might call into question
their validity or generality as surrogate methods for detecting LGT.
2. Materials and methods
Datasets were generated by analysis of the genome sequence of E. coli K12 MG1655 (GenBank accession
U00096). An updated list of ORFs having anomalous
base compositions [3] was generously provided by Je¡rey
Lawrence. An updated list of ORFs found by an atypical
Markov model [6] was kindly supplied by Mark Borodovsky. The list of phylogenetically discordant ORFs combines results from six separate PDS analyses, i.e. two
BLASTP expectation thresholds (e910310 and e91035 )
and three levels of GenBank target sets (delineation at
NCBI level 0, all sequences; level 1, Bacteria; and level
2, Proteobacteria), and was graciously furnished by
Robert Charlebois. ORFs with anomalous phyletic distributions are here de¢ned as those ¢nding a BLASTP match
in exactly one non-self bacterial phylum (second-level
NCBI category) at inclusion thresholds e910310 or
e910320 , and exclusion threshold ev1035 . The four
ORF lists were combined in a single spreadsheet and indexed by NCBI gi number ; where necessary, ORFs were
FEMSLE 10016 16-7-01
M.A. Ragan / FEMS Microbiology Letters 201 (2001) 187^191
re-indexed to gi number by BLAST comparison against
GenBank U00096. ORF spacings were based on the order
in U00096, ignoring tRNA and other non-protein-coding
genes. Intersection sizes and ORF spacings expected under
a stochastic model (see below) were determined by Monte
Carlo simulations (n = 10 000 replicates) assuming the location of each ORF to be independent of every other
ORF.
3. Results
Numbers of ORFs in the genome of E. coli K12
MG1655 identi¢ed by the four surrogate methods, individually and in combination, are shown in Table 1. It is
not surprising that di¡erent numbers of ORFs are identi¢ed, as each method utilizes (explicitly or implicitly) one
or more thresholds which were not here selected in any
standardized or coordinated way (indeed, no conceptual
framework exists for doing so). The extent to which the
four methods identify (insofar as possible) the same atypical ORFs is the more relevant and interesting question.
The null model posits that each method and ORF is
independent. Consider genome G containing NGEN number of ORFs. If by application of method A to genome G
we identify NA atypical ORFs, and by method B identify
NB atypical ORFs, then we expect NAB = (NA /
NGEN )U(NB /NGEN )UNGEN in the intersection (identi¢ed
by both methods), and NAb = (NA /NGEN )U[13(NB /
NGEN )]UNGEN to be found by A but not by B. This model
189
is easily extended to all intersections among multiple independent methods. If we observe signi¢cantly more than
the expected number of ORFs in an intersection, those
methods preferentially target the same ORFs. Table 1
shows observed and expected numbers of E. coli ORFs
in all intersections among these four surrogate methods.
Because little is known of the statistical distribution of
ORF subclasses, con¢dence was assessed relative to full
and 95% ranges of counts found in 10 000 simulations.
Several results stand out as very di¡erent from expectations under the null model. In particular, more than twice
as many ORFs are compositionally atypical and identi¢ed
by an atypical Markov model; some of these are also
phylogenetically discordant, or have an atypical distributional pro¢le. Hayes and Borodovsky [6] observed that
lengthy ORFs found by their atypical model tend to
have atypical oligonucleotide compositions. Except where
both base composition and Markov model methods are
involved, however, an E. coli ORF judged atypical by
any one of these methods is less likely to be found atypical
by another method than expected by chance. For example,
ORFs that are compositionally atypical and phylogenetically discordant are 69% fewer than expected. ORFs that
have both atypical distributional pro¢les and atypical base
compositions, or atypical pro¢les and are phylogenetically
discordant, are 40% fewer than expected (Table 1).
Two lines of evidence con¢rm that the four surrogate
methods in fact detect di¡erent subsets of the E. coli genome. First, the four atypical sets exhibit very di¡erent
ORF-to-ORF spacings (Table 2), with 70% of composi-
Table 1
Numbers of E. coli ORFs identi¢ed by four surrogate methods, individually and in combinations
Criterion
# ORFs exp
Sim range
Sim 95%
# ORFs obs
All GC
All MM
All PD
All DP
GC only
MM only
PD only
DP only
GCEMM
GCEPD
GCEDP
MMEPD
MMEDP
PDEDP
GCEMMEPD
GCEMMEDP
GCEPDEDP
MMEPDEDP
GCEMMEPDEDP
GCDMMDPDDDP
^
^
^
^
535.1
449.6
270.4
187.3
95.5
57.4
39.8
48.3
33.4
20.1
10.2
7.1
4.3
3.6
0.8
1763
^
^
^
^
492^575
407^490
230^304
157^216
67^129
33^85
20^60
25^73
15^54
7^37
1^23
0^20
0^13
0^14
0^5
1716^1823
^
^
^
^
514^559
430^472
253^289
171^203
79^112
45^71
29^52
36^61
23^44
12^29
5^17
3^13
1^9
1^8
0^3
1740^1794
752
650
416
297
462**
332**
322**
200
201**
18**
24*
39
32
12
20*
24**
3
2
0
1691**
% +/3
^
^
^
^
314
326
+19
+7
+110
369
340
319
34
340
+96
+238
34.1
Expectations are calculated as described in the text. Simulations range is total range over 10 000 simulations; simulations 95% interval was calculated
by removing 2.5% of simulations from each end of the distribution. Asterisks mark observed values outside the full range (**) or 95% interval (*) of
simulations. GC, base composition method [3]; MM, Markov model method [6]; PD, phylogenetic discordance method (Clarke et al., unpublished) ;
DP, distributional pro¢le method (Ragan and Charlebois, submitted). Boolean symbols: E, intersection of the data sets; D, union of the data sets.
FEMSLE 10016 16-7-01
190
M.A. Ragan / FEMS Microbiology Letters 201 (2001) 187^191
Table 2
Spatial distribution of ORFs identi¢ed by each of the four surrogate methods
Method
GC
GC
GC
GC
observed
sim mean
sim range
sim 95%
MM
MM
MM
MM
observed
sim mean
sim range
sim 95%
ORF-to-ORF spacing, in numbers of ORFs
1
2
3^5
6^10
11^20
21^30
31^50
51+
69.9**
17.5
12.8^21.9
15.0^20.1
1.9**
14.5
15.0^20.1
12.1^16.9
6.8**
11.9
7.6^16.8
9.7^14.2
6.0**
24.6
18.8^30.1
21.5^27.7
7.9**
19.5
14.6^24.6
16.9^22.3
3.1**
10.3
6.9^13.8
8.6^12.0
3.5**
1.5
0.3^3.1
0.8^2.3
1.5**
0.2
0.0^0.9
0.0^0.7
28.3**
15.1
10.2^20.3
10.8^17.7
13.7
12.8
8.6^18.0
10.5^15.2
21.5**
10.9
6.6^15.4
8.6^13.2
16.2**
23.8
18.2^30.0
20.6^27.1
14.5**
20.9
15.4^25.5
18.0^24.0
3.2**
13.2
8.9^17.7
11.1^15.4
2.5
2.6
0.6^5.2
1.5^3.7
0.2
0.6
0.0^1.8
0.2^1.2
PD
PD
PD
PD
observed
sim mean
sim range
sim 95%
11.3
9.7
4.6^15.1
7.0^12.5
10.1
8.7
4.3^13.9
6.3^11.5
22.1**
7.9
3.6^13.2
5.5^10.3
19.5
19.4
12.7^27.4
15.9^23.1
23.3
21.7
15.1^28.1
17.8^25.7
9.4**
20.9
13.9^28.4
17.1^24.8
3.4*
7.5
2.9^12.5
5.3^9.9
1.0*
3.7
0.7^7.2
2.2^5.3
DP
DP
DP
DP
observed
sim mean
sim range
sim 95%
12.8**
6.9
2.7^12.5
4.4^9.8
9.4*
6.4
1.7^12.5
3.7^9.1
15.5**
6.0
1.7^11.4
3.4^8.8
14.1
15.6
8.8^23.6
11.8^19.5
24.6*
19.6
11.1^29.0
15.2^23.9
12.5**
23.3
14.8^33.0
18.5^28.3
7.4*
11.4
5.4^18.9
8.1^15.2
3.4**
8.3
3.7^13.8
5.7^10.8
The observed value of the distance between the ORFs, mean and full range of simulations, and interval containing 95% of simulated values, expressed
as percent of ORFs in each category. Asterisks mark observed values outside the full range (**) or 95% interval (*) of simulations. Rows may not add
to 100% due to rounding. Abbreviations for the methods are as in Table 1.
tionally atypical ORFs, but only 11% of phylogenetically
discordant ORFs and 13% of those with unexpected
distributions, adjacent to another such ORF. Conversely,
23% of the latter ORFs, but only 6% of those found by an
atypical Markov model and 8% of those atypical by base
composition, are located 21 ORFs or more from the next
such ORF downstream. Second, the base composition
method identi¢es 47 insertion sequences and transposases,
whereas an atypical Markov model ¢nds 17, the anomalous distribution method three, and phylogenetic discordance only one (results not shown).
4. Discussion
These four surrogate methods fail almost completely to
identify a common set of E. coli ORFs. The pair of methods overlapping most in their predictions ^ the base composition approach of Lawrence and Ochman [3] and the
atypical Markov model of Hayes and Borodovsky [6] ^
identify the same ORFs only about twice as frequently
as expected by chance. Several pairs of methods ¢nd common ORFs much less frequently than by chance. Indeed,
surprisingly few intersections fall within even the full range
of simulated values. The four predicted sets of ORFs are
spaced di¡erently across the E. coli genome, and include
very di¡erent numbers of insertion sequences and transposons. Thus each surrogate method does in fact ¢nd a
non-random set of ORFs, whatever its nature. If these are
laterally transferred ORFs, there must be several distinct
subsets.
The observed frequencies at which each method detects
insertion sequences and transposases suggests an explanation. Base composition di¡erences are in many cases `ameliorated' over some tens to a few hundred million years
[3,5]. Compositional di¡erence may thus preferentially detect recent lateral transfers, including transposable elements. The atypical Markov models detect some of these,
plus ORFs whose base compositions (but not oligonucleotide frequencies or codon usage) have equilibrated with
their new genomic background. The other two methods
explicitly focus on cross-phylum and cross-domain patterns, hence might be expected to detect more ancient
events.
This study demonstrates the need for a systematic, comprehensive approach to the study of LGT based on ¢rst
principles, i.e. rigorous inference and statistically based
comparison of molecular phylogenetic trees. As more genomic sequences appear, a tree-based approach will become both more challenging and more rewarding. Perhaps
only by such an approach will we ultimately learn what
these surrogate methods are actually detecting.
Acknowledgements
I thank Robert Charlebois, Je¡ Lawrence and Mark
Borodovsky for access to data; Robert Charlebois and
W. Ford Doolittle for discussions; Ian Bailey-Mortimer
and Emily McGhie for expert programming; and the
Canadian Institute for Advanced Research for fellowship
support.
FEMSLE 10016 16-7-01
M.A. Ragan / FEMS Microbiology Letters 201 (2001) 187^191
References
[1] Gray, M.W. (1998) Evolution of organellar genomes. Curr. Opin.
Genet. Dev. 9, 678^687.
[2] Spratt, B.G. and Maiden, M.C. (1999) Bacterial population genetics,
evolution and epidemiology. Phil. Trans. R. Soc. Lond. Biol. 354,
701^710.
[3] Lawrence, J.G. and Ochman, H. (1997) Amelioration of bacterial
genomes: rates of change and exchange. J. Mol. Evol. 44, 383^397.
[4] Lawrence, J.G. and Ochman, H. (1998) Molecular archaeology of the
Escherichia coli genome. Proc. Natl. Acad. Sci. USA 95, 9413^9417.
[5] Ochman, H. and Lawrence, J.G. (1996) Phylogenetics and the amelioration of bacterial genomes. In: Escherichia coli and Salmonella.
Cellular and Molecular Biology, 2nd edn., Vol. 2 (Neidhardt, F.C.,
Curtis, R., III, Ingraham, J.L., Lin, E.C.C., Low, K.B., Magasanik,
B., Rezniko¡, W.S., Riley, M., Schaechter, M. and Umbarger, H.E.,
Eds.), pp. 2627^2637. American Society for Microbiology, Washington, DC.
[6] Hayes, W.S. and Borodovsky, M. (1998) How to interpret an anonymous bacterial genome: machine learning approach to gene identi¢cation. Genome Res. 8, 1154^1171.
[7] Koonin, E.V., Mushegian, A.R., Galperin, M.Y. and Walker, D.R.
(1997) Comparison of archaeal and bacterial genomes: computer
analysis of protein sequences predicts novel functions and suggests
a chimeric origin for the archaea. Mol. Microbiol. 25, 619^637.
[8] Koonin, E.V. and Galperin, M.Y. (1997) Prokaryotic genomes: the
emerging paradigm of genome-based microbiology. Curr. Opin. Genet. Dev. 7, 757^763.
[9] Karlin, S., Mräzek, J. and Campbell, A.M. (1998) Codon usages in
di¡erent gene classes of the Escherichia coli genome. Mol. Microbiol.
29, 1341^1355.
[10] Karlin, S., Brocchieri, L., Mräzek, J., Campbell, A.M. and Spormann, A.M. (1999) A chimeric prokaryotic ancestry of mitochondria
and primitive eukaryotes. Proc. Natl. Acad. Sci. USA 96, 9190^
9195.
[11] Makarova, K.S., Aravind, L., Galperin, M.Y., Grishin, N.V., Tatusov, R.L., Wolf, Y.I. and Koonin, E.V. (1999) Comparative genomics
of the Archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shell. Genome Res. 9, 608^628.
[12] Nelson, K.E., Clayton, R.A., Gill, S.R., Gwinn, M.L., Dodson, R.J.,
Haft, D.H., Hickey, E.K., Peterson, J.D., Nelson, W.C., Ketchum,
K.A., McDonald, L., Utterback, T.R., Malek, J.A., Linher, K.D.,
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
191
Garrett, M.M., Stewart, A.M., Cotton, M.D., Pratt, M.S., Phillips,
C.A., Richardson, D., Heidelberg, J., Sutton, G.G., Fleischmann,
R.D., Eisen, J.A., White, O., Salzberg, S.L., Smith, H.O., Venter,
J.C. and Fraser, C.M. (1999) Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga
maritima. Nature 399, 323^329.
Worning, P., Jensen, L.J., Nelson, K.E., Brunak, S. and Ussery,
D.W. (2000) Structural analysis of DNA sequence: evidence for lateral gene transfer in Thermotoga maritima. Nucleic Acids Res. 28,
706^709.
Smith, M.W., Feng, D.-F. and Doolittle, R.F. (1992) Evolution by
acquisition: the case for horizontal gene transfers. Trends Biochem.
Sci. 17, 489^493.
Brown, J.R. and Doolittle, W.F. (1997) Archaea and the prokaryoteto-eukaryote transition. Microbiol. Mol. Biol. Rev. 61, 456^502.
Jain, R., Rivera, M.C. and Lake, J.A. (1998) Horizontal gene transfer
among genomes: the complexity hypothesis. Proc. Natl. Acad. Sci.
USA 96, 3801^3806.
NesbÖ, C.L., Boucher, Y. and Doolittle, W.F. (2001) Comparative
genomics of four archaea: is there a core of euryarchaeal non-transferable proteins ? J. Mol. Evol., in press.
Koski, L.B., Morton, R.A. and Golding, G.B. (2001) Codon bias and
base composition are poor indicators of horizontally transferred
genes. Mol. Biol. Evol. 18, 404^412.
Fickett, J. (1982) Recognition of protein-coding regions in DNA
sequences. Nucleic Acids Res. 10, 5303^5318.
Gribskov, M., Devereux, J. and Burgess, R.R. (1984) The codon
preference plot: graphic analysis of protein coding sequences and
prediction of gene expression. Nucleic Acids Res. 12, 539^549.
Staden, R. (1984) Measurements of the e¡ect that coding for a protein has on DNA sequence and their use for ¢nding genes. Nucleic
Acids Res. 12, 551^567.
Lukashin, A.V. and Borodovsky, M. (1998) GeneMark.hmm : new
solutions for gene ¢nding. Nucleic Acids Res. 26, 1107^1115.
Salzberg, S.L., Delcher, A.L., Kasif, S. and White, O. (1998) Microbial gene identi¢cation using interpolated Markov models. Nucleic
Acids Res. 26, 544^548.
Blattner, F.R., Plunkett III, G., Bloch, C.A., Perna, N.T., Burland,
V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode, C.K., Mayhew,
G.F., Gregor, J., Davis, N.W., Kirkpatrick, H.A., Goeden, M.A.,
Rose, D.J., Mau, B. and Shao, Y. (1997) The complete genome sequence of Escherichia coli K-12. Science 277, 1453^1474.
FEMSLE 10016 16-7-01