Inference and Analysis of the Relative Stability of Bacterial

Inference and Analysis of the Relative Stability of Bacterial Chromosomes
Eduardo P. C. Rocha
Unité Génétique des Génomes Bactériens, Institut Pasteur, Paris, France and Atelier de BioInformatique,
Université Pierre et Marie Curie (Paris VI), Paris, France
The stability of genomes is highly variable, both in terms of gene content and gene order. Here I calibrate the loss of gene
order conservation (GOC) through time by fitting a simple probabilistic model on pairwise comparisons involving 126
bacterial genomes. The model computes the probability of separation of pairs of contiguous genes per unit of time and fits
the data better than previous ones while allowing a mechanistic interpretation for the loss of GOC with time. Although the
information on operons is not used in the model, I observe, as expected, that most highly conserved pairs of genes are
indeed within operons. However, even the other pairs are much more conserved than expected given the observed experimental rearrangement rates. After 500 Myr, about 50% of the originally contiguous orthologues remain so in the
average genome. Hence, the large majority of rearrangements must be deleterious and random genome rearrangements
are unlikely to provide for positively selected structural changes. I then use the deviations from the model to define an
intrinsic measure of genome stability that allowed the comparison of distantly related genomes and the inference of
ancestral states. This shows that clades differ in genome stability, with cyanobacteria being the least stable and cproteobacteria the most stable. Without correction for phylogeny, free-living bacteria are the least stable group of genomes,
followed by pathogens, and then endomutualists. However, after correction for phylogenetic inertia (or the removal of
cyanobacteria from the analysis), there is no significant association between genome stability and lifestyle or genome size.
Hence, although this method has allowed uncovering some of mechanisms leading to rearrangements, we still ignore the
forces that differentially shape selection upon genome stability in different species.
Introduction
The stability of genomes results from a mutationselection balance. As a result of different selection pressures,
effective population sizes, and rearrangement rates, some
genomes are significantly more stable than others. Different
selective forces have been recently uncovered that shape
gene order both in prokaryotes and eukaryotes (Hurst,
Pal, and Lercher 2004; Rocha 2004b). Many of the selective
elements shaping the organization of bacterial chromosomes result from fundamental processes of life such as gene
expression and replication (reviewed in Lawrence 2003;
Bentley and Parkhill 2004; Rocha 2004a). Gene expression
shapes the chromosome organization by selecting the clustering of functionally related genes into operons (Jacob and
Monod 1961) and supraoperons (Lathe, Snel, and Bork
2000). This allows the optimization of gene expression regulation (Hershberg, Yeger-Lotem, and Margalit 2005). Replication asymmetries result in the preferential positioning
of genes, especially essential genes, in the leading strand
(Rocha and Danchin 2003), highly expressed genes near
the origin of replication in fast-growing bacteria (Chandler
and Pritchard 1975), and balanced horizontal gene transfer,
and possibly gene deletion, among the two bacterial replichores (Bergthorsson and Ochman 1998). These organizational features allow bacteria to be substantially fitter, and
their disruption is bound to be deleterious.
Inversions have experimentally been found to have
very different fitness effects depending on the exact rearrangements and the selective features that are disrupted.
Many cases have been reported of significantly lower growth
rates and even lethality resulting from inversions (Hill and
Gray 1988; Rebollo, Francxois, and Louarn 1988; Segall,
Mahan, and Roth 1988; Campo et al. 2004). Rearrangements
Key words: genome stability, recombination, rearrangements,
bacteria, evolution, gene order.
E-mail: [email protected].
Mol. Biol. Evol. 23(3):513–522. 2006
doi:10.1093/molbev/msj052
Advance Access publication November 9, 2005
Ó The Author 2005. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: [email protected]
are thought to occur mostly by two types of mechanisms
both involving intrachromosomal recombination (reviewed
in Hughes 1999; Rocha 2004b). Homologous recombination can target inverted repeated elements, such as rDNA
operons or insertion sequences (IS), and lead to large inversions (Hill and Harnish 1981; Nakagawa et al. 2003; Parkhill
et al. 2003). IS and other elements, for example, phages, may
additionally lead to chromosome rearrangements because of
the activity of their recombinases (Gray 2000). One should
note that rearrangements in Escherichia coli are very frequent in the laboratory (Hill and Harnish 1981), but few
are found to have been fixed since the divergence from Salmonella enterica, about ;100 MYA. This is usually thought
to be the result of natural selection acting toward eliminating
these deleterious events, which assumes that most inversions
are at least slightly deleterious. Hence, the frequency of rearrangement events and the subsequent purging of deleterious ones by natural selection will determine the long-term
conservation of gene order in bacterial genomes. The calibration of bacterial stability, understood as the inverse of the
frequency of gene order disruption, can thus shed some light
on both rearrangement mechanisms and selective processes
shaping the organization of the chromosome.
The analysis of complete bacterial genomes has
revealed that the obligatory intracellular mutualists such
as Buchnera are among the most stable genomes. In these
sexually isolated species with small effective population
sizes, horizontal transfer cannot compensate for gene loss.
Combined with relaxed selection this leads to small genomes with few repeated elements, no IS, and few recombination genes (Shigenobu et al. 2000; Lawrence, Hendrix,
and Casjens 2001; Dale et al. 2003; Silva, Latorre, and Moya
2003). These genomes are then extremely stable, with no
large rearrangement apparent for over 100 Myr (Tamas
et al. 2002), just some gene deletions, possibly by illegitimate recombination between small repeats (Mira, Ochman,
and Moran 2001; Silva, Latorre, and Moya 2001; Rocha
2003b). A more nuanced picture is given by the genome
514 Rocha
of pathogenic obligatory intracellular bacteria such as the
ones of chlamydia and rickettsia, which are nevertheless
quite stable and nearly devoid of very large repeats (Kalman
et al. 1999; Frank, Amiri, and Andersson 2002). Intermediate genome stability is found in E. coli and S. enterica. These
genomes show abundant horizontal gene transfer, and some
close strains differ by more than 25% of their genetic material (Bergthorsson and Ochman 1998; Ochman and Jones
2000). However, this has little influence on the overall genome structure because only one large rearrangement differentiates E. coli K12 from S. enterica typhimurium
(McClelland et al. 2001). Among the least stable genomes,
one finds pathogens such as Bordetella pertussis and Yersinia pestis that have very recently diverged from a more stable genome by losing some genes and by gaining a large
number of IS. These genomes show a large number of recent
rearrangements (Parkhill et al. 2003; Chain et al. 2004).
It must be emphasized that most characterizations of
genome stability have relied on the comparisons of a small
number of closely related genomes (Eisen et al. 2000; Tillier
and Collins 2000; Suyama and Bork 2001; Parkhill et al.
2003). This poses a problem if one wants to define a calibrated measure of stability, which would allow to compare
very distant genomes and test the association of stability with
ecological and population parameters. There are currently
two major methodological approaches to the study of gene
order. One approach aims at determining the rearrangement
distance between two genomes. This distance is an estimation of the number of rearrangement events that took place
since speciation and allows the analysis of evolutionary scenarios (Sankoff 2003). However, it is not yet adapted for the
analysis of large sets of bacterial genomes at large rearrangement distances. As a result, most large-scale analyses of gene
order evolution in bacteria have used empirical measures of
distance based on pairwise comparisons of gene order
(Huynen and Bork 1998; Tamames 2001; Rocha 2003a).
Here I try to calibrate bacterial genome stability by building
an index of genome stability that takes into account the average loss of gene order through time. A nonlinear regression
fits the available data, and this allows the assessment of the
relative stability of each genome. I then explore the potential
association of the stability of genomes with their phylogenetic grouping, genome size, and lifestyle.
Methods and Data
Methods
Data
A total of 126 genomes from six clades with more than
eight available genomes were analyzed: Actinobacteria, Cyanobacteria, Firmicutes, a-proteobacteria, b-proteobacteria,
and c-proteobacteria (Supplementary Table 1, Supplementary Material online). Genomes were taken from GenBank.
The classification of bacteria according to their lifestyle was
taken from the literature.
Assignment of Orthology and Phylogenetic Analysis
Orthologues were defined as unique reciprocal best
hits (using the score of a global alignment where the gaps
on the edges of the smallest sequence are not penalized)
with at least 40% similarity in amino acid sequence and less
than 20% of difference in protein length. The analysis of
orthology was made for every pair of genomes within each
phylogenetic group. Because of gene transfer, gene deletion, and sequence divergence, the number of orthologues
decreases with the evolutionary distance. I tested if the most
distant comparisons in each clade still included a significant
number of genes. Among c-proteobacteria, the comparison
between E. coli K12 (;4,000 genes) and Xyllela fastidiosa
Temecula1 (;2,000 genes) included over 1,000 orthologues. Among Firmicutes, the comparison of Bacillus subtilis (;4,100 genes) with Clostridium perfringens (;2,800
genes) included over 1,100 orthologues. Hence, the number
of orthologues involved in the comparisons is always very
large. The evolutionary distances between bacteria were
computed from the 16S rDNA subunit, with Tree-Puzzle
(Schmidt et al. 2002) using the Haseawa, Kishino, and
Yano (HKY)1C model. The assessment of phylogenetic
inertia was done using CONTINUOUS (Pagel 1999) and
the inference of ancestral states with COMPARE (Martins
and Hansen 1997), both using the 16S rDNA phylogenetic
tree as a reference.
Operons
Many recent methods have been developed to unravel
the operon structure of genomes, but most use gene order
conservation (GOC) to attain higher accuracy. Because I
am interested in gene conservation and on the constraints
imposed on it by operons, the use of such methods would
introduce some circularity in the analysis. I have thus identified operons in B. subtilis and E. coli based only on gene
transcription sense, rho-independent terminators identified
by TransTerm (Ermolaeva et al. 2000), and intergenic distance (Salgado et al. 2000).
GOC
We are interested in analyzing the changes in gene order when horizontal gene transfer is excluded (as in Huynen
and Bork 1998; Tamames 2001; Rocha 2003a). For this I
make pairwise comparisons starting from the list of orthologues between the genomes. In such pairwise comparisons, the GOC is defined as the relative frequency with
which two contiguous genes in a genome have their respective orthologues also contiguous in the ordered list of orthologues of the other genome. Hence, GOC can be estimated
between a pair of genomes as the number of pairs of orthologues that are contiguous in the two genomes divided by
the total number of orthologues:
GOC 5
Northologues; contiguous
:
Northologues
This condition of contiguity is somewhat stringent, but
simulations have shown that in bacteria the addition of
a neighborhood (i.e., a vicinity distance larger than one)
changes very little of the results and unnecessarily complicates the statistical analysis (Rocha 2003a). The analysis of
eukaryotes, because of gene loss following genome duplication, will certainly require the inclusion of a neighborhood. When the number of orthologues is large, as in
this work, the probability of a pair of genes being contiguous in the two genomes by chance is very small.
The Evolution of Bacterial Genome Stability 515
Models of GOC
Several empirical models have been proposed to fit the
loss of GOC with time (t), with a parameter a to be adjusted
by regression. Tamames proposed a sigmoid loss of GOC
with time (Tamames 2001):
GOC 5
2
at :
11e
ðmodel 0Þ
I previously proposed a square-root dependence
(Rocha 2003a):
GOC 5 1 pffiffiffiffiffi
at:
ðmodel 1Þ
Although this equation fitted best the data on cproteobacteria, it has the obvious problem that for large
enough t GOC will take negative values, which it cannot.
A reciprocal transformation has also been proposed (Suyama
and Bork 2001):
1
5 at 1 1:
ðmodel 2Þ
GOC
All these models were purely empirical, that is, no
mechanistic model was proposed to explain their fit. Because model 2 fits well the data, I tried to infer the dynamic
behind it. In fact, 1=GOC5at11 arises from solving the
following differential equation:
dGOC
2
5 aGOC :
dt
This is equivalent to consider a decrease of GOC with
time that is negatively proportional to the square of the GOC
at a given moment. A different formulation of the model is:
GOC 5
1
:
1 1 at
I also derived a probabilistic model, inspired from
Wolf et al. (2001), that allows a deeper insight on the mechanisms leading to genome rearrangements. A genome is an
ordered circular (or linear) sequence of genes that can
change by rearrangement (fig. 1). If the probability of
two contiguous genes staying together in consecutive generations is p, and is constant throughout the genome and
through time, then the likeliness of two genes staying contiguous after t generations is given by P 5 pt. Assuming that
the probability of splitting contiguous genes is the same for
all genes, GOC is an estimator of P. Hence model 3 is:
GOC 5 pt :
ðmodel 3Þ
Because some gene pairs are under stronger selection
than others, it would be better to consider a model where the
gene pairs are divided into two classes of n1 fast- and n2
slow-rearranging pairs (pf and ps, n1 1 n2 5 N) (fig. 1),
then GOC can be modeled by:
t
GOC 5 Pf 1 Ps 5
t
ðn1 pf 1 n2 ps Þ
:
N
ðmodel 4Þ
FIG. 1.—Schematic view of the rationale of the analysis. A rearrangement takes place between two elements in the genomes (e.g., between two
repeats). Such rearrangement leads to two gene order changes (A). Rearrangements can separate genes strongly linked (e.g., because of selective
constraints), which are separated with low probability ps, or weakly linked,
which can be separated with higher probability pf (B).
A nonlinear regression of GOC with time allows the
determination of pi, the probability of two consecutive
genes of class i not being separated in a given unit of time.
Results and Discussion
The Change of GOC Through Time
I defined GOC between two genomes as the frequency
of pairs of orthologous genes that are contiguous in both
genomes. Hence, GOC varies between one (complete conservation of gene order) and zero (no conservation). High
values of GOC can be obtained if the genomes are very stable or if they have diverged very recently. As a result, if one
is interested in understanding the patterns of genome stability, one must start by modeling the time dependence of
GOC. The loss of GOC with time, that is, the way GOC
decays when the comparisons are made between increasingly divergent genomes, has been modeled in several ways
(see Methods and Data for the precise description of the
models). An early analysis suggested that GOC decreased
with time following a sigmoid transformation (model 0),
which would reflect a cooperative loss of gene order with
time (Tamames 2001). There is little evidence supporting
that claim from this much larger data set of 126 complete
genomes from six clades and when the divergence of the
16S rRNA subunit is used as a proxy for time (fig. 2). I have
previously suggested that in c-proteobacteria GOC decreased with the square root of time (model 1) (Rocha
2003a). Although this empirical transformation fits well
516 Rocha
FIG. 2.—GOC between pairs of genomes in function of their evolutionary distance according to the 16S rRNA subunit.
pffi Four models are fitted
by nonlinear regression. Model 1: GOC51 a t; model 2: 1/GOC 5
at 1 1, model 3: GOC 5 pt, and model 4: GOC5ptf 1pts where a#s and
p#s are the estimated parameters and t is the evolutionary distance.
the data on close comparisons, it performs poorly when distant ones are included (fig. 2). Finally, it has also been proposed that 1/GOC varies linearly with time (model 2)
(Suyama and Bork 2001). This model is better than the previous models, especially for more distant comparisons
(akaike information criterion [AIC], P , 0.001).
A probabilistic-based model accounts for the probability of contiguous gene pairs being conserved per unit of time
(p), and leads to GOC5pt (model 3). This model assumes
that all pairs of contiguous genes have the same probability
of being separated per unit of time. This assumption is unreasonable because the linkage of pairs of contiguous genes
is most certainly under varying selective pressures. As a result of this limitation, the regression of GOC with time using
this model performs poorly for the most divergent comparisons (fig. 2). If, however, one allows for two equally sized
classes of contiguous pairs of genes, one obtains a model
(model 4) GOC5ðptf 1pts Þ=2; which fits very well the data,
explaining 73% of the variance (P , 0.001). One class of
pairs of contiguous genes corresponds to slow-separating
pairs (ps 5 0.1246) and the other to fast-separating pairs
(pf5 2.8 3 1010, significantly different from 0, P ,
0.001, F test) (fig. 2). The fit of this model is the best
(P 0.001, AIC tests) and immediately suggests that selection on the conservation of operons could be a major
force behind GOC. Operons are on average slightly more
than three genes long and include two-thirds of the genome
(Zheng et al. 2002). Hence, roughly half of the contiguous
genes belong to the same operon and half to different operons. This is consistent to the hypothesis that half of the pairs
separate fast and half separate slowly.
In spite of their intrinsic differences, both model 2 and
model 4 fit reasonably well the data. Model 4 assumes the
existence of two classes of pairs of contiguous orthologues,
one splitting apart faster than the other, whereas model 2
assumes a continuum of increasing selective pressures on
FIG. 3.—Loss of GOC when comparing Bacillus subtilis with other
Firmicutes and Escherichia coli with other c-proteobacteria. Two different
analysis are indicated in each plot, one representing GOC for pairs of genes
that are in the same operon in B. subtilis or E. coli (GOCop, points) other for
pairs of contiguous genes that are in contiguous operons (GOCbetop,
circles). The regression lines were computed within each of the groups
using as models GOC5ptop (points) and GOCbetop 5ptbetop (lines) and
are represented as full lines. For comparison, are plotted the lines for
the two groups when one regresses the bivariate model 4 on the GOC variable (dashed line) (as in fig. 3). The close fit between the two types
of regressions suggests that genes separating fast (corresponding to the
component pf in model 4) are mostly associated with contiguous genes
in contiguous operons, whereas genes separating slowly (corresponding
to the component ps in model 4) are mostly associated with contiguous
genes within the same operon.
GOC between pairs of contiguous genes (see Methods
and Data). Model 4 fits better the data and one would expect the two classes of pairs of orthologues to concern pairs
within operons (ps) versus pairs between contiguous operons (pf). To test this hypothesis, I identified operons in B.
subtilis and E. coli and computed separately the change of
GOC with time for contiguous genes in the same operon
and for contiguous genes in contiguous operons. Not only
GOC is much more conserved for genes in operons but the
results are also quantitatively close to the ones expected by
the previous fit of model 4 (fig. 3). Hence, the selection acting upon operons is the major force acting against the disruption of gene order and is responsible for the particularly
low level of rearrangements for pairs of orthologous genes
The Evolution of Bacterial Genome Stability 517
within an operon. If one calibrates the evolutionary distance
using 100 Myr for the Escherichia/Salmonella divergence
(Ochman and Wilson 1987), one finds that bacteria separated by over 500 Myr still have conserved the contiguity
of ;80% of the within-operon gene pairs.
The above mentioned pf estimation of 2.8 3 1010 may
seem very small and suggests a very high rate of rearrangements for these pairs. However, one should bear in mind that
experimental data suggest that in the absence of selection,
rearrangement rates can be on the order of 104/generation
among rDNA operons of E. coli K12 (Hill and Harnish
1981). This rate is an underestimation of the real rearrangement rate per generation per genome because rDNA elements are not the only repeated elements that are sources
of inversions and because strongly deleterious rearrangements are probably missed in the analysis (e.g., if they
are lethal). Assuming 100 Myr for the separation between
Escherichia and Salmonella and 200 generations/year for
these bacteria, then pf 5 2.8 3 1010 translates in a genomic
frequency of rearrangement per generation of 107, which is
still much less than the conservative frequency of 104 observed in the laboratory. This strongly suggests that even
among this class, the vast majority of rearrangements are
deleterious. Given the large effective sizes of bacterial populations, slightly deleterious effects are probably enough to
purge out such rearrangements from natural populations.
Relative Stability of Bacterial Genomes
Because model 4 is the one presenting the best fit to the
data, it was the one used throughout the rest of the work.
Each genome participates in a subset of the pairwise comparisons used to make the regression analysis. The stability
of each genome is then defined as the average of the residuals resulting from the nonlinear regression of model 4 for
the comparisons where the genome participates. Hence, the
stability of a genome, for example, B. subtilis, is the average
of the residuals for all pairwise comparisons where B. subtilis participates. The stability of each genome is then the
average deviation of the observed values of GOC in the pairwise comparisons from model 4 to the expected values given
a certain phylogenetic distance (Supplementary Table 1,
Supplementary Material online). Naturally, the estimates of
stability are more precise for the groups containing the larger
number of genomes (i.e., Firmicutes and c-proteobacteria).
The stability thus calculated matches previous reports for
simple pairwise comparisons between genomes. This is
the case for genomes that are less stable such as the ones
of Streptomyces (Volff and Altenbuchner 1998), Bacillus
cereus (Kolsto 1997), Wolbachia (Wu et al. 2004), or Neisseria meningitidis (Dempsey, Wallace, and Cannon 1995).
Some other particularly unstable genomes within each clade
include Streptococcus pneumoniae, Enterococcus faecalis,
Synechocystis, and Thermosynechocystis. On the other extreme, this analysis fits previous observations that the genomes of c-proteobacteria obligatory endomutualists such
as Buchnera and Candidatus are extremely stable. Also
stable, but less so, are the genomes of some obligatory
intracellular pathogens, for example, Tropheryma and
Mycobacterium leprae, although other intracellular obligatory parasites such as rickettsia are not particularly stable.
As expected, closely related strains or species tend to
have similar stability indices. This is because, contrary to
most analyses, this one takes into account a larger amount
of comparisons at deeper evolutionary distances. Hence,
very recent changes in stability, such as the high number
of rearrangements in Shigella strains within E. coli (Jin
et al. 2002), have little weight in the overall analysis. To
test if the more recent evolutionary history is associated
with a change in stability, I computed two separate measures of stability, one including only the 50% closer comparisons and another including only the 50% most distant
comparisons. I then tested the difference between the
two values and selected the genomes for which recent
and older comparisons are significantly different (P ,
0.05, t-tests with sequential Bonferroni correction for multiple tests within the clades). Naturally, the significance of
differences also depends on the number of genomes of each
group, and it is not surprising that the differences are not
significant among the less sampled clades. Yet, this analysis
points to significant differences in some genomes among
the most represented groups. Notably, it confirms that
obligatory endomutualists of c-proteobacteria have become
more stable recently (Buchnera, Wigglesworthia, and
Blochmania). Inversely, instability is recent among Actinetobacter, S. pneumoniae, Legionella pneumophila, Pseudomonas, Xanthomonas axonopodis, Wolbachia, Brucella
melitensis, and E. faecalis. It is remarkable that many of
the latter bacteria are thought to be recent pathogens. When
sufficient genomes become available, this method will allow following the evolutionary change of stability through
time, by comparing a genome with increasingly proximal
ones. Using the current data set, such differences are already visible among c-proteobacteria and Firmicutes.
Genomes may be related by common selective constraints acting upon them. However, genomes also share
an evolutionary history. Hence, closely related genomes
are expected to be similar both because they tend to have
more of such selective constraints in common and because
they shared a common history until a more recent period of
time. To distinguish selective explanations for the stability
of genomes from such phylogenetic inertia (see Association
of Stability with Lifestyle), one can use the comparative
method (Felsenstein 1985). For this I first tried to understand the evolution of the stability through time taking into
account the 16S rDNA phylogeny. Using CONTINUOUS
(Pagel 1999), I checked that this variable evolved according
to a Brownian model (k 5 0.99, not significantly different
from 1, P . 0.05). Hence, one can use standard methods to
infer the ancestral states for the stability of the genomes.
This was done using COMPARE (Martins and Hansen
1997) and allowed tracking the changes in stability along
the evolutionary history of these bacteria (fig. 4).
The results displayed in figure 4 suggest that some
clades are more stable than others. This could be confirmed
by pooling the stability of genomes for each clade (fig. 5,
Kruskal-Wallis test, P , 0.001). To check which pairs of
clades are different, a Tukey-Kramer HSD test was used
which revealed the existence of three groups (P , 0.01).
The most stable clade is composed of c-proteobacteria
partly, but not entirely, due to the presence of all obligatory endomutualists in this group. Then follows the group
518 Rocha
FIG. 4.—Unrooted phylogenetic tree of the species analyzed in this study. The numbers correspond to the relative stability of ancestral states (multiplied by 100), inferred with COMPARE (Martins and Hansen 1997), using the linear model (corresponding to Brownian evolution).
including the other proteobacteria, the Firmicutes, and the
Actinobacteria. Cyanobacteria is the group containing the
least stable genomes. There are a couple of observations
that reinforce this classification: cyanobacteria tend to have
little, if any, replication-associated biases (Rocha 2004a),
and site-specific rearrangement systems are known to operate in several of these genomes (Carrasco and Golden
1995). This result, together with the analysis of the ancestral
states, strongly suggests that rearrangements are fixed at
very different rates in different clades.
FIG. 5.—The variation of genome stability according to the phylogenetic classification of bacteria. The dashed lines separate the box plots of
the groups with significantly different means (P , 0.05, Tukey-Kramer
HSD test).
Association of Stability with Lifestyle
Lifestyle has often been associated with genome stability, with endomutualists being regarded as stable (Tamas
et al. 2002) and pathogens as unstable (Hughes 2000;
Hacker, Hentschel, and Dobrindt 2003). This includes extracellular recent pathogens such as Y. pestis and Bordetella
(Deng et al. 2002; Parkhill et al. 2003; Chain et al. 2004). I
tried to test these associations by dividing the genomes into
free-living bacteria (including facultative commensals),
facultative pathogens, obligatory pathogens, and obligatory
endomutualists (fig. 6). This analysis showed significant differences between the groups (P , 0.001, Kruskal-Wallis),
and multiple comparisons indicate that endomutualists
form a significantly more stable group, whereas free-living
bacteria are significantly less stable (P , 0.01, TukeyKramer HSD). Pathogenic bacteria have an average stability. Endomutualists and obligatory pathogens have small
genomes and often have few repeats, which suggest an
inverse correlation between genome size and stability.
Yet, this correlation is very close to zero (Spearman q 5
0.02, P . 0.5), indicating that genome stability is independent of genome size. Hence, the stability of these
genomes is more likely to result from the lack of the
agents that more often cause recombination such as IS,
repeats, and homologous recombination.
The phylogenetic correlation between the species
complicates the analysis of the association of lifestyle with
genome stability. For example, all cyanobacteria are freeliving and it is here found that cyanobacteria are less stable.
So, one should assign the instability of this group to their
The Evolution of Bacterial Genome Stability 519
FIG. 6.—Genome stability classed in different groups according to
lifestyle. The dashed lines separate the box plots of the groups with significantly different means (P , 0.05, Tukey-Kramer HSD test).
lifestyle or to their clade? More importantly, the difference of
stability between species is dependent on their common
evolutionary history. This means that the association between two variables (e.g., lifestyle and stability) may be a
fortuitous consequence of phylogenetic inertia (Felsenstein
1985). To separate the hypothesis that the two variables are
intrinsically correlated or that their correlation is due to phylogenetic inertia we used the comparative method (Pagel
1999). One of the classes, obligatory endomutualists, is a
monophyletic clade of c-proteobacteria. Hence, in this case,
the test cannot be performed and one cannot decide between
the two hypotheses. Because both classes of pathogens
showed similar average stability, we pooled them together
and tested if this group is significantly more stable than
the free-living bacteria. Because stability is a continuous variable and lifestyle a categorical variable with two states,
CONTINUOUS was used to evaluate the robustness of this
association (Pagel 1999). The contrast analysis indicated
a much smaller and nonsignificant correlation between stability and lifestyle (r 5 0.04) than in the original data (r 5
0.33). This suggests that phylogenetic inertia may explain
the largest part of the effect of lifestyle on genome stability.
Hence, although it seems clear that pathogens are not less
stable than free-living bacteria, one cannot at the moment
exclude phylogenetic inertia as the reason for the inverse
association.
Conclusion
The definition of an index of genome stability allows
testing evolutionary hypothesis concerning the occurrence
and fixation of structural changes in the bacterial chromosome. When more close eukaryotic genomes become available, one may also use it to understand the patterns of
eukaryotic rearrangements and to what extent is gene order
under selection in these genomes. However, the analysis
will have to be adapted to tackle the much more frequent
events of gene and even genome duplication. The method
will also allow, when more data become available, to study
the change in stability through time in a lineage by using
different thresholds of phylogenetic distance when choosing the genomes from which to compute stability. Here, I
showed that one could identify some cases of recent increased genome stability and instability. Interestingly, bacteria that most recently became more stable are obligatory
endomutualists, and bacteria that have recently become less
stable are for the majority recent pathogens. Several of these
observations match previous independent analysis. A major
hurdle that remains in this analysis concerns the use of
a given phylogeny. This can be circumvented if the use
of the comparative method includes phylogenetic uncertainty (Huelsenbeck, Rannala, and Masly 2000). However,
any such study will still assume that the substitution rates
are the same in the 16S of the different lineages. If this is not
true, then deviations from the model can be attributed to
differences in genome stability or to differences in the evolutionary rate of the phylogenetic markers. The data on the
analysis of closely related genomes (e.g., among enterobacteria) suggest that the stability of genomes evolves faster
than the evolutionary rate of sequences. However, the effect
of deviations to the molecular clock should be taken into
account in future studies.
The first comparisons between distant genomes
revealed little conservation of gene order (Mushegian
and Koonin 1996; Huynen and Bork 1998; Itoh et al.
1999). This was surprising because decades of research
on the regulation of gene expression suggested that this
was under strong selection. Further, several experimental
works measured the impact of large rearrangements on bacterial fitness (Louarn et al. 1985; Hill and Gray 1988;
Campo et al. 2004). However, detailed comparisons of operon composition have shown that a significant number of
operons are partially conserved between the distantly related
E. coli and B. subtilis (Overbeek et al. 1999; de Daruvar,
Collado-Vides, and Valencia 2002). This brings to the fore
the problem I tried to tackle in this paper: the stability of
genomes must be interpreted after the comparisons have
been calibrated by the time since divergence. When this
is done and one compares genomes that have diverged over
500 MYA, one still observes a very significant conservation
of gene order. This long-range conservation seems to stem
essentially from the existence of operons and explains why
no similar conservation was found in the comparison of
distantly related yeasts (Dujon et al. 2004). However, even
pairs of genes in contiguous operons seem to be separated
much slower than expected given the rearrangement rates
observed in the laboratory. This may partly be because of
breakpoint reuse in large repeated elements, such as rDNA
operons, or if laboratory conditions, by some unknown reason, overestimate the rate of chromosome rearrangements.
However, supraoperonic gene order has been observed in
bacterial genomes, possibly to optimize the use of regulatory
regions (Korbel et al. 2004; Warren and ten Wolde 2004;
Hershberg, Yeger-Lotem, and Margalit 2005), and it has
been suggested that contiguity between operons can be as
much under selection as contiguity between genes within
an operon. Hence, it seems reasonable to suggest that the
GOC data reflects such selective constraints.
Previous analysis, with fewer data, suggested that gene
order is randomized (except for gene clusters with functional constraints such as operons) if the 16S rRNA distance
exceeds 10% (Huynen and Bork 1998). The current work
confirms that this is true on average, but that for unstable
520 Rocha
genomes (such as B. subtilis) the threshold is lower,
whereas for stable genomes (such as E. coli) the threshold
is significantly higher. Overall, contiguity persists even for
genomes diverging by as much as 0.5 substitutions/site in
the 16S rDNA sequence. The understanding of these differences may highlight a differential impact of recombination
mechanisms in different genomes, but also differences in
selective pressures, associated with either particular selective features or simply with the effectiveness of selection in
purging slightly deleterious rearrangements.
Bacterial genomes evolve mostly by point mutation,
recombination, horizontal gene transfer, and rearrangements. Hence, it is important to compare the relative rates
of rearrangements with the ones of other mechanisms shaping the genome. Escherichia and Salmonella orthologues
are separated by an average of one substitution per synonymous position and 0.1 per nonsynonymous position. This
makes ;1,000,000 substitutions since divergence among
the ;3,000 orthologues that resulted from a genomic mutation rate of ;103/generation. Estimation of recombination rates in E. coli suggest that recombination is responsible
for more than 10 times more changes than point mutations,
for a genomic rate of recombination of ;109/generation
(Guttman and Dykhuizen 1994). Over 1,200 genes in E. coli
K12 do not have an orthologue in S. enterica typhymurium.
Hence, horizontal transfer and deletions account for over 1
Mb of the differences in the current genomes (Lawrence and
Ochman 1997). Compared to these numbers the analysis of
genome rearrangements is striking. A genomic rearrangement rate of at least 104/generation in E. coli leads to an
observed number of rearrangements on the order of a dozen
between E. coli and S. enterica, depending on the strains that
are analyzed. Hence, purifying selection pressure acting on
rearrangements is several orders of magnitude stronger than
the one operating on nucleotide substitutions, recombination, or horizontal gene transfer. This indicates that in spite
of the massive horizontal gene transfer, the backbone of orthologous pairs in bacteria is highly conserved, especially
for pairs of genes in the same operon. Hence, the fraction
of rearrangement events that does not lead to a significant
loss of fitness must be exceedingly small.
The results of this work indicate that there is no simple
clear-cut association between genome stability and lifestyle
as both very unstable and very stable genomes are included
in each class. Although it has often been suggested that
pathogens are less stable because of adaptation to the host
(Hughes 1999; Hacker, Hentschel, and Dobrindt 2003), I
found that they have an intermediate stability. The exception may be some recent pathogens that were found to have
become less stable recently. Hence, instability may be more
related with a recent change in ecological niche than with
pathogenicity itself. Free-living bacteria are less stable and
endomutualists are more stable. If stability is the by-product
of the need to adapt and this is the reflex of the number of
different challenges imposed on the species, it is not surprising that free-living bacteria are the least stable. However, given the strong purifying selection acting on
rearrangements, the possibility of a positive effect of a rearrangement is small, at best. In fact, stability is likely to be
the result of conflicting forces. On one hand, rearrangements are expected to be deleterious, otherwise they would
be fixed more frequently. Hence, more efficient selection,
for example, due to larger effective population sizes, will
tend to increase the stability because most rearrangements
will be purged from the population. This may explain why
the bottleneck-prone recent pathogens have recently become less stable. On the other hand, the occurrence of
rearrangements is strongly dependent on substrates for
recombination events, such as repeats. This is why highly
repeated genomes are less stable (Rocha 2003a). The number of rearrangements is expected to be higher in genomes
that are larger, because they have more duplicated elements
(Achaz et al. 2002), or less sexually isolated, because these
genomes will be more frequently invaded by IS, phages,
and restriction and modification systems. However, larger
bacterial genomes are probably also associated with higher
effective population sizes because they correspond to
commensals and free-living bacteria. These compensating
forces might explain why there is no general correlation between genome size and stability or between lifestyle and
stability when phylogenetic inertia is taken into account.
Obligatory symbionts, either pathogens or mutualists, have
lower effective population sizes, but also fewer repeats, few
horizontal gene transfers, and they often miss genes associated with homologous recombination. They are thus more
stable not because of more efficient selection, but for lack of
elements capable of disrupting the genome.
Supplementary Material
Supplementary Table 1 is available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.
org/).
Acknowledgments
I am most grateful to the people with whom I have
discussed the issues presented in this article and especially
to Guillaume Achaz and Isabelle Goncxalves for comments
and criticisms on earlier versions of this text.
Literature Cited
Achaz, G., E. P. C. Rocha, P. Netter, and E. Coissac. 2002.
Origin and fate of repeats in bacteria. Nucleic Acids Res.
30:2987–2994.
Bentley, S. D., and J. Parkhill. 2004. Comparative genomic structure of prokaryotes. Annu. Rev. Genet. 38:771–791.
Bergthorsson, U., and H. Ochman. 1998. Distribution of chromosome length variation in natural isolates of Escherichia coli.
Mol. Biol. Evol. 15:6–16.
Campo, N., M. J. Dias, M. L. Daveran-Mingot, P. Ritzenthaler,
and P. Le Bourgeois. 2004. Chromosomal constraints in
Gram-positive bacteria revealed by artificial inversions. Mol.
Microbiol. 52:511–522.
Carrasco, C. D., and J. W. Golden. 1995. Two heterocyst-specific
DNA rearrangements of nif operons in Anabaena cylindrica
and Nostoc sp. strain Mac. Microbiology 141:2479–2487.
Chain, P. S., E. Carniel, F. W. Larimer et al. (23 co-authors). 2004.
Insights into the evolution of Yersinia pestis through wholegenome comparison with Yersinia pseudotuberculosis. Proc.
Natl. Acad. Sci USA 101:13826–13831.
Chandler, M. G., and R. H. Pritchard. 1975. The effect of gene
concentration and relative gene dosage on gene output in
Escherichia coli. Mol. Gen. Genet. 138:127–141.
The Evolution of Bacterial Genome Stability 521
Dale, C., B. Wang, N. Moran, and H. Ochman. 2003. Loss of
DNA recombinational repair enzymes in the initial stages of
genome degeneration. Mol. Biol. Evol. 20:1188–1194.
de Daruvar, A., J. Collado-Vides, and A. Valencia. 2002. Analysis
of the cellular functions of Escherichia coli operons and their
conservation in Bacillus subtilis. J. Mol. Evol. 55:211–221.
Dempsey, J. A., A. B. Wallace, and J. G. Cannon. 1995.
The physical map of the chromosome of a serogroup A strain
of Neisseria meningitidis shows complex rearrangements
relative to the chromosomes of the two mapped strains of
the closely related species N. gonorrhoeae. J. Bacteriol.
177:6390–6400.
Deng, W., V. Burland, G. Plunkett III et al. (20 co-authors). 2002.
Genome sequence of Yersinia pestis KIM. J. Bacteriol.
184:4601–4611.
Dujon, B., D. Sherman, G. Fischer et al. (107 co-authors). 2004.
Genome evolution in yeasts. Nature 430:35–44.
Eisen, J. A., J. F. Heidelberg, O. White, and S. L. Salzberg. 2000.
Evidence for symmetric chromosomal inversions around the
replication origin in bacteria. Genome Biol. 1:11.11–11.19.
Ermolaeva, M. D., H. G. Khalak, O. White, H. O. Smith, and
S. L. Salzberg. 2000. Prediction of transcription terminators
in bacterial genomes. J. Mol. Biol. 301:27–33.
Felsenstein, J. 1985. Phylogenies and the comparative method.
Am. Nat. 125:1–15.
Frank, A. C., H. Amiri, and S. G. Andersson. 2002. Genome deterioration: loss of repeated sequences and accumulation of
junk DNA. Genetica 115:1–12.
Gray, Y. H. M. 2000. It takes two transposons to tango. Trends
Genet. 16:461–468.
Guttman, D. S., and D. E. Dykhuizen. 1994. Clonal divergence in
Escherichia coli as a result of recombination, not mutation.
Science 266:1380–1383.
Hacker, J., U. Hentschel, and U. Dobrindt. 2003. Prokaryotic
chromosomes and disease. Science 301:790–793.
Hershberg, R., E. Yeger-Lotem, and H. Margalit. 2005. Chromosomal organization is shaped by the transcription regulatory
network. Trends Genet. 21:138–142.
Hill, C. W., and J. A. Gray. 1988. Effects of chromosomal inversion on cell fitness in Escherichia coli K-12. Genetics
119:771–778.
Hill, C. W., and B. Harnish. 1981. Inversions between ribosomal RNA genes of E. coli. Proc. Natl. Acad. Sci. USA
78:7069–7072.
Huelsenbeck, J. P., B. Rannala, and J. P. Masly. 2000. Accommodating phylogenetic uncertainty in evolutionary studies.
Science 288:2349–2350.
Hughes, D. 1999. Impact of homologous recombination on
genome organization and stability. Pp. 109–128 in R. L.
Charlebois, ed. Organization of the prokaryotic genome.
ASM Press, Wash.
———. 2000. Evaluating genome dynamics: the constraints on
rearrangementswithinbacterialgenomes.GenomeBiol.1:6.1–6.8.
Hurst, L. D., C. Pal, and M. J. Lercher. 2004. The evolutionary dynamics of eukaryotic gene order. Nat. Rev. Genet. 5:299–310.
Huynen, M. A., and P. Bork. 1998. Measuring genome evolution.
Proc. Natl. Acad. Sci. USA 95:5849–5856.
Itoh, T., K. Takemoto, H. Mori, and T. Gojobori. 1999. Evolutionary instability of operon structures disclosed by sequence comparisons of complete microbial genomes. Mol. Biol. Evol.
16:332–346.
Jacob, F., and J. Monod. 1961. Genetic regulatory mechanisms in
the synthesis of proteins. J. Mol. Biol. 3:318–356.
Jin, Q., Z. Yuan, J. Xu et al. (34 co-authors). 2002. Genome sequence of Shigella flexneri 2a: insights into pathogenicity
through comparison with genomes of Escherichia coli K12
and O157. Nucleic Acids Res. 30:4432–4441.
Kalman, S., W. Mitchell, R. Marathe, C. Lammel, J. Fan, R. W.
Hyman, L. Olinger, J. Grimwood, R. W. Davis, and R. S.
Stephens. 1999. Comparative genomes of Chlamydia pneumoniae and C. trachomatis. Nat. Genet. 21:385–389.
Kolsto, A.-B. 1997. Dynamic bacterial genome organization. Mol.
Microbiol. 24:241–248.
Korbel, J. O., L. J. Jensen, C. von Mering, and P. Bork. 2004.
Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs.
Nat. Biotechnol. 22:911–917.
Lathe, W. C., B. Snel, and P. Bork. 2000. Gene context conservation of a higher order than operons. Trends Biochem. Sci.
25:474–479.
Lawrence, J. G. 2003. Gene organization: selection, selfishness,
and serendipity. Annu. Rev. Microbiol. 57:419–440.
Lawrence, J. G., R. W. Hendrix, and S. Casjens. 2001. Where are
the pseudogenes in bacterial genomes? Trends Microbiol.
9:535–540.
Lawrence, J. G., and H. Ochman. 1997. Amelioration of bacterial
genomes: rates of change and exchange. J. Mol. Evol. 44:
383–397.
Louarn, J. M., J. P. Bouche, F. Legendre, J. Louarn, and J. Patte.
1985. Characterization and properties of very large inversions
of the E. coli chromosome along the origin-to-terminus axis.
Mol. Gen. Genet. 201:467–476.
Martins, E. P., and T. F. Hansen. 1997. Phylogenies and the comparative method: a general approach to incorporating phylogenetic information into the analysis of interspecific data. Am.
Nat. 149:656–667.
McClelland, M., K. Sanderson, J. Spieth et al. (26 co-authors).
2001. Complete genome sequence of Salmonella enterica
serovar Typhimurium LT2. Nature 413:852–656.
Mira, A., H. Ochman, and N. A. Moran. 2001. Deletional bias and
the evolution of bacterial genomes. Trends Genet. 17:589–596.
Mushegian, A. R., and E. V. Koonin. 1996. Gene order is not conserved in bacterial evolution. Trends Genet. 12:289–290.
Nakagawa, I., K. Kurokawa, A. Yamashita et al. (13 co-authors).
2003. Genome sequence of an M3 strain of Streptococcus pyogenes reveals a large-scale genomic rearrangement in invasive
strains and new insights into phage evolution. Genome Res.
13:1042–1055.
Ochman, H., and I. B. Jones. 2000. Evolutionary dynamics of
full genome content in Escherichia coli. EMBO J. 19:6637–6643.
Ochman, H., and A. C. Wilson. 1987. Evolution in bacteria: evidence for a universal substitution rate in cellular genomes.
J. Mol. Evol. 26:74–86.
Overbeek, R., M. Fonstein, M. D’Souza, G. D. Pusch, and
N. Maltsev. 1999. The use of gene clusters to infer functional
coupling. Proc. Natl. Acad. Sci. USA 96:2896–2901.
Pagel, M. 1999. Inferring the historical patterns of biological evolution. Nature 401:877–884.
Parkhill, J., M. Sebaihia, A. Preston et al. (54 co-authors). 2003.
Comparative analysis of the genome sequences of Bordetella
pertussis, Bordetella parapertussis and Bordetella bronchiseptica. Nat. Genet. 35:32–40.
Rebollo, J. E., V. Franc
xois, and J. M. Louarn. 1988. Detection
and possible role of two large nondivisible zones on the
Escherichia coli chromosome. Proc. Natl. Acad. Sci. USA
85:9391–9395.
Rocha, E. P. C. 2003a. DNA repeats lead to the accelerated loss of
gene order in bacteria. Trends Genet. 19:600–604.
———. 2003b. An appraisal of the potential for illegitimate
recombination in bacterial genomes and its consequences:
from duplications to genome reduction. Genome Res.
13:1123–1132.
———. 2004a. The replication-related organisation of the bacterial chromosome. Microbiology 150:1609–1627.
522 Rocha
———. 2004b. Order and disorder in bacterial genomes. Curr.
Opin. Microbiol. 7:519–527.
Rocha, E. P. C., and A. Danchin. 2003. Essentiality, not expressiveness, drives gene strand bias in bacteria. Nat. Genet.
34:377–378.
Salgado, H., G. Moreno-Hagelsieb, T. F. Smith, and J. ColladoVides. 2000. Operons in Escherichia coli: genomic analyses
and predictions. Proc. Natl. Acad. Sci. USA 97:6652–6657.
Sankoff, D. 2003. Rearrangements and chromosomal evolution.
Curr. Opin. Genet. Dev. 13:583–587.
Schmidt, H. A., K. Strimmer, M. Vingron, and A. von Haeseler.
2002. Tree-puzzle: maximum likelihood phylogenetic analysis
using quartets and parallel computing. Bioinformatics 18:
502–504.
Segall, A., M. J. Mahan, and J. R. Roth. 1988. Rearrangement of
the bacterial chromosome: forbidden inversions. Science
241:1314–1318.
Shigenobu, S., H. Watanabe, M. Hattori, Y. Sakaki, and
H. Ishikawa. 2000. Genome sequence of the endocellular
bacterial symbiont of aphids Buchnera sp. APS. Nature
407:81–86.
Silva, F. J., A. Latorre, and A. Moya. 2001. Genome size reduction
through multiple events of gene disintegration in Buchnera
APS. Trends Genet. 17:615–618.
———. 2003. Why are the genomes of endosymbiotic bacteria so
stable? Trends Genet. 19:176–180.
Suyama, M., and P. Bork. 2001. Evolution of prokaryotic gene
order: genome rearrangements in closely related species.
Trends Genet. 17:10–13.
Tamames, J. 2001. Evolution of gene order conservation in prokaryotes. Genome Biol. 2:0020.0021–0020.0011.
Tamas, I., L. Klasson, B. Canback, A. K. Naslund, A. S. Eriksson,
J. J. Wernegreen, J. P. Sandstrom, N. A. Moran, and S. G.
Andersson. 2002. 50 million years of genomic stasis in endosymbiotic bacteria. Science 296:2376–2379.
Tillier, E. R., and R. A. Collins. 2000. Genome rearrangement by
replication-directed translocation. Nat. Genet. 26:195–197.
Volff, J.-N., and J. Altenbuchner. 1998. Genetic instability of the
Streptomyces chromosome. Mol. Microbiol. 27:239–246.
Warren, P. B., and P. R. ten Wolde. 2004. Statistical analysis of the
spatial distribution of operons in the transcriptional regulation
network of Escherichia coli. J. Mol. Biol. 342:1379–1390.
Wolf, Y. I., I. B. Rogozin, N. V. Grishin, R. L. Tatusov, and E. V.
Koonin. 2001. Genome trees constructed using five different
approaches suggest new major bacterial clades. BMC Evol.
Biol. 1:8.
Wu, M., L. V. Sun, J. Vamathevan et al. (30 co-authors) 2004.
Phylogenomics of the reproductive parasite Wolbachia pipientis
wMel: a streamlined genome overrun by mobile genetic elements. PLoS Biol. 2:E69.
Zheng, Y., J. D. Szustakowski, L. Fortnow, R. J. Roberts, and S.
Kasif. 2002. Computational identification of operons in microbial genomes. Genome Res. 12:1221–1230.
Takashi Gojobori, Associate Editor
Accepted November 2, 2005