The Evolutionary Relationships Between the Two Bacteria

The Evolutionary Relationships Between the Two Bacteria Escherichia coli
and Haemophilus influenzae and their Putative Last Common Ancestor
Renaud de Rosa and Bernard Labedan
Institut de Génétique et Microbiologie, Université Paris-Sud, France
We have tried to approach the nature of the last common ancestor to Haemophilus influenzae and Escherichia coli
and to determine how each bacterium could have diverged from this putative organism. The approach used was
exhaustive analysis of the homologous proteins coded by genes present in these bacteria, using as criteria for
sequence relatedness an alignment of at least 80 amino acid residues and a PAM distance (number of accepted
point mutations per 100 residues separating two sequences) below 250. Evolutionarily significant similarities were
found between 1,345 H. influenzae proteins (85% of the total genome) and 3,058 E. coli proteins (75% of the total
genome), many of them belonging to families of various sizes (from 666 doublets to 35 large groups of more than
10 members). Nearly all the genes found by this approach to be duplicated in both bacteria were already duplicated
in their last common ancestor. This was deduced from (1) the comparison of the respective distributions of evolutionary distances between orthologs (genes separated only by speciation events) and paralogs (genes duplicated in
the same genome) and (2) the analysis of the phylogenetic trees reconstructed for each family of paralogs containing
at least two members belonging to each bacterium. The distributions of the different categories of homologs show
a significant loss of paralogous genes in H. influenzae (reduction proportional to the genome size), of many sequences which are still present in one copy in E. coli, and of some entire gene families. Phylogenetic trees also
confirmed this recent loss of paralogous genes in H. influenzae. Thus, the genome size of the last common ancestor
of these two bacteria would have been close to that of present-day E. coli, and the evolution of H. influenzae toward
a parasitic life led to an important decrease in its genome size by some mechanism of streamlining. During this
recent evolution, the memory of the gene order present in the last common ancestor has been blurred, but a few
short conserved chromosomal fragments can still be detected in present-day E. coli and H. influenzae.
Introduction
According to the evolutionary relationships established in the16S rRNA tree (for a recent version, see
Olsen, Woese, and Overbeek 1994), Escherichia coli
and Haemophilus influenzae belong to the gamma subdivision of purple bacteria. Inside this subdivision, they
appear to be closely related, with H. influenzae being
the closest sister group to the enterobacteria cluster.
However, these two bacteria differ in many important
features. For example, their genome sizes vary from 4.7
Mb for E. coli to only 1.8 Mb for H. influenzae. Moreover, their ways of life appear quite different: although
it can behave as a pathogen, E. coli is a versatile freeliving bacterium, able to adapt to many environmental
conditions (for a general overview, see Neidhardt et al.
1996), while H. influenzae, an obligate parasite of human respiratory tract which can invade the blood and
the central nervous system, is dependent on very specialized growing conditions (Musser et al. 1990; Moxon
1992). This suggests that these two bacteria followed
different paths in their recent evolutionary history after
having diverged from their last common ancestor. (In
Abbreviations: DARWIN, Data Analysis and Retrieval With Indexed Nucleotide/Peptide Sequences; Eco, Escherichia coli; Hin, Haemophilus influenzae; ORF, open reading frame; PAM, number of accepted point mutations per 100 residues separating two protein sequences.
Key words: comparative genomics, bacterial evolution, paralogous proteins, gene families, rearrangements of bacterial chromosomes,
gene duplication, Escherichia coli, Haemophilus influenzae.
Address for correspondence and reprints: Bernard Labedan, Institut de Génétique et Microbiologie, Université Paris-Sud, Bâtiment 409,
91405 Orsay Cedex, France. E-mail: [email protected].
Mol. Biol. Evol. 15(1):17–27. 1998
q 1998 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
this paper, ‘‘last common ancestor’’ will always refer to
the last common ancestor of E. coli and H. influenzae.)
To solve some aspects of this recent history, we
have tried to reconstruct the putative nature of this common ancestor and to determine how each bacterium
could have diverged from it. This has been attempted
by analyzing the different classes of genes encoded by
each genome using the following rationale. It has already been shown that E. coli contains an important proportion of paralogous genes and that many of these paralogous genes group into families of various sizes (Labedan and Riley 1995a, 1995b; Riley and Labedan
1997). Paralogous genes (Fitch 1970) have been defined
as copies issued from a duplication of an ancestral gene,
each copy having diverged before any speciation event.
A preliminary study (Brenner et al. 1995) has suggested
that a significant proportion of H. influenzae genes probably descend from duplications. Many open reading
frames (ORFs) found by systematic sequencing
(Fleischman et al. 1995) of the H. influenzae genome
code for proteins which are similar in sequence to E.
coli proteins, and the corresponding functions were assigned on the basis of this similarity (Fleischman et al.
1995; Tatusov et al. 1996). These studies suggested that
many H. influenzae genes are homologous to E. coli
ones. However, many assignments of homology have
been based on similarities limited to small motifs or
signatures. Such an approach is very helpful when trying to find the maximum of functional assignments (Tatusov et al. 1996) but does not fit with a consistent evolutionary study. Since the evolution of complete genomes involves large-scale (that is, at least the size of
a gene) chromosomal rearrangements, comparison of
long segments of homology is especially crucial to determine the timing of duplication events versus specia17
18
de Rosa and Labedan
tion and thus to reconstruct the nature of the putative
last common ancestor of E. coli and H. influenzae.
Therefore, we have undertaken an exhaustive analysis,
at the level of protein sequence, of the whole sets of H.
influenzae and E. coli genes. Taking into account only
long segments of homology allowed us to look at the
evolutionary behavior of whole genes. Accordingly, we
counted the respective proportions of each class of proteins (paralogs, orthologs, and unique), and compared
the families of proteins and their members inside each
genome. Here, we define a protein family as any group
of proteins encoded by genes which derive from the
same ancestral gene. Such a definition extends from a
pair of orthologs (doublets) to large clusters of paralogs.
Finally, we compared the genomic maps of these evolutionarily related bacteria in order to detect the vestiges
of an ancestral gene order by looking at conserved chromosomal fragments larger than a gene. The data reported in this paper support the hypothesis that E. coli and
H. influenzae descend from a common ancestor which
had a genome size and composition close to that of present-day E. coli.
Materials and Methods
Sequences
The whole sets of E. coli (version sent to GenBank
in January 1997, accession number U00096, further
completed with a few unpublished missing sequences
later obtained from the Blattner group through Monica
Riley) and H. influenzae (Fleischman et al. 1995) proteins have been harvested with the exception of the nonchromosomal genes (e.g., insertion sequences). To study
only significantly long segments of homology, we further discarded all of the proteins displaying a length
shorter than 80 amino acids. After these two steps, we
had a data set made of 1,574 H. influenzae (Hin) and
4,061 E. coli (Eco) protein sequences.
Finding Extended Similarities Between Protein
Sequences of H. influenzae and E. coli
The so-called DARWIN (Data Analysis and Retrieval With Indexed Nucleotide/Peptide Sequences)
program (Gonnet, Cohen and Benner 1992) is an interactive and programmable system which allows one to
search for all matches between proteins of a database
using a complete Smith-Waterman-type dynamic programming algorithm (Smith and Waterman 1981). In
this DARWIN system, the sequences are organized as
sets of evolutionarily connected components which are
characterized by an evolutionary distance measured in
PAM units (Dayhoff, Schwartz, and Orcutt 1978). This
PAM distance—the number of accepted point mutations
per 100 residues separating two sequences—is based on
(1) a mutation data matrix normalized to a distance of
250 PAM units and recomputed for each new set of
sequences and (2) a gap penalty which is itself dependent on the PAM distance intrinsic to the set of sequences studied (see Gonnet, Cohen, and Benner 1992; Benner, Cohen, and Gonnet 1993 for additional details of
the method). This program, which has been judged to
be one of the best performers among sequence comparison programs (Johnson and Overington 1993; Vogt, Etzold, and Argos 1995) was obtained from the Institut
für Wissenschaftliches Rechnen (ETH Zürich, Switzerland) and implemented on a DEC Alpha station. After
slight adaptation of several existing DARWIN procedures and addition of several new ones created for our
purpose (available on request by electronic mail), we
were able to detect, in one step, all the matches between
the 1,574 Hin and 4,061 Eco proteins. To limit our
search to the evolutionarily significant matches, we imposed the two following cutoffs: only the pairs corresponding to an alignment of at least 80 residues and
separated by less than 250 PAM units were kept for
further analysis. The rationale for these two cutoffs is
essentially based on the finding that a PAM250 substitution matrix is the most efficient scoring matrix when
applied to distantly related protein pairs for a minimum
significant length of 83 residues (Altschul 1991).
Analysis of the Protein Pairs and Families
To analyze all of the Eco/Eco, Hin/Hin, and Eco/
Hin protein pairs extracted from the DARWIN outputs,
a database was built using the relational database application Claris FilemakerPro 3.03 for Macintosh. A program written in Caml light language was used to automatically gather into one family all sequences that were
related by a chain of similarities, collecting all relatives
of both members of each pair until no further pairwise
relationships were found (unpublished work of R.D.R.,
program available on request). (Information about the
family of Caml languages is available at the Internet
address http://pauillac.inria.fr/caml/index-eng.html.)
The respective distributions, as a function of their
PAM distance, of the paralogs (protein pairs Eco/Eco
and Hin/Hin) and of the orthologs (pairs Eco/Hin) have
been weighted by the size of their respective families as
follows. Since the number of pairs N(N 2 1)/2 increases
faster than the number of sequences N, a weighting factor of 1/(N 2 1) was used for each family. This weighting, which allows each family to be represented proportionally to its N, was computed by applying a program written in Pascal language (unpublished data).
Making Phylogenetic Trees for the Families of
Proteins
For each family containing at least two members
belonging to each bacterium, a multiple alignment and
an unrooted tree were established using two functions
(MulAlignment and PhyloTree) which are also part of
the DARWIN program (available on the CBRG WWW
server of the ETH at Zürich). The trees obtained using
PhyloTree are based on the estimated PAM distances
between each pair of sequences, and the deduced evolutionary distance between each node is weighted by
computing the variance of the respective distance.
Therefore, these distance trees appear to be approximations of maximum-likelihood trees (see Gonnet, Cohen, and Benner [1992] for additional details of the
method and the booklet available at the Internet address
Last Common Ancestor to E. coli and H. influenzae
http://cbrg.inf.ethz.ch/ServerBooklet, especially the subsection 2p3p5p1).
Analysis of Gene Positions
Eight hundred twenty-nine pairs of orthologs with
gene positions in base pairs available in the GenBank
(accession number U00096) and TIGR databases
(Fleischman et al. 1995), respectively, were used for this
analysis. We considered as orthologs pairs of genes from
the two species which fit into either one of the following
categories: genes that had been given the same name in
the two species (assuming that the functional assignments in the databases were correct), genes from the two
species belonging to two-member families, or genes belonging to larger families and being closer (in PAM distance) to each other than to any other member of the
family. Positions taken as value 0 correspond to gene
thrA in E. coli (Berlyn, Brooks Low, and Rudd 1996),
and to that of the unique Not I restriction site in H.
influenzae (Fleischman et al. 1995). The distance between genes was computed by taking into account the
fact that the chromosomes are circular, that is, the top
and the bottom and the left and the right of this rectangle
join to make a torus. The theoretical distribution was
computed as follows: assuming the 829 genes had the
same position on the E. coli chromosome, we computed
the distribution of the angles as if each had 10 orthologs
uniformly scattered on the H. influenzae chromosome,
located at 1/10, 2/10, . . . , 10/10 of the length of the
chromosome. This distribution was divided by 10 for
scaling reasons.
To further detect conserved gene clusters, we
sought consecutive pairs of orthologs, i.e., two genes
present in the same order and at the same distance in
the two genomes. We only considered the pairs which
had no more than two ORFs between each pair of orthologs. Starting from these, we looked at the genes upstream and downstream in order to find longer conserved clusters.
Results
Finding Significant Similarities Between the Proteins
Encoded by H. influenzae and E. coli
We adapted the DARWIN program (Gonnet, Cohen, and Benner 1992) to build a mixed set of procedures, some already existing and others we created, in
order to mimick the AllAllDB program. This set allowed
us to collect, in one step, all the significant matches
between a defined set of Hin and Eco proteins. Significant matches are defined as matches displaying an
alignment length greater than 80 residues and a PAM
distance below 250. Indeed, a length cutoff seemed essential to get evolutionarily consistent data: by taking
into account only similarities extending along a large
part of each protein, if not the whole protein, we would
be able to trace with confidence the corresponding gene
duplication events. Accordingly, the defined set corresponds to a total of 5,635 proteins which are longer than
80 amino acids, 1,574 belonging to H. influenzae and
4,061 to E. coli.
19
The exhaustive search for all the matches separated
by less than 250 PAM units and at least 80 residues
long gave 19,896 matches corresponding to 3,058 E.
coli and 1,345 H. influenzae protein sequences. This implied that many sequences belong to protein families. A
program (see Materials and Methods) was applied to
assemble the proteins belonging to the same family. To
maintain a high level of consistency in our analysis of
gene evolution, we further considered only families displaying matches where the alignment length was longer
than 50% of the sequence length of each protein. This
corresponded to 9,599 matches made of 1,219 Hin and
2,537 Eco sequences. These 3,756 sequences assembled
in 1,015 families of various sizes: 666 doublets, 139
triplets, 175 small families from 4 to 10 members, and
35 large groups of more than 10 members.
Correlation Between the Distribution of Evolutionary
Distances and the Nature of the Homologous Proteins
We first made a comparison of the respective distributions of evolutionary distances separating matching
proteins which are coded either by the same genome
(paralogs) or by the two genomes. The latter may correspond to either orthologs (homologous genes separated by speciation) or descendants of paralogs which have
further been separated by speciation. Members of this
last category have been called metalogs (from the Greek
meta 5 change) by several authors (Solignac et al.
1995). To make these distributions, we used the total set
of 19,896 matches, but each match was weighted according to the size of the family to which the matching
proteins belong (see Materials and Methods). Figure 1
shows that the category of paralogs sensu stricto made
of all the Eco/Eco and Hin/Hin pairs displays a very
broad peak centered at 160 PAM units, whereas the Eco/
Hin pairs are separated into two different classes, one
having a distribution similar to that of the paralogs and
the other one showing a narrower peak centered at 40
PAM units. We propose that the first class corresponds
to the so-called metalogs and the other class is made of
orthologs. The small plateau appearing at the intermediate PAM distance between 90 and 110 could correspond to the addition of the decreasing number of pairs
of orthologs and the increasing number of pairs of metalogs. Taken together, these data strongly suggest that
the large majority of gene duplications (paralogs sensu
lato, including metalogs) found in both bacteria must
have occurred before the speciation event leading to the
immediate ancestors of E. coli and H. influenzae.
The Loss of Proteins in H. influenzae Appears to be
Selective
Next, we compared the whole data set of 4,061 E.
coli and 1,574 H. influenzae sequences to determine the
distributions of each category of proteins in both organisms. The proteins encoded by each genome may be
separated into the following four categories: Sequences
either can be found in both genomes (Eco and Hin) or
may belong only to one genome (Eco or Hin). In this
last case, sequences either can be unique to this genome
or may be found as duplicated copies inside this ge-
20
de Rosa and Labedan
FIG. 1.—Weighted distribution of PAM distances separating pairs of proteins belonging to the same family. The number of pairs corresponding to each family was weighted by 1/(N 2 1), where N is the number of sequences belonging to the family, in order for each family to
be represented proportionally to its N. White bars: PAM distance separating sequences from two different organisms (Eco/Hin) (orthologs or
metalogs). Black bars: PAM distance separating sequences from the same organism (Eco/Eco or Hin/Hin) (paralogs sensu stricto).
nome. Table 1 summarizes the number of sequences corresponding to each category: (1) By definition, the same
number of unique orthologs, 546, is present in both bacteria. Consequently, their relative proportion is larger in
H. influenzae (34.7%) than it is in E. coli (13.4%). (2)
The number of paralogs present in families having at
least two representatives in one genome and one in the
other genome decreases from 1,950 (421 1 1,511 1 18)
in E. coli to 774 (141 1 591 1 42) in H. influenzae.
Thus, there has been a reduction of this class of paralogs
which is remarkably proportional to that of the genome
size in H. influenzae, since the relative percentages (49%
in H. influenzae and 48% in E. coli) remain very close.
(3) There are far more sequences, 1,565 (1,003 1 562),
which are unique to E. coli than are peculiar to H. influenzae (254 5 229 1 25). Therefore, it appears that
many E. coli sequences have no (more) homologs in H.
influenzae.
When looking at the respective functional categories of the sequences which are unique to H. influenzae,
we found that 69.7% (177) are proteins with unknown
functions, some of them making homogeneous families
Table 1
Distribution of Proteins in Different Classes of Families
E. COLI
(ECO)
SEQUENCES
IN EACH
FAMILY
H.
INFLUENZAE
None
None. . . . .
One. . . . . .
—
1,003 Eco
Several. . .
562 Eco
(HIN) SEQUENCES
One
229
546
546
421
141
Hin
Eco
Hin
Eco
Hin
IN
EACH FAMILY
Several
25
18
42
1,511
591
Hin
Eco
Hin
Eco
Hin
(up to five members). Among the proteins whose functions have been attributed by sequence similarity, we
found a significant proportion of proteins which play a
role similar to E. coli proteins, such as 18 enzymes involved in metabolism, 6 ribosomal proteins, and 2 lipoproteins, but whose sequence similarity to E. coli proteins was too low to be recognized in our search of
significant similarities. This emphasizes that our strict
criteria of evolutionary consistency may lead to some
underestimation of the actual total number of homologous proteins, and we could not exclude the possibility
that these analogous proteins are actually very distant
orthologs. Therefore, it appears that there would be very
few proteins really specific to H. influenzae, such as two
competence factors (Williams, Bannister, and Redfield
1994; Zulty and Barcak 1995) or three restriction enzymes.
Analysis of Evolutionary Trees: Losses of Paralogs in
Large Families
To further determine where the main losses of
genes have occurred during the recent evolution of H.
influenzae, we reconstructed a phylogenetic tree for each
family of paralogs containing at least two members from
each genome.
An exhaustive analysis of the 137 corresponding
trees showed two main features which are illustrated in
the few examples shown in figures 2–4. Schematically,
there are two main classes of trees: (1) The small families generally display the same number of paralogs in
each organism. Figures 2 and 3 (family 932) show several examples of such families. When the number of
members was odd, this generally corresponded (93%) to
the presence of a supplementary paralog in E. coli, as
shown in figure 3 (family 889), where an unknown E.
Last Common Ancestor to E. coli and H. influenzae
21
FIG. 2.—A few examples of family trees containing two pairs of paralogs. The tree reconstruction method is described in the text. Branch
lengths are displayed in PAM units. For each family, the names of the proteins are given as their SwissProt (Bairoch and Apweiler 1996)
mnemonics. The corresponding SwissProt accession numbers are: PBP2pECOLI: P08150; PBP2pHAEIN: P44469; PBP3pECOLI: P04286;
PBP3pHAEIN: P45059; RODApECOLI: P15035; RODApHAEIN: P44468; FTSWpECOLI: P16457; FTSWpHAEIN: P45064; ODO2pECOLI:
P07016; ODO2pHAEIN: P45302; ODP2pECOLI: P06959; ODP2pHAEIN: P45118; CN16pECOLI: P08331; CN16pHAEIN: P44764;
USHApECOLI: P07024; 5NTDpHAEIN: P44569.
coli ORF belongs to this family of penicillin-binding
proteins. These small families frequently correspond to
enzymes (such as acyltransferases or periplasmic hydrolases in fig. 2 or lyases in fig. 3) essential to the metabolism of each bacterium, or to proteins important for
cell structure (such as proteins necessary to the integrity
of the cell wall; figs. 2 and 3). (2) On the contrary, in
large families, the number of paralogs present in H. influenzae was always strongly reduced when compared
to that of E. coli. For example, in the families shown in
figure 4, there are only three sensor proteins belonging
to the two-component signal transduction system (Hoch
and Silhavy 1995) in H. influenzae compared to 19 such
sensors present in E. coli, and there is only 1 transcriptional regulator in H. influenzae compared to 12 in E.
coli. Figure 5 further shows the distribution profile displaying the number of paralogs in E. coli versus that in
H. influenzae for each gene family containing at least
seven members. This distribution confirms that a significant part of the paralogous proteins specifically lost in
H. influenzae belong to large E. coli families. Note that
the loss is less massive in the largest displayed family
(ABC proteins).
Analysis of Evolutionary Trees: Dating of the
Duplications
The distribution of evolutionary distances (fig. 1)
already suggested that the duplication events which gave
rise to the paralogous genes took place before the separation of the two bacteria. This appears to be confirmed
by the tree topologies. Indeed, the copies present in each
species are always separated in the distal parts of the
trees (figs. 2 and 3). For example, in figure 3 (family
932), successive gene duplications which separated the
ancestors of the argininosuccinate from those of fumarate hydratase and aspartate ammonia-lyase occured before the further separation of each pair of orthologs due
to the speciation.
We further used these trees as a tool to attempt a
relative dating of these duplications. Indeed, a systematic survey of tree topologies disclosed a significant
number of cases where the branch lengths were of un-
22
de Rosa and Labedan
FIG. 3.—Other examples of small families. The tree reconstruction method is described in the text. Branch lengths are displayed in PAM
units. For each family, the names of the proteins are given as their SwissProt mnemonics except for the new ORF in family 889, which is
indicated by its GenBank identification number. The corresponding SwissProt accession numbers are: PBPApECOLI: P02918; PBPApHAEIN:
P31776; PBPBpECOLI: P02919; PBPBpHAEIN: P45345; ARLYpECOLI: P11447; ARLYpHAEIN: P44314; ASPApECOLI: P04422;
ASPApHAEIN: P44324; FUMCpECOLI: P05042.
equal size. A limited sampling is shown in figures 2 and
3. For example, in the case of family 817 (fig. 2), divergence of the UDP-sugar hydrolase appears to have
been more important than that of the 2939-cyclic-nucleotide 29-phosphodiesterase. To check if the distribution
observed in figure 1 could be due to some acceleration
of the divergence between paralogous sequences, we
tried to estimate the magnitude of this process using the
39 families of four members containing two paralogous
proteins from each organism. Since each of these families has two copies in each species which separated in
the distal branches, the highest ratio of the sums of the
distal branch lengths would give us an estimation of the
acceleration of the divergence between paralogous sequences. For example, in the case of figure 2, this ratio
varied from (32 1 32)/(14 1 41) 5 1.16 for family 807
to (106 1 84)/(27 1 20) 5 4.04 for family 817. The
mean value of these ratios, calculated for this subset of
39 trees, was found to be 1.36. This value is clearly too
low to explain the difference of about 4 which has been
found between the peak of orthologs (around 40 PAM
units) and that of paralogs/metalogs (around 160 PAM
units) in figure 1.
Comparing Gene Positions
Up to now, we have studied gene duplications, but
some events could have involved chromosomal fragments larger than a gene. To detect possible conservation of chromosomal fragments between both genomes
we plotted on the same graph the relative positions of
orthologs. Figure 6 shows that there is no evidence for
alignment of dots representing similar genes present on
the same strand along a line forming a 458 angle with
the horizontal line. Likewise, there is no such alignment
along a line forming a 2458 angle for similar genes
present on opposite strands. This negative result was
Last Common Ancestor to E. coli and H. influenzae
23
FIG. 4.—Two examples of large families. The tree reconstruction method is described in the text. Branch lengths are displayed in PAM
units. For each family, the names of the proteins are given either as their SwissProt mnemonics or as their GenBank identification numbers.
The corresponding SwissProt accession numbers are: ATOCpECOLI: Q06065; PSPFpECOLI: P37344; YFHApECOLI: P21712; TYRRpHAEIN:
P44694; TYRRpECOLI: P07604; NTRCpECOLI: P06713; YHGBpECOLI: P38035; HYDGpECOLI: P14375; FHLApECOLI: P19323;
UHPBpECOLI: P09835; BAESpECOLI: P30847; NTRBpECOLI: P06712; RTSBpECOLI: P18392; YGIYpHAEIN: P45336; BASSpECOLI:
P30844; ATOSpECOLI: Q06067; YJDHpECOLI: P39272; ENVZpECOLI: P02933; ARCBpHAEIN: P44578; ARCBpECOLI: P22763;
PHOQpECOLI: P23837; CRECpECOLI: P08401; CPXApECOLI: P08336; BARApECOLI: P26607; PHORpECOLI: P08400.
still true when the scale was increased until all dots became distinctly discernible. Therefore, no long homologous fragment could be detected in present-day chromosomes of these two bacteria. A similar observation
has been independently made by Tatusov et al. (1996)
using a similar approach on a partial set of E. coli proteins.
To look for possible smaller chromosomal fragments of homology, we further computed all the angles
each segment joining two consecutive points would
make with the horizontal line, and we analyzed their
distribution. Figure 7 shows two peaks at 908 and 2908
corresponding to the theoretical curve of a random gene
arrangement without any fragment conservation. Note
that the peaks at 2908 . . . 2858 are more important than
those at 858 . . . 908, because the classes used to build
our histogram include their lower but not their upper
bounds, implying that the 908 angles are counted as
2908. Besides this, two peaks of significant size appear
at positions around 2458 and 458. This angle distribution confirms that extensive chromosomal rearrangement events have shuffled the order of many genes, but
that there are still short conserved fragments in chromosomes of present-day E. coli and H. influenzae. We
found 35 such conserved fragments corresponding to
cotranscribed genes or association of neighboring cotranscribed genes. The respective number of genes goes
from 2 (nine cases) to as many as 28 (corresponding to
the three operons of ribosomal proteins, rpsM to rplG,
rplN to rpmJ, and rpsJ to rpsQ, amounting in total to
24
de Rosa and Labedan
FIG. 5.—Distribution of the numbers of paralogs in the large families. The number of paralogs in E. coli versus that in H. influenzae was
calculated for each gene family containing at least seven members. The largest family, containing 353 proteins (78 Hin, 275 Eco), is not
displayed for scaling reasons. White bars: E. coli paralogs. Black bars: H. influenzae paralogs.
13.3 kb). Another long fragment, amounting in total to
16.9 kb, corresponds to the three operons ftsLI, murEftsW, and murG-ddlB plus the neighboring genes ftsQ,
ftsA, ftsZ, and lpxA and the two ORFs yabB and yabC.
Notice also the eight genes of the histidine operon (hisG
to hisI), or the two divergent operons glpQT and glpABC
for transport and utilization of glycerol-3-phosphate.
The complete list is available on request.
Discussion
Comparing whole genomes has already been shown
to be a powerful approach to determining biological features of poorly known living beings and will be more
and more useful in the future as new complete genomes
become accessible. This is apparent in pioneering stud-
ies (Tatusov et al. 1996; Karp, Ouzounis, and Paley
1996) which tried to determine what could be the metabolism of H. influenzae by comparing the whole set
of putative proteins from this bacterium to the known
(sometimes well-known) enzymes from E. coli (Riley
1993; Neidhart et al. 1996). We used this new approach
to better understand how two bacteria may have diverged from a recent common ancestor.
Our attempt aims at deducing how a limited number of changes in the repertoire of genes could have such
irreversible effects on the ways of life of prokaryotic
organisms. Indeed, taken together, all of our results
strongly suggest that the last common ancestor to E. coli
and H. influenzae was an organism having a genome
size and a way of life similar to present-day E. coli.
FIG. 6.—Plotting of the positions of orthologous genes on the chromosomal maps of E. coli and H. influenzae. Position 0 is set to the Thr
locus (Eco) and the unique Not I restriction site (Hin). The gene positions are as given in the GenBank (accession number U00096) and TIGR
databases (Internet address: http://www.tigr.org) for Hin, respectively. Open squares: genes on the same strand (1/1 or 2/2). Closed squares:
genes on opposite strands (1/2 or 2/1).
Last Common Ancestor to E. coli and H. influenzae
25
FIG. 7.—Distribution of the directions of the segments joining consecutive dots on the figure 6 plot. White bars: genes on the same strand
(1/1 or 2/2). Black bars: genes on opposite strands (1/2 or 2/1). The distributions expected if the genes were uniformly scattered on the
chromosomal maps are shown by lines: genes on the same strand, dotted line; genes on opposite strands, continuous line.
This working hypothesis is supported by the comparison—at the qualitative as well as at the quantitative level—of the respective sets of proteins present in E. coli
and H. influenzae.
Timing of Duplication Events Versus Speciation
Comparison of the different sets of proteins present
in both bacteria led to the following conclusions. Nearly
all of the genes found to be duplicated in both bacteria
were already duplicated in their last common ancestor.
This is shown by comparing the respective distributions
of evolutionary distances of orthologs and paralogs (fig.
1) and by looking at the phylogenetic trees reconstructed
for each family of paralogs (e.g., figs. 2–4). The topology of the different trees strongly suggests that the duplication events occurred before the emergence of the
last common ancestor of both bacteria. However, we
could not exclude the alternative hypothesis that the apparently ancient divergence found between copies of
paralogous proteins was actually due to some acceleration proper to a lowering of the selection pressure. Such
a difference in evolution speed appeared in a few trees
(e.g., family 817 in fig. 2). To settle this point, we analyzed a subset of trees containing two members of each
bacterium. An acceleration (if any) of the rate of divergence may be directly estimated by measuring the highest ratio of the sums of the distal branch lengths. The
mean of these highest ratios for the subset of 39 families
was only 1.36, a value too low to explain by unequal
rates of divergence the difference of about 4 observed
in figure 1 between paralogs (peak centered at 160 PAM
units) and orthologs (peak centered at 40 PAM units).
Thus, nearly all of the duplications observed in
present-day organisms were already present in the last
common ancestor of E. coli and H. influenzae. We con-
clude that this putative bacterium had a genome size
similar to (or maybe higher than) that of E. coli.
The Reduction of Gene Number in H. influenzae: A
Directed Process?
After emergence from the last common ancestor,
H. influenzae went through an intense streamlining
which reduced its genome size by a factor of about 2.6.
Comparison of the different classes of genes present in
both bacteria suggests that this streamlining process has
been directed. Table 1 shows the two following features:
1. Haemophilus influenzae lost many sequences which
are still present in one copy in E. coli, as well as
some entire gene families. An example is the unexpected loss of the genes coding for some important
enzymes of the central metabolism (Fleischman et al.
1995; Karp, Ouzounis, and Paley 1996; Tatusov et
al. 1996). Examples of families lost in H. influenzae
and present in E. coli are genes coding for proteins
necessary to chemotaxis, fimbriae synthesis, or enzymes such as a dehydrogenase family. The progressive adaptation to parasitic life may have made these
genes dispensable. Alternatively, the accidental loss
of these genes could have been a stimulus for adopting such a way of life. However, H. influenzae harbors a minor set of genes which are not present in
E. coli. Many belong to families made uniquely of
ORFs without any function detectable by sequence
similarity. There are also some genes specific to H.
influenzae, such as two genes coding for competence
factors (Williams, Bannister, and Redfield 1994; Zulty and Barcak 1995) used in the very sophisticated
transformation process of this bacterium (Tomb et al.
1989; Smith et al. 1995).
26
de Rosa and Labedan
2. The dramatic reduction in the number of genes observed in the case of H. influenzae is also due to a
loss in the families of paralogs which seems to have
been parallel to the global diminution in genome size
(see, for example, fig. 4). This loss appears to have
been selective according to family size: it occurred
rarely in small families but has often been massive
in large ones. Indeed, families having less than 10
members contain an average of 36.6% of H. influenzae proteins compared to only 22.6% in families having more than 10 members. Again, this could be partially explained in terms of adaptation. Many small
families correspond to either enzymes which may be
indispensable to the H. influenzae metabolism or
structural proteins essential for cell integrity and cell
division (figs. 2 and 3). Many large families correspond to membrane transporters or transcriptional
regulators (fig. 4). Escherichia coli—and also the last
common ancestor—should have a great selective advantage in maintaining a large panel of transporters
and regulators adapted to many different environmental conditions, while such a large repertoire could
be superfluous (and may even be a burden) to a strict
parasite such as H. influenzae.
Therefore, these data strongly support the already
proposed (Fleischman et al. 1995; Tatusov et al. 1996)
hypothesis that the streamlining process of the H. influenzae genome must have corresponded to the progressive adaptation toward a parasitic life.
‘‘Recent’’ Evolution of Bacterial Chromosomes: The
Role of Map Rearrangements
Recombinational events are found to be very frequent in bacteria and lead to large-scale chromosomal
rearrangements. This is the case when comparing closely related bacteria such as E. coli and Salmonella typhimurium (for an overview, see Roth et al. 1996), but
this phenomenon may also occur inside the same species
(Dykhuisen and Green 1991; Milkman 1996), or even
inside a large collection of E. coli K12 substrains (Perkins et al. 1993). Therefore, it was not surprising to find
so few long chromosomal fragments which are still conserved between E. coli and H. influenzae. The memory
of an ancient gene order must have been rapidly blurred
when many genes underwent apparently erratic, and
probably mostly neutral, displacements at any position
on the chromosome after the separation of the lines of
descent leading to these two bacteria. Thus, it may be
significant to find some synteny cases remaining between these two bacteria despite such strong divergence
of their chromosomal maps. Interestingly, while this paper was in the revision stage, a paper (Tamames et al.
1997) appeared that described the occurrence of conserved clusters of functionally related genes by comparing a subset of proteins from E. coli and the whole
set of H. influenzae proteins.
It would be interesting to see if such conservation
of gene order has occurred in even more distant species.
A remarkable case of considerable synteny has recently
been shown between the yeast Saccharomyces cerevi-
siae and another fungus, Ashbya gossypi (Altmann-Jöhl
and Philipsen 1996), although no apparent synteny occurs between S. cerevisiae and Schizosccharomyces
pombe (Goffeau et al. 1996). Therefore, the comparison
of entire genomes of various species, both closely and
distantly related, will help to progressively define the
evolutionary forces at work in the shaping of a genome.
Acknowledgments
We thank Hervé Philippe for helpful suggestions
and Purificacion Lopez-Garcia for critical reading of this
manuscript.
LITERATURE CITED
ALTMANN-JÖHL, R., and P. PHILIPSEN. 1996. AgTHR4, a new
selection marker for transformation of the filamentous fungus Ashbya gossypii, maps in a four-gene cluster that is
conserved between A. gossypii and Saccharomyces cerevisiae. Mol. Gen. Genet. 250:69–80.
ALTSCHUL, S. F. 1991. Amino acid substitution matrices from
an information theoretic perspective. J. Mol. Biol. 219:555–
565.
BAIROCH, A., and R. APWEILER. 1996. The SWISS-PROT protein sequence data bank and its new supplement TREMBL.
Nucleic Acids Res. 24:21–25.
BENNER, S. A., M. A. COHEN, and G. H. GONNET. 1993. Empirical and structural models for insertions and deletions in
the divergent evolution of proteins. J. Mol. Biol. 229:1065–
1082.
BERLYN, K. B., K. BROOKS LOW, and K. E. RUDD. 1996. Linkage map of Escherichia coli K12, edition 9. Pp. 1175–1209
in F. C. NEIDHARDT, R. CURTISS III, E. C. C. LIN, J. INGRAHAM, K. BROOKS LOW, B. MAGASANIK, W. REZNIKOFF,
M. RILEY, M. SCHAECHTER, and H. E. UMBARGER, eds.
Escherichia coli and Salmonella, cellular and molecular biology. 2nd edition. ASM Press, Washington, D.C.
BRENNER, S. E., T. HUBBARD, A. MURZIN, and C. CHOTHIA.
1995. Gene duplications in H. influenzae. Nature 378:140.
DAYHOFF, M. O., R. M. SCHWARTZ, and B. C. ORCUTT. 1978.
A model of evolutionary change in proteins. Pp. 345–352
in M. O. DAYHOFF, ed. Atlas of protein sequence and structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D.C.
DYKHUIZEN, D. E., and L. GREEN. 1991. Recombination in
Escherichia coli and the definition of biological species. J.
Bacteriol. 173:7257–7268.
FITCH, W. D. 1970. Distinguishing homologous from analogous proteins. Syst. Zool. 19:99–113.
FLEISCHMAN, R. D., M. D. ADAMS, O. WHITE et al. (40 coauthors). 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–
512.
GOFFEAU, A., B. G. BARRELL, H. BUSSEY et al. (16 co-authors).
1996. Life with 6000 genes. Science 274:546–567.
GONNET, G. H., M. A. COHEN, and S. A. BENNER. 1992. Exhaustive matching of the entire protein sequence database.
Science 256:1443–1445.
HOCH, A. H., and T. J. SILHAVY. 1995. Two-component signal
transduction. ASM Press, Washington, D.C.
JOHNSON, M. S., and J. P. OVERINGTON. 1993. A structural
basis for sequence comparisons: an evaluation of scoring
methodologies. J. Mol. Biol. 233:716–738.
KARP, P. D., C. OUZOUNIS, and S. M. PALEY. 1996. HinCyc:
a knowledge base of the complete genome and metabolic
Last Common Ancestor to E. coli and H. influenzae
pathways of H. influenzae. Pp. 116–124 in Proceedings of
the Fourth International Conference on Intelligent Systems
for Molecular Biology, Menlo Park, CA. AAAI Press, St.
Louis, Mo.
LABEDAN, B., and M. RILEY. 1995a. Widespread protein sequence similarities: origins of E. coli genes. J. Bacteriol.
177:1585–1588.
. 1995b. Gene products of Escherichia coli: sequence
comparisons and common ancestries. Mol. Biol. Evol. 12:
980–987.
MILKMAN, R. 1996. Recombinational exchange among clonal
populations. Pp. 2663–2684 in F. C. NEIDHARDT, R. CURTISS III, E. C. C. LIN, J. INGRAHAM, K. BROOKS LOW, B.
MAGASANIK, W. REZNIKOFF, M. RILEY, M. SCHAECHTER,
and H. E. UMBARGER, eds. Escherichia coli and Salmonella,
cellular and molecular biology. 2nd edition. ASM Press,
Washington, D.C.
MOXON, E. R. 1992. Molecular basis of invasive Haemophilus
influenzae type b disease. J. Infect. Dis. 165(Suppl. 1):S77–
S81.
MUSSER, J. M., J. S. KROLL, D. M. GRANOFF, E. R. MOXON,
B. R. BRODEUR, J. CAMPOS, H. DABERNAT, W. FREDERIKSEN, J. HAMEL, and G. HAMMOND. 1990. Global genetic
structure and molecular epidemiology of encapsulated Haemophilus influenzae. Rev. Infect. Dis. 12:75–111.
NEIDHARDT, F. C., R. CURTISS III, E. C. C. LIN, J. INGRAHAM,
K. BROOKS LOW, B. MAGASANIK, W. REZNIKOFF, M. RILEY,
M. SCHAECHTER, and H. E. UMBARGER, eds. 1996. Escherichia coli and Salmonella, cellular and molecular biology.
2nd edition. ASM Press, Washington, D.C.
OLSEN, G. J., C. R. WOESE, and R. OVERBEEK. 1994. The
winds of evolutionary change: breathing new life into microbiology. J. Bacteriol. 176:1–6.
PERKINS, J. D., J. D. HEATH, B. R. SHARMA, and G. M. WEINSTOCK. 1993. XbaI and BlnI genomic cleavage maps of
Escherichia coli K-12 strain MG1655 and comparative
analysis of other strains. J. Mol. Biol. 232:419–445.
RILEY, M. 1993. Functions of the gene products of Escherichia
coli. Microbiol. Rev. 57:862–952.
RILEY, M., and B. LABEDAN. 1997. Protein evolution viewed
through Escherichia coli protein sequences: introducing the
notion of structural segment of homology, the module. J.
Mol. Biol. 269:1–12.
27
ROTH, J. R., N. BENSON, T. GALITSKI, K. HAACK, J. G. LAWRENCE, and L. MIESEL. 1996. Rearrangements of the bacterial chromosome: formation and applications. Pp. 2256–
2276 in F. C. NEIDHARDT, R. CURTISS III, E. C. C. LIN, J.
INGRAHAM, K. BROOKS LOW, B. MAGASANIK, W. REZNIKOFF, M. RILEY, M. SCHAECHTER, and H. E. UMBARGER,
eds. Escherichia coli and Salmonella, cellular and molecular biology. 2nd edition. ASM Press, Washington, D.C.
SMITH, H. O., J.-F. TOMB, B. A. DOUGHERTY, R. D. FLEISCHMANN, and J. C. VENTER. 1995. Frequency and distribution
of DNA uptake signal sequences in the Haemophilus influenzae Rd genome. Science 269:538–540.
SMITH, T. F., and M. S. WATERMAN. 1981. Identification of
common molecular subsequences. J. Mol. Biol. 147:195–
197.
SOLIGNAC, M., C. PERIQUET, D. ANXOLABEHERE, and C. PETIT.
1995. Génétique et évolution. Hermann, Paris.
TAMAMES, J., G. CASARI, C. OUZOUNIS, and A. VALENCIA.
1997. Conserved clusters of functionally related genes in
two bacterial genomes. J. Mol. Evol 44:66–73.
TATUSOV, R. L., A. R. MUSHEGIAN, P. BORK, N. P. BROWN, W.
S. HAYES, M. BORODOVSKY, K. E. RUDD, and E. V. KOONIN. 1996. Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with
Escherichia coli. Current Biology 6:279–291.
TOMB, J. F., G. J. BARCAK, M. S. CHANDLER, R. J. REDFIELD,
and H. O. SMITH. 1989. Transposon mutagenesis, characterization and cloning of transformation genes of Haemophilus influenzae Rd. J. Bacteriol. 171:3796–3802.
VOGT, G., T. ETZOLD, and P. ARGOS. 1995. An assessment of
amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J. Mol. Biol. 249:816–831.
WILLIAMS, P. M., L. A. BANNISTER, and R. J. REDFIELD. 1994.
The Haemophilus influenzae sxy-1 mutation is in a newly
identified gene essential for competence. J. Bacteriol. 176:
6789–6794.
ZULTY, J. J., and G. J. BARCAK. 1995. Identification of a DNA
transformation gene required for com101A1 expression and
supertransformer phenotype in Haemophilus influenzae.
Proc. Natl. Acad. Sci. USA 92:3616–3620.
MANOLO GOUY, reviewing editor
Accepted October 8, 1997