The Evolutionary Relationships Between the Two Bacteria Escherichia coli and Haemophilus influenzae and their Putative Last Common Ancestor Renaud de Rosa and Bernard Labedan Institut de Génétique et Microbiologie, Université Paris-Sud, France We have tried to approach the nature of the last common ancestor to Haemophilus influenzae and Escherichia coli and to determine how each bacterium could have diverged from this putative organism. The approach used was exhaustive analysis of the homologous proteins coded by genes present in these bacteria, using as criteria for sequence relatedness an alignment of at least 80 amino acid residues and a PAM distance (number of accepted point mutations per 100 residues separating two sequences) below 250. Evolutionarily significant similarities were found between 1,345 H. influenzae proteins (85% of the total genome) and 3,058 E. coli proteins (75% of the total genome), many of them belonging to families of various sizes (from 666 doublets to 35 large groups of more than 10 members). Nearly all the genes found by this approach to be duplicated in both bacteria were already duplicated in their last common ancestor. This was deduced from (1) the comparison of the respective distributions of evolutionary distances between orthologs (genes separated only by speciation events) and paralogs (genes duplicated in the same genome) and (2) the analysis of the phylogenetic trees reconstructed for each family of paralogs containing at least two members belonging to each bacterium. The distributions of the different categories of homologs show a significant loss of paralogous genes in H. influenzae (reduction proportional to the genome size), of many sequences which are still present in one copy in E. coli, and of some entire gene families. Phylogenetic trees also confirmed this recent loss of paralogous genes in H. influenzae. Thus, the genome size of the last common ancestor of these two bacteria would have been close to that of present-day E. coli, and the evolution of H. influenzae toward a parasitic life led to an important decrease in its genome size by some mechanism of streamlining. During this recent evolution, the memory of the gene order present in the last common ancestor has been blurred, but a few short conserved chromosomal fragments can still be detected in present-day E. coli and H. influenzae. Introduction According to the evolutionary relationships established in the16S rRNA tree (for a recent version, see Olsen, Woese, and Overbeek 1994), Escherichia coli and Haemophilus influenzae belong to the gamma subdivision of purple bacteria. Inside this subdivision, they appear to be closely related, with H. influenzae being the closest sister group to the enterobacteria cluster. However, these two bacteria differ in many important features. For example, their genome sizes vary from 4.7 Mb for E. coli to only 1.8 Mb for H. influenzae. Moreover, their ways of life appear quite different: although it can behave as a pathogen, E. coli is a versatile freeliving bacterium, able to adapt to many environmental conditions (for a general overview, see Neidhardt et al. 1996), while H. influenzae, an obligate parasite of human respiratory tract which can invade the blood and the central nervous system, is dependent on very specialized growing conditions (Musser et al. 1990; Moxon 1992). This suggests that these two bacteria followed different paths in their recent evolutionary history after having diverged from their last common ancestor. (In Abbreviations: DARWIN, Data Analysis and Retrieval With Indexed Nucleotide/Peptide Sequences; Eco, Escherichia coli; Hin, Haemophilus influenzae; ORF, open reading frame; PAM, number of accepted point mutations per 100 residues separating two protein sequences. Key words: comparative genomics, bacterial evolution, paralogous proteins, gene families, rearrangements of bacterial chromosomes, gene duplication, Escherichia coli, Haemophilus influenzae. Address for correspondence and reprints: Bernard Labedan, Institut de Génétique et Microbiologie, Université Paris-Sud, Bâtiment 409, 91405 Orsay Cedex, France. E-mail: [email protected]. Mol. Biol. Evol. 15(1):17–27. 1998 q 1998 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 this paper, ‘‘last common ancestor’’ will always refer to the last common ancestor of E. coli and H. influenzae.) To solve some aspects of this recent history, we have tried to reconstruct the putative nature of this common ancestor and to determine how each bacterium could have diverged from it. This has been attempted by analyzing the different classes of genes encoded by each genome using the following rationale. It has already been shown that E. coli contains an important proportion of paralogous genes and that many of these paralogous genes group into families of various sizes (Labedan and Riley 1995a, 1995b; Riley and Labedan 1997). Paralogous genes (Fitch 1970) have been defined as copies issued from a duplication of an ancestral gene, each copy having diverged before any speciation event. A preliminary study (Brenner et al. 1995) has suggested that a significant proportion of H. influenzae genes probably descend from duplications. Many open reading frames (ORFs) found by systematic sequencing (Fleischman et al. 1995) of the H. influenzae genome code for proteins which are similar in sequence to E. coli proteins, and the corresponding functions were assigned on the basis of this similarity (Fleischman et al. 1995; Tatusov et al. 1996). These studies suggested that many H. influenzae genes are homologous to E. coli ones. However, many assignments of homology have been based on similarities limited to small motifs or signatures. Such an approach is very helpful when trying to find the maximum of functional assignments (Tatusov et al. 1996) but does not fit with a consistent evolutionary study. Since the evolution of complete genomes involves large-scale (that is, at least the size of a gene) chromosomal rearrangements, comparison of long segments of homology is especially crucial to determine the timing of duplication events versus specia17 18 de Rosa and Labedan tion and thus to reconstruct the nature of the putative last common ancestor of E. coli and H. influenzae. Therefore, we have undertaken an exhaustive analysis, at the level of protein sequence, of the whole sets of H. influenzae and E. coli genes. Taking into account only long segments of homology allowed us to look at the evolutionary behavior of whole genes. Accordingly, we counted the respective proportions of each class of proteins (paralogs, orthologs, and unique), and compared the families of proteins and their members inside each genome. Here, we define a protein family as any group of proteins encoded by genes which derive from the same ancestral gene. Such a definition extends from a pair of orthologs (doublets) to large clusters of paralogs. Finally, we compared the genomic maps of these evolutionarily related bacteria in order to detect the vestiges of an ancestral gene order by looking at conserved chromosomal fragments larger than a gene. The data reported in this paper support the hypothesis that E. coli and H. influenzae descend from a common ancestor which had a genome size and composition close to that of present-day E. coli. Materials and Methods Sequences The whole sets of E. coli (version sent to GenBank in January 1997, accession number U00096, further completed with a few unpublished missing sequences later obtained from the Blattner group through Monica Riley) and H. influenzae (Fleischman et al. 1995) proteins have been harvested with the exception of the nonchromosomal genes (e.g., insertion sequences). To study only significantly long segments of homology, we further discarded all of the proteins displaying a length shorter than 80 amino acids. After these two steps, we had a data set made of 1,574 H. influenzae (Hin) and 4,061 E. coli (Eco) protein sequences. Finding Extended Similarities Between Protein Sequences of H. influenzae and E. coli The so-called DARWIN (Data Analysis and Retrieval With Indexed Nucleotide/Peptide Sequences) program (Gonnet, Cohen and Benner 1992) is an interactive and programmable system which allows one to search for all matches between proteins of a database using a complete Smith-Waterman-type dynamic programming algorithm (Smith and Waterman 1981). In this DARWIN system, the sequences are organized as sets of evolutionarily connected components which are characterized by an evolutionary distance measured in PAM units (Dayhoff, Schwartz, and Orcutt 1978). This PAM distance—the number of accepted point mutations per 100 residues separating two sequences—is based on (1) a mutation data matrix normalized to a distance of 250 PAM units and recomputed for each new set of sequences and (2) a gap penalty which is itself dependent on the PAM distance intrinsic to the set of sequences studied (see Gonnet, Cohen, and Benner 1992; Benner, Cohen, and Gonnet 1993 for additional details of the method). This program, which has been judged to be one of the best performers among sequence comparison programs (Johnson and Overington 1993; Vogt, Etzold, and Argos 1995) was obtained from the Institut für Wissenschaftliches Rechnen (ETH Zürich, Switzerland) and implemented on a DEC Alpha station. After slight adaptation of several existing DARWIN procedures and addition of several new ones created for our purpose (available on request by electronic mail), we were able to detect, in one step, all the matches between the 1,574 Hin and 4,061 Eco proteins. To limit our search to the evolutionarily significant matches, we imposed the two following cutoffs: only the pairs corresponding to an alignment of at least 80 residues and separated by less than 250 PAM units were kept for further analysis. The rationale for these two cutoffs is essentially based on the finding that a PAM250 substitution matrix is the most efficient scoring matrix when applied to distantly related protein pairs for a minimum significant length of 83 residues (Altschul 1991). Analysis of the Protein Pairs and Families To analyze all of the Eco/Eco, Hin/Hin, and Eco/ Hin protein pairs extracted from the DARWIN outputs, a database was built using the relational database application Claris FilemakerPro 3.03 for Macintosh. A program written in Caml light language was used to automatically gather into one family all sequences that were related by a chain of similarities, collecting all relatives of both members of each pair until no further pairwise relationships were found (unpublished work of R.D.R., program available on request). (Information about the family of Caml languages is available at the Internet address http://pauillac.inria.fr/caml/index-eng.html.) The respective distributions, as a function of their PAM distance, of the paralogs (protein pairs Eco/Eco and Hin/Hin) and of the orthologs (pairs Eco/Hin) have been weighted by the size of their respective families as follows. Since the number of pairs N(N 2 1)/2 increases faster than the number of sequences N, a weighting factor of 1/(N 2 1) was used for each family. This weighting, which allows each family to be represented proportionally to its N, was computed by applying a program written in Pascal language (unpublished data). Making Phylogenetic Trees for the Families of Proteins For each family containing at least two members belonging to each bacterium, a multiple alignment and an unrooted tree were established using two functions (MulAlignment and PhyloTree) which are also part of the DARWIN program (available on the CBRG WWW server of the ETH at Zürich). The trees obtained using PhyloTree are based on the estimated PAM distances between each pair of sequences, and the deduced evolutionary distance between each node is weighted by computing the variance of the respective distance. Therefore, these distance trees appear to be approximations of maximum-likelihood trees (see Gonnet, Cohen, and Benner [1992] for additional details of the method and the booklet available at the Internet address Last Common Ancestor to E. coli and H. influenzae http://cbrg.inf.ethz.ch/ServerBooklet, especially the subsection 2p3p5p1). Analysis of Gene Positions Eight hundred twenty-nine pairs of orthologs with gene positions in base pairs available in the GenBank (accession number U00096) and TIGR databases (Fleischman et al. 1995), respectively, were used for this analysis. We considered as orthologs pairs of genes from the two species which fit into either one of the following categories: genes that had been given the same name in the two species (assuming that the functional assignments in the databases were correct), genes from the two species belonging to two-member families, or genes belonging to larger families and being closer (in PAM distance) to each other than to any other member of the family. Positions taken as value 0 correspond to gene thrA in E. coli (Berlyn, Brooks Low, and Rudd 1996), and to that of the unique Not I restriction site in H. influenzae (Fleischman et al. 1995). The distance between genes was computed by taking into account the fact that the chromosomes are circular, that is, the top and the bottom and the left and the right of this rectangle join to make a torus. The theoretical distribution was computed as follows: assuming the 829 genes had the same position on the E. coli chromosome, we computed the distribution of the angles as if each had 10 orthologs uniformly scattered on the H. influenzae chromosome, located at 1/10, 2/10, . . . , 10/10 of the length of the chromosome. This distribution was divided by 10 for scaling reasons. To further detect conserved gene clusters, we sought consecutive pairs of orthologs, i.e., two genes present in the same order and at the same distance in the two genomes. We only considered the pairs which had no more than two ORFs between each pair of orthologs. Starting from these, we looked at the genes upstream and downstream in order to find longer conserved clusters. Results Finding Significant Similarities Between the Proteins Encoded by H. influenzae and E. coli We adapted the DARWIN program (Gonnet, Cohen, and Benner 1992) to build a mixed set of procedures, some already existing and others we created, in order to mimick the AllAllDB program. This set allowed us to collect, in one step, all the significant matches between a defined set of Hin and Eco proteins. Significant matches are defined as matches displaying an alignment length greater than 80 residues and a PAM distance below 250. Indeed, a length cutoff seemed essential to get evolutionarily consistent data: by taking into account only similarities extending along a large part of each protein, if not the whole protein, we would be able to trace with confidence the corresponding gene duplication events. Accordingly, the defined set corresponds to a total of 5,635 proteins which are longer than 80 amino acids, 1,574 belonging to H. influenzae and 4,061 to E. coli. 19 The exhaustive search for all the matches separated by less than 250 PAM units and at least 80 residues long gave 19,896 matches corresponding to 3,058 E. coli and 1,345 H. influenzae protein sequences. This implied that many sequences belong to protein families. A program (see Materials and Methods) was applied to assemble the proteins belonging to the same family. To maintain a high level of consistency in our analysis of gene evolution, we further considered only families displaying matches where the alignment length was longer than 50% of the sequence length of each protein. This corresponded to 9,599 matches made of 1,219 Hin and 2,537 Eco sequences. These 3,756 sequences assembled in 1,015 families of various sizes: 666 doublets, 139 triplets, 175 small families from 4 to 10 members, and 35 large groups of more than 10 members. Correlation Between the Distribution of Evolutionary Distances and the Nature of the Homologous Proteins We first made a comparison of the respective distributions of evolutionary distances separating matching proteins which are coded either by the same genome (paralogs) or by the two genomes. The latter may correspond to either orthologs (homologous genes separated by speciation) or descendants of paralogs which have further been separated by speciation. Members of this last category have been called metalogs (from the Greek meta 5 change) by several authors (Solignac et al. 1995). To make these distributions, we used the total set of 19,896 matches, but each match was weighted according to the size of the family to which the matching proteins belong (see Materials and Methods). Figure 1 shows that the category of paralogs sensu stricto made of all the Eco/Eco and Hin/Hin pairs displays a very broad peak centered at 160 PAM units, whereas the Eco/ Hin pairs are separated into two different classes, one having a distribution similar to that of the paralogs and the other one showing a narrower peak centered at 40 PAM units. We propose that the first class corresponds to the so-called metalogs and the other class is made of orthologs. The small plateau appearing at the intermediate PAM distance between 90 and 110 could correspond to the addition of the decreasing number of pairs of orthologs and the increasing number of pairs of metalogs. Taken together, these data strongly suggest that the large majority of gene duplications (paralogs sensu lato, including metalogs) found in both bacteria must have occurred before the speciation event leading to the immediate ancestors of E. coli and H. influenzae. The Loss of Proteins in H. influenzae Appears to be Selective Next, we compared the whole data set of 4,061 E. coli and 1,574 H. influenzae sequences to determine the distributions of each category of proteins in both organisms. The proteins encoded by each genome may be separated into the following four categories: Sequences either can be found in both genomes (Eco and Hin) or may belong only to one genome (Eco or Hin). In this last case, sequences either can be unique to this genome or may be found as duplicated copies inside this ge- 20 de Rosa and Labedan FIG. 1.—Weighted distribution of PAM distances separating pairs of proteins belonging to the same family. The number of pairs corresponding to each family was weighted by 1/(N 2 1), where N is the number of sequences belonging to the family, in order for each family to be represented proportionally to its N. White bars: PAM distance separating sequences from two different organisms (Eco/Hin) (orthologs or metalogs). Black bars: PAM distance separating sequences from the same organism (Eco/Eco or Hin/Hin) (paralogs sensu stricto). nome. Table 1 summarizes the number of sequences corresponding to each category: (1) By definition, the same number of unique orthologs, 546, is present in both bacteria. Consequently, their relative proportion is larger in H. influenzae (34.7%) than it is in E. coli (13.4%). (2) The number of paralogs present in families having at least two representatives in one genome and one in the other genome decreases from 1,950 (421 1 1,511 1 18) in E. coli to 774 (141 1 591 1 42) in H. influenzae. Thus, there has been a reduction of this class of paralogs which is remarkably proportional to that of the genome size in H. influenzae, since the relative percentages (49% in H. influenzae and 48% in E. coli) remain very close. (3) There are far more sequences, 1,565 (1,003 1 562), which are unique to E. coli than are peculiar to H. influenzae (254 5 229 1 25). Therefore, it appears that many E. coli sequences have no (more) homologs in H. influenzae. When looking at the respective functional categories of the sequences which are unique to H. influenzae, we found that 69.7% (177) are proteins with unknown functions, some of them making homogeneous families Table 1 Distribution of Proteins in Different Classes of Families E. COLI (ECO) SEQUENCES IN EACH FAMILY H. INFLUENZAE None None. . . . . One. . . . . . — 1,003 Eco Several. . . 562 Eco (HIN) SEQUENCES One 229 546 546 421 141 Hin Eco Hin Eco Hin IN EACH FAMILY Several 25 18 42 1,511 591 Hin Eco Hin Eco Hin (up to five members). Among the proteins whose functions have been attributed by sequence similarity, we found a significant proportion of proteins which play a role similar to E. coli proteins, such as 18 enzymes involved in metabolism, 6 ribosomal proteins, and 2 lipoproteins, but whose sequence similarity to E. coli proteins was too low to be recognized in our search of significant similarities. This emphasizes that our strict criteria of evolutionary consistency may lead to some underestimation of the actual total number of homologous proteins, and we could not exclude the possibility that these analogous proteins are actually very distant orthologs. Therefore, it appears that there would be very few proteins really specific to H. influenzae, such as two competence factors (Williams, Bannister, and Redfield 1994; Zulty and Barcak 1995) or three restriction enzymes. Analysis of Evolutionary Trees: Losses of Paralogs in Large Families To further determine where the main losses of genes have occurred during the recent evolution of H. influenzae, we reconstructed a phylogenetic tree for each family of paralogs containing at least two members from each genome. An exhaustive analysis of the 137 corresponding trees showed two main features which are illustrated in the few examples shown in figures 2–4. Schematically, there are two main classes of trees: (1) The small families generally display the same number of paralogs in each organism. Figures 2 and 3 (family 932) show several examples of such families. When the number of members was odd, this generally corresponded (93%) to the presence of a supplementary paralog in E. coli, as shown in figure 3 (family 889), where an unknown E. Last Common Ancestor to E. coli and H. influenzae 21 FIG. 2.—A few examples of family trees containing two pairs of paralogs. The tree reconstruction method is described in the text. Branch lengths are displayed in PAM units. For each family, the names of the proteins are given as their SwissProt (Bairoch and Apweiler 1996) mnemonics. The corresponding SwissProt accession numbers are: PBP2pECOLI: P08150; PBP2pHAEIN: P44469; PBP3pECOLI: P04286; PBP3pHAEIN: P45059; RODApECOLI: P15035; RODApHAEIN: P44468; FTSWpECOLI: P16457; FTSWpHAEIN: P45064; ODO2pECOLI: P07016; ODO2pHAEIN: P45302; ODP2pECOLI: P06959; ODP2pHAEIN: P45118; CN16pECOLI: P08331; CN16pHAEIN: P44764; USHApECOLI: P07024; 5NTDpHAEIN: P44569. coli ORF belongs to this family of penicillin-binding proteins. These small families frequently correspond to enzymes (such as acyltransferases or periplasmic hydrolases in fig. 2 or lyases in fig. 3) essential to the metabolism of each bacterium, or to proteins important for cell structure (such as proteins necessary to the integrity of the cell wall; figs. 2 and 3). (2) On the contrary, in large families, the number of paralogs present in H. influenzae was always strongly reduced when compared to that of E. coli. For example, in the families shown in figure 4, there are only three sensor proteins belonging to the two-component signal transduction system (Hoch and Silhavy 1995) in H. influenzae compared to 19 such sensors present in E. coli, and there is only 1 transcriptional regulator in H. influenzae compared to 12 in E. coli. Figure 5 further shows the distribution profile displaying the number of paralogs in E. coli versus that in H. influenzae for each gene family containing at least seven members. This distribution confirms that a significant part of the paralogous proteins specifically lost in H. influenzae belong to large E. coli families. Note that the loss is less massive in the largest displayed family (ABC proteins). Analysis of Evolutionary Trees: Dating of the Duplications The distribution of evolutionary distances (fig. 1) already suggested that the duplication events which gave rise to the paralogous genes took place before the separation of the two bacteria. This appears to be confirmed by the tree topologies. Indeed, the copies present in each species are always separated in the distal parts of the trees (figs. 2 and 3). For example, in figure 3 (family 932), successive gene duplications which separated the ancestors of the argininosuccinate from those of fumarate hydratase and aspartate ammonia-lyase occured before the further separation of each pair of orthologs due to the speciation. We further used these trees as a tool to attempt a relative dating of these duplications. Indeed, a systematic survey of tree topologies disclosed a significant number of cases where the branch lengths were of un- 22 de Rosa and Labedan FIG. 3.—Other examples of small families. The tree reconstruction method is described in the text. Branch lengths are displayed in PAM units. For each family, the names of the proteins are given as their SwissProt mnemonics except for the new ORF in family 889, which is indicated by its GenBank identification number. The corresponding SwissProt accession numbers are: PBPApECOLI: P02918; PBPApHAEIN: P31776; PBPBpECOLI: P02919; PBPBpHAEIN: P45345; ARLYpECOLI: P11447; ARLYpHAEIN: P44314; ASPApECOLI: P04422; ASPApHAEIN: P44324; FUMCpECOLI: P05042. equal size. A limited sampling is shown in figures 2 and 3. For example, in the case of family 817 (fig. 2), divergence of the UDP-sugar hydrolase appears to have been more important than that of the 2939-cyclic-nucleotide 29-phosphodiesterase. To check if the distribution observed in figure 1 could be due to some acceleration of the divergence between paralogous sequences, we tried to estimate the magnitude of this process using the 39 families of four members containing two paralogous proteins from each organism. Since each of these families has two copies in each species which separated in the distal branches, the highest ratio of the sums of the distal branch lengths would give us an estimation of the acceleration of the divergence between paralogous sequences. For example, in the case of figure 2, this ratio varied from (32 1 32)/(14 1 41) 5 1.16 for family 807 to (106 1 84)/(27 1 20) 5 4.04 for family 817. The mean value of these ratios, calculated for this subset of 39 trees, was found to be 1.36. This value is clearly too low to explain the difference of about 4 which has been found between the peak of orthologs (around 40 PAM units) and that of paralogs/metalogs (around 160 PAM units) in figure 1. Comparing Gene Positions Up to now, we have studied gene duplications, but some events could have involved chromosomal fragments larger than a gene. To detect possible conservation of chromosomal fragments between both genomes we plotted on the same graph the relative positions of orthologs. Figure 6 shows that there is no evidence for alignment of dots representing similar genes present on the same strand along a line forming a 458 angle with the horizontal line. Likewise, there is no such alignment along a line forming a 2458 angle for similar genes present on opposite strands. This negative result was Last Common Ancestor to E. coli and H. influenzae 23 FIG. 4.—Two examples of large families. The tree reconstruction method is described in the text. Branch lengths are displayed in PAM units. For each family, the names of the proteins are given either as their SwissProt mnemonics or as their GenBank identification numbers. The corresponding SwissProt accession numbers are: ATOCpECOLI: Q06065; PSPFpECOLI: P37344; YFHApECOLI: P21712; TYRRpHAEIN: P44694; TYRRpECOLI: P07604; NTRCpECOLI: P06713; YHGBpECOLI: P38035; HYDGpECOLI: P14375; FHLApECOLI: P19323; UHPBpECOLI: P09835; BAESpECOLI: P30847; NTRBpECOLI: P06712; RTSBpECOLI: P18392; YGIYpHAEIN: P45336; BASSpECOLI: P30844; ATOSpECOLI: Q06067; YJDHpECOLI: P39272; ENVZpECOLI: P02933; ARCBpHAEIN: P44578; ARCBpECOLI: P22763; PHOQpECOLI: P23837; CRECpECOLI: P08401; CPXApECOLI: P08336; BARApECOLI: P26607; PHORpECOLI: P08400. still true when the scale was increased until all dots became distinctly discernible. Therefore, no long homologous fragment could be detected in present-day chromosomes of these two bacteria. A similar observation has been independently made by Tatusov et al. (1996) using a similar approach on a partial set of E. coli proteins. To look for possible smaller chromosomal fragments of homology, we further computed all the angles each segment joining two consecutive points would make with the horizontal line, and we analyzed their distribution. Figure 7 shows two peaks at 908 and 2908 corresponding to the theoretical curve of a random gene arrangement without any fragment conservation. Note that the peaks at 2908 . . . 2858 are more important than those at 858 . . . 908, because the classes used to build our histogram include their lower but not their upper bounds, implying that the 908 angles are counted as 2908. Besides this, two peaks of significant size appear at positions around 2458 and 458. This angle distribution confirms that extensive chromosomal rearrangement events have shuffled the order of many genes, but that there are still short conserved fragments in chromosomes of present-day E. coli and H. influenzae. We found 35 such conserved fragments corresponding to cotranscribed genes or association of neighboring cotranscribed genes. The respective number of genes goes from 2 (nine cases) to as many as 28 (corresponding to the three operons of ribosomal proteins, rpsM to rplG, rplN to rpmJ, and rpsJ to rpsQ, amounting in total to 24 de Rosa and Labedan FIG. 5.—Distribution of the numbers of paralogs in the large families. The number of paralogs in E. coli versus that in H. influenzae was calculated for each gene family containing at least seven members. The largest family, containing 353 proteins (78 Hin, 275 Eco), is not displayed for scaling reasons. White bars: E. coli paralogs. Black bars: H. influenzae paralogs. 13.3 kb). Another long fragment, amounting in total to 16.9 kb, corresponds to the three operons ftsLI, murEftsW, and murG-ddlB plus the neighboring genes ftsQ, ftsA, ftsZ, and lpxA and the two ORFs yabB and yabC. Notice also the eight genes of the histidine operon (hisG to hisI), or the two divergent operons glpQT and glpABC for transport and utilization of glycerol-3-phosphate. The complete list is available on request. Discussion Comparing whole genomes has already been shown to be a powerful approach to determining biological features of poorly known living beings and will be more and more useful in the future as new complete genomes become accessible. This is apparent in pioneering stud- ies (Tatusov et al. 1996; Karp, Ouzounis, and Paley 1996) which tried to determine what could be the metabolism of H. influenzae by comparing the whole set of putative proteins from this bacterium to the known (sometimes well-known) enzymes from E. coli (Riley 1993; Neidhart et al. 1996). We used this new approach to better understand how two bacteria may have diverged from a recent common ancestor. Our attempt aims at deducing how a limited number of changes in the repertoire of genes could have such irreversible effects on the ways of life of prokaryotic organisms. Indeed, taken together, all of our results strongly suggest that the last common ancestor to E. coli and H. influenzae was an organism having a genome size and a way of life similar to present-day E. coli. FIG. 6.—Plotting of the positions of orthologous genes on the chromosomal maps of E. coli and H. influenzae. Position 0 is set to the Thr locus (Eco) and the unique Not I restriction site (Hin). The gene positions are as given in the GenBank (accession number U00096) and TIGR databases (Internet address: http://www.tigr.org) for Hin, respectively. Open squares: genes on the same strand (1/1 or 2/2). Closed squares: genes on opposite strands (1/2 or 2/1). Last Common Ancestor to E. coli and H. influenzae 25 FIG. 7.—Distribution of the directions of the segments joining consecutive dots on the figure 6 plot. White bars: genes on the same strand (1/1 or 2/2). Black bars: genes on opposite strands (1/2 or 2/1). The distributions expected if the genes were uniformly scattered on the chromosomal maps are shown by lines: genes on the same strand, dotted line; genes on opposite strands, continuous line. This working hypothesis is supported by the comparison—at the qualitative as well as at the quantitative level—of the respective sets of proteins present in E. coli and H. influenzae. Timing of Duplication Events Versus Speciation Comparison of the different sets of proteins present in both bacteria led to the following conclusions. Nearly all of the genes found to be duplicated in both bacteria were already duplicated in their last common ancestor. This is shown by comparing the respective distributions of evolutionary distances of orthologs and paralogs (fig. 1) and by looking at the phylogenetic trees reconstructed for each family of paralogs (e.g., figs. 2–4). The topology of the different trees strongly suggests that the duplication events occurred before the emergence of the last common ancestor of both bacteria. However, we could not exclude the alternative hypothesis that the apparently ancient divergence found between copies of paralogous proteins was actually due to some acceleration proper to a lowering of the selection pressure. Such a difference in evolution speed appeared in a few trees (e.g., family 817 in fig. 2). To settle this point, we analyzed a subset of trees containing two members of each bacterium. An acceleration (if any) of the rate of divergence may be directly estimated by measuring the highest ratio of the sums of the distal branch lengths. The mean of these highest ratios for the subset of 39 families was only 1.36, a value too low to explain by unequal rates of divergence the difference of about 4 observed in figure 1 between paralogs (peak centered at 160 PAM units) and orthologs (peak centered at 40 PAM units). Thus, nearly all of the duplications observed in present-day organisms were already present in the last common ancestor of E. coli and H. influenzae. We con- clude that this putative bacterium had a genome size similar to (or maybe higher than) that of E. coli. The Reduction of Gene Number in H. influenzae: A Directed Process? After emergence from the last common ancestor, H. influenzae went through an intense streamlining which reduced its genome size by a factor of about 2.6. Comparison of the different classes of genes present in both bacteria suggests that this streamlining process has been directed. Table 1 shows the two following features: 1. Haemophilus influenzae lost many sequences which are still present in one copy in E. coli, as well as some entire gene families. An example is the unexpected loss of the genes coding for some important enzymes of the central metabolism (Fleischman et al. 1995; Karp, Ouzounis, and Paley 1996; Tatusov et al. 1996). Examples of families lost in H. influenzae and present in E. coli are genes coding for proteins necessary to chemotaxis, fimbriae synthesis, or enzymes such as a dehydrogenase family. The progressive adaptation to parasitic life may have made these genes dispensable. Alternatively, the accidental loss of these genes could have been a stimulus for adopting such a way of life. However, H. influenzae harbors a minor set of genes which are not present in E. coli. Many belong to families made uniquely of ORFs without any function detectable by sequence similarity. There are also some genes specific to H. influenzae, such as two genes coding for competence factors (Williams, Bannister, and Redfield 1994; Zulty and Barcak 1995) used in the very sophisticated transformation process of this bacterium (Tomb et al. 1989; Smith et al. 1995). 26 de Rosa and Labedan 2. The dramatic reduction in the number of genes observed in the case of H. influenzae is also due to a loss in the families of paralogs which seems to have been parallel to the global diminution in genome size (see, for example, fig. 4). This loss appears to have been selective according to family size: it occurred rarely in small families but has often been massive in large ones. Indeed, families having less than 10 members contain an average of 36.6% of H. influenzae proteins compared to only 22.6% in families having more than 10 members. Again, this could be partially explained in terms of adaptation. Many small families correspond to either enzymes which may be indispensable to the H. influenzae metabolism or structural proteins essential for cell integrity and cell division (figs. 2 and 3). Many large families correspond to membrane transporters or transcriptional regulators (fig. 4). Escherichia coli—and also the last common ancestor—should have a great selective advantage in maintaining a large panel of transporters and regulators adapted to many different environmental conditions, while such a large repertoire could be superfluous (and may even be a burden) to a strict parasite such as H. influenzae. Therefore, these data strongly support the already proposed (Fleischman et al. 1995; Tatusov et al. 1996) hypothesis that the streamlining process of the H. influenzae genome must have corresponded to the progressive adaptation toward a parasitic life. ‘‘Recent’’ Evolution of Bacterial Chromosomes: The Role of Map Rearrangements Recombinational events are found to be very frequent in bacteria and lead to large-scale chromosomal rearrangements. This is the case when comparing closely related bacteria such as E. coli and Salmonella typhimurium (for an overview, see Roth et al. 1996), but this phenomenon may also occur inside the same species (Dykhuisen and Green 1991; Milkman 1996), or even inside a large collection of E. coli K12 substrains (Perkins et al. 1993). Therefore, it was not surprising to find so few long chromosomal fragments which are still conserved between E. coli and H. influenzae. The memory of an ancient gene order must have been rapidly blurred when many genes underwent apparently erratic, and probably mostly neutral, displacements at any position on the chromosome after the separation of the lines of descent leading to these two bacteria. Thus, it may be significant to find some synteny cases remaining between these two bacteria despite such strong divergence of their chromosomal maps. Interestingly, while this paper was in the revision stage, a paper (Tamames et al. 1997) appeared that described the occurrence of conserved clusters of functionally related genes by comparing a subset of proteins from E. coli and the whole set of H. influenzae proteins. It would be interesting to see if such conservation of gene order has occurred in even more distant species. A remarkable case of considerable synteny has recently been shown between the yeast Saccharomyces cerevi- siae and another fungus, Ashbya gossypi (Altmann-Jöhl and Philipsen 1996), although no apparent synteny occurs between S. cerevisiae and Schizosccharomyces pombe (Goffeau et al. 1996). Therefore, the comparison of entire genomes of various species, both closely and distantly related, will help to progressively define the evolutionary forces at work in the shaping of a genome. Acknowledgments We thank Hervé Philippe for helpful suggestions and Purificacion Lopez-Garcia for critical reading of this manuscript. LITERATURE CITED ALTMANN-JÖHL, R., and P. PHILIPSEN. 1996. AgTHR4, a new selection marker for transformation of the filamentous fungus Ashbya gossypii, maps in a four-gene cluster that is conserved between A. gossypii and Saccharomyces cerevisiae. Mol. Gen. Genet. 250:69–80. ALTSCHUL, S. F. 1991. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219:555– 565. BAIROCH, A., and R. APWEILER. 1996. The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic Acids Res. 24:21–25. BENNER, S. A., M. A. COHEN, and G. H. GONNET. 1993. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. Mol. Biol. 229:1065– 1082. BERLYN, K. B., K. BROOKS LOW, and K. E. RUDD. 1996. Linkage map of Escherichia coli K12, edition 9. Pp. 1175–1209 in F. C. NEIDHARDT, R. CURTISS III, E. C. C. LIN, J. INGRAHAM, K. BROOKS LOW, B. MAGASANIK, W. REZNIKOFF, M. RILEY, M. SCHAECHTER, and H. E. UMBARGER, eds. Escherichia coli and Salmonella, cellular and molecular biology. 2nd edition. ASM Press, Washington, D.C. BRENNER, S. E., T. HUBBARD, A. MURZIN, and C. CHOTHIA. 1995. Gene duplications in H. influenzae. Nature 378:140. DAYHOFF, M. O., R. M. SCHWARTZ, and B. C. ORCUTT. 1978. A model of evolutionary change in proteins. Pp. 345–352 in M. O. DAYHOFF, ed. Atlas of protein sequence and structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D.C. DYKHUIZEN, D. E., and L. GREEN. 1991. Recombination in Escherichia coli and the definition of biological species. J. Bacteriol. 173:7257–7268. FITCH, W. D. 1970. Distinguishing homologous from analogous proteins. Syst. Zool. 19:99–113. FLEISCHMAN, R. D., M. D. ADAMS, O. WHITE et al. (40 coauthors). 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496– 512. GOFFEAU, A., B. G. BARRELL, H. BUSSEY et al. (16 co-authors). 1996. Life with 6000 genes. Science 274:546–567. GONNET, G. H., M. A. COHEN, and S. A. BENNER. 1992. Exhaustive matching of the entire protein sequence database. Science 256:1443–1445. HOCH, A. H., and T. J. SILHAVY. 1995. Two-component signal transduction. ASM Press, Washington, D.C. JOHNSON, M. S., and J. P. OVERINGTON. 1993. A structural basis for sequence comparisons: an evaluation of scoring methodologies. J. Mol. Biol. 233:716–738. KARP, P. D., C. OUZOUNIS, and S. M. PALEY. 1996. HinCyc: a knowledge base of the complete genome and metabolic Last Common Ancestor to E. coli and H. influenzae pathways of H. influenzae. Pp. 116–124 in Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, Menlo Park, CA. AAAI Press, St. Louis, Mo. LABEDAN, B., and M. RILEY. 1995a. Widespread protein sequence similarities: origins of E. coli genes. J. Bacteriol. 177:1585–1588. . 1995b. Gene products of Escherichia coli: sequence comparisons and common ancestries. Mol. Biol. Evol. 12: 980–987. MILKMAN, R. 1996. Recombinational exchange among clonal populations. Pp. 2663–2684 in F. C. NEIDHARDT, R. CURTISS III, E. C. C. LIN, J. INGRAHAM, K. BROOKS LOW, B. MAGASANIK, W. REZNIKOFF, M. RILEY, M. SCHAECHTER, and H. E. UMBARGER, eds. Escherichia coli and Salmonella, cellular and molecular biology. 2nd edition. ASM Press, Washington, D.C. MOXON, E. R. 1992. Molecular basis of invasive Haemophilus influenzae type b disease. J. Infect. Dis. 165(Suppl. 1):S77– S81. MUSSER, J. M., J. S. KROLL, D. M. GRANOFF, E. R. MOXON, B. R. BRODEUR, J. CAMPOS, H. DABERNAT, W. FREDERIKSEN, J. HAMEL, and G. HAMMOND. 1990. Global genetic structure and molecular epidemiology of encapsulated Haemophilus influenzae. Rev. Infect. Dis. 12:75–111. NEIDHARDT, F. C., R. CURTISS III, E. C. C. LIN, J. INGRAHAM, K. BROOKS LOW, B. MAGASANIK, W. REZNIKOFF, M. RILEY, M. SCHAECHTER, and H. E. UMBARGER, eds. 1996. Escherichia coli and Salmonella, cellular and molecular biology. 2nd edition. ASM Press, Washington, D.C. OLSEN, G. J., C. R. WOESE, and R. OVERBEEK. 1994. The winds of evolutionary change: breathing new life into microbiology. J. Bacteriol. 176:1–6. PERKINS, J. D., J. D. HEATH, B. R. SHARMA, and G. M. WEINSTOCK. 1993. XbaI and BlnI genomic cleavage maps of Escherichia coli K-12 strain MG1655 and comparative analysis of other strains. J. Mol. Biol. 232:419–445. RILEY, M. 1993. Functions of the gene products of Escherichia coli. Microbiol. Rev. 57:862–952. RILEY, M., and B. LABEDAN. 1997. Protein evolution viewed through Escherichia coli protein sequences: introducing the notion of structural segment of homology, the module. J. Mol. Biol. 269:1–12. 27 ROTH, J. R., N. BENSON, T. GALITSKI, K. HAACK, J. G. LAWRENCE, and L. MIESEL. 1996. Rearrangements of the bacterial chromosome: formation and applications. Pp. 2256– 2276 in F. C. NEIDHARDT, R. CURTISS III, E. C. C. LIN, J. INGRAHAM, K. BROOKS LOW, B. MAGASANIK, W. REZNIKOFF, M. RILEY, M. SCHAECHTER, and H. E. UMBARGER, eds. Escherichia coli and Salmonella, cellular and molecular biology. 2nd edition. ASM Press, Washington, D.C. SMITH, H. O., J.-F. TOMB, B. A. DOUGHERTY, R. D. FLEISCHMANN, and J. C. VENTER. 1995. Frequency and distribution of DNA uptake signal sequences in the Haemophilus influenzae Rd genome. Science 269:538–540. SMITH, T. F., and M. S. WATERMAN. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195– 197. SOLIGNAC, M., C. PERIQUET, D. ANXOLABEHERE, and C. PETIT. 1995. Génétique et évolution. Hermann, Paris. TAMAMES, J., G. CASARI, C. OUZOUNIS, and A. VALENCIA. 1997. Conserved clusters of functionally related genes in two bacterial genomes. J. Mol. Evol 44:66–73. TATUSOV, R. L., A. R. MUSHEGIAN, P. BORK, N. P. BROWN, W. S. HAYES, M. BORODOVSKY, K. E. RUDD, and E. V. KOONIN. 1996. Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli. Current Biology 6:279–291. TOMB, J. F., G. J. BARCAK, M. S. CHANDLER, R. J. REDFIELD, and H. O. SMITH. 1989. Transposon mutagenesis, characterization and cloning of transformation genes of Haemophilus influenzae Rd. J. Bacteriol. 171:3796–3802. VOGT, G., T. ETZOLD, and P. ARGOS. 1995. An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J. Mol. Biol. 249:816–831. WILLIAMS, P. M., L. A. BANNISTER, and R. J. REDFIELD. 1994. The Haemophilus influenzae sxy-1 mutation is in a newly identified gene essential for competence. J. Bacteriol. 176: 6789–6794. ZULTY, J. J., and G. J. BARCAK. 1995. Identification of a DNA transformation gene required for com101A1 expression and supertransformer phenotype in Haemophilus influenzae. Proc. Natl. Acad. Sci. USA 92:3616–3620. MANOLO GOUY, reviewing editor Accepted October 8, 1997
© Copyright 2026 Paperzz