Coevolution of DNA-Interacting Proteins and Genome ‘‘Dialect’’ A. Paz, V. Kirzhner, E. Nevo, and A. Korol Institute of Evolution, University of Haifa, Mount Carmel, Haifa, Israel Several species-specific characteristics of genome organization that are superimposed on its coding aspects were proposed earlier, including genome signature (GS), genome accent, and compositional spectrum (CS). These notions could be considered as representatives of genome dialect (GD). We measured within the Proteobacteria some GD representatives, the relative abundance of dinucleotides or GS, the profiles of occurrence of 10 nucleotide words (CS), and the profiles of occurrence of 20 nucleotide words, using a degenerate two-letter alphabet (purine-pyrimidine compositional spectra [PPCS]). Here, we show that the evolutionary distances between DNA repair and recombination orthologous enzymes (especially those of the nucleotide excision repair system) are highly correlated with PPCS and GS distances. Orthologous proteins involved in structural or metabolic processes (control group) have significantly lower correlations of their evolutionary distances with the PPCS and GS distances. We hypothesize that the high correlation of the evolutionary distances of the DNA repair orthologous enzymes with their GD is a result of the coevolution of the DNA repair enzymes’ structures and GDs. Species GDs could be substantially influenced by the function of DNA polymerase I (the bacterial major DNA repair polymerase). This might cause the correlation of species GDs differentiation with evolutionary changes of species DNA polymerase I. Simultaneously, the structures of DNA repair-recombination enzymes might be evolutionarily sensitive and responsive to changes in the structure of their substrate—the DNA (including those that are represented by GD differentiation). We further discuss the rationale and mechanisms of the hypothesized coevolution. We suggest that stress might be an important cause of changes in the repair-recombination genes and the GD and the trigger of the aforementioned coevolution process. Other triggers might be massive horizontal gene transfer and ecological selection. Introduction Since the 1980s, the method of choice in estimation of evolutionary distances between species is the sequence comparison of their 16S or 18S rRNAs (Fox et al. 1980; Woese 1987). The advantage of rRNA for phylogenetic reconstructions is its slow change and putative resistance to lateral gene transfer (Gogarten, Hilario, and Olendzenski 1996; Jain, Rivera, and Lake 1999). Phylogenetic reconstructions can also be based on sequences of slow-evolving orthologous proteins. However, many protein-based tree topologies were incongruent with those based on rRNA (reviewed by Brocchieri 2001). These and other difficulties in using various sequence families for phylogenetic reconstructions call for methods of species comparisons at the entire-genome level. In the last decades, new approaches for measuring species distances were suggested (Nussinov 1984; Brendel, Beckmann, and Trifonov 1986; Burge, Campbell, and Karlin 1992; Kirzhner et al. 2002). In particular, species-specific ‘‘genome signature’’ (GS) analysis based on di- and/or trinucleotide relative abundances was widely used (Karlin, Mrázek, and Campbell 1997; Campbell, Mrazek, and Karlin 1999; Gentles and Karlin 2001; Coenye and Vandamme 2004). Similarly, ‘‘compositional spectrum’’ (CS) analysis characterizing genomes in terms of distribution of frequencies of imperfectly matching words (length ;10–20 letters) was developed in our lab (Kirzhner et al. 2002, 2003). These speciesspecific characteristics of genome structure are superimposed on its protein-coding aspects and exhibit short-range patterns of DNA organization that are dispersed throughout the genome. The intergenome differences are, in a sense, analogous to variations in English pronunciation among Key words: genome dialect, genome signature, compositional spectrum, DNA repair, stress. E-mail: [email protected]. Mol. Biol. Evol. 23(1):56–64. 2006 doi:10.1093/molbev/msj007 Advance Access publication September 8, 2005 Ó The Author 2005. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] people of different nationalities and can in total be referred to as a ‘‘genome dialect’’ (GD) (Forsdyke and Mortimer 2000). When dealing with large genomes of eukaryotic species, the two methods of characterizing GD, CS, and GS can use randomly chosen samples of genomic ‘‘texts’’ rather than whole genomes (Gentles and Karlin 2001; Kirzhner et al. 2003). A small genome can easily be sampled as a whole. In this study, we used dinucleotide GS and two versions of CS. In the first version of CS, 10mer ‘‘words’’ in the four-letter alphabet were employed. In the second version, 20 nucleotide words were used, selecting the ‘‘degenerate’’ two-letter alphabet (R 5 purine and Y 5 pyrimidine), resulting in purine-pyrimidine compositional spectra (PPCS). Our major hypothesis is that the evolution of GD, as defined above, is strongly affected by some DNA template– dependent processing proteins (DTDPs), their structure, and ‘‘preferences’’ (see also Forsdyke 1995). Moreover, we suggest that the structure of the DTDPs that ‘‘read’’ and ‘‘write’’ the DNA text, in turn, may be influenced indirectly (via evolutionary tuning) by the abundant DNA structures of the species. Namely, after a significant change in species GD caused by some trigger (see below), there might also be a need for a structural change of its DNA repair enzymes. It is possible that some mutations that may occur in these genes after a change in the GD will be evolutionarily advantageous. We assumed that the bacterial major replicative DNA polymerase, the high-fidelity DNA polymerase III known by its high fidelity in the replication process, and the major DNA repair polymerase, DNA polymerase I (also known to be high-fidelity DNA polymerase) are better adapted to their genome predominant compositional organization than to rarely occurring sequence patterns. As an illustration of this idea, one can presumably consider the known results on the T4-related phage RB69 DNA polymerase (Bebenek et al. 2001). Base pair substitution hot spots produced by the phage intact DNA polymerase in vivo tended to occur Genome Dialect and DNA Repair 57 FIG. 1.—Factors presumably involved in the coevolution of GD and repair-recombination enzymes. at certain specific GC-rich 6mer and, especially, at GG/CC dimers at the T4 rI gene, a particularly striking tendency because of the low percentage of guanine plus cytosine (GC%) content of the T4 genome (35.3%). Earlier, Karlin and coworkers hypothesized that species-specific GS (peculiarities of oligonucleotide relative abundances) may be substantially affected by the DNA replication and repair systems (Burge, Campbell, and Karlin 1992; Karlin, Mrázek, and Campbell 1997). In particular, it was suggested that species-specific overand underrepresentation of short oligonucleotides are related to the Deoxycytosine methylase methylase/very short patch repair system (Burge, Campbell, and Karlin 1992; Lieb and Bhagwat 1996). An experimental testing of effect of bacterial DNA polymerases I, II, and III on the species GS was also proposed (Karlin, Mrázek, and Campbell 1997) although such a test has not been conducted. The objective of the current in silico investigation is to test whether species differentiation with respect to (1) GDs and (2) the primary structure of their DNA template–dependent enzymes are interconnected. If this is indeed the case, one may think about several explanations of such correlation: (1) DNA polymerase may play a substantial role on its product (DNA sequence) evolution; (2) the structure of repair-recombination enzymes might be evolutionarily more sensitive and ‘‘responsive’’ to changes in the GD (i.e., abundant DNA structures) than proteins involved in structural and metabolic processes (SMPs); (3) mutation that are preserved in the DNA and changing the GD dialect are those that improve the DNA sequence organization as target of the repair-recombination (Forsdyke 1995). It is also possible that changes in the DTDP genes and changes in DNA across the whole genome (changes to the GD) happen at the same time, e.g., due to genomic stress (see below). Hypothetical triggers of dialect-DTDP coevolution are shown in figure 1. Closely related species from contrasted environmental conditions may differ in codon usage and/or preferred amino acids. An example is the high preference to purine and purine tracts in mRNA displayed by thermophilic prokaryotes compared to mesophilic species of the same life domains (Paz et al. 2004), resulting in GD differences between the thermophiles and mesophiles. Coevolution of DTDP sequences and GD might also happen after genomic cataclysms caused by massive horizontal transfer. Acute genomic stress might be a third possible FIG. 2.—The roles of selected DTDPs in the Proteobacteria species cells: transcription, replication, repair, and recombination and antirecombination. Some of these enzymes are involved in more than one of these four processes, as shown with the grouping lines. (and, presumably, most plausible) trigger for the postulated scenario. In order to test the proposed hypothesis, we estimated the correlation between distance matrices built using GD representatives and distance matrices based on the alignment of orthologous proteins of a certain group of DTDPs (figs. 2 and 3). As a control to the DTDPs, a group of SMPs was employed. The expectation was that the DTDP distances will display higher correlation than the control group distances, with the distances measured for GD. The DTDPs, FIG. 3.—Evaluation of the interdependence of the species distances for GD and DTDPs. As a control to the DTDP group, SMPs were taken. 58 Paz et al. Xanthomonas campestris, Xylella fastidiosa, Yersinia pestis; and e—Campylobacter jejuni, Helicobacter pylori. Targeted Proteins Employed in the Study FIG. 4.—Examples of CS of genomic DNA of Proteobacteria species calculated based on fR, Yg alphabet. As a reference, a stretch of Escherichia coli DNA (two millions of nucleotides) was employed, thus, the words of all other examples are ordered using the PPCS of this reference ranking (the second stretch of E. coli genome showed in the example is the rest of the genome). The abscissa represents the words of the set W placed in some order (in accordance with the word’s frequency of appearance in the reference sequence), whereas the ordinate shows the observed frequencies F(W, S) of the words in the compared sequence S. The table under the PPCS graphs represents the pairwise distances between the compared species. Abbreviations: E.c 1, E. coli K-12 halve genome 1; E.c 2, E. coli K-12 halve genome 2; N.m, Neisseria meningitides; S.t, Salmonella typhimurium; R.c, Rickettsia conorii. selected for our comparisons, have different roles in transcription, replication, repair, and recombination (fig. 2). Some DTDPs are known to be involved in more than one of these four processes: DNA polymerase III beta chain (DnaN) also participates in repair processes in addition to its role in replication (López de Saro and O’Donnell 2001; Becherel, Fuchs, and Wagner 2002); the mismatch repair (MMR) enzyme MutS, in antirecombination (Rayssiguier, Thaler, and Radman 1989; Schofield and Hsieh 2003); and the recombination enzymes RecQ and RecG, in antirecombination and repair (Courcelle and Hanawalt 2001; Gregg et al. 2002; Robu, Inman, and Cox 2004; N. Tuteja and R. Tuteja 2004). The error prone DNAPs II, IV, and V were not selected because they are missing from the genomes of several species included in this study. We also did not include proteins involved in apurinic and apyrimidinic repair, such as the DNA glycosylases, Ung and AlkA, and Phr involved in direct repair of pyrimidine dimers. Methods Targeted Species Employed in the Study Twenty Proteobacteria of a, b, c, and e groups were investigated: a—Agrobacterium tumefaciens, Caulobacter crescentus, Rickettsia conorii, Rickettsia prowazekii; b—Neisseria meningitides;c—Buchnera aphidicola, Escherichia coliK-12, Haemophilus influenzae, Pseudomonas aeruginosa, Pasteurella multocida, Salmonella enterica, Shigella flexneri, Shewanella oneidensis, Salmonella typhimurium, Vibrio cholerae, Altogether, sequences of 40 proteins were selected in such a way that most of them have orthologues in the vast majority of 20 Proteobacteria species. As shown in figure 2 and table 3, the list included RNA polymerase subunits, replication apparatus enzymes, and DNA repair and recombination enzymes. Dual/multiple functions of some of these enzymes are shown in figure 2, and the references that showed or suggested additional roles of the enzymes are cited in the text. The remaining 20 SMPs served as a control group. The SMPs were included in the following: protein synthesis, RpLA, RpLB, RpSB, RpSC, TufA, AlaS; amino acid biosynthesis, AroK, DapA; biosynthesis of cofactors, prosthetic groups, and carriers, FolA, LipA, RibE, BioB; energy metabolism, Pgk, Eno, GapA, AtpA; pyrimidine biosynthesis, arginine biosynthesis, and urea cycle, CarA, CarB; protein fate, Ffh; and cell envelope, MurA. Protein sequences and 16SrRNA sequences were obtained from publicly available databases (http://www.ncbi. nim.gov/entrez/query.fegi?db=Protein and http://www.ncbi. nim.gov/entrez/query.fegi?db=Nucleotide). In the supplemental data (Table 4, Supplementary Material online) there are the accession numbers and IDs of the 40 orthologous proteins that were included in this study. Calculating Species Distance for Orthologous Proteins and 16SrRNA Blast 2 sequence comparison tool was used to calculate the distances for orthologous proteins with standard alignment parameters (matrix: blosom62; gap open: 11; gap extension: 1; expect: 10; word size: 3). Distances between orthologous proteins are determined as follows: distance 5 100 (the percentage of identical amino acids after Blast 2 sequences alignment). Calculating CS of DNA Sequences Definition Let us take a set W of n different oligonucleotides (words) wi of length L in any alphabet, e.g., in the standard A, T, G, C, or R, Y (purine, pyrimidine). For each word wi of the set W and any chosen large sequence S, one can calculate the observed number of matches mi 5 m(wi), allowing for a preset number of r mismatches (i.e., r 5 2). This approximate matching can be denoted as ‘‘r-mismatching’’. Now let M 5 Rmi. The frequency distribution F(W, S) of fi 5 mi/M will be referred to as CS of the sequence S relative to the set W (Kirzhner et al. 2002, 2003). The word sets can be produced using a random generator. In the current study, we employed word lengths L 5 10 and L 5 20 for CS analysis based on four-letter and two-letter alphabets, respectively (resulting in equal sizes of the corresponding ‘‘vocabularies’’). Visualizing CS and Measuring Genome Similarities Our previous studies (Kirzhner et al. 2002, 2003) indicate that a random set of 200 words is sufficient to Genome Dialect and DNA Repair 59 compare CS F(W, S) of different parts of the same genome or between genomes. Figure 4 illustrates the same idea using example from the list of Proteobacteria employed in this study with CS calculated for degenerated (purinepyrimidine) alphabet (PPCS). In order to display spectra, we ranked the words according to the frequency of their appearance in one of the sequences S (reference ranking). The intuitive impressions about inter- or intragenomic similarities and dissimilarities can be supported by distance metrics obtained, based on the Spearman rank correlation q. The quantity d 5 (1 q)/2 (0 d 1) can be considered as the distance between two spectra. The maximal distance d 5 1 corresponds to strictly reverse CS, whereas minimal distance of 0 corresponds to identically ordered spectra. It should be stressed that the CS distances between species are independent of the randomly chosen sets of words (that do not exclude sampling variation). Calculating Species GS Definition Relative abundance of dinucleotides: Let fX denote the frequency of the nucleotide X in a DNA sequence S and fXY the frequency of the dinucleotide XY in S. Then, XY is considered of high (low) relative abundance, compared with a random association of the component mononucleotides X and Y in S, if the odds ratio fXY/fXfY is sufficiently larger (lower) than 1. Based on the above indices, some distance measures were proposed (see Karlin and Cardon 1994) to estimate the closeness of compared sequences. Correction for Time of Divergence Correlation between the two groups of characteristics, (1) species distances based on GD and (2) distances based on protein sequences, may be driven by functional interdependence, but the period of time elapsed from species divergence also may be a factor or even may be the major shared factor contributing to the observed correlation between the two measures of species distance. In other words, it could be expected that the earlier the two species diverged, the higher the correlation would be between the different distance measures. Despite this reasoning, we found that protein sequences of the repair-recombination group show significantly higher correlation with the GD characteristics than the remaining set of tested proteins. Nevertheless, we attempted to take into account the time factor. Namely, we represented the ‘‘time elapsed from divergence’’ of any two species, by using sequence comparison of their 16SrRNA. To compensate for the influence of this factor on pairwise correlations, we recalculated the correlations between the discussed groups of proteins and GD by using a two-factor regression, with distance of dialect and ‘‘time’’ (distance in 16SrRNA) as the independent variables and protein distance as the dependent variable. Results Some examples of distances between the species calculated for orthologous proteins and for different GD representatives are shown in table 1 (for the whole data set, see Table 5 in Supplementary Material online). As expected, orthologous DTDPs tend to display, on the average, a larger evolutionary distance between the species than orthologous SMPs: the mean distance was 48.34 6 1.92 and 38.79 6 1.72 for DTDPs and SMPs, respectively. But in both groups there are proteins with small distances and large distances between the orthologues. For example, in the DTDP group, the average distances between orthologous UvrA and UvrB were 35.48% and 37.08%, respectively, while the distances between orthologous DnaNs are much larger (average 58.06). Similarly, within the SMP group, the average distance between orthologous AtpAs is small (33.67) while the average distance between orthologous folAs is large (55.1). Distance Measures Between Orthologous DNA Repair Proteins Show High Correlation with Distance Measures of the GD for both PPCS and GS A positive, albeit low, correlation of the species GC% content distances and all 40 protein distances was found (table 2, column 3). Similar results were obtained for the CS distances (column 4). Presumably, the last fact may result from the dependence of CS measures on the species GC% content. We also tried a modified version of CS, which should not be sensitive to differences in species GC% content, by reducing to purine-pyrimidine alphabet (resulting in PPCS). Although the correlations between many protein distances and PPCS distances were moderate, some DTDPs, especially those involved in DNA repair and recombination, showed high correlations (table 2, column 5). In fact, 9 out of the 10 highest correlations were displayed by the group of 11 repair-recombination proteins, and only one was from the group of 20 control proteins (the difference was significant at P , 5 3 105 by Fisher’s exact test for 2 3 2 table). Correlations of the distances of most proteins with GS distances were high, but, seemingly less discriminative between the repair-recombination versus all other chosen proteins (column 6). Additional instructive comparisons between different groups and subgroups of the chosen 40 proteins, with respect to the correlation of their distances with GD distances, were conducted using the nonparametric Mann-Whitney test for both PPCS and GS as dialect characteristics (table 3, columns 2 and 3). As already mentioned, DTDP distances are correlated higher than those of control proteins with PPCS distances (the difference significant at P , 0.033). However, the two groups cannot be distinguished by correlations of their protein distances with GS distances (P . 0.3). A clearer differentiation was found when the subgroup of repair-recombination proteins, rather than the total DTDP group, was compared to the control: in this case, the differences in correlation to PPCS and GS were significant at P , 0.0008 and P 5 0.083, respectively. Comparison of repair-recombination proteins to an extended control that included the remaining 29 proteins showed even higher significance (P 5 0.00001 for PPCS and P 5 0.016 for GS) (table 3). We further divided the repair-recombination group to ‘‘repair’’ and ‘‘recombination’’ subgroups. The repair 60 Paz et al. Table 1 Distances of Orthologous DTDPs and SMPs and Species GD Representatives Name DTDPs RpoA RpoB RpoC DnaG GyrA DnaE DnaQ DnaB Rep DnaN PolA UvrA UvrB UvrC UvrD MutS RecG RecQ RuvA RecA SMPs RpLA RpLB RpSB RpSC TufA AlaS AroK DapA FolA LipA RibE BioB Pgk Eno GapA AtpA CarA CarB Ffh MurA GC% CSb PPCSb GSc N CDa LDa ADa RNA polymerase alpha subunit RNA polymerase beta subunit RNA polymerase beta prime subunit Primase DNA gyrase alpha chain DNA polymerase III alpha chain DNA polymerase III epsilon chain Replicative DNA helicase Replicative DNA helicase DNA polymerase III beta chain (clamp) DNA polymerase I Excision nuclease subunit A Excision nuclease subunit B Excision nuclease subunit C DNA-dependent ATPase I and helicase II DNA-binding, DNA-dependent ATPase Holliday junction helicase Helicase Holliday junction DNA helicase Recombinase DNA strand exchange Average of DTDPs 20 20 20 20 20 20 20 20 20 20 19 19 19 19 20 20 20 17 19 20 0.00 0.01 4.44 2.23 4.89 1.46 0.00 0.00 2.22 0.27 6.78 2.76 4.16 0.02 1.80 5.62 2.64 4.04 5.91 3.39 2.63 71.73 55.82 57.44 71.42 67.72 76.1 78.72 71.91 70.41 80.92 71.76 53.70 65.46 71.87 70.05 79.39 80.12 70.85 74.63 44.54 69.23 40.53 36.48 36.35 57.07 45.97 52.17 53.91 49.70 53.58 58.06 53.07 35.48 37.08 54.61 52.46 54.71 53.06 54.12 56.46 32.02 48.34 50S ribosomal protein L1 50S ribosomal protein L2 30S ribosomal protein S2 30S ribosomal protein S3 Protein chain elongation factor EF-Tu Alanyl-tRNA synthetase Shikimate kinase I Dihydrodipicolinate synthase Dihydrofolate reductase type I Lipoic acid synthetase Riboflavin synthase alpha chain Biotin synthetase Phosphoglycerate kinase (2-Phosphoglycerate dehydratase) (2phospho-D-glycerate hydrolyase). Glyceraldehyde-3-phosphate dehydrogenase A Adenosine triphosphate synthase alpha chain Carbamoyl-phosphate synthase, small subunit Carbamoyl-phosphate synthase, large subunit GTP-binding export factor binds to signal sequence, GTP, and RNA UDP-N-acetylglucosamine 1carboxyvinyltransferase Average of SMPs Percentage of guanine plus cytosine within genomic DNA Compositional spectra Purine-pyrimidine Compositional spectra Genome signature 20 20 20 20 20 20 20 18 20 18 18 18 18 18 7.94 3.29 0.00 1.40 9.89 6.05 0.00 0.00 0.00 0.00 0.00 0.04 0.00 6.48 60.00 52.72 57.08 56.66 36.18 65.65 73.83 69.11 76.86 58.02 73.20 72.32 67.50 53.77 42.44 34.72 38.75 33.12 25.90 47.29 48.35 52.14 55.10 37.50 55.21 43.54 43.10 35.49 18 3.92 63.37 43.32 18 20 18 18 3.50 0.05 0.003 4.63 49.50 63.92 51.79 63.80 33.77 40.78 33.16 42.74 20 2.86 59.00 43.51 20 2.50 0.114 61.21 40.30 38.79 14.04 20 20 20 0.00 0.005 0.001 82.00 85.00 1.279 31.93 27.53 0.645 Description NOTE.—Name, name of the protein in Escherichia coli; N, the number of orthologous proteins within the 20 selected Proteobacteria species; CD, the closest distance (between the two most identical orthologous sequences); LD, the largest distance (between the two most different orthologous sequences); AD, the average distances between orthologues; ATPase, adenosine triphosphatase; and GTP, guanosine triphosphate. a Distance 5 100 (the percentage of identical amino acids), after Blast 2 sequences alignment. b CS and PPCS distances are normalized to a scale ranging from 0 to 100. c GS distances. subgroup showed higher significance than the recombination subgroup of its correlations to PPCS (P 5 0.00032 for repair vs. control and P , 0.017 for recombination vs. control) and GS (P , 0.0406 for repair vs. control and P . 0.6 for recombination vs. control). It is interesting to note that the high correlation of the distances of the orthologous enzymes belonging to the nucleotide excision repair (NER), with PPCS and GS distances, is shared by both UvrA and UvrB (that have relatively low average distances 35.48 and 37.08, respectively) and PolA (DNA Genome Dialect and DNA Repair 61 Table 2 Correlations Between the Distance of the Orthologous Proteins and Dialect Representatives Name DTDPs RpoA RpoB RpoC DnaG GyrA DnaE DnaQ DnaB Rep DnaN PolA UvrA UvrB UvrC UvrD MutS RecG RecQ RuvA RecA SMPs RpLA RpLB RpSB RpSC TufA AlaS AroK DapA FolA LipA RibE BioB Pgk Eno GapA AtpA CarA CarB Ffh MurA Description GC% CS PPCS GS RNA polymerase alpha subunit RNA polymerase beta subunit RNA polymerase beta prime subunit Primase DNA gyrase alpha chain DNA polymerase III alpha chain DNA polymerase III epsilon chain Replicative DNA helicase Replicative DNA helicase DNA polymerase III beta chain (clamp) DNA polymerase I Excision nuclease subunit A Excision nuclease subunit B Excision nuclease subunit C DNA-dependent ATPase I and helicase II DNA-binding, DNA-dependent ATPase Holliday junction helicase Helicase Holliday junction DNA helicase Recombinase DNA strand exchange 0.317 0.341 0.291 0.420 0.341 0.349 0.260 0.310 0.375 0.466 0.493 0.457 0.441 0.522 0.467 0.387 0.527 0.474 0.417 0.472 0.330 0.357 0.324 0.381 0.313 0.300 0.267 0.332 0.364 0.453 0.498 0.460 0.439 0.510 0.424 0.377 0.436 0.534 0.345 0.477 0.586 0.602 0.555 0.552 0.386 0.401 0.488 0.479 0.590 0.731 0.722 0.709 0.703 0.735 0.736 0.623 0.666 0.681 0.603 0.672 0.672 0.699 0.673 0.669 0.667 0.606 0.624 0.660 0.670 0.727 0.754 0.730 0.742 0.757 0.680 0.721 0.693 0.671 0.633 0.742 50S ribosomal protein L1 50S ribosomal protein L2 30S ribosomal protein S2 30S ribosomal protein S3 Protein chain elongation factor EF-Tu Alanyl-tRNA synthetase Shikimate kinase I Dihydrodipicolinate synthase Dihydrofolate reductase type I Lipoic acid synthetase Riboflavin synthase alpha chain Biotin synthetase Phosphoglycerate kinase (2-Phosphoglycerate dehydratase) (2phospho-D-glycerate hydrolyase). Glyceraldehyde-3-phosphate dehydrogenase A ATP synthase alpha chain Carbamoyl-phosphate synthase, small subunit Carbamoyl-phosphate synthase, large subunit GTP-binding export factor binds to signal sequence, GTP, and RNA UDP-N-acetylglucosamine 1carboxyvinyltransferase Average of all proteins 0.429 0.327 0.309 0.356 0.201 0.486 0.348 0.297 0.509 0.509 0.396 0.389 0.442 0.44 0.367 0.352 0.311 0.35 0.213 0.41 0.373 0.315 0.457 0.396 0.426 0.435 0.421 0.479 0.505 0.580 0.470 0.527 0.231 0.665 0.508 0.461 0.558 0.476 0.486 0.689 0.577 0.621 0.689 0.671 0.581 0.631 0.462 0.63 0.748 0.527 0.759 0.61 0.666 0.606 0.797 0.753 0.302 0.341 0.528 0.636 0.359 0.498 0.504 0.393 0.304 0.490 0.483 0.383 0.453 0.640 0.654 0.600 0.615 0.747 0.719 0.719 0.423 0.387 0.624 0.699 0.401 0.39 0.577 0.676 NOTE.—The names of the ten proteins with highest correlation to PPCS are in bold. Nine of these are involved in repairrecombination processes. ATPase, adenosine triphosphatase and GTP, guanosine triphosphate. polymerase I) that has a high distance (53.07) between orthologues. Correction for Time of Divergence Columns 4 and 5 in table 3 represent the significance of differences between the selected subgroups and control after correction for time of divergence. Column 4 demonstrates lower significance of the differences between the repair-recombination group and the control group of their distance with the PPCS distances after time correction (compared to column 2). Column 5 demonstrates a better significance of the differences between repair-recombination and control proteins for their GS correlations after time correction (compared to column 3). Recombination helicases RecG and RecQ seem to be closer to the repair group than RecA and RuvA. Indeed, better significances were obtained for the distance differences between ‘‘repair 1 RecG 1 RecQ’’ group and the control group, than that of all repair-recombination enzymes, including RecA and RuvA. Therefore, we conclude that species GD distances (PPCS and GS distances) are highly correlated with repair-recombination protein distances but display lower correlations to other protein distances. Thus, it makes sense to narrow our initial hypothesis. Namely, we suggest that the foregoing evidence points to ‘‘coevolution of the GD and DNA repair-recombination enzymes.’’ 62 Paz et al. Table 3 Significance of the Differences Between Selected Protein Groups in the Correlations of Their Protein Distances to Dialect Distances (Mann-Whitney U test) Dialect Distances Groups Compared DTDPs (20) versus SMPs (20) RRPs (11) versus SMPs (20) RRPs (11) versus the rest (29) Repair (7) versus SMPs (20) Recombination (4) versus SMPs (20) (Repair 1 RecG 1 RecQ) (9) versus SMPs (20) PPCS GS tPPCS tGS 0.0326 0.00074 0.00001 0.00032 0.01634 0.00009 .0.3 0.08291 0.016 0.0406 .0.6 0.06599 .0.1 0.00385 0.00065 0.00567 .0.1 0.00114 0.00941 0.00295 0.00346 0.00109 .0.3 0.00041 NOTE.—tPPCS and tGS, dialect distances based on PPCS and GS, respectively, but corrected for time by using two-factor regression analysis, with species 16SrRNA distances as the second independent parameter. RRPs, DTDPs involved in repair and recombination; Repair, DTDPs involved in repair; Recombination, DTDPs involved in recombination; and RecG 1 RecQ, names of specific helicases. Numbers in the brackets are the total of proteins included in the group. Discussion Rationale and Possible Mechanisms That Might Cause Coevolution of Species DNA Repair-Recombination Enzymes and GD In an experiment with E. coli culture with nonfunctional MMR system, lower recombination rates were found after 20,000 replications between the evolved population and the formerly identical lines compared to the standard recombination rates (Vulic, Lenski, and Radman 1999). This phenomenon was interpreted in terms of evolved ‘‘genetic barrier.’’ To explain these results, the authors postulated a considerable divergence of DNA organization in the repair-defective population compared to the control. We suggest that genomic variation displayed in the GD and repair-recombination enzymes derives from evolutionary processes similar, to some extent, to the events in the aforementioned experiments with E. coli (Vulic, Lenski, and Radman 1999). A key process in prokaryotic evolution might be shortterm bursts of genetic variation. Elevation of variation can result from ‘‘attenuation’’ of both repair fidelity and constraints on nonhomologous recombination, causing increased mutation rate and recombination both within and between species (Rayssiguier, Thaler, and Radman 1989). Although usually not advantageous, there may be occasions where a relaxation of genome stability is adaptive: when the organism finds itself in ‘‘time of trouble,’’ i.e., in stress conditions (McClintock 1984; Hoffman and Parsons 1991; Korol, I. A. Preygel, and S. I. Preygel 1994; Korol 2001). As the intensity, duration, and nature of stress are highly variable, it may be advantageous for a population to increase its genotypic heterogeneity ensuring that at least a subset of organisms would survive the stress. An increase in heterogeneity could be achieved through mutations in DNA repair gene/genes, the ‘‘guards’’ of the quality of the genome text. These kinds of mutations will reduce the fidelity of repair-recombination processes. Some of the repair-recombination genes are known to be ‘‘hot spots’’ for mutations due to the high levels of Simple Sequence Requests (SSRs) within or very close to these genes (Metzgar and Wills 2000; Li et al. 2002; Massey and Buckling 2002; Rocha, Matic, and Taddei 2002). SSRs are among the fastest evolving DNA sequences and are known to increase variability during the processing of genetic information (in replication and recombination but also during gene expression [Li et al. 2004]). If in stress situations a mutation in a DNA repair-recombination gene occurs, an increase in the genome heterogeneity and GD differentiation can be expected in future generations. Later, reversion to high functionality and fidelity of the repairprocessing gene may be selected in order to conserve and stabilize the gains of the change (Korol, I. A. Preygel, and S. I. Preygel 1994). It is reasonable to assume that readaptation and phenotypic reversion to high-fidelity ‘‘normal’’ functioning may be achieved by additional mutations of the previously mutated DNA repair gene as well as suppressing mutations in other repair genes. Then, measuring the species molecular differences will reveal corroboration of changes in both the repair enzyme sequence and GD (it should be remembered that the GD also differs now from the original). The higher correlations found between the species DNA repair and recombination proteins (RRPs) and PPCS distances compared to the correlations between RRPs distances and either GC% content or CS distances suggest that (at least in the Proteobacteria) the purine-pyrimidine sequence organization might be important in nongenic speciation. Nongenic speciation had been proposed earlier for the role of species differences in GC% content (Forsdyke 2004). What Might be the Reasons for High Evolutionary Sensitivity of the DNA Repair-Recombination Enzymes to the GD? Both the external conditions and internal environment of the cell of an organism seem to have a profound effect on the layout of the repair system (Aravind, Walker, and Koonin 1999). DNA is the substrate for the repair enzymes. Unlike ‘‘regular’’ substrates of other (metabolic) enzymes, this substrate is always changing. Therefore, we suggest that in order to fulfill their roles with high speed and fidelity, the structures of repair-recombination enzymes might be more evolutionary sensitive and ‘‘responsive’’ to changes in the GD (i.e., DNA-abundant structures). Changes of the GD might result from stress events or from the two other possible triggers presented in figure 1. In the process of DNA repair, some of the enzymes involved might use specific short tracts of nucleotides that are dispersed in their genome as targets for a nuclease function (like the Genome Dialect and DNA Repair 63 d(GATC) sequence for the MutH enzyme [Lahue, Au, and Modrich 1989]) or as ‘‘anchors’’ or ‘‘toggles’’ (the octamer sequence Chi [5#-GCTGGTGG-3#] for the RecBCD enzymatic complex in the MMR of E. coli [Myers and Stahl 1994]). Different species might have different ‘‘preferred’’ short DNA ‘‘signals’’ related to the repair and recombination processes (Chedin et al. 1998; Lao and Forsdyke 2000). It is reasonable to assume that these signals are related to the species-specific GD. Some of the DNA signal targets involved in repair-recombination processes might be simple, based on two-letter (purine/pyrimidine) discrimination (as proposed also for unwinding centers in DNA transcription or replication processes [Yagil, Shimron, and Tal 1998]). There are sequence patterns that may cause obstacles during the genome replication or reduce the replication fidelity and be nevertheless maintained due to their essential roles at the RNA or protein levels (like microsatellite tandem repeats within genes) (Li et al. 2004). The repair enzymes should be involved in the solution of these genome replication problems. We think that this challenge, together with the need in the repair of continuously arising damages to DNA at any stage of the cell life, might be an additional reason why the repair group of enzymes displays the highest correlation with GD within the DTDP group (see tables 2 and 3). It is interesting that Parniewski et al. (1999) showed that the major in vivo repair system for DNA damages, the NER system, affects the stability of long-transcribed triplet repeat sequences in E. coli. Our results well corroborate this finding (all bacterial enzymes of the NER system described in the literature were included in our study). In conclusion, we suggest that the DNA repair-recombination enzymes may play an essential role in GD differentiation as a potentially important step of speciation on the entire-genome level. And vice versa, variation in the GD may impose a selection pressure on evolution of the DNA repair-recombination system. To What Extend Horizontal Gene Transfer of DNA Maintenance Genes Might Affect the Evolution of GD? In some DNA repair and recombination processes, complexes between proteins are formed and hence high structural fit between the components is needed. This may impede the integration of an alien DNA repair gene that is not identical (or not sufficiently similar) to the original gene of the recipient cell to fulfill the ‘‘native’’ gene functions. Such lack of complementation might be one of the reasons why genes involved in information processing seem to be rarely horizontally transferred (Jain, Rivera, and Lake 1999). Although cases of duplication of some DNA repair genes in Proteobacteria species resulting from lateral gene transfer are known (Martins-Pinheiro et al. 2004), in these cases, the native original gene is highly conserved and remains functional. Davidsen and coauthors (Davidsen et al. 2004) proposed that the transfer of DNA maintenance genes between cells of the same species may be an important stress-adaptation process in certain prokaryotes (the three Proteobacteria species considered in their work are also included in our study). Their argument is that during stress, both the loss of function of some DNA maintenance genes in some cells and death and lysis of the majority of other cells in the colony happen. As a result, both the probability of DNA maintenance genes from the lysed cells to be released to the vicinity of the survivals and the need of the survivals to uptake these genes are elevated. The analysis conducted by the authors points to the existence of structural features within maintenance genes facilitating such transformation events. In particular, within the species N. meningitides, H. influenzae, and, possibly, P. multocida, the distribution of oligonucleotides known as DNA ‘‘uptake sequences’’ is biased toward genome maintenance genes (DTDPs, in our terminology). We believe that the foregoing model does not contradict our interpretations. Firstly, only a few bacterial species use this mechanism (Redfield 2001). Secondly, Davidsen et al. (2004) did not subdivide the DNA maintenance group of genes (similar to our DTDP group) into replicationoriented and repair-recombination–oriented groups. We speculate that there might be significant differences between such subgroups with respect to the presence of DNA uptake sequences (in favor of repair-recombination group). Thirdly, stress-promoted transformation and recombination events might be accompanied by small changes in these genes. Hence, again, the result of stress might be a corroboration of changes in both the repairrecombination enzyme sequences and the GD. Supplementary Material Tables 4 and 5 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/). Acknowledgments We thank D. R. Forsdyke, E. Rocha, E. Trifonov, Y. Kashi, M. Soller, P. Capy, and two anonymous reviewers for their helpful comments and suggestions. This work was supported by the Israeli Ministry of Absorption and by the Ancell-Teicher Research Foundation for Molecular Genetics and Evolution. A.P. was supported by scholarship in Bioinformatics from the Eshkol foundation of the Israeli Ministry of Science and Technology. Literature Cited Aravind, L., D. R. Walker, and E. V. Koonin. 1999. Conserved domains in DNA repair proteins and evolution of repair systems. Nucleic Acids Res. 27:1223–1242. Bebenek, A., H. K. Dressman, G. T. Carver, S. Ng, V. Petrov, G. Yang, W. H. Konigsberg, J. D. Karam, and J. W. Drake. 2001. Interacting fidelity defects in the replicative DNA polymerase of bacteriophage RB69. J. Biol. Chem. 276:10387–10397. Becherel, O. J., R. P. Fuchs, and J. Wagner. 2002. Pivotal role of the beta-clamp in translesion DNA synthesis and mutagenesis in E. coli cells. DNA Repair (Amst.). 1:703–708. Brendel, V., J. S. Beckmann, and E. N. Trifonov. 1986. Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J. Biomol. Struct. Dyn. 4:11–20. Brocchieri, L. 2001. Phylogenetic inferences from molecular sequences: review and critique. Theor. Popul. Biol. 59:27–40. Burge, C., A. M. Campbell, and S. Karlin. 1992. Over- and underrepresentation of short oligonucleotides in DNA sequences. Proc. Natl. Acad. Sci. USA 89:1358–1362. 64 Paz et al. Campbell, A., J. Mrazek, and S. Karlin. 1999. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc. Natl. Acad. Sci. USA 96:9184–9189. Chedin, F., P. Noirot, V. Biaudet, and S. D. Ehrlich. 1998. A fivenucleotide sequence protects DNA from exonucleolytic degradation by AddAB, the RecBCD analogue of Bacillus subtilis. Mol. Microbiol. 29:1369–1377. Coenye, T., and P. Vandamme. 2004. Use of the genomic signature in bacterial classification and identification. Syst. Appl. Microbiol. 27:175–185. Courcelle, J., and P. C. Hanawalt. 2001. Participation of recombination proteins in rescue of arrested replication forks in UVirradiated Escherichia coli need not involve recombination. Proc. Natl. Acad. Sci. USA 98:8196–8202. Davidsen, T., E. A. Rodland, K. Lagesen, E. Seeberg, T. Rognes, and T. Tonjum. 2004. Biased distribution of DNA uptake sequences towards genome maintenance genes. Nucleic Acids Res. 32:1050–1058. Forsdyke, D. R. 1995. A stem-loop ‘‘kissing’’ model for the initiation of recombination and the origin of introns. Mol. Biol. Evol. 12:949–958 Forsdyke, D. R. 2004. Chromosomal speciation: a reply. J. Theor. Biol. 230:189–196. Forsdyke, D. R., and J. R. Mortimer. 2000. Chargaff’s legacy. Gene 261:127–137. Fox, G. E., E. Stackebrandt, R. B. Hespell et al. (19 co-authors) 1980. The phylogeny of prokaryotes. Science 25:457–463. Gentles, A. J., and S. Karlin. 2001. Genome-scale compositional comparisons in eukaryotes. Genome Res. 11:540–546. Gogarten, P. J., E. Hilario, and L. Olendzenski. 1996. Gene duplications and horizontal gene transfer during early evolution. Pp. 267–292 in D. M. Roberts, P. Sharp, G. Alderson, and M. Collins, eds. Evolution of microbial life: 54th Symposium of the Society for General Microbilogy. Cambridge University Press, Cambridge. Gregg, A. V., P. McGlynn, R. P. Jaktaji, and R. G. Lloyd. 2002. Direct rescue of stalled DNA replication forks via the combined action of PriA and RecG helicase activities. Mol. Cell 9:241–251. Hoffman, A. A., and P. A. Parsons. 1991. Evolutionary genetics and environmental stress. Oxford Science publications, Oxford. Jain, R., M. C. Rivera, and J. A. Lake. 1999. Horizontal gene transfer among genomes: the complexity hypothesis. Proc. Natl. Acad. Sci. USA 96:3801–3806. Karlin, S., and L. R. Cardon. 1994. Computational DNA sequence analysis. Annu. Rev. Microbiol. 48:619–654. Karlin, S., J. Mrázek, and A. M. Campbell. 1997. Compositional biases of bacterial genomes and evolutionary implications. J. Bacteriol. 179:3899–3913. Kirzhner, V. M., A. B., Korol, A. Bolshoy, and E. Nevo. 2002. Compositional spectrum revealing patterns for genomic sequence characterization and comparison. Physica A 312: 447–457. Kirzhner, V. M., A. B. Korol, E. Nevo, and A. Bolshoy. 2003. A large-scale comparison of genomic sequences: one promising approach. Acta Biotheor. 51:73–89. Korol, A. B. 2001. Recombination. Pp. 53–71 in Simon A. Levin, ed. Encyclopedia of biodiversity. Volume 5. Academic Press, San Diego, Calif. Korol, A. B., I. A., Preygel, and S. I. Preygel. 1994. Recombination variability and evolution. Chapman and Hall, London. Lahue, R. S., K. G. Au, and P. Modrich. 1989. DNA mismatch correction in a defined system. Science 245:160–164. Lao, P. J., and D. R. Forsdyke. 2000. Crossover hot-spot instigator (Chi) sequences in Escherichia coli occupy distinct recombination/transcription islands. Gene 243:47–57. Li, Y., A. B. Korol, T. Fahima, A. Beiles, and E. Nevo. 2002. Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Mol. Ecol. 11:2453–2465. ———. 2004. Microsatellites within genes: structure, function, and evolution. Mol. Biol. Evol. 21:991–1007. Lieb, M., and A. S. Bhagwat. 1996. Very short patch repair: reducing the cost of cytosine methylation. Mol. Microbiol. 3:467–473. López de Saro, F. J., and M. O’Donnell. 2001. Interaction of the beta sliding clamp with MutS, ligase, and DNA polymerase I. Proc. Natl. Acad. Sci. USA 98:8376–8380. Martins-Pinheiro, M., R. S. Galhardo, C. Lage, K. M. Lima-Bessa, K. A. Aires, and C. F. Menck. 2004. Different patterns of evolution for duplicated DNA repair genes in bacteria of the Xanthomonadales group. BMC Evol. Biol. 4(1):29. Massey, R. C., and A. Buckling. 2002. Environmental regulation of mutation rates at specific sites. Trends Microbiol. 10: 580–584. McClintock, B. 1984. The significance of responses of the genome to challenge. Science 226:792–801. Metzgar, D., and C. Wills. 2000. Evidence for the adaptive evolution of mutation rates. Cell 101:581–584. Myers, R. S., and F. W. Stahl. 1994. Chi and the RecBC D enzyme of Escherichia coli. Annu. Rev. Genet. 28:49–70. Nussinov, R. 1984. Doublet frequencies in evolutionary distinct groups. Nucleic Acids Res. 10:1749–1763. Parniewski, P., A. Bacolla, A. Jaworski, and R. D. Wells. 1999. Nucleotide excision repair affects the stability of long transcribed (CTG*CAG) tracts in an orientation-dependent manner in Escherichia coli. Nucleic Acids Res. 27:616–623. Paz, A., D. Mester, I. Baca, E. Nevo, and A. Korol. 2004. Adaptive role of increased frequency of polypurine tracts in mRNA sequences of thermophilic prokaryotes. Proc. Natl. Acad. Sci. USA 101:2951–2956. Rayssiguier, C., D. S. Thaler, and M. Radman. 1989. The barrier to recombination between Escherichia coli and Salmonella typhimurium is disrupted in mismatch-repair mutants. Nature 342:396–401. Redfield, R. J. 2001. Do bacteria have sex? Nat. Rev. Genet. 2:634–639. Robu, M. E., R. B. Inman, and M. M. Cox. 2004. Situational repair of replication forks: roles of RecG and RecA proteins. J. Biol. Chem. 279:10973–10981. Rocha, E., I. Matic, and F. Taddei. 2002. Over-representation of repeats in stress response genes: a strategy to increase versatility under stressful conditions? Nucleic Acids Res. 30:1886– 1894. Schofield, M. J., and P. Hsieh. 2003. DNA mismatch repair: molecular mechanisms and biological function. Annu. Rev. Microbiol. 57:579–608. Tuteja, N., and R. Tuteja. 2004. Prokaryotic and eukaryotic DNA helicases. Essential molecular motor proteins for cellular machinery. Eur. J. Biochem. 271:1835–1848. Vulic, M., R. E. Lenski, and M. Radman. 1999. Mutation, recombination, and incipient speciation of bacteria in the laboratory. Proc. Natl. Acad. Sci. USA 96:7348–7351. Woese, C. 1987. Bacterial evolution. Microbiol. Rev. 51:221–271. Yagil, G., F. Shimron, and M. Tal. 1998. DNA unwinding in the CYC1 and DED1 yeast promoters. Gene 225:152–163. Pierre Capy, Associate Editor Accepted August 30, 2005
© Copyright 2026 Paperzz