Coevolution of DNA-Interacting Proteins and

Coevolution of DNA-Interacting Proteins and Genome ‘‘Dialect’’
A. Paz, V. Kirzhner, E. Nevo, and A. Korol
Institute of Evolution, University of Haifa, Mount Carmel, Haifa, Israel
Several species-specific characteristics of genome organization that are superimposed on its coding aspects were proposed
earlier, including genome signature (GS), genome accent, and compositional spectrum (CS). These notions could be considered as representatives of genome dialect (GD). We measured within the Proteobacteria some GD representatives, the
relative abundance of dinucleotides or GS, the profiles of occurrence of 10 nucleotide words (CS), and the profiles of
occurrence of 20 nucleotide words, using a degenerate two-letter alphabet (purine-pyrimidine compositional spectra
[PPCS]). Here, we show that the evolutionary distances between DNA repair and recombination orthologous enzymes
(especially those of the nucleotide excision repair system) are highly correlated with PPCS and GS distances. Orthologous
proteins involved in structural or metabolic processes (control group) have significantly lower correlations of their evolutionary distances with the PPCS and GS distances. We hypothesize that the high correlation of the evolutionary distances
of the DNA repair orthologous enzymes with their GD is a result of the coevolution of the DNA repair enzymes’ structures
and GDs. Species GDs could be substantially influenced by the function of DNA polymerase I (the bacterial major DNA
repair polymerase). This might cause the correlation of species GDs differentiation with evolutionary changes of species
DNA polymerase I. Simultaneously, the structures of DNA repair-recombination enzymes might be evolutionarily sensitive and responsive to changes in the structure of their substrate—the DNA (including those that are represented by GD
differentiation). We further discuss the rationale and mechanisms of the hypothesized coevolution. We suggest that stress
might be an important cause of changes in the repair-recombination genes and the GD and the trigger of the aforementioned
coevolution process. Other triggers might be massive horizontal gene transfer and ecological selection.
Introduction
Since the 1980s, the method of choice in estimation of
evolutionary distances between species is the sequence
comparison of their 16S or 18S rRNAs (Fox et al. 1980;
Woese 1987). The advantage of rRNA for phylogenetic
reconstructions is its slow change and putative resistance
to lateral gene transfer (Gogarten, Hilario, and Olendzenski
1996; Jain, Rivera, and Lake 1999). Phylogenetic reconstructions can also be based on sequences of slow-evolving
orthologous proteins. However, many protein-based tree topologies were incongruent with those based on rRNA
(reviewed by Brocchieri 2001). These and other difficulties
in using various sequence families for phylogenetic reconstructions call for methods of species comparisons at the
entire-genome level. In the last decades, new approaches
for measuring species distances were suggested (Nussinov
1984; Brendel, Beckmann, and Trifonov 1986; Burge,
Campbell, and Karlin 1992; Kirzhner et al. 2002). In particular, species-specific ‘‘genome signature’’ (GS) analysis based on di- and/or trinucleotide relative abundances
was widely used (Karlin, Mrázek, and Campbell 1997;
Campbell, Mrazek, and Karlin 1999; Gentles and Karlin
2001; Coenye and Vandamme 2004). Similarly, ‘‘compositional spectrum’’ (CS) analysis characterizing genomes
in terms of distribution of frequencies of imperfectly
matching words (length ;10–20 letters) was developed
in our lab (Kirzhner et al. 2002, 2003). These speciesspecific characteristics of genome structure are superimposed on its protein-coding aspects and exhibit short-range
patterns of DNA organization that are dispersed throughout the genome. The intergenome differences are, in a sense,
analogous to variations in English pronunciation among
Key words: genome dialect, genome signature, compositional spectrum, DNA repair, stress.
E-mail: [email protected].
Mol. Biol. Evol. 23(1):56–64. 2006
doi:10.1093/molbev/msj007
Advance Access publication September 8, 2005
Ó The Author 2005. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: [email protected]
people of different nationalities and can in total be referred
to as a ‘‘genome dialect’’ (GD) (Forsdyke and Mortimer
2000). When dealing with large genomes of eukaryotic species, the two methods of characterizing GD, CS, and GS can
use randomly chosen samples of genomic ‘‘texts’’ rather than
whole genomes (Gentles and Karlin 2001; Kirzhner et al.
2003). A small genome can easily be sampled as a whole.
In this study, we used dinucleotide GS and two versions of
CS. In the first version of CS, 10mer ‘‘words’’ in the four-letter
alphabet were employed. In the second version, 20 nucleotide
words were used, selecting the ‘‘degenerate’’ two-letter
alphabet (R 5 purine and Y 5 pyrimidine), resulting in
purine-pyrimidine compositional spectra (PPCS).
Our major hypothesis is that the evolution of GD, as
defined above, is strongly affected by some DNA template–
dependent processing proteins (DTDPs), their structure,
and ‘‘preferences’’ (see also Forsdyke 1995). Moreover,
we suggest that the structure of the DTDPs that ‘‘read’’
and ‘‘write’’ the DNA text, in turn, may be influenced indirectly (via evolutionary tuning) by the abundant DNA
structures of the species. Namely, after a significant change
in species GD caused by some trigger (see below), there
might also be a need for a structural change of its DNA repair enzymes. It is possible that some mutations that may
occur in these genes after a change in the GD will be evolutionarily advantageous.
We assumed that the bacterial major replicative DNA
polymerase, the high-fidelity DNA polymerase III known
by its high fidelity in the replication process, and the major
DNA repair polymerase, DNA polymerase I (also known to
be high-fidelity DNA polymerase) are better adapted to
their genome predominant compositional organization than
to rarely occurring sequence patterns. As an illustration of
this idea, one can presumably consider the known results on
the T4-related phage RB69 DNA polymerase (Bebenek
et al. 2001). Base pair substitution hot spots produced by
the phage intact DNA polymerase in vivo tended to occur
Genome Dialect and DNA Repair 57
FIG. 1.—Factors presumably involved in the coevolution of GD and
repair-recombination enzymes.
at certain specific GC-rich 6mer and, especially, at GG/CC
dimers at the T4 rI gene, a particularly striking tendency
because of the low percentage of guanine plus cytosine
(GC%) content of the T4 genome (35.3%).
Earlier, Karlin and coworkers hypothesized that
species-specific GS (peculiarities of oligonucleotide relative abundances) may be substantially affected by the
DNA replication and repair systems (Burge, Campbell,
and Karlin 1992; Karlin, Mrázek, and Campbell 1997).
In particular, it was suggested that species-specific overand underrepresentation of short oligonucleotides are related to the Deoxycytosine methylase methylase/very short
patch repair system (Burge, Campbell, and Karlin 1992;
Lieb and Bhagwat 1996). An experimental testing of effect
of bacterial DNA polymerases I, II, and III on the species
GS was also proposed (Karlin, Mrázek, and Campbell
1997) although such a test has not been conducted.
The objective of the current in silico investigation is to
test whether species differentiation with respect to (1) GDs
and (2) the primary structure of their DNA template–dependent enzymes are interconnected. If this is indeed the case,
one may think about several explanations of such correlation: (1) DNA polymerase may play a substantial role on its
product (DNA sequence) evolution; (2) the structure of
repair-recombination enzymes might be evolutionarily
more sensitive and ‘‘responsive’’ to changes in the GD
(i.e., abundant DNA structures) than proteins involved in
structural and metabolic processes (SMPs); (3) mutation
that are preserved in the DNA and changing the GD dialect
are those that improve the DNA sequence organization as
target of the repair-recombination (Forsdyke 1995). It is
also possible that changes in the DTDP genes and changes
in DNA across the whole genome (changes to the GD)
happen at the same time, e.g., due to genomic stress
(see below).
Hypothetical triggers of dialect-DTDP coevolution are
shown in figure 1. Closely related species from contrasted
environmental conditions may differ in codon usage and/or
preferred amino acids. An example is the high preference
to purine and purine tracts in mRNA displayed by thermophilic prokaryotes compared to mesophilic species of the
same life domains (Paz et al. 2004), resulting in GD differences between the thermophiles and mesophiles. Coevolution of DTDP sequences and GD might also happen
after genomic cataclysms caused by massive horizontal
transfer. Acute genomic stress might be a third possible
FIG. 2.—The roles of selected DTDPs in the Proteobacteria species
cells: transcription, replication, repair, and recombination and antirecombination. Some of these enzymes are involved in more than one of these
four processes, as shown with the grouping lines.
(and, presumably, most plausible) trigger for the postulated
scenario.
In order to test the proposed hypothesis, we estimated
the correlation between distance matrices built using GD
representatives and distance matrices based on the alignment of orthologous proteins of a certain group of DTDPs
(figs. 2 and 3). As a control to the DTDPs, a group of SMPs
was employed. The expectation was that the DTDP distances will display higher correlation than the control group
distances, with the distances measured for GD. The DTDPs,
FIG. 3.—Evaluation of the interdependence of the species distances
for GD and DTDPs. As a control to the DTDP group, SMPs were taken.
58 Paz et al.
Xanthomonas campestris, Xylella fastidiosa, Yersinia pestis;
and e—Campylobacter jejuni, Helicobacter pylori.
Targeted Proteins Employed in the Study
FIG. 4.—Examples of CS of genomic DNA of Proteobacteria species
calculated based on fR, Yg alphabet. As a reference, a stretch of Escherichia coli DNA (two millions of nucleotides) was employed, thus, the
words of all other examples are ordered using the PPCS of this reference
ranking (the second stretch of E. coli genome showed in the example is the
rest of the genome). The abscissa represents the words of the set W placed
in some order (in accordance with the word’s frequency of appearance in
the reference sequence), whereas the ordinate shows the observed frequencies F(W, S) of the words in the compared sequence S. The table under the
PPCS graphs represents the pairwise distances between the compared species. Abbreviations: E.c 1, E. coli K-12 halve genome 1; E.c 2, E. coli K-12
halve genome 2; N.m, Neisseria meningitides; S.t, Salmonella typhimurium; R.c, Rickettsia conorii.
selected for our comparisons, have different roles in transcription, replication, repair, and recombination (fig. 2).
Some DTDPs are known to be involved in more than
one of these four processes: DNA polymerase III beta chain
(DnaN) also participates in repair processes in addition to
its role in replication (López de Saro and O’Donnell 2001;
Becherel, Fuchs, and Wagner 2002); the mismatch repair
(MMR) enzyme MutS, in antirecombination (Rayssiguier,
Thaler, and Radman 1989; Schofield and Hsieh 2003); and
the recombination enzymes RecQ and RecG, in antirecombination and repair (Courcelle and Hanawalt 2001; Gregg
et al. 2002; Robu, Inman, and Cox 2004; N. Tuteja and R.
Tuteja 2004).
The error prone DNAPs II, IV, and V were not selected
because they are missing from the genomes of several species included in this study. We also did not include proteins
involved in apurinic and apyrimidinic repair, such as the
DNA glycosylases, Ung and AlkA, and Phr involved in direct repair of pyrimidine dimers.
Methods
Targeted Species Employed in the Study
Twenty Proteobacteria of a, b, c, and e groups were investigated: a—Agrobacterium tumefaciens, Caulobacter crescentus, Rickettsia conorii, Rickettsia prowazekii; b—Neisseria
meningitides;c—Buchnera aphidicola, Escherichia coliK-12,
Haemophilus influenzae, Pseudomonas aeruginosa, Pasteurella multocida, Salmonella enterica, Shigella flexneri, Shewanella oneidensis, Salmonella typhimurium, Vibrio cholerae,
Altogether, sequences of 40 proteins were selected in
such a way that most of them have orthologues in the vast
majority of 20 Proteobacteria species. As shown in figure 2
and table 3, the list included RNA polymerase subunits,
replication apparatus enzymes, and DNA repair and recombination enzymes. Dual/multiple functions of some of these
enzymes are shown in figure 2, and the references that
showed or suggested additional roles of the enzymes are
cited in the text.
The remaining 20 SMPs served as a control group. The
SMPs were included in the following: protein synthesis,
RpLA, RpLB, RpSB, RpSC, TufA, AlaS; amino acid biosynthesis, AroK, DapA; biosynthesis of cofactors, prosthetic groups, and carriers, FolA, LipA, RibE, BioB;
energy metabolism, Pgk, Eno, GapA, AtpA; pyrimidine
biosynthesis, arginine biosynthesis, and urea cycle, CarA,
CarB; protein fate, Ffh; and cell envelope, MurA.
Protein sequences and 16SrRNA sequences were obtained from publicly available databases (http://www.ncbi.
nim.gov/entrez/query.fegi?db=Protein and http://www.ncbi.
nim.gov/entrez/query.fegi?db=Nucleotide). In the supplemental data (Table 4, Supplementary Material online) there
are the accession numbers and IDs of the 40 orthologous proteins that were included in this study.
Calculating Species Distance for Orthologous Proteins
and 16SrRNA
Blast 2 sequence comparison tool was used to calculate the distances for orthologous proteins with standard
alignment parameters (matrix: blosom62; gap open: 11;
gap extension: 1; expect: 10; word size: 3). Distances between orthologous proteins are determined as follows:
distance 5 100 (the percentage of identical amino acids
after Blast 2 sequences alignment).
Calculating CS of DNA Sequences
Definition
Let us take a set W of n different oligonucleotides
(words) wi of length L in any alphabet, e.g., in the standard
A, T, G, C, or R, Y (purine, pyrimidine). For each word wi of
the set W and any chosen large sequence S, one can calculate
the observed number of matches mi 5 m(wi), allowing for
a preset number of r mismatches (i.e., r 5 2). This approximate matching can be denoted as ‘‘r-mismatching’’. Now let
M 5 Rmi. The frequency distribution F(W, S) of fi 5 mi/M
will be referred to as CS of the sequence S relative to the set W
(Kirzhner et al. 2002, 2003). The word sets can be produced
using a random generator. In the current study, we employed
word lengths L 5 10 and L 5 20 for CS analysis based on
four-letter and two-letter alphabets, respectively (resulting
in equal sizes of the corresponding ‘‘vocabularies’’).
Visualizing CS and Measuring Genome Similarities
Our previous studies (Kirzhner et al. 2002, 2003)
indicate that a random set of 200 words is sufficient to
Genome Dialect and DNA Repair 59
compare CS F(W, S) of different parts of the same genome
or between genomes. Figure 4 illustrates the same idea using example from the list of Proteobacteria employed in
this study with CS calculated for degenerated (purinepyrimidine) alphabet (PPCS). In order to display spectra,
we ranked the words according to the frequency of their
appearance in one of the sequences S (reference ranking).
The intuitive impressions about inter- or intragenomic similarities and dissimilarities can be supported by distance
metrics obtained, based on the Spearman rank correlation
q. The quantity d 5 (1 q)/2 (0 d 1) can be considered
as the distance between two spectra. The maximal distance
d 5 1 corresponds to strictly reverse CS, whereas minimal
distance of 0 corresponds to identically ordered spectra. It
should be stressed that the CS distances between species are
independent of the randomly chosen sets of words (that do
not exclude sampling variation).
Calculating Species GS
Definition
Relative abundance of dinucleotides: Let fX denote the
frequency of the nucleotide X in a DNA sequence S and fXY
the frequency of the dinucleotide XY in S. Then, XY is considered of high (low) relative abundance, compared with
a random association of the component mononucleotides
X and Y in S, if the odds ratio fXY/fXfY is sufficiently larger
(lower) than 1. Based on the above indices, some distance
measures were proposed (see Karlin and Cardon 1994) to
estimate the closeness of compared sequences.
Correction for Time of Divergence
Correlation between the two groups of characteristics,
(1) species distances based on GD and (2) distances based
on protein sequences, may be driven by functional interdependence, but the period of time elapsed from species divergence also may be a factor or even may be the major
shared factor contributing to the observed correlation between the two measures of species distance. In other words,
it could be expected that the earlier the two species diverged, the higher the correlation would be between the different distance measures. Despite this reasoning, we found
that protein sequences of the repair-recombination group
show significantly higher correlation with the GD characteristics than the remaining set of tested proteins. Nevertheless, we attempted to take into account the time factor.
Namely, we represented the ‘‘time elapsed from divergence’’ of any two species, by using sequence comparison
of their 16SrRNA. To compensate for the influence of this
factor on pairwise correlations, we recalculated the correlations between the discussed groups of proteins and GD
by using a two-factor regression, with distance of dialect
and ‘‘time’’ (distance in 16SrRNA) as the independent variables and protein distance as the dependent variable.
Results
Some examples of distances between the species calculated for orthologous proteins and for different GD representatives are shown in table 1 (for the whole data set, see
Table 5 in Supplementary Material online). As expected,
orthologous DTDPs tend to display, on the average, a larger
evolutionary distance between the species than orthologous
SMPs: the mean distance was 48.34 6 1.92 and 38.79 6
1.72 for DTDPs and SMPs, respectively. But in both
groups there are proteins with small distances and large distances between the orthologues. For example, in the DTDP
group, the average distances between orthologous UvrA
and UvrB were 35.48% and 37.08%, respectively, while
the distances between orthologous DnaNs are much larger
(average 58.06). Similarly, within the SMP group, the average distance between orthologous AtpAs is small (33.67)
while the average distance between orthologous folAs is
large (55.1).
Distance Measures Between Orthologous DNA Repair
Proteins Show High Correlation with Distance Measures
of the GD for both PPCS and GS
A positive, albeit low, correlation of the species
GC% content distances and all 40 protein distances
was found (table 2, column 3). Similar results were
obtained for the CS distances (column 4). Presumably,
the last fact may result from the dependence of CS measures on the species GC% content. We also tried a modified version of CS, which should not be sensitive to
differences in species GC% content, by reducing to purine-pyrimidine alphabet (resulting in PPCS). Although
the correlations between many protein distances and
PPCS distances were moderate, some DTDPs, especially
those involved in DNA repair and recombination, showed
high correlations (table 2, column 5). In fact, 9 out of the
10 highest correlations were displayed by the group of 11
repair-recombination proteins, and only one was from the
group of 20 control proteins (the difference was significant at P , 5 3 105 by Fisher’s exact test for 2 3 2
table). Correlations of the distances of most proteins with
GS distances were high, but, seemingly less discriminative between the repair-recombination versus all other
chosen proteins (column 6).
Additional instructive comparisons between different
groups and subgroups of the chosen 40 proteins, with respect to the correlation of their distances with GD distances,
were conducted using the nonparametric Mann-Whitney
test for both PPCS and GS as dialect characteristics (table
3, columns 2 and 3). As already mentioned, DTDP distances are correlated higher than those of control proteins with
PPCS distances (the difference significant at P , 0.033).
However, the two groups cannot be distinguished by correlations of their protein distances with GS distances (P .
0.3). A clearer differentiation was found when the subgroup
of repair-recombination proteins, rather than the total
DTDP group, was compared to the control: in this case,
the differences in correlation to PPCS and GS were significant at P , 0.0008 and P 5 0.083, respectively. Comparison of repair-recombination proteins to an extended
control that included the remaining 29 proteins showed
even higher significance (P 5 0.00001 for PPCS and
P 5 0.016 for GS) (table 3).
We further divided the repair-recombination group
to ‘‘repair’’ and ‘‘recombination’’ subgroups. The repair
60 Paz et al.
Table 1
Distances of Orthologous DTDPs and SMPs and Species GD Representatives
Name
DTDPs
RpoA
RpoB
RpoC
DnaG
GyrA
DnaE
DnaQ
DnaB
Rep
DnaN
PolA
UvrA
UvrB
UvrC
UvrD
MutS
RecG
RecQ
RuvA
RecA
SMPs
RpLA
RpLB
RpSB
RpSC
TufA
AlaS
AroK
DapA
FolA
LipA
RibE
BioB
Pgk
Eno
GapA
AtpA
CarA
CarB
Ffh
MurA
GC%
CSb
PPCSb
GSc
N
CDa
LDa
ADa
RNA polymerase alpha subunit
RNA polymerase beta subunit
RNA polymerase beta prime subunit
Primase
DNA gyrase alpha chain
DNA polymerase III alpha chain
DNA polymerase III epsilon chain
Replicative DNA helicase
Replicative DNA helicase
DNA polymerase III beta chain (clamp)
DNA polymerase I
Excision nuclease subunit A
Excision nuclease subunit B
Excision nuclease subunit C
DNA-dependent ATPase I and helicase II
DNA-binding, DNA-dependent ATPase
Holliday junction helicase
Helicase
Holliday junction DNA helicase
Recombinase DNA strand exchange
Average of DTDPs
20
20
20
20
20
20
20
20
20
20
19
19
19
19
20
20
20
17
19
20
0.00
0.01
4.44
2.23
4.89
1.46
0.00
0.00
2.22
0.27
6.78
2.76
4.16
0.02
1.80
5.62
2.64
4.04
5.91
3.39
2.63
71.73
55.82
57.44
71.42
67.72
76.1
78.72
71.91
70.41
80.92
71.76
53.70
65.46
71.87
70.05
79.39
80.12
70.85
74.63
44.54
69.23
40.53
36.48
36.35
57.07
45.97
52.17
53.91
49.70
53.58
58.06
53.07
35.48
37.08
54.61
52.46
54.71
53.06
54.12
56.46
32.02
48.34
50S ribosomal protein L1
50S ribosomal protein L2
30S ribosomal protein S2
30S ribosomal protein S3
Protein chain elongation factor EF-Tu
Alanyl-tRNA synthetase
Shikimate kinase I
Dihydrodipicolinate synthase
Dihydrofolate reductase type I
Lipoic acid synthetase
Riboflavin synthase alpha chain
Biotin synthetase
Phosphoglycerate kinase
(2-Phosphoglycerate dehydratase) (2phospho-D-glycerate hydrolyase).
Glyceraldehyde-3-phosphate dehydrogenase
A
Adenosine triphosphate synthase alpha chain
Carbamoyl-phosphate synthase, small subunit
Carbamoyl-phosphate synthase, large subunit
GTP-binding export factor binds to signal
sequence, GTP, and RNA
UDP-N-acetylglucosamine 1carboxyvinyltransferase
Average of SMPs
Percentage of guanine plus cytosine within
genomic DNA
Compositional spectra
Purine-pyrimidine Compositional spectra
Genome signature
20
20
20
20
20
20
20
18
20
18
18
18
18
18
7.94
3.29
0.00
1.40
9.89
6.05
0.00
0.00
0.00
0.00
0.00
0.04
0.00
6.48
60.00
52.72
57.08
56.66
36.18
65.65
73.83
69.11
76.86
58.02
73.20
72.32
67.50
53.77
42.44
34.72
38.75
33.12
25.90
47.29
48.35
52.14
55.10
37.50
55.21
43.54
43.10
35.49
18
3.92
63.37
43.32
18
20
18
18
3.50
0.05
0.003
4.63
49.50
63.92
51.79
63.80
33.77
40.78
33.16
42.74
20
2.86
59.00
43.51
20
2.50
0.114
61.21
40.30
38.79
14.04
20
20
20
0.00
0.005
0.001
82.00
85.00
1.279
31.93
27.53
0.645
Description
NOTE.—Name, name of the protein in Escherichia coli; N, the number of orthologous proteins within the 20 selected Proteobacteria species; CD, the closest distance (between the two most identical orthologous sequences); LD, the largest distance
(between the two most different orthologous sequences); AD, the average distances between orthologues; ATPase, adenosine
triphosphatase; and GTP, guanosine triphosphate.
a
Distance 5 100 (the percentage of identical amino acids), after Blast 2 sequences alignment.
b
CS and PPCS distances are normalized to a scale ranging from 0 to 100.
c
GS distances.
subgroup showed higher significance than the recombination subgroup of its correlations to PPCS (P 5 0.00032 for
repair vs. control and P , 0.017 for recombination vs.
control) and GS (P , 0.0406 for repair vs. control and
P . 0.6 for recombination vs. control). It is interesting
to note that the high correlation of the distances of the
orthologous enzymes belonging to the nucleotide excision
repair (NER), with PPCS and GS distances, is shared by
both UvrA and UvrB (that have relatively low average
distances 35.48 and 37.08, respectively) and PolA (DNA
Genome Dialect and DNA Repair 61
Table 2
Correlations Between the Distance of the Orthologous Proteins and Dialect Representatives
Name
DTDPs
RpoA
RpoB
RpoC
DnaG
GyrA
DnaE
DnaQ
DnaB
Rep
DnaN
PolA
UvrA
UvrB
UvrC
UvrD
MutS
RecG
RecQ
RuvA
RecA
SMPs
RpLA
RpLB
RpSB
RpSC
TufA
AlaS
AroK
DapA
FolA
LipA
RibE
BioB
Pgk
Eno
GapA
AtpA
CarA
CarB
Ffh
MurA
Description
GC%
CS
PPCS
GS
RNA polymerase alpha subunit
RNA polymerase beta subunit
RNA polymerase beta prime subunit
Primase
DNA gyrase alpha chain
DNA polymerase III alpha chain
DNA polymerase III epsilon chain
Replicative DNA helicase
Replicative DNA helicase
DNA polymerase III beta chain (clamp)
DNA polymerase I
Excision nuclease subunit A
Excision nuclease subunit B
Excision nuclease subunit C
DNA-dependent ATPase I and helicase II
DNA-binding, DNA-dependent ATPase
Holliday junction helicase
Helicase
Holliday junction DNA helicase
Recombinase DNA strand exchange
0.317
0.341
0.291
0.420
0.341
0.349
0.260
0.310
0.375
0.466
0.493
0.457
0.441
0.522
0.467
0.387
0.527
0.474
0.417
0.472
0.330
0.357
0.324
0.381
0.313
0.300
0.267
0.332
0.364
0.453
0.498
0.460
0.439
0.510
0.424
0.377
0.436
0.534
0.345
0.477
0.586
0.602
0.555
0.552
0.386
0.401
0.488
0.479
0.590
0.731
0.722
0.709
0.703
0.735
0.736
0.623
0.666
0.681
0.603
0.672
0.672
0.699
0.673
0.669
0.667
0.606
0.624
0.660
0.670
0.727
0.754
0.730
0.742
0.757
0.680
0.721
0.693
0.671
0.633
0.742
50S ribosomal protein L1
50S ribosomal protein L2
30S ribosomal protein S2
30S ribosomal protein S3
Protein chain elongation factor EF-Tu
Alanyl-tRNA synthetase
Shikimate kinase I
Dihydrodipicolinate synthase
Dihydrofolate reductase type I
Lipoic acid synthetase
Riboflavin synthase alpha chain
Biotin synthetase
Phosphoglycerate kinase
(2-Phosphoglycerate dehydratase) (2phospho-D-glycerate hydrolyase).
Glyceraldehyde-3-phosphate dehydrogenase
A
ATP synthase alpha chain
Carbamoyl-phosphate synthase, small subunit
Carbamoyl-phosphate synthase, large subunit
GTP-binding export factor binds to signal
sequence, GTP, and RNA
UDP-N-acetylglucosamine 1carboxyvinyltransferase
Average of all proteins
0.429
0.327
0.309
0.356
0.201
0.486
0.348
0.297
0.509
0.509
0.396
0.389
0.442
0.44
0.367
0.352
0.311
0.35
0.213
0.41
0.373
0.315
0.457
0.396
0.426
0.435
0.421
0.479
0.505
0.580
0.470
0.527
0.231
0.665
0.508
0.461
0.558
0.476
0.486
0.689
0.577
0.621
0.689
0.671
0.581
0.631
0.462
0.63
0.748
0.527
0.759
0.61
0.666
0.606
0.797
0.753
0.302
0.341
0.528
0.636
0.359
0.498
0.504
0.393
0.304
0.490
0.483
0.383
0.453
0.640
0.654
0.600
0.615
0.747
0.719
0.719
0.423
0.387
0.624
0.699
0.401
0.39
0.577
0.676
NOTE.—The names of the ten proteins with highest correlation to PPCS are in bold. Nine of these are involved in repairrecombination processes. ATPase, adenosine triphosphatase and GTP, guanosine triphosphate.
polymerase I) that has a high distance (53.07) between
orthologues.
Correction for Time of Divergence
Columns 4 and 5 in table 3 represent the significance
of differences between the selected subgroups and control
after correction for time of divergence. Column 4 demonstrates lower significance of the differences between the
repair-recombination group and the control group of their
distance with the PPCS distances after time correction
(compared to column 2). Column 5 demonstrates a better
significance of the differences between repair-recombination and control proteins for their GS correlations after time
correction (compared to column 3). Recombination helicases RecG and RecQ seem to be closer to the repair group
than RecA and RuvA. Indeed, better significances were
obtained for the distance differences between ‘‘repair 1
RecG 1 RecQ’’ group and the control group, than that
of all repair-recombination enzymes, including RecA and
RuvA.
Therefore, we conclude that species GD distances
(PPCS and GS distances) are highly correlated with repair-recombination protein distances but display lower correlations to other protein distances. Thus, it makes sense to
narrow our initial hypothesis. Namely, we suggest that the
foregoing evidence points to ‘‘coevolution of the GD and
DNA repair-recombination enzymes.’’
62 Paz et al.
Table 3
Significance of the Differences Between Selected Protein Groups in the Correlations of Their
Protein Distances to Dialect Distances (Mann-Whitney U test)
Dialect Distances
Groups Compared
DTDPs (20) versus SMPs (20)
RRPs (11) versus SMPs (20)
RRPs (11) versus the rest (29)
Repair (7) versus SMPs (20)
Recombination (4) versus SMPs (20)
(Repair 1 RecG 1 RecQ) (9) versus SMPs (20)
PPCS
GS
tPPCS
tGS
0.0326
0.00074
0.00001
0.00032
0.01634
0.00009
.0.3
0.08291
0.016
0.0406
.0.6
0.06599
.0.1
0.00385
0.00065
0.00567
.0.1
0.00114
0.00941
0.00295
0.00346
0.00109
.0.3
0.00041
NOTE.—tPPCS and tGS, dialect distances based on PPCS and GS, respectively, but corrected for time by using two-factor
regression analysis, with species 16SrRNA distances as the second independent parameter. RRPs, DTDPs involved in repair and
recombination; Repair, DTDPs involved in repair; Recombination, DTDPs involved in recombination; and RecG 1 RecQ, names
of specific helicases. Numbers in the brackets are the total of proteins included in the group.
Discussion
Rationale and Possible Mechanisms That Might Cause
Coevolution of Species DNA Repair-Recombination
Enzymes and GD
In an experiment with E. coli culture with nonfunctional MMR system, lower recombination rates were found
after 20,000 replications between the evolved population
and the formerly identical lines compared to the standard
recombination rates (Vulic, Lenski, and Radman 1999).
This phenomenon was interpreted in terms of evolved
‘‘genetic barrier.’’ To explain these results, the authors postulated a considerable divergence of DNA organization in
the repair-defective population compared to the control. We
suggest that genomic variation displayed in the GD and repair-recombination enzymes derives from evolutionary
processes similar, to some extent, to the events in the aforementioned experiments with E. coli (Vulic, Lenski, and
Radman 1999).
A key process in prokaryotic evolution might be shortterm bursts of genetic variation. Elevation of variation can
result from ‘‘attenuation’’ of both repair fidelity and constraints on nonhomologous recombination, causing increased mutation rate and recombination both within and
between species (Rayssiguier, Thaler, and Radman
1989). Although usually not advantageous, there may be
occasions where a relaxation of genome stability is adaptive: when the organism finds itself in ‘‘time of trouble,’’
i.e., in stress conditions (McClintock 1984; Hoffman and
Parsons 1991; Korol, I. A. Preygel, and S. I. Preygel
1994; Korol 2001). As the intensity, duration, and nature
of stress are highly variable, it may be advantageous for
a population to increase its genotypic heterogeneity ensuring that at least a subset of organisms would survive the
stress. An increase in heterogeneity could be achieved
through mutations in DNA repair gene/genes, the ‘‘guards’’
of the quality of the genome text. These kinds of mutations
will reduce the fidelity of repair-recombination processes.
Some of the repair-recombination genes are known
to be ‘‘hot spots’’ for mutations due to the high levels of
Simple Sequence Requests (SSRs) within or very close to
these genes (Metzgar and Wills 2000; Li et al. 2002; Massey
and Buckling 2002; Rocha, Matic, and Taddei 2002). SSRs
are among the fastest evolving DNA sequences and are
known to increase variability during the processing of
genetic information (in replication and recombination but
also during gene expression [Li et al. 2004]). If in stress
situations a mutation in a DNA repair-recombination gene
occurs, an increase in the genome heterogeneity and GD
differentiation can be expected in future generations. Later,
reversion to high functionality and fidelity of the repairprocessing gene may be selected in order to conserve and stabilize the gains of the change (Korol, I. A. Preygel, and S. I.
Preygel 1994). It is reasonable to assume that readaptation
and phenotypic reversion to high-fidelity ‘‘normal’’ functioning may be achieved by additional mutations of the previously mutated DNA repair gene as well as suppressing
mutations in other repair genes. Then, measuring the species
molecular differences will reveal corroboration of changes
in both the repair enzyme sequence and GD (it should be
remembered that the GD also differs now from the original).
The higher correlations found between the species
DNA repair and recombination proteins (RRPs) and PPCS
distances compared to the correlations between RRPs distances and either GC% content or CS distances suggest that
(at least in the Proteobacteria) the purine-pyrimidine sequence organization might be important in nongenic speciation. Nongenic speciation had been proposed earlier for
the role of species differences in GC% content (Forsdyke
2004).
What Might be the Reasons for High Evolutionary
Sensitivity of the DNA Repair-Recombination Enzymes
to the GD?
Both the external conditions and internal environment
of the cell of an organism seem to have a profound effect
on the layout of the repair system (Aravind, Walker, and
Koonin 1999). DNA is the substrate for the repair enzymes.
Unlike ‘‘regular’’ substrates of other (metabolic) enzymes,
this substrate is always changing. Therefore, we suggest
that in order to fulfill their roles with high speed and fidelity,
the structures of repair-recombination enzymes might be
more evolutionary sensitive and ‘‘responsive’’ to changes
in the GD (i.e., DNA-abundant structures). Changes of
the GD might result from stress events or from the two
other possible triggers presented in figure 1. In the process
of DNA repair, some of the enzymes involved might use
specific short tracts of nucleotides that are dispersed in
their genome as targets for a nuclease function (like the
Genome Dialect and DNA Repair 63
d(GATC) sequence for the MutH enzyme [Lahue, Au, and
Modrich 1989]) or as ‘‘anchors’’ or ‘‘toggles’’ (the octamer
sequence Chi [5#-GCTGGTGG-3#] for the RecBCD enzymatic complex in the MMR of E. coli [Myers and Stahl
1994]). Different species might have different ‘‘preferred’’
short DNA ‘‘signals’’ related to the repair and recombination processes (Chedin et al. 1998; Lao and Forsdyke
2000). It is reasonable to assume that these signals are related to the species-specific GD. Some of the DNA signal
targets involved in repair-recombination processes might be
simple, based on two-letter (purine/pyrimidine) discrimination (as proposed also for unwinding centers in DNA transcription or replication processes [Yagil, Shimron, and Tal
1998]).
There are sequence patterns that may cause obstacles
during the genome replication or reduce the replication
fidelity and be nevertheless maintained due to their essential
roles at the RNA or protein levels (like microsatellite tandem repeats within genes) (Li et al. 2004). The repair
enzymes should be involved in the solution of these genome replication problems. We think that this challenge,
together with the need in the repair of continuously arising
damages to DNA at any stage of the cell life, might be an
additional reason why the repair group of enzymes displays
the highest correlation with GD within the DTDP group
(see tables 2 and 3). It is interesting that Parniewski
et al. (1999) showed that the major in vivo repair system
for DNA damages, the NER system, affects the stability
of long-transcribed triplet repeat sequences in E. coli.
Our results well corroborate this finding (all bacterial enzymes of the NER system described in the literature were
included in our study). In conclusion, we suggest that the
DNA repair-recombination enzymes may play an essential
role in GD differentiation as a potentially important step of
speciation on the entire-genome level. And vice versa, variation in the GD may impose a selection pressure on evolution of the DNA repair-recombination system.
To What Extend Horizontal Gene Transfer of DNA
Maintenance Genes Might Affect the Evolution of GD?
In some DNA repair and recombination processes,
complexes between proteins are formed and hence high
structural fit between the components is needed. This
may impede the integration of an alien DNA repair gene
that is not identical (or not sufficiently similar) to the original gene of the recipient cell to fulfill the ‘‘native’’ gene
functions. Such lack of complementation might be one
of the reasons why genes involved in information processing seem to be rarely horizontally transferred (Jain, Rivera,
and Lake 1999). Although cases of duplication of some
DNA repair genes in Proteobacteria species resulting from
lateral gene transfer are known (Martins-Pinheiro et al.
2004), in these cases, the native original gene is highly conserved and remains functional. Davidsen and coauthors
(Davidsen et al. 2004) proposed that the transfer of
DNA maintenance genes between cells of the same species
may be an important stress-adaptation process in certain
prokaryotes (the three Proteobacteria species considered
in their work are also included in our study). Their argument
is that during stress, both the loss of function of some DNA
maintenance genes in some cells and death and lysis of the
majority of other cells in the colony happen. As a result,
both the probability of DNA maintenance genes from the
lysed cells to be released to the vicinity of the survivals
and the need of the survivals to uptake these genes are elevated. The analysis conducted by the authors points to the
existence of structural features within maintenance genes
facilitating such transformation events. In particular, within
the species N. meningitides, H. influenzae, and, possibly,
P. multocida, the distribution of oligonucleotides known
as DNA ‘‘uptake sequences’’ is biased toward genome
maintenance genes (DTDPs, in our terminology).
We believe that the foregoing model does not contradict our interpretations. Firstly, only a few bacterial species
use this mechanism (Redfield 2001). Secondly, Davidsen
et al. (2004) did not subdivide the DNA maintenance group
of genes (similar to our DTDP group) into replicationoriented and repair-recombination–oriented groups. We
speculate that there might be significant differences between such subgroups with respect to the presence of
DNA uptake sequences (in favor of repair-recombination
group). Thirdly, stress-promoted transformation and recombination events might be accompanied by small
changes in these genes. Hence, again, the result of stress
might be a corroboration of changes in both the repairrecombination enzyme sequences and the GD.
Supplementary Material
Tables 4 and 5 are available at Molecular Biology and
Evolution online (http://www.mbe.oxfordjournals.org/).
Acknowledgments
We thank D. R. Forsdyke, E. Rocha, E. Trifonov,
Y. Kashi, M. Soller, P. Capy, and two anonymous reviewers for their helpful comments and suggestions. This
work was supported by the Israeli Ministry of Absorption
and by the Ancell-Teicher Research Foundation for
Molecular Genetics and Evolution. A.P. was supported
by scholarship in Bioinformatics from the Eshkol foundation of the Israeli Ministry of Science and Technology.
Literature Cited
Aravind, L., D. R. Walker, and E. V. Koonin. 1999. Conserved
domains in DNA repair proteins and evolution of repair systems. Nucleic Acids Res. 27:1223–1242.
Bebenek, A., H. K. Dressman, G. T. Carver, S. Ng, V. Petrov, G.
Yang, W. H. Konigsberg, J. D. Karam, and J. W. Drake. 2001.
Interacting fidelity defects in the replicative DNA polymerase
of bacteriophage RB69. J. Biol. Chem. 276:10387–10397.
Becherel, O. J., R. P. Fuchs, and J. Wagner. 2002. Pivotal role of
the beta-clamp in translesion DNA synthesis and mutagenesis
in E. coli cells. DNA Repair (Amst.). 1:703–708.
Brendel, V., J. S. Beckmann, and E. N. Trifonov. 1986. Linguistics of nucleotide sequences: morphology and comparison of
vocabularies. J. Biomol. Struct. Dyn. 4:11–20.
Brocchieri, L. 2001. Phylogenetic inferences from molecular
sequences: review and critique. Theor. Popul. Biol. 59:27–40.
Burge, C., A. M. Campbell, and S. Karlin. 1992. Over- and underrepresentation of short oligonucleotides in DNA sequences.
Proc. Natl. Acad. Sci. USA 89:1358–1362.
64 Paz et al.
Campbell, A., J. Mrazek, and S. Karlin. 1999. Genome signature
comparisons among prokaryote, plasmid, and mitochondrial
DNA. Proc. Natl. Acad. Sci. USA 96:9184–9189.
Chedin, F., P. Noirot, V. Biaudet, and S. D. Ehrlich. 1998. A fivenucleotide sequence protects DNA from exonucleolytic degradation by AddAB, the RecBCD analogue of Bacillus subtilis.
Mol. Microbiol. 29:1369–1377.
Coenye, T., and P. Vandamme. 2004. Use of the genomic signature in bacterial classification and identification. Syst. Appl.
Microbiol. 27:175–185.
Courcelle, J., and P. C. Hanawalt. 2001. Participation of recombination proteins in rescue of arrested replication forks in UVirradiated Escherichia coli need not involve recombination.
Proc. Natl. Acad. Sci. USA 98:8196–8202.
Davidsen, T., E. A. Rodland, K. Lagesen, E. Seeberg, T. Rognes,
and T. Tonjum. 2004. Biased distribution of DNA uptake
sequences towards genome maintenance genes. Nucleic Acids
Res. 32:1050–1058.
Forsdyke, D. R. 1995. A stem-loop ‘‘kissing’’ model for the initiation of recombination and the origin of introns. Mol. Biol.
Evol. 12:949–958
Forsdyke, D. R. 2004. Chromosomal speciation: a reply. J. Theor.
Biol. 230:189–196.
Forsdyke, D. R., and J. R. Mortimer. 2000. Chargaff’s legacy.
Gene 261:127–137.
Fox, G. E., E. Stackebrandt, R. B. Hespell et al. (19 co-authors)
1980. The phylogeny of prokaryotes. Science 25:457–463.
Gentles, A. J., and S. Karlin. 2001. Genome-scale compositional
comparisons in eukaryotes. Genome Res. 11:540–546.
Gogarten, P. J., E. Hilario, and L. Olendzenski. 1996. Gene duplications and horizontal gene transfer during early evolution.
Pp. 267–292 in D. M. Roberts, P. Sharp, G. Alderson, and
M. Collins, eds. Evolution of microbial life: 54th Symposium
of the Society for General Microbilogy. Cambridge University
Press, Cambridge.
Gregg, A. V., P. McGlynn, R. P. Jaktaji, and R. G. Lloyd. 2002.
Direct rescue of stalled DNA replication forks via the combined action of PriA and RecG helicase activities. Mol. Cell
9:241–251.
Hoffman, A. A., and P. A. Parsons. 1991. Evolutionary genetics and
environmental stress. Oxford Science publications, Oxford.
Jain, R., M. C. Rivera, and J. A. Lake. 1999. Horizontal gene
transfer among genomes: the complexity hypothesis. Proc.
Natl. Acad. Sci. USA 96:3801–3806.
Karlin, S., and L. R. Cardon. 1994. Computational DNA sequence
analysis. Annu. Rev. Microbiol. 48:619–654.
Karlin, S., J. Mrázek, and A. M. Campbell. 1997. Compositional
biases of bacterial genomes and evolutionary implications.
J. Bacteriol. 179:3899–3913.
Kirzhner, V. M., A. B., Korol, A. Bolshoy, and E. Nevo. 2002.
Compositional spectrum revealing patterns for genomic
sequence characterization and comparison. Physica A 312:
447–457.
Kirzhner, V. M., A. B. Korol, E. Nevo, and A. Bolshoy. 2003. A
large-scale comparison of genomic sequences: one promising
approach. Acta Biotheor. 51:73–89.
Korol, A. B. 2001. Recombination. Pp. 53–71 in Simon A. Levin,
ed. Encyclopedia of biodiversity. Volume 5. Academic Press,
San Diego, Calif.
Korol, A. B., I. A., Preygel, and S. I. Preygel. 1994. Recombination variability and evolution. Chapman and Hall, London.
Lahue, R. S., K. G. Au, and P. Modrich. 1989. DNA mismatch
correction in a defined system. Science 245:160–164.
Lao, P. J., and D. R. Forsdyke. 2000. Crossover hot-spot instigator
(Chi) sequences in Escherichia coli occupy distinct recombination/transcription islands. Gene 243:47–57.
Li, Y., A. B. Korol, T. Fahima, A. Beiles, and E. Nevo. 2002.
Microsatellites: genomic distribution, putative functions and
mutational mechanisms: a review. Mol. Ecol. 11:2453–2465.
———. 2004. Microsatellites within genes: structure, function,
and evolution. Mol. Biol. Evol. 21:991–1007.
Lieb, M., and A. S. Bhagwat. 1996. Very short patch repair: reducing the cost of cytosine methylation. Mol. Microbiol.
3:467–473.
López de Saro, F. J., and M. O’Donnell. 2001. Interaction of the
beta sliding clamp with MutS, ligase, and DNA polymerase I.
Proc. Natl. Acad. Sci. USA 98:8376–8380.
Martins-Pinheiro, M., R. S. Galhardo, C. Lage, K. M. Lima-Bessa,
K. A. Aires, and C. F. Menck. 2004. Different patterns of
evolution for duplicated DNA repair genes in bacteria of the
Xanthomonadales group. BMC Evol. Biol. 4(1):29.
Massey, R. C., and A. Buckling. 2002. Environmental regulation
of mutation rates at specific sites. Trends Microbiol. 10:
580–584.
McClintock, B. 1984. The significance of responses of the genome
to challenge. Science 226:792–801.
Metzgar, D., and C. Wills. 2000. Evidence for the adaptive evolution of mutation rates. Cell 101:581–584.
Myers, R. S., and F. W. Stahl. 1994. Chi and the RecBC D enzyme
of Escherichia coli. Annu. Rev. Genet. 28:49–70.
Nussinov, R. 1984. Doublet frequencies in evolutionary distinct
groups. Nucleic Acids Res. 10:1749–1763.
Parniewski, P., A. Bacolla, A. Jaworski, and R. D. Wells. 1999.
Nucleotide excision repair affects the stability of long transcribed (CTG*CAG) tracts in an orientation-dependent manner in Escherichia coli. Nucleic Acids Res. 27:616–623.
Paz, A., D. Mester, I. Baca, E. Nevo, and A. Korol. 2004. Adaptive
role of increased frequency of polypurine tracts in mRNA
sequences of thermophilic prokaryotes. Proc. Natl. Acad.
Sci. USA 101:2951–2956.
Rayssiguier, C., D. S. Thaler, and M. Radman. 1989. The barrier
to recombination between Escherichia coli and Salmonella
typhimurium is disrupted in mismatch-repair mutants. Nature
342:396–401.
Redfield, R. J. 2001. Do bacteria have sex? Nat. Rev. Genet.
2:634–639.
Robu, M. E., R. B. Inman, and M. M. Cox. 2004. Situational repair
of replication forks: roles of RecG and RecA proteins. J. Biol.
Chem. 279:10973–10981.
Rocha, E., I. Matic, and F. Taddei. 2002. Over-representation of
repeats in stress response genes: a strategy to increase versatility under stressful conditions? Nucleic Acids Res. 30:1886–
1894.
Schofield, M. J., and P. Hsieh. 2003. DNA mismatch repair:
molecular mechanisms and biological function. Annu. Rev.
Microbiol. 57:579–608.
Tuteja, N., and R. Tuteja. 2004. Prokaryotic and eukaryotic DNA
helicases. Essential molecular motor proteins for cellular machinery. Eur. J. Biochem. 271:1835–1848.
Vulic, M., R. E. Lenski, and M. Radman. 1999. Mutation, recombination, and incipient speciation of bacteria in the laboratory.
Proc. Natl. Acad. Sci. USA 96:7348–7351.
Woese, C. 1987. Bacterial evolution. Microbiol. Rev. 51:221–271.
Yagil, G., F. Shimron, and M. Tal. 1998. DNA unwinding in the
CYC1 and DED1 yeast promoters. Gene 225:152–163.
Pierre Capy, Associate Editor
Accepted August 30, 2005