Transcription-Induced Mutational Strand Bias and Its Effect on

Transcription-Induced Mutational Strand Bias and Its Effect on Substitution
Rates in Human Genes
Carina F. Mugal,1 Hans-Hennig von Grünberg,1 and Martin Peifer2
Institute of Chemistry, Karl-Franzens University Graz, Graz, Austria
If substitution rates are not the same on the two complementary DNA strands, a substitution is considered strand
asymmetric. Such substitutional strand asymmetries are determined here for the three most frequent types of substitution
on the human genome (C / T, A / G, and G / T). Substitution rate differences between both strands are estimated
for 4,590 human genes by aligning all repeats occurring within the introns with their ancestral consensus sequences. For
1,630 of these genes, both coding strand and noncoding strand rates could be compared with rates in gene-flanking
regions. All three rates considered are found to be on average higher on the coding strand and lower on the transcribed
strand in comparison to their values in the gene-flanking regions. This finding points to the simultaneous action of rateincreasing effects on the coding strand—such as increased adenine and cytosine deamination—and transcription-coupled
repair as a rate-reducing effect on the transcribed strand. The common behavior of the three rates leads to strong
correlations of the rate asymmetries: Whenever one rate is strand biased, the other two rates are likely to show the same
bias. Furthermore, we determine all three rate asymmetries as a function of time: the A / G and G / T rate
asymmetries are both found to be constant in time, whereas the C / T rate asymmetry shows a pronounced time
dependence, an observation that explains the difference between our results and those of an earlier work by Green et al.
(2003. Transcription-associated mutational asymmetry in mammalian evolution. Nat Genet. 33:514–517.). Finally, we
show that in addition to transcription also the replication process biases the substitution rates in genes.
Introduction
Substitutional strand asymmetry in DNA evolution
and its causes is a topic of long-standing interest (Francino
and Ochman 1997; Frank and Lobry 1999). Such a strand
asymmetry is given if substitution rates do not have identical values on the two complementary strands of the DNA.
Substitutional strand asymmetry results from the processes
of transcription and replication, which both distinguish between the two DNA strands. Replication-associated substitutional asymmetry has been found in bacteria (Lobry
1996b) and human mitochondrial DNA (Tanaka and Ozawa
1994), whereas transcription-induced substitutional asymmetry has been observed in enterobacterial genes (Francino
et al. 1996; Beletskii and Bhagwat 1998; Francino and
Ochman 2001) and mammalian genes (Green et al. 2003).
In the following, we will focus mainly on transcriptioninduced rate asymmetries and will only briefly discuss the
effect of replication on the observed rate asymmetries (Huvet
et al. 2007).
We study such asymmetries on the human genome for
the three most frequent types of substitution, the two transitions C / T (with its complementary rate G / A) and
A / G (connected to T / C) and the transversion G / T
(complementary to C / A). The C / T transition is the
dominant mutation for many genes (Coulondre et al. 1978;
Lindahl 1993; Hutchinson and Donnellan 1998) and is particularly important on the human genome where in a genome-wide average it has a rate of a relative strength
0.93 (Karro et al. 2008). It is followed in strength by the
A / G rate (strength 0.47) and the rate G / T (strength
1
Present address: Institute of Chemistry, Karl-Franzens University
Graz, Graz, Austria.
2
Present address: Max-Planck Institute for Neurological Research,
Cologne, Germany.
Key words: substitution rate asymmetries, transcription-coupled
repair, cytosine deamination.
E-mail: [email protected].
Mol. Biol. Evol. 26(1):131–142. 2009
doi:10.1093/molbev/msn245
Advance Access publication October 29, 2008
Ó The Author 2008. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: [email protected]
0.2). These rates can be associated with certain mutagenic
lesions on the genome. Hydrolytic deamination of cytosine,
5-methylcytosine (5-mC), and adenine; oxidation of guanine to 8-hydroxyguanine (8-oxoG); and strand breaks
and depurination are the major types of spontaneous directly mutagenic events in the living cell (Lindahl 1993).
Cytosine and 5-mC are the main targets for hydrolytic
deamination (Lindahl 1993). Deamination of C leads to uracil
U; uracil simulates T, pairs with A, and after two rounds of
replication the substitution from C / T becomes persistent
(Francino and Ochman 2001) (see fig. 1). The resulting substitution rate, however, is naturally affected also by the efficiency of the corresponding repair enzymes, in the C / T
case the uracil–DNA glycosylase (Lindahl and Nyberg 1974;
Lindahl et al. 1977; Duncan and Miller 1980). Hydrolytic deamination of adenine to hypoxanthine (H) is one of the reactions that can be associated with the second rate considered
here, A / G. This reaction is known to be much slower than
cytosine deamination (Shapiro and Klein 1966; Karran and
Lindahl 1980; Zhang et al. 2007). Because H forms a more
stable base pair with C than with T, it is mutagenic; two
rounds of replication are again required to produce an A:T
to G:C transition (see fig. 1). As for uracil, there exists a hypoxanthine–DNA glycosylase in mammalian cells to release
hypoxanthine from double-stranded DNA (Karran and
Lindahl 1980; Dianov and Lindahl 1991). The third mutagenic base lesion important in the context of this work is
8-oxoG, a common product of oxidative damage to DNA.
It can mispair with adenine and thus induces G / T (and
C / A) transversions after replication (Shibutani et al.
1991). This lesion is repaired through base excision repair
by 8-oxoG DNA glycosylases (David et al. 2007), an enzyme
that works quite efficiently such that the substitution frequency of 8-oxoG in both Escherichia coli and mammalian
cells is comparatively low (David et al. 2007). G / T transversions can also be caused by depurination if after the release
of a guanine the DNA polymerase inserts an adenine opposite
the abasic (AP) site (see again fig. 1).
In this paper, these three types of substitutions will be
shown to occur with different frequencies on the two DNA
132 Mugal et al.
FIG. 1.—Deamination of cytosine to uracil and of adenine to hypoxanthine causes C / T and A / G transitions, respectively. The presence of 8oxo-G can lead to G / T transversions, whereas loss of a purine residue may cause A / T (Y 5 A) or G / T (Y 5 G) transversions.
strands in transcribed regions of the human genome. During
transcription, two processes are currently thought to be responsible for the generation of strand asymmetric substitutions: transcription-coupled repair (TCR) and hydrolytic
cytosine deamination (Francino and Ochman 2001). The
transcription-induced deamination event occurs primarily
on the coding strand, which while transiently exposed during transcription is particularly vulnerable to deamination
(Beletskii and Bhagwat 1996, 1998; Beletskii et al.
2000). In contrast, the noncoding strand is not exposed
to the environment but protected by the RNA polymerase
and the nascent RNA; therefore, hydrolytic deamination is
less frequent on this strand. In fact, Francino and Ochman
(2001) has shown that the C / T rate on noncoding strands
in enterobacteria is even lower than the average value of this
rate in the untranscribed region—an effect that the authors
attribute to TCR. TCR corrects any kind of bulky lesions
(Takahasi et al. 2007) that block transcription by causing
the RNA polymerase to stall, which then leads to preferential repair of the lesion on the template strand of a gene.
Among other things, we here try to quantify the relative
contribution of TCR and other competing mechanisms to
the observed substitutional rate asymmetries in human transcribed regions.
Green et al. (2003) have recently discovered
transcription-induced asymmetries also in mammalian
germ-line genes. Based on comparative analyses of orthologous sequences of eight different mammals containing
nine genes (in total 1.5 Mb), these authors have determined
substitution rates and, subsequently, substitutional strand
asymmetries. They find that A / G and G / A (purine
transitions) are significantly higher on the coding strand
than T / C and C / T (pyrimidine transitions) and argue
that this bias may be caused by the mammalian TCR. Because this finding is not compatible with the transcriptioninduced substitutional asymmetry found in enterobacteria
which were characterized by an increased rate C / T
on the coding strand (Francino and Ochman 2001), Green
et al. claimed to have found a transcription-induced strand
asymmetry that is based on a qualitatively different mech-
anism. Moreover, six of nine genes showed a significant
G þ T excess over A þ C on the coding strand that was
again coupled to transcription. This finding suggests the interpretation that this type of compositional asymmetry in
mammalian transcribed regions is caused by transcriptioninduced substitutional asymmetries—a hypothesis that was
further supported by correlating expression levels of germ
cell genes with the compositional asymmetry (Majewski
2003; Comeron 2004; Webster et al. 2004).
Green et al. (2003) observed that the same asymmetries were also seen when only substitutions in interspersed
repeats in the transcribed regions were considered. This result motivated the present work. To further test the hypothesis of Green et al., we estimated the rates and rate strand
asymmetries for C / T, A / G, and G / T for 4,590
human genes by aligning all repeats occurring within the
introns of each gene with their ancestral consensus sequences. Our main result is that all three rates C / T, A / G,
and G / T are significantly higher on the coding strand of
transcribed regions when compared with their values in the
genomic region adjacent to the genes (flanking region) and
considerably lower on the template strand when measured
relative to their values in the flanking region. This common
behavior of all three rates leads to interesting correlations of
the rates. In particular, we find that A / G and C / T are
both considerably larger on the coding strand than on
the noncoding strand. Whereas the first coding strand asymmetry ([A / G]cod . [A / G]noncod) agrees with the findings of Green et al., the second one ([C / T]cod . [C /
T]noncod) does not, an apparent discrepancy that we will explain with a time dependence of the C / T rate.
In a recent study, Huvet et al. (2007) were able to identify a large number of ‘‘putative’’ human replication origins.
Genes in the neighborhood of these origins should show
rate asymmetries that result from the superposed effect
of both transcription and replication. Replication-associated
strand asymmetries can either increase or decrease transcription-associated asymmetries, depending on the transcription direction and the position of the gene relative
to the origin of replication. Resorting to the results of Huvet
Mutational Strand Bias in Human Genes 133
et al., we will demonstrate in the last part of this paper that
our three rates are indeed biased also by replication.
Methods
We briefly present our method to estimate substitution
rates. It is the same method used in the work of Karro et al.
(2008), where the method and its validation is explained in
more detail. Our underlying model is a nonhomogeneous
Markov chain, similar to that used in other standard approaches. The four-dimensional time-dependent state vector (pA(t), pC(t), pG(t), pT(t)), defining the probability of
being in each state at time t, evolves according to a substitution rate matrix q. The transition probability matrix propagating a state vector a period t forward in time reads
ð1Þ
Pc t 5 exp mqc t :
where m is the absolute substitution rate while the 4 4
rate matrix qc reflects the relative contributions of the different types of substitutions. One should remark that the
product mqc is in principle a quantity averaged over the
period t so that the rates within qc must be considered as
time-averaged rates, with the time t in Pc(t) determining
the period over which the rates are averaged. We estimated
rates either gene by gene (the index c refers to the individual
gene) or, using the repeat data of the entire genome, in a genome-wide analysis (index c is then suppressed). For qc, we
used a 4 4 rate model with 12 independent substitution
rates for all possible mutual replacements within the group
of four base nucleotides. All estimates of rates are based on
the transitional probability matrix Pc(t) which we fit to RepeatMasker-generated repeat data. Human repeat information was extracted by RepeatMasker version 3.1.2 and RM
database version 20051025. Genome builds used were
downloaded from the University of California–San Cruz
(UCSC) browser for hg18 (human). Low-complexity repeats were removed. We were then left with M 5 857 repeat families. The RepeatMasker data provide the
reconstructed ancestor of each repeat family a (a 5 1,
. . ., M) as well as a pairwise alignment between each modern instance and the family’s reconstructed ancestor that
was inserted into the genome a time ta ago. Alignment positions involving gaps are discarded. For each repeat family
a and each gene c, we determine from repeat alignments the
matrix components of Pc(ta) through a maximum likelihood
fit. From this maximum likelihood fit, the substitution rates
are obtained (which are quantities time averaged over the
period ta). We considered just one strand, rate differences
between two strands were obtained from rate differences
between a rate and its complementary rate. We first performed a genome-wide maximum likelihood fit of the M
numbers mta and the 12 rates in q on the right-hand side
of equation (1) to the matrices P(ta) on the left-hand side
of equation (1). Then the procedure was repeated for each
gene c but keeping mta fixed to the corresponding genomewide values, resulting in the 12 rates for every gene. The
resulting pool of rates were then averaged over all M repeat
families in cases where all repeat families were included but
only over the corresponding fraction of families in cases
where the analysis was based on just a subset of families
(such as in fig. 6). One can finally compute the relative rate
asymmetries for gene c by dividing the difference between
a rate and its complementary rate by the corresponding genome-wide rate.
The LogDet time distance used in figure 6 is a measure
for the time distance proportional to the time ta at which
repeat family a was inserted into the genome. It can be obtained directly from the genome-wide transitional probability matrix P(ta) without any further fitting step (see Karro
et al. 2008).
A frequently occurring context-dependent substitution
process in the mammalian genomes is CpG to TpG/CpA—a
dinucleotide rate that as such is not explicitly included in
our neighbor-independent evolutionary model. For this mutation to occur, the cytosine base needs to be located next to
a G, and it must be methylated. To cope with the CpG effect, we have systematically blocked all CpG sites, as Green
et al. (2003) have done. But even if we included all CpG
sites in our analysis, the results hardly changed.
A statistical test was developed to distinguish genes
with high rate asymmetries from those that rather follow
a strand symmetric substitution pattern. To this end, we fitted an asymmetric and a symmetric model to the genomic
data. The asymmetric model has 12 independent rates,
whereas in the symmetric model, every rate is set equal
to its complementary rate, resulting in just six independent
rates within qc in equation (1). Clearly, the more complex
model having more fitting parameter will fit the data better.
We therefore used the Bayesian information criterion
(Schwarz 1978) to decide whether or not the asymmetric
model describes the underlying substitution process better.
Finally, presented correlations are quantified with
Pearson’s moment correlation coefficients (r) having a
P value of P , 104 to be different from zero if not otherwise stated.
Results
We downloaded all transcribed regions from chromosomes 1–22 of the human genome from Known Gene at
UCSC (March 2006). As outlined in the Methods, we determined the six rates (A / C, A / T, C / G, C / A,
A / G, and C / T) together with their complementary
rates by aligning the ancestor sequence of each repeat family (857 families) with the individual repeat copies occurring within the intronic regions of each gene. CpG sites
were systematically ignored. To ensure proper statistics,
we considered only those 4,590 genes (i.e., roughly 20%
of all genes) that have more than 40 repeat copies included
in their introns (on average 78 copies per gene). In total, our
analysis is based on repetitive elements of an accumulated
length of about 350 Mb (360,000 repeat copies). Based on
the 12 estimated rates, one can compute rate strand asymmetries for each gene by computing the difference between
a rate and its complementary rate which is the same as the
difference between a rate on one strand and the same rate on
the other strand. From the resulting data, we selected 1,070
genes that show particularly high rate asymmetries. The statistical test that we developed to select these genes is described in the Methods.
134 Mugal et al.
FIG. 2.—Correlations between rate strand asymmetries for the three rates A / G, C / T, and G / T. A rate asymmetry is defined to be the
difference in substitution rates on the two complementary DNA strands. Data are given in percent by dividing these rate differences by the genomewide averaged rate. Every point corresponds to one human gene. From the initially 4,590 genes considered (gray data points), 1,070 genes (black data
points) are selected showing a particularly high rate asymmetry. Correlation coefficients: r 5 0.67 for A / G versus G / T, r 5 0.50 for C / T
versus G / T, and r 5 0.61 for C / T versus A / G.
Figure 2 shows the correlations between rate asymmetries for the three main rates A / G, C / T, and G / T.
The rate asymmetries are given in percent by relating the
rate differences between the two strands to the genomewide averaged rate. For the genome-wide averaged rates,
no strand asymmetries were found (i.e., the rate differences
between both strands were below the statistical error of our
method). The plot documents that all three rates are highly
correlated: If one rate is significantly higher (lower) on one
strand, then also the rates of the two other substitutions are
consistently higher (lower) on that strand. Further below,
this result will be explained with similarities in the molecular mechanisms causing the asymmetries.
The 1,070 genes selected for their high rate asymmetries are highlighted by black crosses in figure 2. Our test
identifies this subgroup of genes using a substitution-typeaveraged criterion for rate asymmetry. To visualize the extent to which the individual type of substitution is affected
by rate asymmetries, figure 3 shows histograms over rate
asymmetries for all six types of substitutions. The differences for the two transversions C / G and A / C show
unimodal distributions, whereas the distributions of all
other rates are clearly bimodal. A unimodal distribution
can either mean that there is no rate asymmetry at all for
that substitution type or that it is too small in magnitude
for our resolution. It is evident that relative to its genome-wide value the A / G transitional substitution
shows the strongest asymmetry, with values that can become as high as 50% of the genome-wide rate. C / T
and G / T show rate splittings of roughly the same order
of magnitude; a weak rate asymmetry can also be observed
for the A / T transversion.
Asymmetries have been claimed to correlate with
the direction of transcription (Beletskii and Bhagwat
1996; Francino and Ochman 2001; Green et al. 2003;
Touchon et al. 2003). To test whether the rate differences
[X / Y]strand 1 [X / Y]strand 2 in figure 2 support this
idea, we determined the direction of transcription of all
genes and assign this information to the rate asymmetries.
Figure 4 shows again the correlation diagram of figure 2
but now with a key that indicates whether strand 1 in
[X / Y]strand 1 [X / Y]strand 2 is the coding or the non-
coding strand. For all three types of substitution, the differences [X / Y]strand 1 [X / Y]strand 2 tend to be positive
if strand 1 is the coding strand and rather negative otherwise. Both findings are summarized by the statement that
[X / Y]cod [X / Y]noncod tends to be positive. In
words, all three rates tend to be larger on the coding strand
compared with their values on the transcribed strand.
So far, we have compared rates on the two strands with
respect to each other. To decide whether these rates decrease or increase relative to their environment, we next
compare the rates in the transcribed regions with rates determined in the immediate untranscribed neighborhood
(gene-flanking region). Processes such as replication can
produce rate asymmetries both in transcribed and untranscribed regions (Huvet et al. 2007). Thus, taking the difference between rates in genic and gene-flanking regions is
also a convenient way to filter out the effect of these underlying processes. The flanking regions were located on either
side of the transcribed region, with a sequence length ranging between 100 kbp and 1 Mbp. Genes without a flanking
region (of a length of at least 100 kbp) were not further considered, resulting in a subset of 1,733 genes from the initial
pool of 4,590 genes.
It was found that rate asymmetries within the untranscribed flanking regions were only weakly developed,
showing for all rates rather unimodal distributions similar
to those for C / G and A / C in figure 3. Only 103 flanking regions (6% of all flanking regions) showed particularly
high rate asymmetries according to our statistical test. The
corresponding genes were also excluded from the subsequent analysis, such that the comparison was performed
on the basis of 1,630 genes having flanking regions that
1) are of sufficient length and 2) show only rather weak rate
asymmetries. In figure 5, we show—for all three rates and
for coding and noncoding strands separately—the percent
difference between the rates in the transcribed region and
those in the flanking region. These are values obtained after averaging over all 1,630 genes considered. To document that also in this comparison the rate asymmetries
from figures 2–4 have an effect, we next applied our statistical test to identify among the 1,630 genes those 356
genes with particularly high asymmetries (see Methods)
Mutational Strand Bias in Human Genes 135
FIG. 3.—Normalized distributions of substitution rate strand asymmetries on 1,070 human genes with exceptionally high rate asymmetries. The
upper row of graphs shows histograms of the three types of substitutions considered in this work.
and computed averaged rates also on the basis of this subset (column in fig. 5 labeled asymmetric regions). The
whole procedure was then repeated, computing rates only
on the basis of a subgroup of the 50 youngest repeat families (see below).
All three rates considered (C / T, A / G, and G /
T) are found to be consistently higher on the coding strand
of genes in comparison to their values in the flanking region. The difference becomes considerably larger if only
genes showing high rate strand asymmetries are included;
now rate differences between genic and flanking regions
can become as high as 40% of the mean genome-wide rate
(for the C / T and A / G transitions in fig. 5). If only the
class of youngest repeats are used in our analysis, the result
remains essentially unaltered for the A / G and G / T
rates but changes for the C / T transitions—a finding that
further below will be explained with a time dependence of
C / T transition rates. A similar result is obtained for the
noncoding strand. Figure 5 shows that on the noncoding
strand, all rates are lower when compared with their values
in the flanking region. For A / G, this reduction is more
pronounced if only genes showing high rate asymmetries
are considered, whereas for C / T and G / T, the particular choice of the subgroup of genes hardly affects the
results. Taken together, figure 5 confirms that the rates C
/ T, A / G, and G / T are increased on the coding
strand (and decreased on the noncoding strand) not only
with respect to the rates on the complementary strand (as
has been shown in fig. 4) but also with respect to the rates
in the flanking region.
FIG. 4.—Same data as in figure 2 but with a different key to identify direction of transcription: If strand 1 is the coding strand, points are plotted
with black crosses, else if strand 1 is the noncoding strand, gray squares are chosen.
136 Mugal et al.
FIG. 5.—Relative rate differences between rates in the transcribed genic regions and in the intergenic regions adjacent to these transcribed regions
(flanking regions), averaged over all 1,630 genes considered (column labeled ‘‘all transcribed regions’’) and over a subset of 356 genes showing high
rate asymmetries (column labeled asymmetric regions). Data are based on an analysis including either all repeat families or just the subgroup of
youngest repeat families (columns labeled ‘‘young repeats’’). The error bars represent the 95% confidence intervals.
The large pool of available repeat families and repeat
copies permits us to use just a fraction of these families for
estimating rates. This opens up interesting perspectives to
derive information on the time dependence of rates. We
therefore determined for each repeat family its average LogDet time distance (Karro et al. 2008) (the ‘‘age’’ of the family) and sorted all families according to their age. We then
took a package of 50 families from this list, determined the
mean age of these families, and estimated rates on the basis
of just this group. Shifting the package of 50 families stepwise downward in the list, repeating each time the procedure to estimate rates, one obtains rates as a function of the
mean age of the respective package of families. The result
of this procedure is shown in figure 6. The rates represent
averages over all 4,590 genes. The x axis can be understood
as time axis, where a point in time must be considered as
a representative of the group of repeat families used for the
analysis. The set of 50 youngest (oldest) repeat families has
a mean time distance of about 0.07 (0.38), which is where
the curves in figure 6 are terminated. Rates based on a group
of repeat families with a mean age T are to be understood as
quantities averaged over the time period T. Thus, rates
based on the group of oldest repeat families are averaged
over the period T 5 0.38, whereas those obtained with
the group of youngest repeat are averaged only over the
time period T 5 0.07.
The figure demonstrates that for both the A / G transition and the G / T transversion, there is hardly any dependence of the rate asymmetries on the age of the repeat
group used in the analysis. For the C / T transition, how-
ever, the analysis based on older repeats leads to the asymmetry [C / T]cod . [C / T]noncod, whereas, consistent
also with the comparison in figure 5, an analysis based
on younger repeat families suggests that there is no rate
strand asymmetry for this transition. The good statistics
we have is demonstrated by the fact that even though
the rates for coding and noncoding strands are based on
completely disjunct sets of repeat data, the two resulting
curves are almost mirror inverted to each other with respect
to the zero baseline.
To study the interdependence between substitutional
rate asymmetry and compositional asymmetry, we show
in figure 7 the total compositional skew S 5 (T A)/(T þ
A) þ (G C)/(G þ A) as well as the G þ T content of each
gene versus the total substitutional rate asymmetry. To
ensure that the calculations on each axis are independent,
the GT content has been determined from the introns after
excluding all repeats. For the values on the x axis, the data
from figure 4 are taken, multiplied with 1 if strand 1 is the
noncoding strand and with þ1 otherwise, such that the rate
differences in figure 7 now refer to a rate at the coding
strand minus the same rate at the noncoding strand, in that
order. We find that the compositional asymmetry is high
whenever the substitutional asymmetry is increased, a result
that we will compare to the findings of Green et al. (2003) in
the next section.
We finally wanted to examine the influence of replication onto our results. Replication is generally known
as an additional process leading to substitutional rate asymmetries. Touchon et al. (2005) have recently shown that like
Mutational Strand Bias in Human Genes 137
FIG. 6.—Rate asymmetries between the two complementary DNA strands averaged over all genes considered and plotted as a function of the mean
age of the respective group of repeats which the rate estimation procedure is based on. Color code as defined in figure 4. The error bars (dashed lines)
represent the 95% confidence intervals. The arrow is explained in the text.
in prokaryotes also in mammalian cells genomic regions
situated 5# of an origin of replication (ORI) tend to have
a negative skew, whereas regions situated 3# of an ORI
show on average a positive skew. Thus, when comparing
the leading with the lagging strand of replication they have
found compositional strand asymmetries that are quite similar to those on the noncoding and coding strands of a gene.
They concluded that for genes located in the neighborhood
of ORIs, the strand asymmetries associated with replication
and transcription superpose each other. The compositional
skew is amplified if the coding strand happens to be the
lagging strand or attenuated otherwise. This effect should
be dependent on the distance between gene loci and the
nearest ORI.
Following Touchon et al. (2005), we may assume that
averaged over many replication processes, the counteracting effect of replication starting from two neighboring ORIs
cancel each other exactly at the midpoint between these two
ORIs. We furthermore can assume that the replicationassociated bias decreases linearly from its largest value
0.4
55
S
0.2
0
50
-0.2
-0.4
-0.4
r = 0.80
-0.2
0
0.2
0.4
substitutional asymmetry
r = 0.83
0.6
45
-0.4
-0.2
0
0.2
0.4
0.6
substitutional asymmetry
FIG. 7.—Compositional skew S (left panel) and current G þ T content (right panel) of the introns of each gene versus the total substitutional rate
asymmetry, defined as the sum of the three differences [C / T]cod [C / T]noncod, [A / G]cod [A / G]noncod, and [G / T]cod [G / T]noncod.
Correlation coefficients are r 5 0.80 and r 5 0.83 for the correlations in the left and right plots, respectively.
138 Mugal et al.
rel. difference [%]
[X→Y]cod - [X→Y]non-cod
difference between [G→T]
0.6
r = 0.22
a)
difference between [A→G]
r = 0.34
b)
difference between [C→T]
r = 0.49
c)
0.4
0.2
0
-0.2
-0.5
0
0.5
putative bias of replication
-0.5
0
0.5
putative bias of replication
-0.5
0
0.5
putative bias of replication
FIG. 8.—Substitutional asymmetries are also affected by replication: Rate asymmetries correlated with the weighted distance of a gene’s center to
the midpoint between its nearest left and right origin of replication. One data point for each of the 1,088 genes that are positioned within one of the Ndomains determined by Huvet et al. (2007). Correlation coefficients are given in the plots.
(directly at the ORI) to zero (half way to the next ORI).
Under these assumptions, we identified those 1,088 genes
(from our pool of 4,590 genes) having positions within
one of the 678 ‘‘N-domains’’ identified by Huvet et al.
(2007), where the word N-domain refers to the region between two neighboring ORIs. For each of these genes, we
then determined the relative distance b between the center
of a gene and the center of its respective N-domain (ranging from 1 to þ1) and multiplied this distance by þ1 if
lagging and coding strand were the same and by 1 otherwise. Figure 8 correlates the rate asymmetries for the
1,088 genes with this weighted distance b. For all three
substitution types, we find significant correlations, with
correlation coefficients ranging between 0.22 and 0.49
as indicated in the figure. This result supports the idea
that—depending on the distance to the next ORI and
the relative direction of transcription and replication—the
rate asymmetries of genes are to some extent also produced by the process of replication.
Discussion
Asymmetry 1: Increased Rates on the Coding Strand of
Transcribed Genes
All three rates considered (C / T, A / G, and G /
T) are found to be on average higher on the coding strand in
comparison to their values both in the gene-flanking regions
and on the noncoding strand. Following Beletskii and
Bhagwat (1996, 1998) and Beletskii et al. (2000), this observation could be a consequence of the fact that during
transcription the coding strand is transiently exposed to
the environment and—being single stranded—is then particularly vulnerable to mutagenic reactions such as deamination or depurination. Indeed, a number of experiments
confirm that single strandedness can significantly enhance
mutational rates and that enhanced rates are therefore to be
expected in regions of single-stranded DNA generated during replication and transcription (Karran and Lindahl 1980;
Lindahl 1993). For example, the cytosine deamination rate
at 37 °C for human double-stranded DNA is 7 1013 1/s
but as high as 1010 1/s if the DNA is single stranded, that
is, a factor 140 times larger (Frederico et al. 1990).
Although enhanced cytosine deamination on single
and exposed strands can be made responsible for our finding of increased C / T coding strand transitions, the increased A / G transition rate is more difficult to explain
because the situation with adenine deamination in singlestranded DNA is less clear. Under comparable conditions,
the deamination of adenine residues on single strands was
40 times slower than that of cytosine (Karran and Lindahl
1980). If adenine deamination was the only mutagenic effect related to the A / G transition, then this factor 40
would be in conflict with our results because figure 5 clearly
shows that the A / G rate has even a higher relative increase on the coding strand than the C / T transition rate,
which becomes even worse if absolute rate differences were
considered. Therefore, mutagenic effects other than adenine
deamination must have an impact on the A / G rate
asymmetry.
The increased G / T transversion rate on the coding
strand could possibly mean that just like the deamination
rate also the rate of creation of mutagenic 8-oxo-G is increased on single strands. In addition, mutagenesis induced
by AP sites could also play a role. It is known that considerable amounts of purine bases are continuously released
from double-stranded DNA (Lindahl and Nyberg 1972;
Lindahl 1993). Depurination of a guanine with a preferred
insertion of adenine opposite such an AP site would then
lead to a G / T transversion (see fig. 1). Such a substitution
mechanism would indeed be strongly affected by the single
strandedness of the coding strand during transcription: It is
known that the depurination rate jumps from 3 1011 1/s
on double-stranded DNA to 1010 1/s on single-stranded
DNA (Lindahl and Nyberg 1972)—thus being the only mutagenic reaction with rates as high as the single-stranded
cytosine deamination rate. This mechanism requires
a DNA polymerase that preferably inserts an A opposite
an AP site. In mammalian cells, controversial reports suggest either no specificity in the insertion of bases opposite
the AP site (Gentil et al. 1992) or the preferential incorporation of an A opposite an AP site (Shibutani et al. 1997;
Avkindagger et al. 2002; Simonelli et al. 2005), which
would produce either an A / T transversion (if A was released) or a G / T transversion (if G was released). If from
time to time also a C was inserted opposite an AP site, then
Mutational Strand Bias in Human Genes 139
this AP-induced mutation mechanism would contribute
also to the A / G transition rate.
Asymmetry 2: Reduced Rates on the Noncoding Strand
of Transcribed Genes
All three rates tend to be lower on the noncoding
strand when compared with their values in the flanking
regions. The fact that substitution rates on the transcribed
strand of genes are considerably below their values in untranscribed regions can only mean that a mechanism is
effective which removes mutagenic lesions from the transcribed strand of active genes. Such TCR mechanisms are
long known and have been shown to actively repair DNA
damage, for example, the products of guanine oxidation
(Mellon et al. 1987; Svejstrup 2002; Kuraoka et al. 2003;
Saxowsky and Doetsch 2006). Our results in figure 5 suggest that TCR must also operate on the damage causing
the A / G and C / T mutations. Candidate enzymes for
TCR include the recently described NEIL enzymes that
exhibit a preference for excising 8-oxoG from singlestranded DNA and from bubble structures rather than
from intact double-stranded DNA, prompting the suggestion that this enzyme may play a role in the excision of
damage from a transcription bubble (Dou et al. 2003;
Hazra et al. 2007). Reported substrates of this enzymes
also include 5-hydroxyuracil, an oxidative product of cytosine (Thiviyanathan et al. 2005), which like U pairs
preferentially with adenine, thus leading to the same mutagenic effect.
The explanation that TCR reduces the rates on the
transcribed strand is not unproblematic. It is thought that
a lesion must block the RNA polymerase progression for
TCR to occur (Scicchitano et al. 2004). However, all the
mutagenic lesions causing our three rates do not usually
block RNA polymerase and should therefore not be able
to trigger the TCR mechanism. Specifically, high transcriptional bypass of uracil has been reported in vivo
(Viswanathan et al. 1999); high lesion bypass efficiency
has also been observed of AP sites, both during transcription (Kuraoka et al. 2003) and replication (Avkindagger
et al. 2002); and even oxidatively damaged bases, such
as 8-oxoG, do not usually stall the RNA polymerase
(Saxowsky and Doetsch 2006). And still, TCR of 8-oxoG
has clearly been observed in mammalian cells (Page,
Klungland, et al. 2000; Page, Kwoh, et al. 2000; Kuraoka
et al. 2003). How TCR is activated in the absence of a
significant block to transcription is far from being clear
(Scicchitano et al. 2004). Different possibilities are currently discussed (Scicchitano et al. 2004). For example,
it has been suggested that such lesions might merely pause
the transcription machinery and that this is already sufficient to induce TCR (Zhou and Doetsch 1993; Bregeon
et al. 2003). Also, lesions that usually do not block may
lead to some sort of blocking at some minimal rate under
certain circumstances, which—no matter how small this
rate is—would statistically favor eventual TCR, that is,
even if one polymerase bypasses the lesion, the chance
of every subsequent bypass is statistically reduced, and
once the lesion stalls a transcription complex, it will be subjected to TCR. However, the detailed molecular mecha-
nisms look like, our results clearly demonstrate that such
a mechanism must exist and that it actively works toward
reducing rates on the template strand of transcribed genes.
Asymmetry 3: The Correlation between Rate
Asymmetries
In enterobacterial genes, it is long known that cytosine
deamination is coupled to transcription and that it occurs
primarily on the coding strand (Beletskii and Bhagwat
1996, 1998; Beletskii et al. 2000). Simultaneously, there
is a reduction of the C / T rate on the noncoding strand
caused by TCR. As first pointed out by Francino and
Ochman (2001), these two processes are likely to occur together, both working toward an increase of the transcription-induced substitutional asymmetry of C / T
transitions in bacterial genes. The simultaneous occurrence
of rate-reducing effects on one strand and rate-increasing
effects on the other during transcription is in essence what
we have found here also for human genes and with regard
not only to the C / T transition but also to the A / G and
G / T type of substitution.
Interestingly, for C / T and G / T, the rate reduction on the noncoding strand in figure 5 was the same, irrespective of whether we averaged over all transcribed
regions or just over the subset of highly asymmetric genes.
At the same time, we observe on the coding strand a dramatic increase of the rate difference when going from an
average over all transcribed regions to one over just the
asymmetric genes. These two observations together suggest
that for C / T and G / T particularly strong rate asymmetries are caused primarily by the rate-increasing effects
on the coding strand. The fact that all three rates show
a common behavior—being all three reduced on the transcribed and increased on the coding strand—expresses itself in the correlations observed in figure 4: If there is
a strand bias for one rate, then the rates of the two other
substitutions are likely to show the same bias.
Comparison to Work of Green et al.
Green et al. (2003) have recently examined transcription-induced asymmetries in nine different genes of eight
different mammals and have found that A / G tend to have
higher rates on the coding strand than on the noncoding
strand, whereas C / T transitions are found to have
slightly higher rates on the noncoding strand than on the
coding strand. Although the first asymmetry is compatible
with our results, the C / T asymmetry seems to be in contradiction to our findings. However, the method chosen by
Green et al. differs from ours. Their procedure is based on
a three-species alignment between human, chimpanzee, and
baboon. With their method, these authors reconstructed ancestral sequences of introns at some specific point in time,
the human–chimp speciation time (relative time t 5 0.0016
marked by an arrow in fig. 6) and then used these sequences
to identify substitutions that have occurred between this
time point and today. We, on the other hand, estimate
rates on the basis of repetitive elements occurring within
introns—a method that allows to determine rates that are
averaged over different periods of times, depending on the
140 Mugal et al.
composition of the group of repeat families used in the rate
estimation procedure(see fig. 6). For the C / T transition,
the older repeats are found to predict the asymmetry [C /
T]cod . [C / T]noncod, whereas an analysis based on the
group of youngest repeat families (with mean age t 5 0.07)
suggests that [C / T]cod [C / T]noncod. Unfortunately,
we cannot perform our analysis exactly at the time point
t 5 0.0016 investigated by Green et al. but extrapolating
the trend of our curves in figure 6 down to t 5 0.0016 an
agreement with the rate asymmetry found by Green et al.
([C / T]cod [C / T]noncod) does not seem unlikely.
We are not able to give a biological explanation for the
observed time dependence of the C / T rate asymmetry.
The important point, however, is that the asymmetry
[C / T]cod . [C / T]noncod found for all times t . 0.15
has been effective over much longer periods of time than
the asymmetry with reversed rates and thus will be the
decisive factor determining the resulting base compositional asymmetry discussed below.
Relation to the Compositional Skew
GC and AT skew diagrams were introduced in the
analysis of bacterial chromosomes for short sequences
(Lobry 1996a), complete genomes of bacteria (Lobry
1996c; Blattner et al. 1997; Grigoriev 1998), and viruses
(Grigoriev 1999). The correlation found in figure 2 has
an immediate impact on to the compositional skew of human genes. We focus in the following discussion on the two
strongest types of substitutions, the two transitions A / G
and C / T. The correlation in figure 2c can be reinterpreted
as the correlation between (A / G) (T / C) and (C /
T) (G / A) when measured on the same strand. This
correlation implies that the difference of nucleotide frequencies of G and C correlates with the difference of the
frequencies of T and A. In other words, the correlation
found for the rate asymmetries dictates that the GC skew
SGC 5 (G C)/(G þ C) must correlate with the TA skew
STA 5 (T A)/(T þ A). Following this line of reasoning,
the total skew S 5 SGC þ STA must correlate with the sum
of rate asymmetries, as we have indeed found in figure 7.
This shows that compositional and substitutional strand
asymmetries have a causal relation. Moreover, figure 4c implies that on the coding strand, both the GC and the TA
skew are significantly higher than on the noncoding strand.
Exactly, this correlation and also its association with transcription has been found previously by Touchon et al.
(2003) who examined compositional asymmetries on intron
sequences from human genes.
Our rate asymmetries,
C/Tcod .½C/Tnoncod 5 ½G/Acod ;
A/Gcod .½A/Gnoncod 5 ½T/Ccod ;
G/Tcod .½G/Tnoncod 5 ½C/Acod ;
ð2Þ
when acting over evolutionary long periods of time, must
have the effect that on the coding strand an excess of G and
T is produced because more guanine residues and thymine
residues are gained by substitutions than lost by substitutions back to A and C. Accordingly, the G þ T content
on the coding strand must correlate with the sum of [C
/ T]cod [G / A]cod, [A / G]cod [T / C]cod,
and [G / T]cod [C / A]cod, as we have indeed found
in figure 7. A correlation coefficient as high as 0.83 has
been found, confirming that the deviations of G þ T from
50% are indeed produced by the strand asymmetries of the
transitions. Moreover, a linear fit through the data showed
that when C / T, A / G, and G / T are the same on both
strands (genes with strand symmetric substitution patterns),
the regression line predicts a G þ T content of roughly 50%,
which is in agreement with Chargaff’s second parity rule
(Sueoka 1995).
Green et al. (2003) explained an excess of G þ T pairs
over A þ C pairs by coding strand rate asymmetries of the
form [A / G]cod [T / C]cod . 0 and [G / A]cod [C
/ T]cod . 0 that are different from ours. They argued that
G þ T is in excess on the coding strand because [A / G]cod
[T / C]cod . 0 has a much stronger effect toward increasing G þ T than the weak [G / A]cod [C / T]cod .
0 asymmetry working toward reducing the G þ T content.
With our rate asymmetries in equation (2), all three rates
work in the same direction, that is, toward building up
the G þ T content.
A compositional skew can only result from a substitutional asymmetry if the latter has been effective over a sufficiently long, that is, evolutionary, time distance. However,
Green et al. (2003) could study rate asymmetries only at one
specific point in time and had to assume a time independence of their strand bias for their argument to hold. Figure
6 indicates that at least for the C to T rate this assumption
cannot be made. If rates are indeed time dependent, then it is
impossible to clarify the exact relationship between compositional and substitutional asymmetry on the basis of rate
asymmetries determined at just one point in time.
Relation to Replication-Induced Bias
Both replication and transcription lead to a mutational
strand bias. The idea behind studying rate differences between genic and flanking regions (fig. 5) has been to distinguish between these two asymmetry-producing effects:
Because on average the strength of the replication-induced
bias should be very similar in the genic and gene-flanking
region, its effect should be eliminated if rates in a transcribed region are subtracted from their values in the flanking region. Therefore, the asymmetries found in figure 5
should reflect primarily the effect brought about by transcription.
Figure 8, on the other hand, demonstrates that substitutional asymmetries of genes are also influenced by replication. Replication enhances the transcription-induced
asymmetry when the coding strand of a gene is identical
with the lagging strand and reduces this asymmetry if
not. Furthermore, the correlation found shows that the effect of replication tends to be largest near to an ORI and can
be assumed to linearly decrease down to zero half way to
the next ORI. Interestingly, whereas transcription has the
most pronounced effect on the A to G rate, replication
seems to be particularly important for the C to T rate asymmetry, showing by far the strongest correlation in figure 8.
We should also remark that the origins of replication that
we used in our analysis were determined by Huvet et al.
Mutational Strand Bias in Human Genes 141
(2007) through a guided search for certain patterns of the
compositional skew. As such they are of course only putative origins. The strength of the correlations shown in figure 8
seems to confirm that these positions are indeed origins of
replication.
It seems that there are two important factors causing
the strong gene-to-gene variations of rate asymmetries observable in figures 2–4: First, in case, the rate asymmetries
are induced by transcription, repeated transcription increases the extent of substitutional asymmetry. Hence,
genes having a high level of germ-line expression should
have high rate asymmetries. This assumption is supported
by a previously found correlation between gene expression
and compositional skew (Majewski 2003; Comeron 2004).
Second, figure 8 shows that the extent of substitutional
asymmetry is also affected by the gene’s transcription direction and position relative to the next ORI. For genes located close to an ORI, the two processes, replication and
transcription, superpose each other in either a positive or
a negative direction, thus leading to a variation in rate asymmetry. Further work is needed to explain these variations in
a more quantitative fashion.
Acknowledgments
We are grateful to Dr John Karro for his constant support, his helpfulness, and advice and to Dr Phil Green for
reading and commenting on the first draft of this manuscript. M.P. received financial support from the Austrian
Science Foundation (FWF) under project title P18762.
C.F.M. has been supported by the ‘‘Heinrich Jörg Stiftung’’
of the Karl-Franzens University Graz.
Literature Cited
Avkindagger S, Adar S, Blander G, Livneh Z. 2002. Quantitative
measurement of translesion replication in human cells:
evidence for bypass of abasic sites by a replicative DNA
polymerase. Proc Natl Acad Sci USA. 99:3764–3769.
Beletskii A, Bhagwat A. 1996. Transcription-induced mutations:
increase in C to T mutations in the nontranscribed strand
during transcription in Escherichia coli. Proc Natl Acad Sci
USA. 93:13919–13924.
Beletskii A, Bhagwat A. 1998. Correlation between transcription
and C to T mutations in the non-transcribed DNA strand. Biol
Chem. 379:549–551.
Beletskii A, Grigoriev A, Joyce S, Bhagwat A. 2000. Mutations
induced by bacteriophage T7 RNA polymerase and their effects on
the composition of the T7 genome. J Mol Biol. 300:1057–1065.
Blattner F, Plunkett G, Bloch C, et al. (17 co-authors). 1997. The
complete genome sequence of Escherichia coli K-12. Science.
277:1453–1462.
Bregeon D, Doddridge Z, You H, Weiss B, Doetsch P. 2003.
Transcriptional mutagenesis induced by uracil and 8-oxoguanine in Escherichia coli. Mol Cell. 12:959–970.
Comeron J. 2004. Selective and mutational patterns associated
with gene expression in humans: influences on synonymous
composition and intron presence. Genetics. 167:1293–1304.
Coulondre C, Miller J, Farabaugh P, Gilbert W. 1978. Molecularbasis of base substitution hotspots in Escherichia-coli. Nature.
274:775–780.
David S, O’Shea V, Kundu S. 2007. Base-excision repair of
oxidative DNA damage. Nature. 447:941–950.
Dianov G, Lindahl T. 1991. Preferential recognition of IT basepairs in the initiation of excision-repair by hypoxanthineDNA glycosylase. Nucleic Acids Res. 19:3829–3833.
Dou H, Mitra S, Hazra T. 2003. Repair of oxidized bases in DNA
bubble structures by human DNA glycosylases NEIL1 and
NEIL2. J Biol Chem. 278:49679.
Duncan B, Miller J. 1980. Mutagenic deamination of cytosine
residues in DNA. Nature. 287:560–561.
Francino M, Chao L, Riley M, Ochman H. 1996. Asymmetries
generated by transcription-coupled repair in enterobacterial
genes. Science. 272:107–109.
Francino M, Ochman H. 1997. Strand asymmetries in DNA
evolution. Trends Genet. 13:240–245.
Francino M, Ochman H. 2001. Deamination as the basis of
strand-asymmetric evolution in transcribed Escherichia coli
sequences. Mol Biol Evol. 18:1147–1150.
Frank A, Lobry J. 1999. Asymmetric substitution patterns:
a review of possible underlying mutational or selective
mechanisms. Gene. 238:65–77.
Frederico L, Kunkel T, Shaw B. 1990. A sensitive genetic assay
for the detection of cytosine deamination-determination of
rate constants and the activation-energy. Biochemistry.
29:2532–2537.
Gentil A, Cabral-Neto J, Mariage-Samson R, Margot A,
Imbach J, Rayner B, Sarasin A. 1992. Mutagenicity of
a unique apurinic/apyrimidinic site in mammalian cells. J Mol
Biol. 227:981–984.
Green P, Ewing B, Miller W, Thomas P, Green E. 2003.
Transcription-associated mutational asymmetry in mammalian evolution. Nat Genet. 33:514–517.
Grigoriev A. 1998. Analyzing genomes with cumulative skew
diagrams. Nucleic Acids Res. 26:2286–2290.
Grigoriev A. 1999. Strand-specific compositional asymmetries in
double-stranded DNA viruses. Virus Res. 60:1–19.
Hazra T, Das A, Das S, Choudhury S, Kow Y, Roy R. 2007.
Oxidative DNA damage repair in mammalian cells: a new
perspective. DNA Repair. 6:470–480.
Hutchinson F, Donnellan J. 1998. A mutation spectra database
for bacterial and mammalian genes: 1998. Nucleic Acids Res.
26:290–291.
Huvet M, Nicolay S, Touchon M, Audit B, d’Aubenton
Carafa Y, Arneodo A, Thermes C. 2007. Human gene
organization driven by the coordination of replication and
transcription. Genome Res. 17:1278–1285.
Karran P, Lindahl T. 1980. Hypoxanthine in deoxyribonucleic
acid. Biochemistry. 19:6005.
Karro J, Peifer M, Hardison R, Kollmann M, von Grünberg H.
2008. Exponential decay of GC content detected by strandsymmetric substitution rates influences the evolution of
isochore structure. Mol Biol Evol. 25:362–374.
Kuraoka I, Endou M, Yamaguchi Y, Wada T, Handa H,
Tanaka K. 2003. Effects of endogenous DNA base lesions
on transcription elongation by mammalian RNA polymerase
II. J Biol Chem. 278:7294.
Lindahl T. 1993. Instability and decay of the primary structure of
DNA. Nature. 362:709–715.
Lindahl T, Ljungquist S, Siegert W, Nyberg B, Sperens B. 1977.
DNA N-glycosidases: properties of uracil-DNA glycosidase
from Escherichia coli. J Biol Chem. 252:3286–3294.
Lindahl T, Nyberg B. 1972. Rate of depurination of native
deoxyribonucleic acid. Biochemistry. 11:3610.
Lindahl T, Nyberg B. 1974. Heat-induced deamination of
cytosine residues in deoxyribonucleic acid. Biochemistry.
13:3405–3410.
Lobry J. 1996a. A simple vectorial representation of DNA
sequences for the detection of replication origins in bacteria.
Biochimie. 78:323–326.
142 Mugal et al.
Lobry J. 1996b. Asymmetric substitution patterns in the two
DNA strands of bacteria. Mol Biol Evol. 13:660–665.
Lobry J. 1996c. Origin of replication of Mycoplasma genitalium.
Science. 272:745–746.
Majewski J. 2003. Dependence of mutational asymmetry on
gene-expression levels in the human genome. Am J Hum
Genet. 73:688–692.
Mellon I, Spivak G, Hanawalt P. 1987. Selective removal
of transcription-blocking DNA damage from the transcribed strand of the mammalian DHFR gene. Cell. 51:
241–249.
Page FL, Klungland A, Barnes D, Sarasin A, Boiteux S. 2000.
Transcription coupled repair of 8-oxoguanine in murine cells.
Proc Natl Acad Sci USA. 97:8397.
Page FL, Kwoh E, Avrutskaya A, Gentil A, Leadon S, Sarasin A,
Cooper P. 2000. Transcription-coupled repair of 8-oxoguanine: requirement for XPG, TFIIH, and CSB and implications
for Cockayne syndrome. Cell. 101:159–171.
Saxowsky T, Doetsch P. 2006. RNA polymerase encounters with
DNA damage: transcription-coupled repair or transcriptional
mutagenesis? Chem Rev. 106:474.
Schwarz G. 1978. Estimating the dimension of a model. Ann
Stat. 6:461–464.
Scicchitano D, Olesnicky E, Dimitri A. 2004. Transcription and
DNA adducts: what happens when the message gets cut off?
DNA Repair. 3:1537–1548.
Shapiro R, Klein R. 1966. The deamination of cytidine and
cytosine by acidic buffer solutions. Mutagenic implications.
Biochem. 5:2358–2362.
Shibutani S, Takeshita M, Grollman A. 1991. Insertion of
specific bases during DNA synthesis past the oxidationdamaged base 8-oxodG. Nature. 349:431.
Shibutani S, Takeshita M, Grollman A. 1997. Translesional
synthesis on DNA templates containing a single abasic site.
J Biol Chem. 272:13916–13922.
Simonelli V, Narciso L, Dogliotti E, Fortini P. 2005. Base
excision repair intermediates are mutagenic in mammalian
cells. Nucleic Acids Res. 33:4405.
Sueoka N. 1995. Intrastrand parity rules of DNA base
composition and usages biases of synonymous codons.
J Mol Evol. 40:318–3–25 Erratum 42: 323.
Svejstrup J. 2002. Mechanisms of transcription-coupled DNA
repair. Mol Cell Biol. 3:21–29.
Takahasi K, Sakuraba Y, Gondo Y. 2007. Mutational pattern and
frequency of induced nucleotide changes in mouse ENU
mutagenesis. BMC Mol Biol. 8:1–10.
Tanaka M, Ozawa T. 1994. Strand asymmetry in human
mitochondrial-DNA mutations. Genomics. 22:327–335.
Thiviyanathan V, Somasunderam A, Volk D, Gorenstein D.
2005. 5-Hydroxyuracil can form stable base pairs with all four
bases in a DNA duplex. Chem Commun. 400–402.
Touchon M, Nicolay S, Arneodo A, d’Aubenton Carafa Y,
Thermes C. 2003. Transcription-coupled TA and GC strand
asymmetries in the human genome. FEBS Lett. 555:
579–582.
Touchon M, Nicolay S, Audit B, Brodie EB, d’Aubenton
Carafa Y, Arneodo A, Thermes C. 2005. Replicationassociated strand asymmetries in mammalian genomes:
toward detection of replication origins. Proc Natl Acad Sci
USA. 102:9836–9841.
Viswanathan A, You H, Doetsch P. 1999. Phenotypic Change
Caused by Transcriptional Bypass of Uracil in Nondividing
Cells. Science. 284:159.
Webster M, Smith N, Lercher M, Ellegren H. 2004. Gene
expression, synteny, and local similarity in human noncoding
mutation rates. Mol Biol Evol. 21:1820–1830.
Zhang A, Yang B, Li Z. 2007. Theoretical study on the
hydrolytic deamination reaction mechanism of adenine–
(H2O)n (n 5 1–4). J Mol Struct Theochem. 819:95–101.
Zhou W, Doetsch P. 1993. Effects of abasic sites and DNA
single-strand breaks on prokaryotic RNA polymerases. Proc
Natl Acad Sci USA. 90:6601–6605.
Manolo Gouy, Associate Editor
Accepted October 7, 2008