Exponential Decay of GC Content Detected by Strand

Exponential Decay of GC Content Detected by Strand-Symmetric
Substitution Rates Influences the Evolution of Isochore Structure
J. E. Karro,* à1 M. Peifer,§1 R. C. Hardison,§ M. Kollmann,k and H. H. von Grünberg§
*Department of Computer Science and Systems Analysis, Miami University, Ohio; Department of Microbiology, Miami University,
Ohio; àCenter for Comparative Genomics and Bioinformatics, Pennsylvania State University; §Institute of Chemistry, Karl-Franzens
University Graz, Graz, Austria; and kInstitute for Theoretical Biology, Humboldt University, Berlin, Germany
The distribution of guanine and cytosine nucleotides throughout a genome, or the GC content, is associated with numerous
features in mammals; understanding the pattern and evolutionary history of GC content is crucial to our efforts to annotate
the genome. The local GC content is decaying toward an equilibrium point, but the causes and rates of this decay, as well as
the value of the equilibrium point, remain topics of debate. By comparing the results of 2 methods for estimating local
substitution rates, we identify 620 Mb of the human genome in which the rates of the various types of nucleotide
substitutions are the same on both strands. These strand-symmetric regions show an exponential decay of local GC content
at a pace determined by local substitution rates. DNA segments subjected to higher rates experience disproportionately
accelerated decay and are AT rich, whereas segments subjected to lower rates decay more slowly and are GC rich. Although
we are unable to draw any conclusions about causal factors, the results support the hypothesis proposed by Khelifi A,
Meunier J, Duret L, and Mouchiroud D (2006. GC content evolution of the human and mouse genomes: insights from the
study of processed pseudogenes in regions of different recombination rates. J Mol Evol. 62:745–752.) that the isochore
structure has been reshaped over time. If rate variation were a determining factor, then the current isochore structure of
mammalian genomes could result from the local differences in substitution rates. We predict that under current conditions
strand-symmetric portions of the human genome will stabilize at an average GC content of 30% (considerably less than the
current 42%), thus confirming that the human genome has not yet reached equilibrium.
Introduction
The (local) ‘‘GC content,’’ or the percentage of GC
base pairs in a given genomic region, is highly variable
across many mammalian genomes (Bernardi 2000). It is associated with genomic features including gene density, intron length, replication timing, recombination rate and the
distribution of repeat elements (Mouchiroud et al. 1991;
Duret et al. 1995; Smit 1999; Lander et al. 2001; Kong
et al. 2002; Waterston et al. 2002). Investigating the causes
of variation in GC content should improve our understanding not only of the GC structure of the genome but also of
these associated features. Further, in studying the history of
a genome, it is important to determine whether the distribution of GC content is changing over time. Has the GC
content reached equilibrium or is it still evolving toward
some target point of stability? What factors are holding
it in check or driving the change?
It is well established that a genome is subject to a varying ‘‘neutral substitution rate’’—the rate at which neutral
(functionless) DNA undergoes changes that become fixed
in the population. This rate varies considerably, differing
between organisms, between sexes, and across the genome
(Lander et al. 2001; Waterston et al. 2002; Ellegren et al.
2003; Hardison et al. 2003; Gibbs et al. 2004; von Grünberg
et al. 2004; Arndt et al. 2005; Gaffney and Keightley 2005;
Lindblad-Toh et al. 2005; Taylor et al. 2006). Many of these
studies have investigated the interdependence between variation in the (neutral) substitution rate and the variation in
GC content, but simple explanations of causation have not
emerged.
1
These authors contributed equally to this work.
Key words: GC content, isochore decay, neutral substitution rates,
strand symmetry, genomic equilibrium.
E-mail: [email protected].
Mol. Biol. Evol. 25(2):362–374. 2008
doi:10.1093/molbev/msm261
Advance Access publication November 27, 2007
Various studies have also established an asymmetry in
the rates at which different substitutions occur; GC base
pairs become AT base pairs far more frequently than the
reverse (Lander et al. 2001; Arndt et al. 2005). Using realistic estimates for these rates, the imbalance implies that the
GC content at equilibrium will be considerably below the
current GC content. Thus, one would predict from the rate
asymmetry that GC content must be decreasing. It further
seems likely that those areas subject to higher substitution
rates experience faster GC-content decay. However, these
predictions have not been easy to test as calculating timeresolved substitution rates is a difficult task (Hardison et al.
2003; von Grünberg et al. 2004; Arndt et al. 2005; Gaffney
and Keightley 2005).
Further complicating matters, local substitution rates
are not the only factor determining the change in GC content and its eventual equilibrium. In both Meunier and Duret
(2004) and Duret et al. (2006) it was argued that recombination plays a role in the decay of GC content, that the modern human genome has not yet reached its equilibrium
point, and that the equilibrium point of any given genomic
region is determined by the interplay of the substitution and
recombination rates. Though there is some disagreement
(Alvarez-Valin et al. 2004; Antezana 2005), several studies
have supported these positions (Lercher et al. 2002; Webster
et al. 2003; Belle et al. 2004; Duret 2006; Khelifi et al.
2006). However, the rate of GC-content decay has not been
determined, nor has the relative importance of the substitution and recombination rates in influencing that decay. How
fast is GC-content decaying in a given region? What is the
importance of substitution, as opposed to recombination, in
determining that rate?
The human genome currently has a genome-wide average GC content of approximately 43% (42% chimpanzee,
42% macaque, 41% dog). We have applied 2 independent
methods for determining rates of substitutions across mammalian genomes to better understand the history and predict
the future of these genomes. We limited our analysis to the
2007 The Authors.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/
uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Exponential Decay of GC Content 363
620 Mb of human DNA in which the rate for each type
of substitution is the same on the complementary strand—
regions referred to as strand symmetric (Green et al. 2003).
This allowed us to apply a simple but powerful model for
determining substitution rates. With these rates, we predict
and confirm that the local GC content decays exponentially
over time toward a local equilibrium value GC* at a rate
determined by the local rate of substitution. We predict that
these strand-symmetric areas will stabilize their average GC
content at 30% (±4%) for each of these genomes, supporting the notion that none of these genomes are currently at
equilibrium. Furthermore, the equilibrium point of a region
is determined by the local rates of substitution and recombination, as well as by the Biased Gene Conversion (BGC)
mechanism (Nagylaki 1983; Galtier et al. 2001; Duret et al.
2002), thus independently confirming the results of Meunier
and Duret (2004). Our studies agree with the prediction that
each genome is evolving toward a new isochore structure
(as proposed by Khelifi et al. 2006). They also suggest that
the current isochore structure could result from the action of
locally varying rates on an ancestral genome, regardless of
that genome’s GC-content pattern.
Methods and Materials
Whereas calculating the GC content of a sequenced
genome is simple, estimating substitution rates and GC*
are more difficult tasks. Substitution rates are estimated using 2 different methods. The first is the ‘‘LogDet’’ method
(Barry and Hartigan 1987; Gu and Li 1996; Baake and von
Haeseler 1999; Sumner and Jarvis 2006); the second is
based on continuous Markov chains and is similar to many
of the well-studied models (e.g., Jukes and Cantor 1968;
Hasegawa et al. 1985; Ewens and Grant 2005). Like those
models, and in contrast to the well-known model of Arndt,
Burge, and Hwa (2003), Arndt, Petrov, and Hwa (2003),
and Arndt et al. (2005), we do not explicitly account for
neighbor interactions affecting substitution rates (specifically elevated CpG dinucleotide substitution rates); in
our Results section, we show that these rates are implicitly
included and thus this parameter reduction is not a source
of error.
Estimating neutral substitution rates is necessarily
based on nucleotides known to be nonfunctional. There
are 2 sources of such data frequently used: 4-fold degenerate coding sites and interspersed repeats (Hardison et al.
2003). We selected the repeat data because of its quantity.
Repeats make up approximately 45% of the genome (providing significantly more points of estimation than do the
4-fold degenerate coding sites) and, as a whole, are uniformly distributed (Lander et al. 2001)—allowing us to average out the effect of location bias. It is generally believed
that the substitution rates of repeat bases are reflective of the
local neutral substitution rate, though this is not completely
agreed upon (Hardison et al. 2003; Arndt, Burge, and
Hwa 2003, Gaffney and Keightley 2005).
Further, instead of fitting our model parameters to interspecies alignments, we follow the lead of Arndt in using
the alignment of modern interspersed repeat sequences to
their RepeatMasker-derived ancestral sequences (Smit A,
Hubley R, Green P, unpublished data). This allows us to
discard the normal Markov chain constraint of reversibility
used as a basis for many calculations. Like the Arndt studies, we also assume strand-symmetric substitution: that both
strands of the genome segment undergo any given type of
substitution at the same rate, hence complementary substitution rates are equal (e.g., the rate of A/C substitutions is
equal to that of T/G substitutions). Strand symmetry is
not a universal condition; it is clearly disrupted by selective
pressure and has also been shown to break down in transcribed regions and around replication origins (Green
et al. 2003, Touchon et al. 2005). We will shortly describe
a technique of identifying where this assumption is valid (or
is a valid approximation) and can use this technique to identify regions where our predictions can be tested.
Notation
Like many studies, we will break the genome into nonoverlapping windows, or partitions (Hardison et al. 2003;
Gaffney and Keightley 2005). The (instantaneous state transition) rate matrix defining a given Markov chain will be
denoted by q, and we use qc to denote the rate matrix that
has been estimated exclusively from the partition c (i.e.,
c will denote the mean
from the repeats falling within c), m
substitution rate of c (averaged over the time span under
investigation), and pGC
c ðtÞ will denote the partition’s GC
content at time t.
Any method based on a Markov chain imposes certain
constraints on q. For example, Jukes and Cantor (1968) require that all off-diagonal elements be equal, Hasegawa
et al. (1985) group the substitution types into 4 different
parameter classes, and the fully reversible model allows
for 6 parameters configured to enforce ‘‘full reversibility’’
(Ewens and Grant 2005). The constraints of our own model
are fully defined by the strand-symmetric property. As a result, we have a 6-parameter model—that is, 6 independent
rates appearing in the rate matrix (see Appendix, eq. 7, for
the matrix layout). These 6 rates are labeled q1, . . ., q6 (defined in table 1). When referring to the components of the
matrix qc, we will denote these rates by qc1 ; . . . ; qc6 . This
strand-symmetric Markov chain was first proposed by
Sueoka (1995) and later studied by Lobry and Lobry (1999).
Estimating GC-Content Equilibrium (GC*)
The importance of limiting our analysis to a strandsymmetric q is as follows. We derive a value kc from
qc—specifically, kc is the sum of the rates of all substitutions between an AT base and a GC base in either direction
(see eq. 9 in the Appendix). It is shown in Lobry and Lobry
(1999) that if (and only if) the matrix qc is strand symmetric
and time independent, we have
cT
GC
kc m
;
pGC
c ðt þ TÞ GCc 5ðpc ðtÞ GCc Þe
ð1Þ
where
GCc 5
qc1
qc1 þ qc5
:
þ qc5 þ qc4 þ qc6
ð2Þ
364 Karro et al.
Table 1
Values of the 6 Independent Components q1, . . ., q6 of the q Matrix Obtained from a
Genome-Wide Analysis of the Genomes of Human (hg18), Chimpanzee (pt2), Dog (cf2),
and Macaque (rm2) (for details see appendix)
Rate
q1
q2
q3
q4
q5
q6
Transition
A/C
T/G
A/T
T/A
C/G
G/C
C/A
G/T
A/G
T/C
G/A
C/A
Human
0.132
(0.14)
0.130
(0.13)
0.129
(0.13)
0.186
(0.19)
0.434
(0.47)
0.989
(0.93)
± 0.003
± 0.004
± 0.004
± 0.004
± 0.005
± 0.008
Chimpanzee
0.128
(0.14)
0.131
(0.14)
0.130
(0.13)
0.185
(0.19)
0.435
(0.47)
0.990
(0.93)
± 0.005
± 0.005
± 0.005
± 0.005
± 0.007
± 0.011
Dog
0.141
(0.15)
0.134
(0.14)
0.131
(0.13)
0.182
(0.19)
0.443
(0.46)
0.968
(0.93)
± 0.004
± 0.005
± 0.005
± 0.005
± 0.007
± 0.011
Macaque
0.130
(0.14)
0.129
(0.13)
0.130
(0.13)
0.183
(0.19)
0.436
(0.46)
0.992
(0.93)
± 0.004
± 0.005
± 0.005
± 0.005
± 0.007
± 0.011
NOTE.—Values in brackets are obtained after masking out CpG sites before performing our analysis. Values are normalized
P
P
such that 14 i;j6¼i qij 51, implying that i qi 52.
Consider the implications of the first equation. If c is
subject to strand-symmetric substitution (and not affected
by issues such as elevated CpG mutation rates and recombination, which we will deal with shortly), then
equation (1) means that the GC content of c must be
decaying exponentially with time, at a rate dictated by the
c . Further,
product of kc and the local substitution rate m
the local GC content is converging to the value GCc . Thus,
GCc is the GC content in partition c that the genome will
eventually reach at equilibrium (with notation GC* chosen
to be consistent with that used in Meunier and Duret
2004). We note that pGC
c ðt þ TÞ GCc can be thought of
as the ‘‘excess GC content’’ contained by partition c on its
way to equilibrium.
One would like to test the prediction by picking 2
points in time and plugging in the relevant values into equation (1). However, we know the GC content for partition c
at only one point in time: ‘‘now.’’ In order to reflect this, we
calibrate time such that t 5 0 refers to the present, with t ,
0 then representing points in the past, and rewrite equation
(1) as
cT
GC
kc m
pGC
c ð0Þ GCc 5ðpc ðTÞ GCc Þe
ð3Þ
with pGC
c ð0Þ (denoting the current GC content) appearing
on the left and pGC
c ðTÞ (denoting the GC content T time
units in the past) occurring on the right. In the derivation
of this equation, we are assuming both strand symmetry
and time independence of the rate matrix—the model
cannot be reliably used for analysis unless we can verify
that the target region conforms to these assumptions. But
in those regions where this is the case, it follows that the
local GC content is decaying over time toward the value
c.
GCc at an exponential rate determined by kc m
Equation (3) describes the interdependence of 3
c-dependent (i.e., locally dependent) functions: the local
GC equilibrium value GCc , the local GC content pGG
c ð0Þ,
c and kc.
and the product of the local substitution rate m
The fourth local function appearing in the equation,
pGC
c ðTÞ, is not accessible and will produce statistical
noise. So equation (3) describes a curve in 3-dimensional
c and GCc . To visualize this
space spanned by pGC
c ð0Þ, kc m
curve, we will have to consider a projection of it onto a plane
that is spanned by just 2 of these 3 local quantities.
Models of Substitution
We estimated substitution rates using 2 independent
methods: a Markov chain–based method and the LogDet
method (Barry and Hartigan 1987; Gu and Li 1996; Baake
and von Haeseler 1999; Sumner and Jarvis 2006). By comparing the estimations derived from each method, we are
able to both identify strand-symmetric regions of the genome and verify that our results on these regions are independent of certain differing assumptions.
Our Markov chain–based method, which assumed
strand symmetry, estimates the rates qc1 ; . . . ; qc6 in qc and
c from the repeat alignments
the mean local substitution rate m
falling within partition c. We use standard fitting techniques
to estimate the state-transition probability matrix Pc(t) (see
eq. 6 in the Appendix). In other words, [Pc(t)]ij denotes the
probability of a base with content i at time t0 becoming
a base with content j at time t0 þ t. The problem is more
complex than with the more common applications of Markov
chain theories (e.g., as used in Hardison et al. 2003 or
Gaffney and Keightley 2005) because we are dealing with
alignments corresponding to repeat families of different
ages. Thus, we will denote a given repeat family as a,
the age of that family as ta, and can then estimate Pc(ta)
for different families a that intersect partition c. Given
a set of alignments corresponding to repeat family a, we
estimate both qc1 ; . . . ; qc6 and the mean local substitution rate
c from a maximum-likelihood fit to the matrices Pc(ta).
m
The final step in the application of our model is to define
c m,
the deviation of c’s
and calculate a new value sc 5m
substitution rate from the genome mean substitution rate m.
(Note that whenever we are interested in genome-wide values, we allow for just one partition comprising the whole
genome; the index c is then dispensable and has been omitted.) Of course, when applying this locally, the procedure is
Exponential Decay of GC Content 365
dependent on choosing an appropriate resolution, as reflected in the partition size. We experimented with resolutions ranging from 160-kb windows to 1-Mb windows. The
accumulated length of all repeats used for our analysis
ranged from 20% to 30% of the genome, depending on
the organism. Examples of the resulting local substitution
rate variations are shown and discussed in the Appendix.
In contrast to the previously described approach,
LogDet makes no assumptions about strand symmetry or
places any constraints on the rate matrix. Let Rc(t) be
the rate matrix at time t and let Rc be the average of the
Rc(t) matrices over the appropriate time interval. The LogDet model is based directly on the matrix Rc , hence makes
no assumptions about the changes in Rc(t) or on the relationship between elements of Rc . Let lc be the arithmetic
mean rate
P based onP Rc , which can be written as
iic =45
lc 5 4i51 R
i;j6¼i Rijc =4. Barry and Hartigan
(1987) and Zharkikh (1994) each showed that there is a direct connection between the product lct and the transitional
probability matrix P(t), which when applied to our Pc(ta)
may be written as
1
dac 5lc ta 5 ln det Pc ðta Þ
4
ð4Þ
(see eq. 6 in Gu and Li 1996). This LogDet value dac is
a ‘‘model-free’’ measure for the time distance ta in partition
c (i.e., it avoids the constraining assumptions of other
models, as discussed previously). We are able to compute
LogDet times genome-wide, da, and in every partition,
dac, and then can combine both sets of numbers by
computing the percentage deviation in partition c from the
genome-wide mean, Ddc 5 Æ(dac da)/daæ (where Ææ
stands for an average over all a’s).
Materials
Genome builds used were downloaded from the
University of California, Santa Cruz browser (Kent et al.
2002): hg18 (human), cf2 (dog), pt2 (chimpanzee), and
rm2 (macaque). Human repeat information was extracted
by RepeatMasker v. 3.1.2 and RM database version
20051025. Chimpanzee repeat information was extracted
by RepeatMasker v. 3.1.3, RM database version 20060120.
All software tools created for this project will be provided to interested parties on request.
Statistics
Linear regressions are characterized with Pearson’s
moment correlation coefficient (denoted by rp) at a P value
of less than 108. Each correlation passed all standard diagnostic tests to ensure that rp is a legitimate characterization of the fit (results not shown).
Results
Identifying strand-symmetric regions
Our core prediction, equation (3), follows directly
from our 2 assumptions: strand symmetry and the time in-
dependence of the rate matrix. To verify the prediction with
genomic data, we must find regions on the genome that conform to both assumptions. We do this by comparing
the local rates, as determined by our method, against the
prediction of the previously discussed LogDet method—
which relaxes both assumptions. Only in areas where the
2 predictive methods agree can we expect the assumptions
to hold. It is worth noting that as both models estimate average rates over some time period, we are actually locating
segments that experienced rates which ‘‘effectively’’ meet
these assumptions over that period. Thus, we will identify
both those segments that conformed to the assumptions uniformly and those that fail to do so at any instant but that
experienced rate variations over time that produced the
same effect.
As the LogDet estimations are independent of a specific model of substitution-rate relations, they are ideal
for testing the assumptions of time constancy and strand
symmetry of q made in the first model. We show in the
supplementary material (eq. S18, S19, and S23–S25, Supplementary Material online) that if these assumptions and
approximations have no effect on the result, we should find
that 1) for a genome-wide fit, the estimated da from equa a and that 2) for the partitiontion (4) must be equal to mt
obtained with the
wide fits, Ddc must be equal to sc =m
strand-symmetric model. Figure 1 shows both correlation
a for 812 difdiagrams for the human genome da versus mt
for the 21,000
ferent repeat families and Ddc versus sc =m
partitions covering the genome (160-kb window). For the
a in figure 1a, we obtain a (Pearson’s)
comparison da versus mt
correlation coefficient as high as rp 5 0.997 (human),
with 50% of all families showing a difference smaller than
2% to the LogDet times. Equally good results were obtained for the other genomes; rp 5 0.997 (chimpanzee),
rp 5 0.996 (dog), and rp 5 0.994 (macaque). This strong
agreement between the predictions shows that our genome-wide assumptions, including strand symmetry, are
valid. Apparently, the effect of regions where these assumptions do not hold averages out in such a genomewide analysis.
When performing the same investigation on a local
scale, a different picture emerges. For Ddc versus sc =m
in figure 1b, we obtain a correlation coefficient of rp 5
0.92. Here, 50% of all partitions show a difference between
that is larger than 64% of sc =m
in that parDdc and sc =m
tition. By means of this plot, we can now clearly identify
those partitions where our model assumptions are justified.
We have selected those regions where the difference be and Ddc is smaller than 5%. For the human getween sc =m
nome, we discarded almost 75% of all 160-kb windows,
leaving approximately 5,000 partitions representing 620
Mb of the entire human genome. Similarly, applying the
same criterion to the corresponding figures for the other genomes, we obtained 560 Mb for the dog genome, 730 Mb
for the macaque genome, and 680 Mb for the chimpanzee
genome (see table 2). In the following, we will limit our
analysis to only these partitions—where both methods produce approximately the same values (within 5%). The cor or LogDet
relations hardly differ whether we show sc =m
rates Ddc, hence the results are independent of the method
used to compute the neutral substitution rate variation.
366 Karro et al.
elements of q on a global scale, giving us a picture of
the general trend of substitution rates similar to that in
Lander et al. (2001). We observe that the transversion rate
away from GC base pairs, q4, is higher than any other transversion rate and that the transition rate away from GC base
pairs, q6, is twice the reverse rate q5 (A/G and T/C transitions) and 5 times a typical transversion rate. Also listed
are results obtained after masking CpG sites on the consensus sequences of each repeat family, leading to a small shift
of the rate q6.
When we fit Pc(ta) in a specific partition c, we obtain
c and the rates qc1 ; . . . ; qc6 that allow us to estimate the GC
m
equilibrium values GC*c (for those partitions conforming to
our assumptions). Averaging over these partitions, we obtain an estimate for the genome-wide GC* that is representative for 620 Mb of the human genome (560–730 Mb for
the other genomes). This GC* value is compared with the
current GC content on today’s genome in table 2. We observe that for the investigated species, our model predicts
that the equilibrium point GC* is about 30%, whereas the
mean GC content on today’s genome is at about 42%, implying that these genomes are far away from the stationary
state. We stress that locally GCc* (estimated with a variance
of 0.04) shows considerable deviations from the genomewide value GC*.
Exponential GC-Content Decay Verified
FIG. 1.—(a) Correlation diagram of the human repeat family ages
estimated with our method (y axis) and with the LogDet method (x axis).
Although our estimate is based on a strand-symmetric and timeindependent substitution model, no such constraining assumption has to
be made when using the LogDet method. (b) The relative local substitution
in partition c computed with our method correlated against the
rate sc =m
corresponding rate estimated using the LogDet time measure (Ddc).
A full list of identified strand-symmetric regions is
included in the supplementary materials (Supplementary
Material online).
Genome-Wide, Strand-Symmetric GC Equilibrium
Value Is at about 30% for Various Mammalian Genomes
We next look at the results of fitting q both globally
and locally. In table 1, we show the results of fitting the
We are now in the position to determine the accuracy
of prediction (3). pGC
c ð0Þ, the modern GC content, is easily
can be calculated by the apcalculable. GCc , kc, and sc =m
plication of our model, with which we can determine the
argument of the exponential function in (3) relative to
c TÞ=ðmTÞ5k
This
kc ðm
sc =mÞ.
the time distance mT:
c ð1 þ ð0Þ
on
the
left-hand
side
of
equation
(3)
allows us to plot pGC
c
versus the argument of the exponential function on the
right-hand side of this equation, as we do in figure 2 for
both human and dog. As we have already observed, equation (3) describes a curve in a space spanned by 3 local
c ; and GCc . Thus, figure 2 can be
functions: pGC
c ð0Þ, kc m
considered to be a projection of this 3-dimensional data
c
cloud onto a plane spanned alone by pGC
c ð0Þ and kc m
(i.e., onto a plane where GCc ð0Þ in equation (3) is fixed
to some value). This should then result in an exponential
c TÞ. Indeed, we
relationship between pGC
c ð0Þ and kc ðm
Table 2
Data Collected for the Genomes of Human (hg18), Chimpanzee (pt2), Dog (cf2), and Macaque (rm2), followed by Human
with Masked CpG Sites and for Human with a GC Content Determined from the Repeat-Masked Human Genome (hg18(bw))
GC(0)
Size (Mb)
GCs(0)
GC*
rp
95% confidence interval
mT
b
Human
Chimpanzee
Dog
Macaque
Human (CpG)
Human (bw)
0.427
620
0.400
0.30
0.85
(0.86, 0.84)
3.6
4.0
0.416
680
0.398
0.30
0.85
(0.86, 0.84)
3.6
4.0
0.412
560
0.397
0.31
0.82
(0.83, 0.81)
6.3
8.3
0.415
730
0.397
0.30
0.85
(0.86, 0.84)
3.5
3.7
0.30 ± 0.04
0.85
(0.86, 0.84)
3.6
4.0
0.81
(0.82, 0.80)
5.0
6.0
NOTE.—The mean GC content on today’s genome, GC(0), the total size of the genomic region selected in this study, the mean GC content GCs(0) over these selected
regions, the equilibrium value of the GC content, GC*, as obtained from the average of GCc over all selected partitions, the correlation coefficient rp for the correlation in
and b obtained from the linear regressions.
figure 3 with their 95% confidence intervals, and the parameters mT
Exponential Decay of GC Content 367
each axis are independent. The minimal change between
the 2 figures shows this is not a problem.
In contrast to the Arndt model of substitution rates
(Arndt, Burge, and Hwa 2003; Arndt, Petrov, and Hwa
2003; Arndt et al. 2005), our model does not explicitly account for neighbor interaction; elevated CpG dinucleotide
rates are instead reflected in the estimation of single-base
substitution rates (e.g., compare the masked and unmasked
estimates for q6 in table 1). The concern that our failure to
explicitly consider Arndt’s seventh parameter may introduce error into our results is addressed in figure 3f, where
we repeat the analysis leading to figure 3a after masking out
CpG sites. We see from this figure, and from table 2, that
calculating our values based only on non-CpG sites has virtually no effect on our calculations. Hence, it is clear that
CpG hypermutability can be ruled out as a source of error.
c TÞ for 5,000
FIG. 2.—GC content versus the time distance kc ðm
(4000) partitions on the human (dog) genome, representing 620 Mb (560
c T is the quantity that appears as argument
Mb) of the entire genome. kc m
of the exponential function of the prediction (3). The local substitution
rates m
c have been computed by 2 independent methods; the solid lines
are fits to an exponential function y5expðb ðmTÞxÞ
with the fitting
given in table 2.
parameters b and mT
see this to be the case in figure 2. We also note an apparent
convergence value of GC content to the predicted GC* values
from table 2, which suggests that we should examine excess
GC content (i.e., the GC content relative to GC*). Taking the
logarithm of the excess GC content, we expect to find that
c T (see
lnðpGC
c ð0Þ GC Þ has a linear relationship to kc m
eq. 3). We check this in figure 3 for 4 mammalian genomes.
A very strong correlation is found, with coefficients as high
as rp 5 0.85 for human, chimpanzee, and macaque and
rp 5 0.82 for dog (table 2). We also show the results of a least
squaresfit,providinguswithestimatesfortheparametersband
in the fitting formula y5lnðpGC
mT
c ð0Þ GC Þ5b ðmTÞx,
c TÞ=ðmTÞ.
All values are given in table 2.
where x5ðkc m
Figure 3e is an alternative to figure 3a in which we do not
use repeats (the basis for our calculation of distance) in the
calculation of GC content to ensure that the calculations on
Including Recombination Rates
In table 1 we see that q1 þ q5 gives the substitution rate
of AT base pairs to GC base pairs, whereas q4 þ q6 gives the
substitution rate of GC base pairs to AT base pairs. Our
method provides us with estimates for these rates in every
partition c. BGC, on the other hand, is known to have the
effect of increasing the AT/GC substitution rate by an
amount proportional to the local recombination rate
(Meunier and Duret 2004; Duret 2006; Galtier and Duret
2007), which we will denote by qc. Having this in mind,
c ðqc1 þ qc5 Þ are locally
we may expect to find that the rates m
increased by bqc (where b is some proportionality con c ðqc4 þ qc6 Þ, should
stant), whereas the GC/AT rates, m
be locally reduced by the same amount. Thus, we expect
c ðqc4 þ qc6 Þ5c2 bqc
c ðqc1 þ qc5 Þ5c1 þ bqc and m
that m
with 2 constants c1 and c2 that are not material to our argument. From these relations, it follows that the difference
c ðqc1 þ qc5 ðqc4 þ qc6 ÞÞ should be equal to c1 – c2 þ 2bqc.
m
c ðqc1 þ qc5 ðqc4 þ qc6 ÞÞ against qc
Figure 4 correlates m
(using sex-averaged deCODE recombination rates from
Kong et al. 2002, resolved here with 1 Mb windows).
As a result, we see a small, but statistically significant,
c T on the genomes of human, chimpanzee, macaque, and dog; (e) shows
FIG. 3.—(a–d) A logarithmic plot of the excess GC content versus kc m
c T, with GC content now computed from genome regions between the repeats; (f) are the human
again the GC content on the human genome versus kc m
genome data with all CpG sites on the consensus sequences being blocked out.
368 Karro et al.
FIG. 4.—Local difference between the AT / GC and GC / AT
substitution rate versus the recombination rate on the human genome (1Mb window, human genome). For recombination rates, we used the sexaveraged deCODE rates from Kong et al. (2002).
correlation with rp 5 0.33 (contained by the 95% confidence interval [0.29, 0.38]). We have also correlated
c ðqc1 þ qc5 þ qc4 þ qc6 Þ; which
the recombination rate with m
c kc in equation (9) and found no significant
is just m
correlation.
Meunier and Duret (2004) have found a similar correlation between GCc and the recombination rate. As observed by these authors, there are a number of reasons
why we cannot expect this correlation to be large. Recombination appears to vary on a scale much smaller than 1 Mb
and thus its rate averages out over scales greater than
1 Mbb; the 2 variables correlated here reflect processes operating on different time scales. Although recombination
rates may change rapidly, the GCc can be traced back to
rates that are actually time averages over long evolutionary
periods. Additionally, in our study, the genomic segments
under analysis were picked based on strand symmetry and
time constancy. We have no reason to believe that characteristic is correlated with recombination; thus, our plot
likely superimposes partitions supporting a strong correlation to others supporting no correlations at all.
Asymmetric Regions of the Genome
Our approach allows us to easily distinguish between
parts of the genome that follow a strand-symmetric, timeindependent substitution rate model and those that do not.
In figure 5, we graph GC content (on the x axis) against our
substitution rate estimator (on the y axis) for both our selected 620 Mb of the human genome (black points, lighter
fit) and the complementary set (dark dots, dark fit). It is
evident that by selecting strand-symmetric areas we have
been able to discard data that grossly deviate from the exponential decay curve, but it also becomes obvious that we
have discarded a bulk of data that were compatible with our
picture (the majority of red data points are covered by the
black data points).
Discussion
In our investigation, we have concentrated on regions
of the genome that we know to be subjected to strand-
FIG. 5.—The exponential decay curve of figure 2 of the human
genome, now mirrored at the bisecting line (black symbols) with the
green solid curve being the exponential fit in figure 2. The data of figure 2
were a selection of 5,000 partitions for which 2 methods predicted similar
rates. The data for the discarded 16,000 partitions are shown as red
symbols. Both sets together cover the whole genome. We observe that we
discarded both data that are compatible with our picture and data that are
not. The latter show an upbending particularly at high GC content, which
by other groups have been fitted to a parabolic function (dark) (see text).
symmetric substitution rates. Selective pressure will frequently produce substitution asymmetries (Frank and
Lobry 1999). Nor is this disruption of symmetry limited
to areas under selection; both transcribed areas and regions
around replication origins are known to be strand asymmetric (Green et al. 2003; Touchon et al. 2005). By limiting our
analysis to the strand-symmetric regions, we eliminate associated sources of noise obscuring the pattern of decay,
and are thus able to reveal the exponential relation between
local GC-content decay and local (neutral) substitution
rates. Thus, we form a picture of the decay dictated by
the underlying neutral processes. If the values on the x axis
of figure 2 are interpreted as the relative decay rate
of the exponential decay, then we can conclude
c =m)
(kc m
that local GC content is decaying over the genome, with
genomic areas subjected to higher decay rates (i.e. experiencing a faster decay) having a disproportionately lower
GC content than those subjected to lower decay rates.
This implies that, regardless of the initial distribution of
local GC content, even in the extreme case of no initial isochore structure, the exponential relationship of GC content
to local substitution rate will inevitably lead to the establishment of some sort of isochore structure simply because
of the regional differences in decay rates and the equilibrium
points—a finding similar to that of Khelifi et al. (2006).
The exponential relationship further supports the development of this structure as these regional differences grow
faster than they would under a linear relationship. One
could interpret the values on the x axis of figure 2 as time
c T, such as those extracted from any Markov
distances kc m
chain–based model (Ewens and Grant 2005). Then the
message of figure 2 would be that the decay of GC content
in partitions with a large time distance is more advanced
than those with a smaller distance. Note finally that wewould expect the correlation in figure 3 to be less than 1 as
there is still the unknown local GC content pGC
c ðTÞ at
time T. In fact, pGC
c ðTÞ is presumably different in every partition, inducing a vertical shift for every data point
in figures 2 and 3. Surprisingly, these shifts are small
Exponential Decay of GC Content 369
enough that they do not obscure the exponential decay.
Figure 2 allows us to roughly estimate an upper limit
of pGC
c ðTÞ by the vertical distance of each data point
from the fitted curve.
Comparisons with Previous Studies
The shape of our correlation is different than that
found in previous studies, but it is not contradictory; our
analysis just presents a clearer picture as we are able to
strip away a source of noise inherent in the other studies.
Consider the works by Hardison et al. (2003), Hellmann
et al. (2003, 2005), and Belle et al. (2004). In the study
by Hardison et al., we see a quadratic relation between
GC content and substitution rate, where local substitution
rate is calculated using the multispecies, repeat-based tAR
statistic. The latter study by Hellmann et al. finds a similar
result (measuring substitution rates with human–chimpanzee
divergence) but reduces this to a linearly decreasing correlation when introducing CpG content as a second variable
in their regression model. In figure 5, we explain both these
results with our model. We see that when using the regions
that are not strand symmetric (or not identifiable as such),
we replicate the shape of the Hardison curve. Concentrating
on the strand-symmetric regions, we find the negative correlation predicted by Hellmann—but are able to better resolve the shape of the curve. In the Belle paper, the authors
point out that an exponential decay would make sense (an
observation also made by Gu and Li 2006), but they find the
explanation incompatible with their data (calculated using
the method of Galtier and Gouy 1998). However, their
results reflect factors not relevant to our analysis as they
base their rate estimations on the alignment of coding sequences, which are not subject to strand-symmetric substitution and are shaped by constraint-related forces.
The question of whether the genome has reached its
GC-content equilibrium has been addressed in a number
of studies. The works of Lercher et al. (2002) and Webster
et al. (2003) both support our assertion that the genome is in
a state of decay through the analysis of SNP evidence, and
the Webster et al. (2005) results are also consistent with this
hypothesis. Both Alvarez-Valin et al. (2004) and Antezana
(2005) have taken the opposite position, though the methodology of the latter is shown to be faulty in Duret (2006).
A line of studies including Duret et al. (2002), Meunier
and Duret (2004), Khelifi et al. (2006), and Duret (2006)
have made significant contributions to the understanding
of GC-content decay, the projected equilibrium point,
and the underlying causes of decay; it is important to consider how our results fit into the picture resulting from these
works. Together, these studies have looked at the relationship between local current GC content and GCc , as well as
the effect of recombination rates on these variables (presumably through the mechanism of BGC). They find pairwise correlations between recombination rate, GCc , and the
GC content of c, and then argue a causal relationship starting from recombination and with GCc as a mediator.
Our analysis is consistent with their work, though it
does not lead to any conclusions about causality. Consider
the formula equation (2) for the equilibrium GC content.
This expression for GC* results directly from balancing
net rates q15AT* 5 q46GC*, where q15 5 q1 þ q5 is the
AT/GC rate, q46 5 q4 þ q6 is the GC/AT rate, and
AT* 5 1 GC* is the equilibrium AT content. If q46 were
equal to q15, then GC* would be 1/2. As q46 increases, the
balance shifts and GC* decrease. We find that in the genome-wide average, the GC / AT rate is roughly twice
the reverse rate, implying that genome-wide GC* is roughly
1/3. Of course, this argument applies not only for the whole
genome but also for every partition c: a ratio qc4 þ qc6 to
qc1 þ qc5 larger than one implies a local GCc falling below
one-half.
Following Duret et al. (2002) and Meunier and Duret
(2004), we incorporate the BGC mechanism as follows.
Consider a time and a location c where the recombination
rate has reached a high enough level to make a difference.
We expect a GC-biased fixation process and thus an increase of the local AT / GC rate by an amount proportional to the local recombination rate, implying that the
reverse rate GC / AT decreases by the same amount.
The sum of these 2 rates, qc1 þ qc5 þ qc4 þ qc6 5kc , is not affected, but their difference is reduced, leading to an upward
shift of the local GCc back toward half. Thus, by changing
the local rates, the BGC process will not alter the exponen c ) but simtial decay behavior (which depends on kc and m
ply shifts the local GCc upward to an extent that is directly
proportional to the recombination rate. This effect is then
responsible for correlation between GCc and the recombination rate found by Meunier and Duret (2004).
By the definition of an ‘‘equilibrium point,’’ the GC
content must be headed toward GCc . Over the 620 Mb
of the human genome we have investigated, we have not
found any partition where the GC content is currently less
then GCc , and in the 14 Mb investigated by Meuiner and
Duret, they found only 2 Mb where this was the case. So, in
practice, the local GC content in c is decreasing toward the
lesser GCc value. Although a high recombination rate is
able to, through BGC, raise the local AT / GC rate over
some period of time, this change can stop the decay of local
GC only if it is strong enough to shift GCc above the local
GC-content level—an effect we do not see.
We recall that equation (3) describes the interdependence of 3 local functions: GCc , pGC
c ð0Þ (local GC content),
c . When projecting the 3-dimensional
and the value kc m
plot onto a plane by setting any one of the variables to a constant, we see the association between the remaining 2 variables found by Duret. To this point, we have only studied
the projection derived from setting GCc to a constant, resulting in our exponential curves. The other projections are
of interest in their own right but are quite complicated and
beyond the scope of this study.
Other Points of Consideration
We need to address our assumption of independent
substitution rates in neighboring sites—specifically that
of elevated CpG substitution rates, which are known to
be significantly higher than those of other substitution rates
(Arndt and Hwa 2004; Arndt et al. 2005). To investigate
this effect, we have followed the common practice of masking out CpG cites (Meunier and Duret 2004; Gaffney and
Keightley 2005; Gu and Li 2006; Taylor et al. 2006) and
370 Karro et al.
found no difference in our results, as illustrated by a comparison of figure 3a against that of figure 3f and in table
1—a result consistent with that of Gu and Li (2006).
The idea that these rates would have no effect on our calculation seems unlikely in light of results such as those of
the Arndt studies (Arndt, Burge, and Hwa 2003; Arndt,
Petrov, and Hwa 2003; Arndt et al. 2005), so the more plausible conclusion is that the CpG effect is not compatible
with one of the characteristics used to pick our partition
set. In other words, if a partition conforms to our assumptions, then the CpG effect is minimal within that partition.
Our analysis has been extended to 3 other genomes:
chimpanzee, macaque, and dog. We have obtained almost
(the
identical results for all, with differences only in mT
slope of the red curve in fig. 3) for dog compared with
the other 3 species. Notably missing from this analysis
are the murids. Because mouse and rat are subjected to
a much higher substitution rate (Waterston et al. 2002; Gibbs
et al. 2004), RepeatMasker has considerably less power to
recognize older repeats. For our purposes, the amount of sufficiently older repeat data is less than is needed to get a clear
picture of the decay in those genomes.
We also investigated the possibility that SNPs occurring within repeats could introduce error into our calculations. An SNP occurring in a repeat is reflective of mutation
rate but not necessarily reflective of substitution rate; using
SNPs to calculate substitution rates could bias the results.
To check for this, we masked out all SNP locations, using
information from dbSNP build 125 downloaded from the
University of Santa Cruise Genome Browser (Sherry et al.
2001; Kent et al. 2002), and found no significant changes
(data not shown).
Finally, we addressed whether our calculation of repeat substitution rates could be biased by the faulty reconstruction of ancestors by RepeatMasker. To check this, we
repeated the experiment based on different subsets of repeat
families, both manually and randomly chosen. Each analysis resulted in the same conclusions. (See supplementary
materials for details, Supplementary Material online.)
Our results show a clear, strong exponential correlation between the substitution rate pattern and GC content in strand-symmetric areas of the genome, which
confirms a prediction that is a direct consequence of a
strand-symmetric rate model. Further investigations of the
relationship, with an aim toward determining causality, are
certainly worthwhile. A study of our identified regions of
strand symmetry (or, more appropriately, the complementary
set of regions) may also prove valuable. And although we
cannot label any factor as being causal, it is potentially revealing to consider equation (3) under a model in which the
substitution rate drives GC-content decay. The location dependence of the substitution rate pattern then allows us to
predict the evolution of the isochore structure: it decays
faster toward GCc in regions of high substitution rate
and lower in those of low rate. This also implies that if
the genome started from a uniform GC content (e.g., with
no isochore structure), then the substitution rate pattern
would inevitably lead to a local variation of the GC content.
However, this model is only speculation; we could also envision a model in which GC content filled the role of determining the local substitution rate.
Supplementary Material
Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe.
oxfordjournals.org/).
Acknowledgments
The authors would like to thank Svitlana Tyekucheva,
Kateryna Makova, Adam Eyre-Walker, and Webb Miller
for their helpful comments; Richard C. Burhans and Nathan
Coraor for their technical support; and Laura Tabacca for
editing. John Karro worked under the support of National
Institutes of Health (NIH) grant 5K01HG003315. Martin
Peifer received financial support from the Austrian Science Foundation (FWF) under project title P18762. Ross
Hardison was supported by NIH grant 5R01DK065806.
This project is funded, in part, under a grant with the
Pennsylvania Department of Health using tobacco settlement funds. The department specifically disclaims responsibility for any analyses, interpretations, or conclusions.
Appendix
In the following, we give a brief description of our
model, the approximations used in its application, and
the tests performed to check the validity of those approximations. A detailed report is given in the supplementary
materials (Supplementary Material online).
The Rate Model
Our underlying model is a nonhomogeneous Markov
chain, similar to that used in other standard approaches. The
4-dimensional time-dependent state vector (pA(t), pC(t),
pG(t), pT(t)), defining the probability of being in each state
at time t, evolves according to a substitution rate matrix R(t)
having components
Rij(t). Recall that every possible R(t)
P
must satisfy j Rij 50, which allows us to define a mean
substitution rate P
m by averaging over
all components of this
P P
rate matrix: m5 4i51 Rii =45 14 i j6¼i Rij . To make this
m explicit in the following expressions,P
weP
replace Rij by
mqij and scale the matrix q such that 14 i j6¼i qij 51.
The transition probability matrix propagating a state vector at time t0 to time t0 þ t reads in its most general form
Z t0 þt
PðtÞ5exp
Rðt# Þdt# 5expð
R tÞ5expðmq tÞ; ð5Þ
t0
where an overlineR over a function f(t) represents its time
t þt
average f 5ð1=tÞ t00 f ðt# Þdt# . In our application, we
cannot consider the most general case and instead
approximate mqij by the product of m and qij , with the
constants qij obtained from a maximum-likelihood fit, as
explained further below. Underlying this approximation is
the assumption that though rates may be time dependent,
the ratio of any 2 rates within the rate matrix does not
change with time. This approximation is necessary in order
Exponential Decay of GC Content 371
to be able to perform fits; with the comparison in figure 1
to LogDet times (which do not depend on this
approximation), we have ensured that we consider only
those partitions where this approximation has a negligible
effect (for a discussion of this approximation, see section
A2 in supplementary material, Supplementary Material
online).
To estimate rates on a local scale, we break the genome
into Z partitions, representing each partition by an index c.
The index is dropped whenever Z 5 1, that is, when we
want to indicate that a quantity is based on a genome-wide
analysis. Using this partitioning notation and applying our
approximation, equation (5) can be written as
c qc tÞ:
Pc ðtÞ5expðm
ð6Þ
For qc (the rate matrix estimated from partition c), we use
a strand-symmetric rate model (Sueoka 1995; Lobry and
Lobry 1999):
0
ðqc1 þ qc5 þ qc2 Þ
B qc
4
qc ¼ B
@ qc
6
qc2
qc1
ðqc4 þ qc3 þ qc6 Þ
qc3
qc5
with the 6 independent substitution rates qc1 ; . . . ; qc6 defined
in table 1. Every strand-symmetric matrix can be transformed into a block-diagonal matrix consisting of two 2 2
blocks, qþ and q (Lobry and Lobry 1999), of which only
ðqc1 þ qc5 Þ qc1 þ qc5
þ
qc 5 c
ð8Þ
q4 þ qc6
ðqc4 þ qc6 Þ
is of interest here. It is the rate matrix for the 2AT
ðtÞ; pGC ðtÞÞ5ðpA ðtÞþ
dimensional state vector ~ðtÞ5ðp
p
T
G
C
p ðtÞ; p ðtÞ þ p ðtÞÞ representing the A þ T and G þ C
content at time t. Note that qcþ depends just on qc1 þ qc5 and
qc4 þ qc6 which correspond to a A,T / G,C transition and
a G,C / A,T transition, respectively (see table 1). The set
of differential equations associated with this reduced 2 2
/
p ðtÞ5p
ðtÞqþ can be integrated (see Lobry and
system, dtd /
Lobry 1999), as well as sections A5 and A6 in the
supplementary material, Supplementary Material online),
leading directly to equation (2), to
kc 5qc1 þ qc5 þ qc4 þ qc6
ð9Þ
and finally to the prediction (3) which is the central
equation of this paper. We emphasize that, if one applies the same transformation leading to equation (8) to
a non–strand-symmetric rate matrix, one will find two 2
2 blocks which are still coupled, making it thus
impossible to replace the set of 4 coupled differential
equations by 2 independent sets of differential equations.
Thus, prediction (3) is valid only as long as a strandsymmetric model is applicable.
Calculation of Local Substitution Rates
All estimates of rates are based on the transitional
probability matrix P(t) on the left-hand side of equation
(5) which we fit to RepeatMasker-generated repeat data.
Low-complexity repeats were removed. We were then
left with M repeat families, where M 5 812 (human),
M 5 434 (dog), M 5 825 (chimpanzee), and M 5 779
(macaque).
The RepeatMasker data provide the reconstructed ancestor of each repeat family a (a 5 1, . . ., M) as well as
a pairwise alignment between each modern instance and
the family’s reconstructed ancestor, which was inserted into
the genome at time ta. Alignment positions involving
gaps are discarded. For each repeat family a and each partition c, we determine from the alignments 1) a 4 4 matrix
type, 2) a 4 kac counting the number of each
P substitution
i
ij
5 j kac
which is the number of
1 vector ~
Nac such that Nac
nucleotides that were in state i at time ta. It then follows
qc5
qc3
ðqc6 þ qc3 þ qc4 Þ
qc1
1
qc2
C
qc6
C
A
qc4
ðqc2 þ qc5 þ qc1 Þ
ð7Þ
kij
that Pij;c ðta Þ5 Naci is the maximum-likelihood fit of P based
ac
on the alignments.
These Pc(ta) matrices represent the input which we
want to use to estimate local substitution rates. We have
2 alternatives: the method of LogDet time distance, equation (4), and a method based on the matrix in equation (7)
and the approximation in equation (6) (von Grünberg
et al. 2004). The LogDet time connects the transitional
probability matrix P(t) for a transition over time t on the
left-hand side of equation (5) with the product d of the time
on the right-hand side
t and the arithmetic mean l rate of R
of
(5). P
Recalling
that
P equation
4l ¼ i;j6¼i Rij ¼ TrR ¼ i ki with ki being the ei the LogDet formula in equation (4) for
genvalues of R,
d 5 lt results directly from the following consideration:
P
det PðtÞ5det eRt 5e
ki t
i
5e4d :
ð10Þ
Applied to our matrices Pc(ta), the LogDet formula
equation (4) provides us with time distances dac for
partition c and family a, and, when performing a genomewide analysis, with da as the time distance from the
moment when family a was inserted into the genome.
Averaging (dac da)/da over all repeat families, we obtain
Ddc, the percentage deviation of the local substitution rate
in partition c relative to the mean.
Whereas the LogDet time produces model-free time
distances, the second method is based on the rate model
in equation (7) and the approximation we made in going
from equation (5) to equation (6) that mq can be split into
372 Karro et al.
per partition (Z 5 21,000), resulting in a typical window
size of 160-kb window at highest resolution. Our rates were
then filtered for noise using a technique discussed in Peifer
et al. (2003) and averaged over all M repeat families. We
report relative rates:
sc 1 X m
a
c ta mt
5
;
M a
a
m
mt
FIG. 6.—Age of a few examples of transposable elements of the
c ta of repeat type
LINE type on the human genome. The time distance m
a as estimated with our method is plotted on the x axis and correlated
on the y axis with the age of the same repeat family as estimated by
Khan et al. (2006). The linear relation allows us to estimate m
as
m52:3
103 =Myr.
and a constant matrix q. Based on equation (6), we then
m
1) set Z 5 1 and perform a genome-wide maximum-likeli a and the 6 rates q1 ; . . . ; q6 in q
hood fit of the M numbers mt
on the right-hand side of equation (6) to the matrices P(ta)
on the left-hand side of equation (6) obtained from the repeat data as described above; 2) partition the genome and
and the rates
c =m
calculate a maximum-likelihood fit of m
c =mÞð
mt
a ÞÞ on the
qc1 ; . . . ; qc6 in the expression expðqc ðm
right-hand side of equation (6) to our data in Pc(ta), taking
a from the genome-wide fit in the previous
the values of mt
step. Partitions were sized to include 40 kb of repeat bases
ð11Þ
which can be compared directly with the LogDet relative
rates Ddc as done in figure 1b. An alternative way of
testing our computations is to compare it with the results
of more established methods. For a few repeat families, we
a against those computed by
compare our age estimates mt
Khan et al. (2006) in figure 6. An example of the substitution rate patterns obtained from our method is shown
in figure 7 and compared with the tAR statistic (Hardison
et al. 2003). After normalizing tAR to reflect variation and
applying the same filtering technique, we find a close
correlation between the 2 (rp 5 0.76). As a further test of
our method, we performed a cross-validation to make sure
that although the curves are derived from repeat family
data, the results are independent of the actual subset of
families chosen for the analysis. For example, splitting all
families into 2 groups—one group containing families
with a long period of activity and a complementary group
with families that have been active over a short period of
time—we found that whatever group we chose had little
effect on our final results (see Supplementary figure S1,
Supplementary Material online). For more information
on the validation of our method, see section A4 of
the supplementary material (Supplementary Material
online).
FIG. 7.—A plot of the local substitution rate (black curve) and filtered tAR (light curve) over 4 sample chromosomes of the human genome. Breaks
in the curves correspond to centromeric regions. The inset box is a whole-genome regression of the 2 values, resulting in a linear correlation value of
rp 5 0.76.
Exponential Decay of GC Content 373
Literature Cited
Alvarez-Valin F, Clay O, Cruveiller S, Bernardi G. 2004. Inaccurate
reconstruction of ancestral GC levels creates a ‘‘vanishing
isochores’’ effect. Mol Phylogenet Evol. 31:788–793.
Antezana MA. 2005. Mammalian GC content is very close to
mutational equilibrium. J Mol Evol. 61:834–836.
Arndt PF, Burge CB, Hwa T. 2003a. DNA sequence evolution with
neighbor-dependent mutation. J Comput Biol. 10:313–322.
Arndt PF, Hwa T. 2004. Regional and time-resolved mutation
patterns of the human genome. Bioinformatics. 20:1482–1485.
Arndt PF, Hwa T, Petrov DA. 2005. Substantial regional
variation in substitution rates in the human genome:
importance of GC content, gene density, and telomere-specific
effects. J Mol Evol. 60:748–763.
Arndt PF, Petrov DA, Hwa T. 2003b. Distinct changes of
genomic biases in nucleotide substitution at the time of
Mammalian radiation. Mol Biol Evol. 20:1887–1896.
Baake E, von Haeseler A. 1999. Distance measures in terms of
substitution processes. Theor Popul Biol. 55:166–175.
Barry D, Hartigan J. 1987. Asynchronous distance between
homologous DNA sequences. Biometrics. 43:261–276.
Belle EM, Duret L, Galtier N, Eyre-Walker A. 2004. The decline
of isochores in mammals: an assessment of the GC content
variation along the mammalian phylogeny. J Mol Evol.
58:653–660.
Bernardi G. 2000. Isochores and the evolutionary genomics of
vertebrates. Gene. 241:3–17.
Duret L. 2006. The GC content of primates and rodents genomes
is not at equilibrium: a reply to Antezana. J Mol Evol. 62:
803–806.
Duret L, Eyre-Walker A, Galtier N. 2006. A new perspective on
isochore evolution. Gene. 385:71–74.
Duret L, Mouchiroud D, Gautier C. 1995. Statistical analysis of
vertebrate sequences reveals that long genes are scarce in GCrich isochores. J Mol Evol. 40:308–317.
Duret L, Semon M, Piganeau G, Mouchiroud D, Galtier N. 2002.
Vanishing GC-rich isochores in mammalian genomes.
Genetics. 162:1837–1847.
Ellegren H, Smith NG, Webster MT. 2003. Mutation rate
variation in the mammalian genome. Curr Opin Genet Dev.
13:562–568.
Ewens WJ, Grant GR. 2005. Statistical methods in bioinformatics: an introduction. New York: Springer Science.
Frank AC, Lobry JR. 1999. Asymmetric substitution patterns:
a review of possible underlying mutational or selective
mechanisms. Gene. 238:65–77.
Gaffney DJ, Keightley PD. 2005. The scale of mutational
variation in the murid genome. Genome Res. 15:1086–94.
Galtier N, Duret L. 2007. Adaptation or biased gene conversion?
Extending the null hypothesis of molecular evolution. Trends
Genet. 23:273–277.
Galtier N, Gouy M. 1998. Inferring pattern and process:
maximum-likelihood implementation of a nonhomogeneous
model of DNA sequence evolution for phylogenetic analysis.
Mol Biol Evol. 15:871–879.
Galtier N, Piganeau G, Mouchiroud D, Duret L. 2001. GCcontent evolution in mammalian genomes: the biased gene
conversion hypothesis. Genetics. 159:907–911.
Gibbs R, Weinstock G, Metzker M, Muzney D, Sondergren E,
et al. 2004. Genome sequence of the Brown Norway rat yields
insights into mammalian evolution. Nature. 428:493.
Green P, Ewing B, Miller W, Thomas PJ, Green ED. 2003.
Transcription-associated mutational asymmetry in mammalian evolution. Nat Genet. 33:514–517.
Gu J, Li WH. 2006. Are GC-rich isochores vanishing in
mammals? Gene. 385:50–56.
Gu X, Li W. 1996. Bias-corrected paralinear and logdet distances
and tests of molecular clocks and phylogenies under nonstationary nucleotide frequencies. Mol Biol Evol. 13:
1375–1383.
Hardison RC, Roskin KM, Yang S, et al. 2003. Covariation in
frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 13:13–26.
Hasegawa M, Kishino H, Yano T. 1985. Dating of the humanape splitting by a molecular clock of mitochondrial DNA. J
Mol Evol. 22:160–174.
Hellmann I, Ebersberger I, Ptak SE, Paabo S, Przeworski M.
2003. A neutral explanation for the correlation of diversity
with recombination rates in humans. Am J Hum Genet.
72:1527–1535.
Hellmann I, Prufer K, Ji H, Zody MC, Paabo S, Ptak SE. 2005.
Why do human diversity levels vary at a megabase scale?
Genome Res. 15:1222–1231.
Jukes TH, Cantor CR. 1969. Evolution of protein molecules. In:
Munro, HN, editor. Mammalian Protein Metabolism. New
York: Academic Press. p. 21–132.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH,
Zahler AM, Haussler D. 2002. The human genome browser at
UCSC. Genome Res. 12:996–1006.
Khan H, Smit A, Boissinot S. 2006. Molecular evolution and
tempo of amplification of human LINE-1 retrotransposons
since the origin of primates. Genome Res. 16:78–87.
Khelifi A, Meunier J, Duret L, Mouchiroud D. 2006. GC content
evolution of the human and mouse genomes: insights from the
study of processed pseudogenes in regions of different
recombination rates. J Mol Evol. 62:745–752.
Kong A, Gudbjartsson DF, Sainz J, et al. 2002. A high-resolution
recombination map of the human genome. Nat Genet.
31:241–247.
Lander ES, Linton LM, Birren B, et al. 2001. Initial sequencing
and analysis of the human genome. Nature. 409:860–921.
Lercher MJ, Smith NG, Eyre-Walker A, Hurst LD. 2002. The
evolution of isochores: evidence from SNP frequency
distributions. Genetics. 162:1805–1810.
Lindblad-Toh K, Wade CM, Mikkelsen TS, et al. 2005. Genome
sequence, comparative analysis and haplotype structure of
the domestic dog. Nature. 438:803.
Lobry JR, Lobry C. 1999. Evolution of DNA base composition
under no-strand-bias conditions when the substitution rates
are not constant. Mol Biol Evol. 16:719–723.
Meunier J, Duret L. 2004. Recombination drives the evolution of
GC-content in the human genome. Mol Biol Evol.
21:984–990.
Mouchiroud D, D’Onofrio G, Aissani B, Macaya G, Gautier C,
Bernardi G. 1991. The distribution of genes in the human
genome. Gene. 100:181–187.
Nagylaki T. 1983. Evolution of a finite population under gene
conversion. Proc Natl Acad Sci USA. 80:6278–6281.
Peifer M, Timmer J, Voss H. 2003. Non-paraetric identification
of non-linear oscillating systems. J Sound Vibration.
267:1157–1167.
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L,
Smigielski EM, Sirotkin K. 2001. dbSNP: the NCBI database
of genetic variation. Nucleic Acids Res. 29:308–311.
Smit AF. 1999. Interspersed repeats and other mementos of
transposable elements in mammalian genomes. Curr Opin Genet
Dev. 9:657–663.
Sueoka N. 1995. Intrastrand parity rules of DNA base
composition and usage biases of synonymous codons. J Mol
Evol. 40:318–325.
Sumner J, Jarvis P. 2006. Using the tangle: a consistent
construction of phylogenetic distance matrices for quartets.
Math Biosciences. 204:49–67.
374 Karro et al.
Taylor J, Tyekucheva S, Zody M, Chiaromonte F, Makova KD.
2006. Strong and weak male mutation bias at different sites in
the primate genomes: insights from the human-chimpanzee
comparison. Mol Biol Evol. 23:565–573.
Touchon M, Nicolay S, Audit B, Brodie of Brodie EB, d’Aubenton
Carafa Y, Arneodo A, Thermes C. 2005. Replication-associated
strand asymmetries in mammalian genomes: toward detection
of replication origins. Proc Natl Acad Sci USA. 102:9836–9841.
von Grünberg HH, Peifer M, Timmer J, Kollmann M. 2004.
Variations in substitution rate in human and mouse
genomes. Phys Rev Lett. 93:208102.
Waterston RH, Lindblad-Toh K, Birney E, et al. 2002. Initial
sequencing and comparative analysis of the mouse genome.
Nature. 420:520–562.
Webster MT, Smith NG, Ellegren H. 2003. Compositional
evolution of noncoding DNA in the human and chimpanzee
genomes. Mol Biol Evol. 20:278–286.
Webster MT, Smith NG, Hultin-Rosenberg L, Arndt PF,
Ellegren H. 2005. Male-driven biased gene conversion
governs the evolution of base composition in human alu
repeats. Mol Biol Evol. 22:1468–1474.
Zharkikh
A.
1994.
Estimation
of
evolutionary
distances between nucleotide sequences. J Mol Evol. 39:
315–329.
Manolo Gouy, Associate Editor
Accepted November 20, 2007