Exponential Decay of GC Content Detected by Strand-Symmetric Substitution Rates Influences the Evolution of Isochore Structure J. E. Karro,* à1 M. Peifer,§1 R. C. Hardison,§ M. Kollmann,k and H. H. von Grünberg§ *Department of Computer Science and Systems Analysis, Miami University, Ohio; Department of Microbiology, Miami University, Ohio; àCenter for Comparative Genomics and Bioinformatics, Pennsylvania State University; §Institute of Chemistry, Karl-Franzens University Graz, Graz, Austria; and kInstitute for Theoretical Biology, Humboldt University, Berlin, Germany The distribution of guanine and cytosine nucleotides throughout a genome, or the GC content, is associated with numerous features in mammals; understanding the pattern and evolutionary history of GC content is crucial to our efforts to annotate the genome. The local GC content is decaying toward an equilibrium point, but the causes and rates of this decay, as well as the value of the equilibrium point, remain topics of debate. By comparing the results of 2 methods for estimating local substitution rates, we identify 620 Mb of the human genome in which the rates of the various types of nucleotide substitutions are the same on both strands. These strand-symmetric regions show an exponential decay of local GC content at a pace determined by local substitution rates. DNA segments subjected to higher rates experience disproportionately accelerated decay and are AT rich, whereas segments subjected to lower rates decay more slowly and are GC rich. Although we are unable to draw any conclusions about causal factors, the results support the hypothesis proposed by Khelifi A, Meunier J, Duret L, and Mouchiroud D (2006. GC content evolution of the human and mouse genomes: insights from the study of processed pseudogenes in regions of different recombination rates. J Mol Evol. 62:745–752.) that the isochore structure has been reshaped over time. If rate variation were a determining factor, then the current isochore structure of mammalian genomes could result from the local differences in substitution rates. We predict that under current conditions strand-symmetric portions of the human genome will stabilize at an average GC content of 30% (considerably less than the current 42%), thus confirming that the human genome has not yet reached equilibrium. Introduction The (local) ‘‘GC content,’’ or the percentage of GC base pairs in a given genomic region, is highly variable across many mammalian genomes (Bernardi 2000). It is associated with genomic features including gene density, intron length, replication timing, recombination rate and the distribution of repeat elements (Mouchiroud et al. 1991; Duret et al. 1995; Smit 1999; Lander et al. 2001; Kong et al. 2002; Waterston et al. 2002). Investigating the causes of variation in GC content should improve our understanding not only of the GC structure of the genome but also of these associated features. Further, in studying the history of a genome, it is important to determine whether the distribution of GC content is changing over time. Has the GC content reached equilibrium or is it still evolving toward some target point of stability? What factors are holding it in check or driving the change? It is well established that a genome is subject to a varying ‘‘neutral substitution rate’’—the rate at which neutral (functionless) DNA undergoes changes that become fixed in the population. This rate varies considerably, differing between organisms, between sexes, and across the genome (Lander et al. 2001; Waterston et al. 2002; Ellegren et al. 2003; Hardison et al. 2003; Gibbs et al. 2004; von Grünberg et al. 2004; Arndt et al. 2005; Gaffney and Keightley 2005; Lindblad-Toh et al. 2005; Taylor et al. 2006). Many of these studies have investigated the interdependence between variation in the (neutral) substitution rate and the variation in GC content, but simple explanations of causation have not emerged. 1 These authors contributed equally to this work. Key words: GC content, isochore decay, neutral substitution rates, strand symmetry, genomic equilibrium. E-mail: [email protected]. Mol. Biol. Evol. 25(2):362–374. 2008 doi:10.1093/molbev/msm261 Advance Access publication November 27, 2007 Various studies have also established an asymmetry in the rates at which different substitutions occur; GC base pairs become AT base pairs far more frequently than the reverse (Lander et al. 2001; Arndt et al. 2005). Using realistic estimates for these rates, the imbalance implies that the GC content at equilibrium will be considerably below the current GC content. Thus, one would predict from the rate asymmetry that GC content must be decreasing. It further seems likely that those areas subject to higher substitution rates experience faster GC-content decay. However, these predictions have not been easy to test as calculating timeresolved substitution rates is a difficult task (Hardison et al. 2003; von Grünberg et al. 2004; Arndt et al. 2005; Gaffney and Keightley 2005). Further complicating matters, local substitution rates are not the only factor determining the change in GC content and its eventual equilibrium. In both Meunier and Duret (2004) and Duret et al. (2006) it was argued that recombination plays a role in the decay of GC content, that the modern human genome has not yet reached its equilibrium point, and that the equilibrium point of any given genomic region is determined by the interplay of the substitution and recombination rates. Though there is some disagreement (Alvarez-Valin et al. 2004; Antezana 2005), several studies have supported these positions (Lercher et al. 2002; Webster et al. 2003; Belle et al. 2004; Duret 2006; Khelifi et al. 2006). However, the rate of GC-content decay has not been determined, nor has the relative importance of the substitution and recombination rates in influencing that decay. How fast is GC-content decaying in a given region? What is the importance of substitution, as opposed to recombination, in determining that rate? The human genome currently has a genome-wide average GC content of approximately 43% (42% chimpanzee, 42% macaque, 41% dog). We have applied 2 independent methods for determining rates of substitutions across mammalian genomes to better understand the history and predict the future of these genomes. We limited our analysis to the 2007 The Authors. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/ uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Exponential Decay of GC Content 363 620 Mb of human DNA in which the rate for each type of substitution is the same on the complementary strand— regions referred to as strand symmetric (Green et al. 2003). This allowed us to apply a simple but powerful model for determining substitution rates. With these rates, we predict and confirm that the local GC content decays exponentially over time toward a local equilibrium value GC* at a rate determined by the local rate of substitution. We predict that these strand-symmetric areas will stabilize their average GC content at 30% (±4%) for each of these genomes, supporting the notion that none of these genomes are currently at equilibrium. Furthermore, the equilibrium point of a region is determined by the local rates of substitution and recombination, as well as by the Biased Gene Conversion (BGC) mechanism (Nagylaki 1983; Galtier et al. 2001; Duret et al. 2002), thus independently confirming the results of Meunier and Duret (2004). Our studies agree with the prediction that each genome is evolving toward a new isochore structure (as proposed by Khelifi et al. 2006). They also suggest that the current isochore structure could result from the action of locally varying rates on an ancestral genome, regardless of that genome’s GC-content pattern. Methods and Materials Whereas calculating the GC content of a sequenced genome is simple, estimating substitution rates and GC* are more difficult tasks. Substitution rates are estimated using 2 different methods. The first is the ‘‘LogDet’’ method (Barry and Hartigan 1987; Gu and Li 1996; Baake and von Haeseler 1999; Sumner and Jarvis 2006); the second is based on continuous Markov chains and is similar to many of the well-studied models (e.g., Jukes and Cantor 1968; Hasegawa et al. 1985; Ewens and Grant 2005). Like those models, and in contrast to the well-known model of Arndt, Burge, and Hwa (2003), Arndt, Petrov, and Hwa (2003), and Arndt et al. (2005), we do not explicitly account for neighbor interactions affecting substitution rates (specifically elevated CpG dinucleotide substitution rates); in our Results section, we show that these rates are implicitly included and thus this parameter reduction is not a source of error. Estimating neutral substitution rates is necessarily based on nucleotides known to be nonfunctional. There are 2 sources of such data frequently used: 4-fold degenerate coding sites and interspersed repeats (Hardison et al. 2003). We selected the repeat data because of its quantity. Repeats make up approximately 45% of the genome (providing significantly more points of estimation than do the 4-fold degenerate coding sites) and, as a whole, are uniformly distributed (Lander et al. 2001)—allowing us to average out the effect of location bias. It is generally believed that the substitution rates of repeat bases are reflective of the local neutral substitution rate, though this is not completely agreed upon (Hardison et al. 2003; Arndt, Burge, and Hwa 2003, Gaffney and Keightley 2005). Further, instead of fitting our model parameters to interspecies alignments, we follow the lead of Arndt in using the alignment of modern interspersed repeat sequences to their RepeatMasker-derived ancestral sequences (Smit A, Hubley R, Green P, unpublished data). This allows us to discard the normal Markov chain constraint of reversibility used as a basis for many calculations. Like the Arndt studies, we also assume strand-symmetric substitution: that both strands of the genome segment undergo any given type of substitution at the same rate, hence complementary substitution rates are equal (e.g., the rate of A/C substitutions is equal to that of T/G substitutions). Strand symmetry is not a universal condition; it is clearly disrupted by selective pressure and has also been shown to break down in transcribed regions and around replication origins (Green et al. 2003, Touchon et al. 2005). We will shortly describe a technique of identifying where this assumption is valid (or is a valid approximation) and can use this technique to identify regions where our predictions can be tested. Notation Like many studies, we will break the genome into nonoverlapping windows, or partitions (Hardison et al. 2003; Gaffney and Keightley 2005). The (instantaneous state transition) rate matrix defining a given Markov chain will be denoted by q, and we use qc to denote the rate matrix that has been estimated exclusively from the partition c (i.e., c will denote the mean from the repeats falling within c), m substitution rate of c (averaged over the time span under investigation), and pGC c ðtÞ will denote the partition’s GC content at time t. Any method based on a Markov chain imposes certain constraints on q. For example, Jukes and Cantor (1968) require that all off-diagonal elements be equal, Hasegawa et al. (1985) group the substitution types into 4 different parameter classes, and the fully reversible model allows for 6 parameters configured to enforce ‘‘full reversibility’’ (Ewens and Grant 2005). The constraints of our own model are fully defined by the strand-symmetric property. As a result, we have a 6-parameter model—that is, 6 independent rates appearing in the rate matrix (see Appendix, eq. 7, for the matrix layout). These 6 rates are labeled q1, . . ., q6 (defined in table 1). When referring to the components of the matrix qc, we will denote these rates by qc1 ; . . . ; qc6 . This strand-symmetric Markov chain was first proposed by Sueoka (1995) and later studied by Lobry and Lobry (1999). Estimating GC-Content Equilibrium (GC*) The importance of limiting our analysis to a strandsymmetric q is as follows. We derive a value kc from qc—specifically, kc is the sum of the rates of all substitutions between an AT base and a GC base in either direction (see eq. 9 in the Appendix). It is shown in Lobry and Lobry (1999) that if (and only if) the matrix qc is strand symmetric and time independent, we have cT GC kc m ; pGC c ðt þ TÞ GCc 5ðpc ðtÞ GCc Þe ð1Þ where GCc 5 qc1 qc1 þ qc5 : þ qc5 þ qc4 þ qc6 ð2Þ 364 Karro et al. Table 1 Values of the 6 Independent Components q1, . . ., q6 of the q Matrix Obtained from a Genome-Wide Analysis of the Genomes of Human (hg18), Chimpanzee (pt2), Dog (cf2), and Macaque (rm2) (for details see appendix) Rate q1 q2 q3 q4 q5 q6 Transition A/C T/G A/T T/A C/G G/C C/A G/T A/G T/C G/A C/A Human 0.132 (0.14) 0.130 (0.13) 0.129 (0.13) 0.186 (0.19) 0.434 (0.47) 0.989 (0.93) ± 0.003 ± 0.004 ± 0.004 ± 0.004 ± 0.005 ± 0.008 Chimpanzee 0.128 (0.14) 0.131 (0.14) 0.130 (0.13) 0.185 (0.19) 0.435 (0.47) 0.990 (0.93) ± 0.005 ± 0.005 ± 0.005 ± 0.005 ± 0.007 ± 0.011 Dog 0.141 (0.15) 0.134 (0.14) 0.131 (0.13) 0.182 (0.19) 0.443 (0.46) 0.968 (0.93) ± 0.004 ± 0.005 ± 0.005 ± 0.005 ± 0.007 ± 0.011 Macaque 0.130 (0.14) 0.129 (0.13) 0.130 (0.13) 0.183 (0.19) 0.436 (0.46) 0.992 (0.93) ± 0.004 ± 0.005 ± 0.005 ± 0.005 ± 0.007 ± 0.011 NOTE.—Values in brackets are obtained after masking out CpG sites before performing our analysis. Values are normalized P P such that 14 i;j6¼i qij 51, implying that i qi 52. Consider the implications of the first equation. If c is subject to strand-symmetric substitution (and not affected by issues such as elevated CpG mutation rates and recombination, which we will deal with shortly), then equation (1) means that the GC content of c must be decaying exponentially with time, at a rate dictated by the c . Further, product of kc and the local substitution rate m the local GC content is converging to the value GCc . Thus, GCc is the GC content in partition c that the genome will eventually reach at equilibrium (with notation GC* chosen to be consistent with that used in Meunier and Duret 2004). We note that pGC c ðt þ TÞ GCc can be thought of as the ‘‘excess GC content’’ contained by partition c on its way to equilibrium. One would like to test the prediction by picking 2 points in time and plugging in the relevant values into equation (1). However, we know the GC content for partition c at only one point in time: ‘‘now.’’ In order to reflect this, we calibrate time such that t 5 0 refers to the present, with t , 0 then representing points in the past, and rewrite equation (1) as cT GC kc m pGC c ð0Þ GCc 5ðpc ðTÞ GCc Þe ð3Þ with pGC c ð0Þ (denoting the current GC content) appearing on the left and pGC c ðTÞ (denoting the GC content T time units in the past) occurring on the right. In the derivation of this equation, we are assuming both strand symmetry and time independence of the rate matrix—the model cannot be reliably used for analysis unless we can verify that the target region conforms to these assumptions. But in those regions where this is the case, it follows that the local GC content is decaying over time toward the value c. GCc at an exponential rate determined by kc m Equation (3) describes the interdependence of 3 c-dependent (i.e., locally dependent) functions: the local GC equilibrium value GCc , the local GC content pGG c ð0Þ, c and kc. and the product of the local substitution rate m The fourth local function appearing in the equation, pGC c ðTÞ, is not accessible and will produce statistical noise. So equation (3) describes a curve in 3-dimensional c and GCc . To visualize this space spanned by pGC c ð0Þ, kc m curve, we will have to consider a projection of it onto a plane that is spanned by just 2 of these 3 local quantities. Models of Substitution We estimated substitution rates using 2 independent methods: a Markov chain–based method and the LogDet method (Barry and Hartigan 1987; Gu and Li 1996; Baake and von Haeseler 1999; Sumner and Jarvis 2006). By comparing the estimations derived from each method, we are able to both identify strand-symmetric regions of the genome and verify that our results on these regions are independent of certain differing assumptions. Our Markov chain–based method, which assumed strand symmetry, estimates the rates qc1 ; . . . ; qc6 in qc and c from the repeat alignments the mean local substitution rate m falling within partition c. We use standard fitting techniques to estimate the state-transition probability matrix Pc(t) (see eq. 6 in the Appendix). In other words, [Pc(t)]ij denotes the probability of a base with content i at time t0 becoming a base with content j at time t0 þ t. The problem is more complex than with the more common applications of Markov chain theories (e.g., as used in Hardison et al. 2003 or Gaffney and Keightley 2005) because we are dealing with alignments corresponding to repeat families of different ages. Thus, we will denote a given repeat family as a, the age of that family as ta, and can then estimate Pc(ta) for different families a that intersect partition c. Given a set of alignments corresponding to repeat family a, we estimate both qc1 ; . . . ; qc6 and the mean local substitution rate c from a maximum-likelihood fit to the matrices Pc(ta). m The final step in the application of our model is to define c m, the deviation of c’s and calculate a new value sc 5m substitution rate from the genome mean substitution rate m. (Note that whenever we are interested in genome-wide values, we allow for just one partition comprising the whole genome; the index c is then dispensable and has been omitted.) Of course, when applying this locally, the procedure is Exponential Decay of GC Content 365 dependent on choosing an appropriate resolution, as reflected in the partition size. We experimented with resolutions ranging from 160-kb windows to 1-Mb windows. The accumulated length of all repeats used for our analysis ranged from 20% to 30% of the genome, depending on the organism. Examples of the resulting local substitution rate variations are shown and discussed in the Appendix. In contrast to the previously described approach, LogDet makes no assumptions about strand symmetry or places any constraints on the rate matrix. Let Rc(t) be the rate matrix at time t and let Rc be the average of the Rc(t) matrices over the appropriate time interval. The LogDet model is based directly on the matrix Rc , hence makes no assumptions about the changes in Rc(t) or on the relationship between elements of Rc . Let lc be the arithmetic mean rate P based onP Rc , which can be written as iic =45 lc 5 4i51 R i;j6¼i Rijc =4. Barry and Hartigan (1987) and Zharkikh (1994) each showed that there is a direct connection between the product lct and the transitional probability matrix P(t), which when applied to our Pc(ta) may be written as 1 dac 5lc ta 5 ln det Pc ðta Þ 4 ð4Þ (see eq. 6 in Gu and Li 1996). This LogDet value dac is a ‘‘model-free’’ measure for the time distance ta in partition c (i.e., it avoids the constraining assumptions of other models, as discussed previously). We are able to compute LogDet times genome-wide, da, and in every partition, dac, and then can combine both sets of numbers by computing the percentage deviation in partition c from the genome-wide mean, Ddc 5 Æ(dac da)/daæ (where Ææ stands for an average over all a’s). Materials Genome builds used were downloaded from the University of California, Santa Cruz browser (Kent et al. 2002): hg18 (human), cf2 (dog), pt2 (chimpanzee), and rm2 (macaque). Human repeat information was extracted by RepeatMasker v. 3.1.2 and RM database version 20051025. Chimpanzee repeat information was extracted by RepeatMasker v. 3.1.3, RM database version 20060120. All software tools created for this project will be provided to interested parties on request. Statistics Linear regressions are characterized with Pearson’s moment correlation coefficient (denoted by rp) at a P value of less than 108. Each correlation passed all standard diagnostic tests to ensure that rp is a legitimate characterization of the fit (results not shown). Results Identifying strand-symmetric regions Our core prediction, equation (3), follows directly from our 2 assumptions: strand symmetry and the time in- dependence of the rate matrix. To verify the prediction with genomic data, we must find regions on the genome that conform to both assumptions. We do this by comparing the local rates, as determined by our method, against the prediction of the previously discussed LogDet method— which relaxes both assumptions. Only in areas where the 2 predictive methods agree can we expect the assumptions to hold. It is worth noting that as both models estimate average rates over some time period, we are actually locating segments that experienced rates which ‘‘effectively’’ meet these assumptions over that period. Thus, we will identify both those segments that conformed to the assumptions uniformly and those that fail to do so at any instant but that experienced rate variations over time that produced the same effect. As the LogDet estimations are independent of a specific model of substitution-rate relations, they are ideal for testing the assumptions of time constancy and strand symmetry of q made in the first model. We show in the supplementary material (eq. S18, S19, and S23–S25, Supplementary Material online) that if these assumptions and approximations have no effect on the result, we should find that 1) for a genome-wide fit, the estimated da from equa a and that 2) for the partitiontion (4) must be equal to mt obtained with the wide fits, Ddc must be equal to sc =m strand-symmetric model. Figure 1 shows both correlation a for 812 difdiagrams for the human genome da versus mt for the 21,000 ferent repeat families and Ddc versus sc =m partitions covering the genome (160-kb window). For the a in figure 1a, we obtain a (Pearson’s) comparison da versus mt correlation coefficient as high as rp 5 0.997 (human), with 50% of all families showing a difference smaller than 2% to the LogDet times. Equally good results were obtained for the other genomes; rp 5 0.997 (chimpanzee), rp 5 0.996 (dog), and rp 5 0.994 (macaque). This strong agreement between the predictions shows that our genome-wide assumptions, including strand symmetry, are valid. Apparently, the effect of regions where these assumptions do not hold averages out in such a genomewide analysis. When performing the same investigation on a local scale, a different picture emerges. For Ddc versus sc =m in figure 1b, we obtain a correlation coefficient of rp 5 0.92. Here, 50% of all partitions show a difference between that is larger than 64% of sc =m in that parDdc and sc =m tition. By means of this plot, we can now clearly identify those partitions where our model assumptions are justified. We have selected those regions where the difference be and Ddc is smaller than 5%. For the human getween sc =m nome, we discarded almost 75% of all 160-kb windows, leaving approximately 5,000 partitions representing 620 Mb of the entire human genome. Similarly, applying the same criterion to the corresponding figures for the other genomes, we obtained 560 Mb for the dog genome, 730 Mb for the macaque genome, and 680 Mb for the chimpanzee genome (see table 2). In the following, we will limit our analysis to only these partitions—where both methods produce approximately the same values (within 5%). The cor or LogDet relations hardly differ whether we show sc =m rates Ddc, hence the results are independent of the method used to compute the neutral substitution rate variation. 366 Karro et al. elements of q on a global scale, giving us a picture of the general trend of substitution rates similar to that in Lander et al. (2001). We observe that the transversion rate away from GC base pairs, q4, is higher than any other transversion rate and that the transition rate away from GC base pairs, q6, is twice the reverse rate q5 (A/G and T/C transitions) and 5 times a typical transversion rate. Also listed are results obtained after masking CpG sites on the consensus sequences of each repeat family, leading to a small shift of the rate q6. When we fit Pc(ta) in a specific partition c, we obtain c and the rates qc1 ; . . . ; qc6 that allow us to estimate the GC m equilibrium values GC*c (for those partitions conforming to our assumptions). Averaging over these partitions, we obtain an estimate for the genome-wide GC* that is representative for 620 Mb of the human genome (560–730 Mb for the other genomes). This GC* value is compared with the current GC content on today’s genome in table 2. We observe that for the investigated species, our model predicts that the equilibrium point GC* is about 30%, whereas the mean GC content on today’s genome is at about 42%, implying that these genomes are far away from the stationary state. We stress that locally GCc* (estimated with a variance of 0.04) shows considerable deviations from the genomewide value GC*. Exponential GC-Content Decay Verified FIG. 1.—(a) Correlation diagram of the human repeat family ages estimated with our method (y axis) and with the LogDet method (x axis). Although our estimate is based on a strand-symmetric and timeindependent substitution model, no such constraining assumption has to be made when using the LogDet method. (b) The relative local substitution in partition c computed with our method correlated against the rate sc =m corresponding rate estimated using the LogDet time measure (Ddc). A full list of identified strand-symmetric regions is included in the supplementary materials (Supplementary Material online). Genome-Wide, Strand-Symmetric GC Equilibrium Value Is at about 30% for Various Mammalian Genomes We next look at the results of fitting q both globally and locally. In table 1, we show the results of fitting the We are now in the position to determine the accuracy of prediction (3). pGC c ð0Þ, the modern GC content, is easily can be calculated by the apcalculable. GCc , kc, and sc =m plication of our model, with which we can determine the argument of the exponential function in (3) relative to c TÞ=ðmTÞ5k This kc ðm sc =mÞ. the time distance mT: c ð1 þ ð0Þ on the left-hand side of equation (3) allows us to plot pGC c versus the argument of the exponential function on the right-hand side of this equation, as we do in figure 2 for both human and dog. As we have already observed, equation (3) describes a curve in a space spanned by 3 local c ; and GCc . Thus, figure 2 can be functions: pGC c ð0Þ, kc m considered to be a projection of this 3-dimensional data c cloud onto a plane spanned alone by pGC c ð0Þ and kc m (i.e., onto a plane where GCc ð0Þ in equation (3) is fixed to some value). This should then result in an exponential c TÞ. Indeed, we relationship between pGC c ð0Þ and kc ðm Table 2 Data Collected for the Genomes of Human (hg18), Chimpanzee (pt2), Dog (cf2), and Macaque (rm2), followed by Human with Masked CpG Sites and for Human with a GC Content Determined from the Repeat-Masked Human Genome (hg18(bw)) GC(0) Size (Mb) GCs(0) GC* rp 95% confidence interval mT b Human Chimpanzee Dog Macaque Human (CpG) Human (bw) 0.427 620 0.400 0.30 0.85 (0.86, 0.84) 3.6 4.0 0.416 680 0.398 0.30 0.85 (0.86, 0.84) 3.6 4.0 0.412 560 0.397 0.31 0.82 (0.83, 0.81) 6.3 8.3 0.415 730 0.397 0.30 0.85 (0.86, 0.84) 3.5 3.7 0.30 ± 0.04 0.85 (0.86, 0.84) 3.6 4.0 0.81 (0.82, 0.80) 5.0 6.0 NOTE.—The mean GC content on today’s genome, GC(0), the total size of the genomic region selected in this study, the mean GC content GCs(0) over these selected regions, the equilibrium value of the GC content, GC*, as obtained from the average of GCc over all selected partitions, the correlation coefficient rp for the correlation in and b obtained from the linear regressions. figure 3 with their 95% confidence intervals, and the parameters mT Exponential Decay of GC Content 367 each axis are independent. The minimal change between the 2 figures shows this is not a problem. In contrast to the Arndt model of substitution rates (Arndt, Burge, and Hwa 2003; Arndt, Petrov, and Hwa 2003; Arndt et al. 2005), our model does not explicitly account for neighbor interaction; elevated CpG dinucleotide rates are instead reflected in the estimation of single-base substitution rates (e.g., compare the masked and unmasked estimates for q6 in table 1). The concern that our failure to explicitly consider Arndt’s seventh parameter may introduce error into our results is addressed in figure 3f, where we repeat the analysis leading to figure 3a after masking out CpG sites. We see from this figure, and from table 2, that calculating our values based only on non-CpG sites has virtually no effect on our calculations. Hence, it is clear that CpG hypermutability can be ruled out as a source of error. c TÞ for 5,000 FIG. 2.—GC content versus the time distance kc ðm (4000) partitions on the human (dog) genome, representing 620 Mb (560 c T is the quantity that appears as argument Mb) of the entire genome. kc m of the exponential function of the prediction (3). The local substitution rates m c have been computed by 2 independent methods; the solid lines are fits to an exponential function y5expðb ðmTÞxÞ with the fitting given in table 2. parameters b and mT see this to be the case in figure 2. We also note an apparent convergence value of GC content to the predicted GC* values from table 2, which suggests that we should examine excess GC content (i.e., the GC content relative to GC*). Taking the logarithm of the excess GC content, we expect to find that c T (see lnðpGC c ð0Þ GC Þ has a linear relationship to kc m eq. 3). We check this in figure 3 for 4 mammalian genomes. A very strong correlation is found, with coefficients as high as rp 5 0.85 for human, chimpanzee, and macaque and rp 5 0.82 for dog (table 2). We also show the results of a least squaresfit,providinguswithestimatesfortheparametersband in the fitting formula y5lnðpGC mT c ð0Þ GC Þ5b ðmTÞx, c TÞ=ðmTÞ. All values are given in table 2. where x5ðkc m Figure 3e is an alternative to figure 3a in which we do not use repeats (the basis for our calculation of distance) in the calculation of GC content to ensure that the calculations on Including Recombination Rates In table 1 we see that q1 þ q5 gives the substitution rate of AT base pairs to GC base pairs, whereas q4 þ q6 gives the substitution rate of GC base pairs to AT base pairs. Our method provides us with estimates for these rates in every partition c. BGC, on the other hand, is known to have the effect of increasing the AT/GC substitution rate by an amount proportional to the local recombination rate (Meunier and Duret 2004; Duret 2006; Galtier and Duret 2007), which we will denote by qc. Having this in mind, c ðqc1 þ qc5 Þ are locally we may expect to find that the rates m increased by bqc (where b is some proportionality con c ðqc4 þ qc6 Þ, should stant), whereas the GC/AT rates, m be locally reduced by the same amount. Thus, we expect c ðqc4 þ qc6 Þ5c2 bqc c ðqc1 þ qc5 Þ5c1 þ bqc and m that m with 2 constants c1 and c2 that are not material to our argument. From these relations, it follows that the difference c ðqc1 þ qc5 ðqc4 þ qc6 ÞÞ should be equal to c1 – c2 þ 2bqc. m c ðqc1 þ qc5 ðqc4 þ qc6 ÞÞ against qc Figure 4 correlates m (using sex-averaged deCODE recombination rates from Kong et al. 2002, resolved here with 1 Mb windows). As a result, we see a small, but statistically significant, c T on the genomes of human, chimpanzee, macaque, and dog; (e) shows FIG. 3.—(a–d) A logarithmic plot of the excess GC content versus kc m c T, with GC content now computed from genome regions between the repeats; (f) are the human again the GC content on the human genome versus kc m genome data with all CpG sites on the consensus sequences being blocked out. 368 Karro et al. FIG. 4.—Local difference between the AT / GC and GC / AT substitution rate versus the recombination rate on the human genome (1Mb window, human genome). For recombination rates, we used the sexaveraged deCODE rates from Kong et al. (2002). correlation with rp 5 0.33 (contained by the 95% confidence interval [0.29, 0.38]). We have also correlated c ðqc1 þ qc5 þ qc4 þ qc6 Þ; which the recombination rate with m c kc in equation (9) and found no significant is just m correlation. Meunier and Duret (2004) have found a similar correlation between GCc and the recombination rate. As observed by these authors, there are a number of reasons why we cannot expect this correlation to be large. Recombination appears to vary on a scale much smaller than 1 Mb and thus its rate averages out over scales greater than 1 Mbb; the 2 variables correlated here reflect processes operating on different time scales. Although recombination rates may change rapidly, the GCc can be traced back to rates that are actually time averages over long evolutionary periods. Additionally, in our study, the genomic segments under analysis were picked based on strand symmetry and time constancy. We have no reason to believe that characteristic is correlated with recombination; thus, our plot likely superimposes partitions supporting a strong correlation to others supporting no correlations at all. Asymmetric Regions of the Genome Our approach allows us to easily distinguish between parts of the genome that follow a strand-symmetric, timeindependent substitution rate model and those that do not. In figure 5, we graph GC content (on the x axis) against our substitution rate estimator (on the y axis) for both our selected 620 Mb of the human genome (black points, lighter fit) and the complementary set (dark dots, dark fit). It is evident that by selecting strand-symmetric areas we have been able to discard data that grossly deviate from the exponential decay curve, but it also becomes obvious that we have discarded a bulk of data that were compatible with our picture (the majority of red data points are covered by the black data points). Discussion In our investigation, we have concentrated on regions of the genome that we know to be subjected to strand- FIG. 5.—The exponential decay curve of figure 2 of the human genome, now mirrored at the bisecting line (black symbols) with the green solid curve being the exponential fit in figure 2. The data of figure 2 were a selection of 5,000 partitions for which 2 methods predicted similar rates. The data for the discarded 16,000 partitions are shown as red symbols. Both sets together cover the whole genome. We observe that we discarded both data that are compatible with our picture and data that are not. The latter show an upbending particularly at high GC content, which by other groups have been fitted to a parabolic function (dark) (see text). symmetric substitution rates. Selective pressure will frequently produce substitution asymmetries (Frank and Lobry 1999). Nor is this disruption of symmetry limited to areas under selection; both transcribed areas and regions around replication origins are known to be strand asymmetric (Green et al. 2003; Touchon et al. 2005). By limiting our analysis to the strand-symmetric regions, we eliminate associated sources of noise obscuring the pattern of decay, and are thus able to reveal the exponential relation between local GC-content decay and local (neutral) substitution rates. Thus, we form a picture of the decay dictated by the underlying neutral processes. If the values on the x axis of figure 2 are interpreted as the relative decay rate of the exponential decay, then we can conclude c =m) (kc m that local GC content is decaying over the genome, with genomic areas subjected to higher decay rates (i.e. experiencing a faster decay) having a disproportionately lower GC content than those subjected to lower decay rates. This implies that, regardless of the initial distribution of local GC content, even in the extreme case of no initial isochore structure, the exponential relationship of GC content to local substitution rate will inevitably lead to the establishment of some sort of isochore structure simply because of the regional differences in decay rates and the equilibrium points—a finding similar to that of Khelifi et al. (2006). The exponential relationship further supports the development of this structure as these regional differences grow faster than they would under a linear relationship. One could interpret the values on the x axis of figure 2 as time c T, such as those extracted from any Markov distances kc m chain–based model (Ewens and Grant 2005). Then the message of figure 2 would be that the decay of GC content in partitions with a large time distance is more advanced than those with a smaller distance. Note finally that wewould expect the correlation in figure 3 to be less than 1 as there is still the unknown local GC content pGC c ðTÞ at time T. In fact, pGC c ðTÞ is presumably different in every partition, inducing a vertical shift for every data point in figures 2 and 3. Surprisingly, these shifts are small Exponential Decay of GC Content 369 enough that they do not obscure the exponential decay. Figure 2 allows us to roughly estimate an upper limit of pGC c ðTÞ by the vertical distance of each data point from the fitted curve. Comparisons with Previous Studies The shape of our correlation is different than that found in previous studies, but it is not contradictory; our analysis just presents a clearer picture as we are able to strip away a source of noise inherent in the other studies. Consider the works by Hardison et al. (2003), Hellmann et al. (2003, 2005), and Belle et al. (2004). In the study by Hardison et al., we see a quadratic relation between GC content and substitution rate, where local substitution rate is calculated using the multispecies, repeat-based tAR statistic. The latter study by Hellmann et al. finds a similar result (measuring substitution rates with human–chimpanzee divergence) but reduces this to a linearly decreasing correlation when introducing CpG content as a second variable in their regression model. In figure 5, we explain both these results with our model. We see that when using the regions that are not strand symmetric (or not identifiable as such), we replicate the shape of the Hardison curve. Concentrating on the strand-symmetric regions, we find the negative correlation predicted by Hellmann—but are able to better resolve the shape of the curve. In the Belle paper, the authors point out that an exponential decay would make sense (an observation also made by Gu and Li 2006), but they find the explanation incompatible with their data (calculated using the method of Galtier and Gouy 1998). However, their results reflect factors not relevant to our analysis as they base their rate estimations on the alignment of coding sequences, which are not subject to strand-symmetric substitution and are shaped by constraint-related forces. The question of whether the genome has reached its GC-content equilibrium has been addressed in a number of studies. The works of Lercher et al. (2002) and Webster et al. (2003) both support our assertion that the genome is in a state of decay through the analysis of SNP evidence, and the Webster et al. (2005) results are also consistent with this hypothesis. Both Alvarez-Valin et al. (2004) and Antezana (2005) have taken the opposite position, though the methodology of the latter is shown to be faulty in Duret (2006). A line of studies including Duret et al. (2002), Meunier and Duret (2004), Khelifi et al. (2006), and Duret (2006) have made significant contributions to the understanding of GC-content decay, the projected equilibrium point, and the underlying causes of decay; it is important to consider how our results fit into the picture resulting from these works. Together, these studies have looked at the relationship between local current GC content and GCc , as well as the effect of recombination rates on these variables (presumably through the mechanism of BGC). They find pairwise correlations between recombination rate, GCc , and the GC content of c, and then argue a causal relationship starting from recombination and with GCc as a mediator. Our analysis is consistent with their work, though it does not lead to any conclusions about causality. Consider the formula equation (2) for the equilibrium GC content. This expression for GC* results directly from balancing net rates q15AT* 5 q46GC*, where q15 5 q1 þ q5 is the AT/GC rate, q46 5 q4 þ q6 is the GC/AT rate, and AT* 5 1 GC* is the equilibrium AT content. If q46 were equal to q15, then GC* would be 1/2. As q46 increases, the balance shifts and GC* decrease. We find that in the genome-wide average, the GC / AT rate is roughly twice the reverse rate, implying that genome-wide GC* is roughly 1/3. Of course, this argument applies not only for the whole genome but also for every partition c: a ratio qc4 þ qc6 to qc1 þ qc5 larger than one implies a local GCc falling below one-half. Following Duret et al. (2002) and Meunier and Duret (2004), we incorporate the BGC mechanism as follows. Consider a time and a location c where the recombination rate has reached a high enough level to make a difference. We expect a GC-biased fixation process and thus an increase of the local AT / GC rate by an amount proportional to the local recombination rate, implying that the reverse rate GC / AT decreases by the same amount. The sum of these 2 rates, qc1 þ qc5 þ qc4 þ qc6 5kc , is not affected, but their difference is reduced, leading to an upward shift of the local GCc back toward half. Thus, by changing the local rates, the BGC process will not alter the exponen c ) but simtial decay behavior (which depends on kc and m ply shifts the local GCc upward to an extent that is directly proportional to the recombination rate. This effect is then responsible for correlation between GCc and the recombination rate found by Meunier and Duret (2004). By the definition of an ‘‘equilibrium point,’’ the GC content must be headed toward GCc . Over the 620 Mb of the human genome we have investigated, we have not found any partition where the GC content is currently less then GCc , and in the 14 Mb investigated by Meuiner and Duret, they found only 2 Mb where this was the case. So, in practice, the local GC content in c is decreasing toward the lesser GCc value. Although a high recombination rate is able to, through BGC, raise the local AT / GC rate over some period of time, this change can stop the decay of local GC only if it is strong enough to shift GCc above the local GC-content level—an effect we do not see. We recall that equation (3) describes the interdependence of 3 local functions: GCc , pGC c ð0Þ (local GC content), c . When projecting the 3-dimensional and the value kc m plot onto a plane by setting any one of the variables to a constant, we see the association between the remaining 2 variables found by Duret. To this point, we have only studied the projection derived from setting GCc to a constant, resulting in our exponential curves. The other projections are of interest in their own right but are quite complicated and beyond the scope of this study. Other Points of Consideration We need to address our assumption of independent substitution rates in neighboring sites—specifically that of elevated CpG substitution rates, which are known to be significantly higher than those of other substitution rates (Arndt and Hwa 2004; Arndt et al. 2005). To investigate this effect, we have followed the common practice of masking out CpG cites (Meunier and Duret 2004; Gaffney and Keightley 2005; Gu and Li 2006; Taylor et al. 2006) and 370 Karro et al. found no difference in our results, as illustrated by a comparison of figure 3a against that of figure 3f and in table 1—a result consistent with that of Gu and Li (2006). The idea that these rates would have no effect on our calculation seems unlikely in light of results such as those of the Arndt studies (Arndt, Burge, and Hwa 2003; Arndt, Petrov, and Hwa 2003; Arndt et al. 2005), so the more plausible conclusion is that the CpG effect is not compatible with one of the characteristics used to pick our partition set. In other words, if a partition conforms to our assumptions, then the CpG effect is minimal within that partition. Our analysis has been extended to 3 other genomes: chimpanzee, macaque, and dog. We have obtained almost (the identical results for all, with differences only in mT slope of the red curve in fig. 3) for dog compared with the other 3 species. Notably missing from this analysis are the murids. Because mouse and rat are subjected to a much higher substitution rate (Waterston et al. 2002; Gibbs et al. 2004), RepeatMasker has considerably less power to recognize older repeats. For our purposes, the amount of sufficiently older repeat data is less than is needed to get a clear picture of the decay in those genomes. We also investigated the possibility that SNPs occurring within repeats could introduce error into our calculations. An SNP occurring in a repeat is reflective of mutation rate but not necessarily reflective of substitution rate; using SNPs to calculate substitution rates could bias the results. To check for this, we masked out all SNP locations, using information from dbSNP build 125 downloaded from the University of Santa Cruise Genome Browser (Sherry et al. 2001; Kent et al. 2002), and found no significant changes (data not shown). Finally, we addressed whether our calculation of repeat substitution rates could be biased by the faulty reconstruction of ancestors by RepeatMasker. To check this, we repeated the experiment based on different subsets of repeat families, both manually and randomly chosen. Each analysis resulted in the same conclusions. (See supplementary materials for details, Supplementary Material online.) Our results show a clear, strong exponential correlation between the substitution rate pattern and GC content in strand-symmetric areas of the genome, which confirms a prediction that is a direct consequence of a strand-symmetric rate model. Further investigations of the relationship, with an aim toward determining causality, are certainly worthwhile. A study of our identified regions of strand symmetry (or, more appropriately, the complementary set of regions) may also prove valuable. And although we cannot label any factor as being causal, it is potentially revealing to consider equation (3) under a model in which the substitution rate drives GC-content decay. The location dependence of the substitution rate pattern then allows us to predict the evolution of the isochore structure: it decays faster toward GCc in regions of high substitution rate and lower in those of low rate. This also implies that if the genome started from a uniform GC content (e.g., with no isochore structure), then the substitution rate pattern would inevitably lead to a local variation of the GC content. However, this model is only speculation; we could also envision a model in which GC content filled the role of determining the local substitution rate. Supplementary Material Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe. oxfordjournals.org/). Acknowledgments The authors would like to thank Svitlana Tyekucheva, Kateryna Makova, Adam Eyre-Walker, and Webb Miller for their helpful comments; Richard C. Burhans and Nathan Coraor for their technical support; and Laura Tabacca for editing. John Karro worked under the support of National Institutes of Health (NIH) grant 5K01HG003315. Martin Peifer received financial support from the Austrian Science Foundation (FWF) under project title P18762. Ross Hardison was supported by NIH grant 5R01DK065806. This project is funded, in part, under a grant with the Pennsylvania Department of Health using tobacco settlement funds. The department specifically disclaims responsibility for any analyses, interpretations, or conclusions. Appendix In the following, we give a brief description of our model, the approximations used in its application, and the tests performed to check the validity of those approximations. A detailed report is given in the supplementary materials (Supplementary Material online). The Rate Model Our underlying model is a nonhomogeneous Markov chain, similar to that used in other standard approaches. The 4-dimensional time-dependent state vector (pA(t), pC(t), pG(t), pT(t)), defining the probability of being in each state at time t, evolves according to a substitution rate matrix R(t) having components Rij(t). Recall that every possible R(t) P must satisfy j Rij 50, which allows us to define a mean substitution rate P m by averaging over all components of this P P rate matrix: m5 4i51 Rii =45 14 i j6¼i Rij . To make this m explicit in the following expressions,P weP replace Rij by mqij and scale the matrix q such that 14 i j6¼i qij 51. The transition probability matrix propagating a state vector at time t0 to time t0 þ t reads in its most general form Z t0 þt PðtÞ5exp Rðt# Þdt# 5expð R tÞ5expðmq tÞ; ð5Þ t0 where an overlineR over a function f(t) represents its time t þt average f 5ð1=tÞ t00 f ðt# Þdt# . In our application, we cannot consider the most general case and instead approximate mqij by the product of m and qij , with the constants qij obtained from a maximum-likelihood fit, as explained further below. Underlying this approximation is the assumption that though rates may be time dependent, the ratio of any 2 rates within the rate matrix does not change with time. This approximation is necessary in order Exponential Decay of GC Content 371 to be able to perform fits; with the comparison in figure 1 to LogDet times (which do not depend on this approximation), we have ensured that we consider only those partitions where this approximation has a negligible effect (for a discussion of this approximation, see section A2 in supplementary material, Supplementary Material online). To estimate rates on a local scale, we break the genome into Z partitions, representing each partition by an index c. The index is dropped whenever Z 5 1, that is, when we want to indicate that a quantity is based on a genome-wide analysis. Using this partitioning notation and applying our approximation, equation (5) can be written as c qc tÞ: Pc ðtÞ5expðm ð6Þ For qc (the rate matrix estimated from partition c), we use a strand-symmetric rate model (Sueoka 1995; Lobry and Lobry 1999): 0 ðqc1 þ qc5 þ qc2 Þ B qc 4 qc ¼ B @ qc 6 qc2 qc1 ðqc4 þ qc3 þ qc6 Þ qc3 qc5 with the 6 independent substitution rates qc1 ; . . . ; qc6 defined in table 1. Every strand-symmetric matrix can be transformed into a block-diagonal matrix consisting of two 2 2 blocks, qþ and q (Lobry and Lobry 1999), of which only ðqc1 þ qc5 Þ qc1 þ qc5 þ qc 5 c ð8Þ q4 þ qc6 ðqc4 þ qc6 Þ is of interest here. It is the rate matrix for the 2AT ðtÞ; pGC ðtÞÞ5ðpA ðtÞþ dimensional state vector ~ðtÞ5ðp p T G C p ðtÞ; p ðtÞ þ p ðtÞÞ representing the A þ T and G þ C content at time t. Note that qcþ depends just on qc1 þ qc5 and qc4 þ qc6 which correspond to a A,T / G,C transition and a G,C / A,T transition, respectively (see table 1). The set of differential equations associated with this reduced 2 2 / p ðtÞ5p ðtÞqþ can be integrated (see Lobry and system, dtd / Lobry 1999), as well as sections A5 and A6 in the supplementary material, Supplementary Material online), leading directly to equation (2), to kc 5qc1 þ qc5 þ qc4 þ qc6 ð9Þ and finally to the prediction (3) which is the central equation of this paper. We emphasize that, if one applies the same transformation leading to equation (8) to a non–strand-symmetric rate matrix, one will find two 2 2 blocks which are still coupled, making it thus impossible to replace the set of 4 coupled differential equations by 2 independent sets of differential equations. Thus, prediction (3) is valid only as long as a strandsymmetric model is applicable. Calculation of Local Substitution Rates All estimates of rates are based on the transitional probability matrix P(t) on the left-hand side of equation (5) which we fit to RepeatMasker-generated repeat data. Low-complexity repeats were removed. We were then left with M repeat families, where M 5 812 (human), M 5 434 (dog), M 5 825 (chimpanzee), and M 5 779 (macaque). The RepeatMasker data provide the reconstructed ancestor of each repeat family a (a 5 1, . . ., M) as well as a pairwise alignment between each modern instance and the family’s reconstructed ancestor, which was inserted into the genome at time ta. Alignment positions involving gaps are discarded. For each repeat family a and each partition c, we determine from the alignments 1) a 4 4 matrix type, 2) a 4 kac counting the number of each P substitution i ij 5 j kac which is the number of 1 vector ~ Nac such that Nac nucleotides that were in state i at time ta. It then follows qc5 qc3 ðqc6 þ qc3 þ qc4 Þ qc1 1 qc2 C qc6 C A qc4 ðqc2 þ qc5 þ qc1 Þ ð7Þ kij that Pij;c ðta Þ5 Naci is the maximum-likelihood fit of P based ac on the alignments. These Pc(ta) matrices represent the input which we want to use to estimate local substitution rates. We have 2 alternatives: the method of LogDet time distance, equation (4), and a method based on the matrix in equation (7) and the approximation in equation (6) (von Grünberg et al. 2004). The LogDet time connects the transitional probability matrix P(t) for a transition over time t on the left-hand side of equation (5) with the product d of the time on the right-hand side t and the arithmetic mean l rate of R of (5). P Recalling that P equation 4l ¼ i;j6¼i Rij ¼ TrR ¼ i ki with ki being the ei the LogDet formula in equation (4) for genvalues of R, d 5 lt results directly from the following consideration: P det PðtÞ5det eRt 5e ki t i 5e4d : ð10Þ Applied to our matrices Pc(ta), the LogDet formula equation (4) provides us with time distances dac for partition c and family a, and, when performing a genomewide analysis, with da as the time distance from the moment when family a was inserted into the genome. Averaging (dac da)/da over all repeat families, we obtain Ddc, the percentage deviation of the local substitution rate in partition c relative to the mean. Whereas the LogDet time produces model-free time distances, the second method is based on the rate model in equation (7) and the approximation we made in going from equation (5) to equation (6) that mq can be split into 372 Karro et al. per partition (Z 5 21,000), resulting in a typical window size of 160-kb window at highest resolution. Our rates were then filtered for noise using a technique discussed in Peifer et al. (2003) and averaged over all M repeat families. We report relative rates: sc 1 X m a c ta mt 5 ; M a a m mt FIG. 6.—Age of a few examples of transposable elements of the c ta of repeat type LINE type on the human genome. The time distance m a as estimated with our method is plotted on the x axis and correlated on the y axis with the age of the same repeat family as estimated by Khan et al. (2006). The linear relation allows us to estimate m as m52:3 103 =Myr. and a constant matrix q. Based on equation (6), we then m 1) set Z 5 1 and perform a genome-wide maximum-likeli a and the 6 rates q1 ; . . . ; q6 in q hood fit of the M numbers mt on the right-hand side of equation (6) to the matrices P(ta) on the left-hand side of equation (6) obtained from the repeat data as described above; 2) partition the genome and and the rates c =m calculate a maximum-likelihood fit of m c =mÞð mt a ÞÞ on the qc1 ; . . . ; qc6 in the expression expðqc ðm right-hand side of equation (6) to our data in Pc(ta), taking a from the genome-wide fit in the previous the values of mt step. Partitions were sized to include 40 kb of repeat bases ð11Þ which can be compared directly with the LogDet relative rates Ddc as done in figure 1b. An alternative way of testing our computations is to compare it with the results of more established methods. For a few repeat families, we a against those computed by compare our age estimates mt Khan et al. (2006) in figure 6. An example of the substitution rate patterns obtained from our method is shown in figure 7 and compared with the tAR statistic (Hardison et al. 2003). After normalizing tAR to reflect variation and applying the same filtering technique, we find a close correlation between the 2 (rp 5 0.76). As a further test of our method, we performed a cross-validation to make sure that although the curves are derived from repeat family data, the results are independent of the actual subset of families chosen for the analysis. For example, splitting all families into 2 groups—one group containing families with a long period of activity and a complementary group with families that have been active over a short period of time—we found that whatever group we chose had little effect on our final results (see Supplementary figure S1, Supplementary Material online). For more information on the validation of our method, see section A4 of the supplementary material (Supplementary Material online). FIG. 7.—A plot of the local substitution rate (black curve) and filtered tAR (light curve) over 4 sample chromosomes of the human genome. Breaks in the curves correspond to centromeric regions. The inset box is a whole-genome regression of the 2 values, resulting in a linear correlation value of rp 5 0.76. Exponential Decay of GC Content 373 Literature Cited Alvarez-Valin F, Clay O, Cruveiller S, Bernardi G. 2004. Inaccurate reconstruction of ancestral GC levels creates a ‘‘vanishing isochores’’ effect. Mol Phylogenet Evol. 31:788–793. Antezana MA. 2005. Mammalian GC content is very close to mutational equilibrium. J Mol Evol. 61:834–836. Arndt PF, Burge CB, Hwa T. 2003a. DNA sequence evolution with neighbor-dependent mutation. J Comput Biol. 10:313–322. Arndt PF, Hwa T. 2004. Regional and time-resolved mutation patterns of the human genome. Bioinformatics. 20:1482–1485. Arndt PF, Hwa T, Petrov DA. 2005. Substantial regional variation in substitution rates in the human genome: importance of GC content, gene density, and telomere-specific effects. J Mol Evol. 60:748–763. Arndt PF, Petrov DA, Hwa T. 2003b. Distinct changes of genomic biases in nucleotide substitution at the time of Mammalian radiation. Mol Biol Evol. 20:1887–1896. Baake E, von Haeseler A. 1999. Distance measures in terms of substitution processes. Theor Popul Biol. 55:166–175. Barry D, Hartigan J. 1987. Asynchronous distance between homologous DNA sequences. Biometrics. 43:261–276. Belle EM, Duret L, Galtier N, Eyre-Walker A. 2004. The decline of isochores in mammals: an assessment of the GC content variation along the mammalian phylogeny. J Mol Evol. 58:653–660. Bernardi G. 2000. Isochores and the evolutionary genomics of vertebrates. Gene. 241:3–17. Duret L. 2006. The GC content of primates and rodents genomes is not at equilibrium: a reply to Antezana. J Mol Evol. 62: 803–806. Duret L, Eyre-Walker A, Galtier N. 2006. A new perspective on isochore evolution. Gene. 385:71–74. Duret L, Mouchiroud D, Gautier C. 1995. Statistical analysis of vertebrate sequences reveals that long genes are scarce in GCrich isochores. J Mol Evol. 40:308–317. Duret L, Semon M, Piganeau G, Mouchiroud D, Galtier N. 2002. Vanishing GC-rich isochores in mammalian genomes. Genetics. 162:1837–1847. Ellegren H, Smith NG, Webster MT. 2003. Mutation rate variation in the mammalian genome. Curr Opin Genet Dev. 13:562–568. Ewens WJ, Grant GR. 2005. Statistical methods in bioinformatics: an introduction. New York: Springer Science. Frank AC, Lobry JR. 1999. Asymmetric substitution patterns: a review of possible underlying mutational or selective mechanisms. Gene. 238:65–77. Gaffney DJ, Keightley PD. 2005. The scale of mutational variation in the murid genome. Genome Res. 15:1086–94. Galtier N, Duret L. 2007. Adaptation or biased gene conversion? Extending the null hypothesis of molecular evolution. Trends Genet. 23:273–277. Galtier N, Gouy M. 1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol Biol Evol. 15:871–879. Galtier N, Piganeau G, Mouchiroud D, Duret L. 2001. GCcontent evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics. 159:907–911. Gibbs R, Weinstock G, Metzker M, Muzney D, Sondergren E, et al. 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature. 428:493. Green P, Ewing B, Miller W, Thomas PJ, Green ED. 2003. Transcription-associated mutational asymmetry in mammalian evolution. Nat Genet. 33:514–517. Gu J, Li WH. 2006. Are GC-rich isochores vanishing in mammals? Gene. 385:50–56. Gu X, Li W. 1996. Bias-corrected paralinear and logdet distances and tests of molecular clocks and phylogenies under nonstationary nucleotide frequencies. Mol Biol Evol. 13: 1375–1383. Hardison RC, Roskin KM, Yang S, et al. 2003. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 13:13–26. Hasegawa M, Kishino H, Yano T. 1985. Dating of the humanape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 22:160–174. Hellmann I, Ebersberger I, Ptak SE, Paabo S, Przeworski M. 2003. A neutral explanation for the correlation of diversity with recombination rates in humans. Am J Hum Genet. 72:1527–1535. Hellmann I, Prufer K, Ji H, Zody MC, Paabo S, Ptak SE. 2005. Why do human diversity levels vary at a megabase scale? Genome Res. 15:1222–1231. Jukes TH, Cantor CR. 1969. Evolution of protein molecules. In: Munro, HN, editor. Mammalian Protein Metabolism. New York: Academic Press. p. 21–132. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. 2002. The human genome browser at UCSC. Genome Res. 12:996–1006. Khan H, Smit A, Boissinot S. 2006. Molecular evolution and tempo of amplification of human LINE-1 retrotransposons since the origin of primates. Genome Res. 16:78–87. Khelifi A, Meunier J, Duret L, Mouchiroud D. 2006. GC content evolution of the human and mouse genomes: insights from the study of processed pseudogenes in regions of different recombination rates. J Mol Evol. 62:745–752. Kong A, Gudbjartsson DF, Sainz J, et al. 2002. A high-resolution recombination map of the human genome. Nat Genet. 31:241–247. Lander ES, Linton LM, Birren B, et al. 2001. Initial sequencing and analysis of the human genome. Nature. 409:860–921. Lercher MJ, Smith NG, Eyre-Walker A, Hurst LD. 2002. The evolution of isochores: evidence from SNP frequency distributions. Genetics. 162:1805–1810. Lindblad-Toh K, Wade CM, Mikkelsen TS, et al. 2005. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature. 438:803. Lobry JR, Lobry C. 1999. Evolution of DNA base composition under no-strand-bias conditions when the substitution rates are not constant. Mol Biol Evol. 16:719–723. Meunier J, Duret L. 2004. Recombination drives the evolution of GC-content in the human genome. Mol Biol Evol. 21:984–990. Mouchiroud D, D’Onofrio G, Aissani B, Macaya G, Gautier C, Bernardi G. 1991. The distribution of genes in the human genome. Gene. 100:181–187. Nagylaki T. 1983. Evolution of a finite population under gene conversion. Proc Natl Acad Sci USA. 80:6278–6281. Peifer M, Timmer J, Voss H. 2003. Non-paraetric identification of non-linear oscillating systems. J Sound Vibration. 267:1157–1167. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. 2001. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29:308–311. Smit AF. 1999. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev. 9:657–663. Sueoka N. 1995. Intrastrand parity rules of DNA base composition and usage biases of synonymous codons. J Mol Evol. 40:318–325. Sumner J, Jarvis P. 2006. Using the tangle: a consistent construction of phylogenetic distance matrices for quartets. Math Biosciences. 204:49–67. 374 Karro et al. Taylor J, Tyekucheva S, Zody M, Chiaromonte F, Makova KD. 2006. Strong and weak male mutation bias at different sites in the primate genomes: insights from the human-chimpanzee comparison. Mol Biol Evol. 23:565–573. Touchon M, Nicolay S, Audit B, Brodie of Brodie EB, d’Aubenton Carafa Y, Arneodo A, Thermes C. 2005. Replication-associated strand asymmetries in mammalian genomes: toward detection of replication origins. Proc Natl Acad Sci USA. 102:9836–9841. von Grünberg HH, Peifer M, Timmer J, Kollmann M. 2004. Variations in substitution rate in human and mouse genomes. Phys Rev Lett. 93:208102. Waterston RH, Lindblad-Toh K, Birney E, et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature. 420:520–562. Webster MT, Smith NG, Ellegren H. 2003. Compositional evolution of noncoding DNA in the human and chimpanzee genomes. Mol Biol Evol. 20:278–286. Webster MT, Smith NG, Hultin-Rosenberg L, Arndt PF, Ellegren H. 2005. Male-driven biased gene conversion governs the evolution of base composition in human alu repeats. Mol Biol Evol. 22:1468–1474. Zharkikh A. 1994. Estimation of evolutionary distances between nucleotide sequences. J Mol Evol. 39: 315–329. Manolo Gouy, Associate Editor Accepted November 20, 2007
© Copyright 2026 Paperzz