Effect of Strong Directional Selection on Weakly Selected Mutations

Effect of Strong Directional Selection on Weakly Selected Mutations
at Linked Sites: Implication for Synonymous Codon Usage
Yuseob Kim
Department of Biological Statistics and Computational Biology, Cornell University
The fixation of weakly selected mutations can be greatly influenced by strong directional selection at linked loci. Here, I
investigate a two-locus model in which weakly selected, reversible mutations occur at one locus and recurrent strong
directional selection occurs at the other locus. This model is analogous to selection on codon usage at synonymous sites
linked to nonsynonymous sites under strong directional selection. Two approximations obtained here describe the
expected frequency of the weakly selected preferred alleles at equilibrium. These approximations, as well as simulation
results, show that the level of codon bias declines with an increasing rate of substitution at the strongly selected locus, as
expected from the well-understood theory that selection at one locus reduces the efficacy of selection at linked loci.
These solutions are used to examine whether the negative correlation between codon bias and nonsynonymous
substitution rates recently observed in Drosophila can be explained by this hitchhiking effect. It is shown that this
observation can be reasonably well accounted for if a large fraction of the nonsynonymous substitutions on genes in the
data set are driven by strong directional selection.
Introduction
Recent studies suggested that the dynamics of
a weakly selected variant can be greatly influenced by
strong selection at closely linked loci (Barton 1995;
Gillespie 2001). With little recombination between loci,
the frequency change of a weakly selected allele is
determined by its initial association with the strongly
selected allele. For example, if a weakly selected mutation
occurs on a chromosome carrying a strongly beneficial
mutation, it will increase in frequency along with the
beneficial mutation, regardless of the direction of the weak
selection. It is analogous to the effect of directional
selection on linked neutral variants, or the ‘‘hitchhiking’’
effect (Maynard Smith and Haigh 1974). However,
whereas the hitchhiking effect does not change the average
rate of substitutions at neutral loci (Birky and Walsh
1988), it does change the rate of substitution at weakly
selected loci (Birky and Walsh 1988; Gillespie 2001). The
average fixation probability of weakly beneficial (deleterious) alleles is decreased (increased) by the hitchhiking
effect of strong selection at linked loci. These changes in
fixation probabilities are generally interpreted as a reduction of the efficacy of selection at one locus because of
selection at other linked sites, commonly referred to as
Hill-Robertson effects (Hill and Robertson 1966). The
fixation probability of a weakly beneficial mutation
affected by linked selection was studied by Barton
(1995) and by Gerrish and Lenski (1998), and that for
a weakly deleterious mutation with complete linkage was
studied by Gillespie (2001). The present study obtains
slightly different solutions for the fixation probability of
newly arising weakly selected alleles, either deleterious or
beneficial, linked to a locus under recurrent directional
selection. Analytic solutions given here are intended to
Present address: Department of Biology, University of Rochester,
Rochester, New York.
Key words: codon bias, linkage, interference, hitchhiking, Drosophila.
E-mail: [email protected].
Mol. Biol. Evol. 21(2):286–294. 2004
DOI: 10.1093/molbev/msh020
Advance Access publication December 5, 2003
Molecular Biology and Evolution vol. 21 no. 2
Ó Society for Molecular Biology and Evolution 2004; all rights reserved.
286
describe the dynamics of molecular evolution at synonymous sites in protein-coding sequences.
A number of studies indicate that synonymous sites in
protein-coding sequences are under weak selection (Sharp
and Li 1987; Shields et al. 1988). For each amino acid,
a certain synonymous codon (‘‘optimal’’ or ‘‘preferred’’
codon) is used more frequently than others. It is believed
that preferred codons are selectively advantageous over
others because of the differences in translational efficiency
and accuracy among alternative codons. The strength of
selection on alternative codons was found to be very low;
that is, selection coefficients are estimated to be on the
order of 1/2Ne (Akashi 1995; Akashi and Schaeffer 1997;
McVean and Vieira 2001). Therefore, the dynamics of
synonymous substitutions may be greatly influenced by
moderate or strong directional selection at linked loci. The
hitchhiking effect will increase the fixation probability of
‘‘deleterious’’ unpreferred codons and decrease the
fixation probability of ‘‘beneficial’’ preferred codons.
Therefore, the level of codon bias should decrease with
increasing rate of positively selected substitutions at linked
sites. This prediction is strongly supported by recent
observations in Drosophila. Betancourt and Presgraves
(2002) analyzed 257 genes from D. melanogaster and D.
simulans (102 from GenBank and 153 from a D. simulans
male-specific EST screen) and found a strong negative
correlation between the frequency of optimal codon usage
and the rate of nonsynonymous substitutions (dN).
Assuming that dN is indicative of the rate of positive
selection, they concluded that the Hill-Robertson interference between selection acting on codon usage bias
and amino acid substitutions caused the decline of optimal
codon usage with increasing dN. However, there are
several alternative hypotheses (see Discussion) to explain
the reduction of codon bias where amino acid sequence is
not well conserved, which has been observed earlier
(Ticher and Grauer 1989; Akashi 1994). For example,
a relaxation of functional constraints may both elevate dN
and depress the frequency of optimal codons. Therefore, to
accept the hypothesis of interference as a satisfying
explanation for the observed correlation between codon
bias and dN, one needs to demonstrate that the observed
Hitchhiking Effect on Codon Bias 287
correlation can be obtained by reasonable parameter values
of directional selection in Drosophila species.
Model and Simulation Method
A two-locus model in which weak selection occurs at
one locus and strong selection at the other is assumed.
These loci are referred to as the ‘‘weak’’ locus and the
‘‘strong’’ locus throughout this paper. At the weak locus,
which models the synonymous site of a twofold degenerate codon, mutations occur from allele A to a with
rate l10 and from a to A with rate l01 each generation. The
relative fitness of A and a is given by 1 and 1 sw,
respectively. At the strong locus, the wild-type allele
b mutates to the beneficial allele B with rate ls per
generation. The relative fitness of b and B is given by 1
and 1 þ ss, respectively. If B is fixed, this allele becomes
the new wild-type allele. Therefore, immediately after
fixation, all copies of B revert to b, and, subsequently,
a new mutation from b to B can occur. To simplify the
model, a haploid population of 2N chromosomes is
assumed. Then the ‘‘weak’’ and ‘‘strong’’ selection refer
to the condition that sw ¼ O(1/N) and ss sw. The
recombination rate between the two loci is given by r per
generation.
There are four possible haplotypes in this model: AB,
Ab, aB, and ab. The dynamics of the system is therefore
simply described by the changes of four haplotype
frequencies, which are x1, x2, x3, and x4, respectively.
The frequency change in each generation is generated by
a set of equations that determines the effect of selection,
recombination, and mutation in an infinite population. The
derivation of these equations is straightforward (Ewens
1979). The solution of these equations represents the pool
of gametes from which the next generation is produced
according to the Wright-Fisher model. The multinomial
sampling is simulated using the random binomial number
generator of Press et al. (1992). This approach correctly
simulated a population at mutation-selection-drift balance
(Kim and Stephan 2000). Simulations start with x1 ¼ x3 ¼
0 and x2 ¼ x4 ¼ 0.5. An initial phase of 4N generations
allows the system to reach the mutation-selection-drift
equilibrium. During the second phase, which is 5 3 109
generations long, the allele frequency of A (¼ x1 þ x2) is
monitored. The frequency of the optimal codon usage (Fop
[see below]) is estimated by the average frequency of A
observed over the entire period. The number of fixation
events at the strong locus is also counted.
Theory
A formula describing the frequency of the optimal
codon usage (Fop) is obtained considering a twofold
degenerate amino acid site. I assume a very low mutation
rate at the synonymous site (the weak locus in the model
above) such that the site is fixed with either the preferred
(A) or unpreferred (a) codon at any given time with high
probability. Fop is defined as the proportion of sites fixed
for A. Then, assuming that the flux of substitutions
between A and a is at equilibrium, it has been shown that
E½Fop ¼
m01
;
m01 þ m10
ð1Þ
where m01 (m10) is the rate of substitution from allele a to A
(A to a) (Bulmer 1991, McVean, and Charlesworth 1999).
The fixation probability of a weakly selected mutation
experiencing genetic drift with effective population size
Ne is given by
uðsÞ ¼
1 expð4Ne sp0 Þ
1 expð4Ne sÞ
ð2Þ
(Ewens 1979), where s is the selection coefficient, and
p0 ¼ 1/(2N) is the initial frequency of the mutation.
Therefore, in the absence of interference from the strong
locus, the expectation of Fop is obtained by equation 1
using Ne ¼ N, m01 ¼ 2Nl01u(sw), and m10 ¼ 2Nl10u(sw).
With recurrent directional selection at the strong locus and
low recombination, fixation probabilities cannot be
obtained by equation 2. I therefore attempt to derive
substitution rates at the weak locus under this situation.
First, no recombination is assumed between the weak
and the strong loci in the model described above. The fate
of a new mutant allele at the weak locus (a or A) depends
only on genetic drift and weak selection until a substitution
at the strong locus occurs. Assume that the strongly
selected allele, B, which eventually goes to fixation,
appears t generations after the mutation occurred at the
weak locus. This mutant at the weak locus may go to
fixation before B arises or, if it is not yet fixed at time t,
through genetic hitchhiking with B. In either case, B arises
on a chromosome carrying this mutant with probability p,
the frequency of the mutant at the time of mutation at the
strong locus (Gillespie 2001). Therefore, the fixation
probability of the weakly selected mutant conditional on
the waiting time t until B arises is given by
Z 1
puð p; tÞdp ¼ Et ½p;
0
where u( p, t) is the frequency density of mutant allele after
t generations of genetic drift and weak selection. Et[ p], the
mean allele frequency, should satisfy E0[p] ¼ p0 ¼ 0.5/N
and E‘[p] ¼ u(s) (Ne ¼ N), where s is the selection
coefficient of the mutant (sw for A and sw for a). One
may derive Et[p] using a well-known diffusion equation
d
d
1 2
d2
Et ½gð pÞ ¼ Et lð p; tÞ gð pÞ þ r ð p; tÞ 2 gð pÞ ;
dt
dp
2
dp
where g( p) is a function of p, and l(p, t) and r2(p, t) are
the infinitesimal drift and diffusion parameters (Stephan,
Wiehe, and Lenz 1992). Using g(p) ¼ p, l( p, t) ¼ sp(1 p), and r2(p, t) ¼ p(1 p)/2N, we obtain
d
Et ½ p ¼ sEt ½pð1 pÞ:
dt
ð3Þ
Therefore, to solve for Et[ p], one needs the solution of
Et[ p2]. As this moment expansion does not stop, I rather assume that the average decay of heterozygosity, Et[2p(1 p)], is not much different from that under neutral
evolution. This assumption is not unreasonable for small t,
because the behavior of a weakly selected allele still at low
288 Kim
frequency is similar to that of a neutral allele. Previous
studies of the neutral model showed that Et [p(1 p)] ¼
p0(1 p0)exp(t/2N) (Ewens 1979). After using this
approximation for equation (3), we obtain Et[ p] ’ p0 þ
s(1 et/2N). For large t, this solution is slightly smaller
than u(s) with s ’ 1/2N, but is significantly smaller than
u(s) for large t and s ’ 1/2N. Therefore, final approximation is given by
p þ sð1 et=2N Þ;
if s . 0
ð4Þ
Et ½ p’ 0
Max½ p0 þ sð1 et=2N Þ; uðsÞ; if s , 0
Next, the recurrent fixation of B at the strong locus is
assumed to occur with rate k per generation. With small k
and strong selection (2Nss 1), substitutions at this locus
may be approximated as a Poisson process. Then the
waiting time until the hitchhiking event, t, is approximately
exponentially distributed. Therefore, the fixation probability
of a weakly selected allele is given by
Z ‘
Et ½pkekt dt:
ð5Þ
f ðs; kÞ’
0
Then the expected frequency of allele A at the weak locus
(Fop) is obtained by equation 1 using m01 ¼ 2Nl01 f(sw, k)
and m10 ¼ 2Nl10 f(sw, k). Therefore, under hitchhiking
with zero recombination, the expected level of codon bias
is given by
E½Fop ¼
l01 f ðsw ; kÞ
:
l01 f ðsw ; kÞ þ l10 f ðsw ; kÞ
ð6Þ
m01 and m10 defined in this way are decreasing and
increasing functions of k, respectively. Figure 1A shows
that the predicted levels of Fop by equation 6 agree well
with simulation results for values of 2Nsw ranging from
0.5 to 4.
The decline of Fop shown in figure 1 can be
interpreted as a decrease in the efficacy of selection with
increasing interference between selected alleles from two
loci (Hill and Robertson 1966). This reduced efficacy of
selection may be summarized by a reduced effective
population size (Ne), which is inversely proportional to the
strength of genetic drift experienced by segregating
mutants, since the strength of selection diminishes with
increasing degree of genetic drift. Therefore, one may
expect that the fixation probabilities of weakly selected
alleles would be approximated by replacing Ne in equation
2 with a proper value that takes interference from the
strong locus into account. Under the model of recurrent
hitchhiking, previous studies showed that
Ne ¼
N
;
1 þ 2Nkð1 hÞ
ð7Þ
where h is the relative heterozygosity at a neutral locus
immediately after the fixation of the beneficial mutation
(Stephan, Wiehe, and Lenz 1992; Wiehe and Stephan
1993, Gillespie 2000a). Under the assumption of no
recombination, h ¼ 0. It is not clear whether it is correct to
replace Ne in equation 2 with the above formula, since Ne
defined in equation 2 is the determinant of the sampling
variance of allele frequency change between generations in
FIG. 1.—Decreasing Fop with increasing rate of substitutions at the
strong locus. Pairs of simulation result (joined by dashed lines) and its
theoretical prediction (continuous curve) are shown for four levels of the
strength of selection at the weak locus (from top to bottom, 2Nsw ¼ 4,
2, 1, and 0.5, respectively). N ¼ 104, 2Nl01 ¼ 0.002, 2Nl10 ¼ 0.006,
2Nss ¼ 1,000, and r ¼ 0. (A) Theoretical predictions are given by equation 6. (B) Theoretical prediction is given by equation 8 in which h is
given by equation 19 of Stephan, Wiehe, and Lenz (1992).
the Wright-Fisher model. On the other hand, equation 7
was derived based on the effect of hitchhiking on the
coalescent. Gillespie (2000a) demonstrated the equivalence of these two definitions of Ne for neutral variants
under recurrent hitchhiking. However, it is not known
whether this equivalence can still be applied to the genetic
drift of weakly selected alleles. I examined the validity of
the approximation using equation 7 by comparing results
from simulation and theory. The new approximation for
the expected Fop is
E½Fop ¼
l01 u9ðsw Þ
;
l01 u9ðsw Þ þ l10 u9ðsw Þ
ð8Þ
where u9(.) is given by equation 2 but using Ne defined by
equation 7. Figure 1B shows that equation 8 and
simulation results agree very well for small values of
2Nsw, but the approximation gets worse with increasing
2Nsw. A possible explanation for the discrepancy in the
latter case is discussed below.
With nonzero recombination between the weak and
strong loci, the hitchhiking effect on codon bias may be
Hitchhiking Effect on Codon Bias 289
Table 1
Summary of Notation for Data Analysis
Symbol
Description
N
Effective population size without nonsynonymous
substitutions
Effective population size with nonsynonymous
substitutions
Mutation rate from unpreferred (preferred) to
preferred (unpreferred) allele per generation
per synonymous site
l10/l01
Selection coefficient for nonsynonymous
(synonymous) allele
The number of linked nonsynonymous sites on
one side of the synonymous site where codon
bias is measured
Recombination rate between nonsynonymous sites
Probability that an unpreferred (preferred) codon
switches to an unpreferred (unpreferred) codon
because of a nonsynonymous substitution
Average number of nonsynonymous sites
per codon
Rate of nonsynonymous substitution per site per
generation
Rate of strongly selected nonsynonymous
substitution per site per generation
Fraction of nonsynonymous substitutions driven
by strong positive selection
Ne
l01 (l10)
b
ss (sw)
L
r
c01 (c10)
ne
kN
k
d
FIG. 2.—Decreasing Fop with increasing rate of substitutions at the
strong locus. Pairs of simulation result (joined by dashed lines) and its
theoretical prediction (continuous curve) are shown for three recombination rates between loci. Theoretical prediction is given by equation 8 in
which h is given by equation 19 of Stephan, Wiehe, and Lenz (1992). (A)
Four different recombination rates (from top to bottom, r/ss ¼ 2, 0.2 and
0.02, respectively). N ¼ 104, 2Nl01 ¼ 0.002, 2Nl10 ¼ 0.006, 2Nsw ¼ 1,
and 2Nss ¼ 1,000. (B) Three different strengths of selection at the strong
locus (from top to bottom, 2Nss ¼ 100, 250, and 1,000, respectively). N ¼
104, 2Nl01 ¼ 0.002, 2Nl10 ¼ 0.006, 2Nsw ¼ 1, and 4Nr ¼ 20.
predicted by equation 8 using an appropriate solution for h
in equation 7, reasoning that this approximation should be
valid for small 2Nsw as in the case of zero recombination.
Figure 2 shows the fit of equation 8, where h is given by
equation 19 of Stephan, Wiehe, and Lenz (1992), to
simulation results with 2Nsw ¼ 1 for various recombination rates (fig. 2A) and various strengths of selection at the
strong locus (fig. 2B).
Application to Data
Next, I examined whether this theory based on the
two-locus model can explain the negative correlation
between codon bias (Fop) and dN observed among genes
sampled from D. melanogaster and D. simulans (Betancourt and Presgraves 2002). It is beyond the scope of this
paper to try to estimate the parameters of directional
selection by a rigorous statistical analysis because there are
a number of contributing factors that are not well known
for Drosophila species. Instead, reasonable ranges of
parameter values regarding positive selection in Drosophila species are explored to generate the relationship that
is not qualitatively different from the data. Table 1 summarizes the notations used in this section. Definition of
some parameters introduced earlier has been adjusted for
data analysis.
Before applying the theory to data, a complicating
factor that was not modeled above but is obvious in the
evolution of an actual coding sequence needs to be
examined first. An optimal codon for one amino acid may
change to a nonoptimal codon for a different amino acid,
or vice versa, by a single nonsynonymous substitution. If
a large proportion of nonsynonymous substitutions cause
changes from preferred to unpreferred codons and they
remain unpreferred for a long period, a negative correlation between dN and Fop can arise. I examined how much
this process alone can explain the data using the following
model. Let c10 be the probability that a preferred allele at
a synonymous site changes its status to an unpreferred
allele by a nonsynonymous substitution at the same codon.
The probability for the other direction, c01, is similarly
defined. Then, in the presence of weak selection for
optimal codons, the expected Fop is obtained by modifying
equation 1 into
2Nl10 uðsw Þ þ ne kN c10
E½Fop ¼ 1
1þ
; ð9Þ
2Nl01 uðsw Þ þ ne kN c01
where ne is the average effective number of nonsynonymous sites per codon and kN is the rate of nonsysnonymous substitution per site per generation. c01 and
c10 may be obtained by examining the table of preferred
codons in D. melanogaster (table 1 of Akashi [1994]),
where there are 22 preferred and 37 unpreferred codons
(Met, Trp, and termination codons are excluded). Of
137 possible one-step nonsynonymous mutations from
290 Kim
preferred codons, 35 mutations cause switches into unpreferred codons. Therefore, c10 is estimated to be 0.255
(¼ 35/137) if the direction of amino acid substitution is
assumed to be random. Similarly, c01 ¼ 0.156 (35
switching mutations out of 223 nonsynonymous mutations
from unpreferred codons). kN can be estimated directly
from dN, assuming that dN obtained from the D. melanogaster–D. simulans pair is an indicator of the long-term
rate of nonsynonymous substitution beyond the common
ancestor of these species (see below); namely, kN ¼ dN/(2T),
where T is the divergence time in generations between D.
melanogaster and simulans. T is assumed to be 3 3 107
using 3 million years of divergence and 10 generations
per year (McVean and Vieira 2001). I use ne ¼ 2.25. It is
obvious from equation 9 that mutation rate at a synonymous site is a critical parameter because the preference
status can be quickly reversed by synonymous site
substitution. McVean and Vieira (2001) estimated that
the 95% credibility interval for the mutation rate per site
per generation in noncoding DNA in Drosophila is 109
to 2.5 3 109. The estimated rate per synonymous site was
slightly higher (McVean and Vieira 2001). Therefore,
similar values should be assumed for l01 and l10. N is
chosen to be 106. However, up to a 5-fold increase in N
does not change the result as long as 2Nsw remains
constant (data not shown). Using these assumptions, the
expected relationship between Fop and dN is plotted in
figure 3, along with the actual data for various synonymous mutation rates and strengths of selection. Even with
very small synonymous mutation rates (curve ‘‘d,’’ l01 ¼
2 3 1010 and l10 ¼ 6 3 1010), the expected Fop
is larger than observed values for most genes of high dN.
Therefore, the erosion of optimal codon usage at the site of
nonsynonymous substitutions alone is unable to account for
the observed decline of Fop with increasing dN. However, the
contribution of this process to the decline of Fop may not be
ignored. In the following analysis, this cause of codon bias
change is included along with hitchhiking effects.
A few assumptions are needed for applying the
hitchhiking model described above to the data. First, the
theory is applicable when the substitutions between the preferred and unpreferred codons are in equilibrium such that
the codon bias at a gene remains constant throughout time.
It has been inferred that the frequency of unpreferred
codons has been increasing, presumably because of recent
relaxation of selective pressure on codon usage, since the
split of D. melanogaster and D. simulans lineages (Akashi
1996; McVean and Vieira 2001). However, the departure
of Fop from equilibrium during this period alone can be
ignored because only 5% of synonymous sites have been
subject to substitutions after the split (Betancourt and
Presgraves 2002). There is evidence that selection on
codon bias has been maintained for a much longer period,
predating the speciation of D. melanogaster and D.
simulans (Powell and Moriyama 1997; McVean and
Vieira 2001). Therefore, it may not be unreasonable to
assume that the current pattern of Fop in these species has
been shaped mainly by a long-term mutation-selectiondrift balance that once attained near equilibrium. Second, I
assume that dN estimated between D. melanogaster and
D. simulans genes is proportional to the rate of positively
FIG. 3.—Comparison of the data from Betancourt and Presgraves
(2002) and predictions by equation 9. Fop and dN estimated for 253 D.
simulans genes are shown as grey points. Five curves (labeled a to e) are
drawn by equation 9 using the following values: a. 2Nsw ¼ 1.2 and l01 ¼
5 3 1010. b. 2Nsw ¼ 0.8 and l01 ¼ 109. c. 2Nsw ¼ 0.8 and l01 ¼ 5 3
1010. d. 2Nsw ¼ 0.8 and l01 ¼ 2 3 1010. e. 2Nsw ¼ 0.5 and l01 ¼ 5 3
1010. Other parameters are N ¼ 106, ne ¼ 2.25, c01 ¼ 0.156, c10 ¼
0.255, and l10 ¼ 3l01.
selected substitutions not only in these lineages but also in
the period predating the common ancestor of these species;
that is, the relative rate of adaptive evolution among genes
is assumed to remain constant through time.
As the cumulative effects of selected substitutions at
many nonsynonymous sites determine the level of codon
bias at a given synonymous site, the number and spatial
structure of nonsynonymous sites for each gene are
important factors. However, for many genes included in
the data of Betancourt and Presgraves (2002), this
information is not available. I therefore simplify the
analysis by considering an ‘‘ideal’’ gene that consists of
one coding region without introns. The expected Fop for
a synonymous site located in the middle of this gene is
calculated. Assume that there are L nonsynonymous sites
on each side of this synonymous site and that the rate of
recombination with the ith closest nonsynonymous site is
given by ir per generation. For r ¼ 0, the expected Fop can
be obtained by modifying equation 6 with k ¼ 2Lk, where
k is the number of strongly selected substitution per
nonsynonymous site per generation. For r . 0, the
effective population size for calculating fixation probabilities is now
,(
)
L
X
1 þ 4Nk
ð1 hðirÞÞ
ð10Þ
Ne ¼ N
i¼0
(Kim and Stephan 2000). Then the expected Fop is given
by modifying equation 8 into
2Nl10 u9ðsw Þ þ ne kN c10
E½Fop ¼ 1
1þ
; ð11Þ
2Nl01 u9ðsw Þ þ ne kN c01
where u9(.) is given by equation 2 using Ne by equation 10.
For the convenience, I use h(ir) ¼ 1(4Nss)ir/s, which well
approximates the diffusion solution of Stephan, Wiehe,
and Lenz (1992). k is obtained from dN, assuming that
a fraction d of nonsynonymous substitutions is driven by
positive selection. Namely, if the two species are separated
for T generations since speciation, k is given by ddN/(2T).
Hitchhiking Effect on Codon Bias 291
to the hypothetical effective population size for D.
simulans after eliminating strong directional selection over
the entire genome, is assumed to be 5 3 106. The observed
mean of Fop for dN ¼ 0 (around 0.65) and that for high dN
(around 0.3) allow quite narrow ranges of b (;3 to ;4)
and 2Nsw (;0.5 to ;1.5) to fit the data. With
combinations of the parameter values considered here,
a reasonable fit is obtained only when quite strong
selection (2Nss ¼ 200 4000) for nonsynonymous sites
is assumed. If a larger value of r is used, a proportionally
larger value of 2Nss needs to be used.
Discussion
FIG. 4.—Comparison of the data and predictions by equation 11.
Five different curves (labeled a to e) are drawn by equation 11 using the
following parameter sets: a. b ¼ 3.5, 2Nsw ¼ 0.8, 2Nss ¼ 200, L ¼ 500,
and r ¼ 107. b. b ¼ 3, 2Nsw ¼ 1, 2Nss ¼ 600, L ¼ 1000, and r ¼ 5 3
108. c. b ¼ 3.5, 2Nsw ¼ 0.8, 2Nss ¼ 2,000, L ¼ 500, and r ¼ 107. d.
b ¼ 4, 2Nsw ¼ 1.4, 2Nss ¼ 4,000, L ¼ 1,000, and r ¼ 107. e. b ¼ 4,
2Nsw ¼ 0.6, 2Nss ¼ 1,000, L ¼ 1,000, and r ¼ 3 3 108. Other parameters
are N ¼ 5 3 106, l01 ¼ 109, c01 ¼ 0.156, c10 ¼ 0.255, and ne ¼ 2.25.
Smith and Eyre-Walker (2002) and Fay, Wyckoff, and Wu
(2002) suggested that d for Drosophila genes is around
0.45 to 0.5. However, it is unrealistic to expect d to be
constant over different genes. For genes whose dN is
greater than the rate of synonymous substitutions, dS, most
nonsynonymous substitutions are likely to have been fixed
by positive selection. It is therefore expected that d itself is
an increasing function of dN. I consider a simple relationship d ¼ 1 Exp(dN/dS), where dS (¼ 0.1) is the mean
synonymous substitution rate observed in the data. The
observed mean number of codons, averaged over 236
genes for which this number is available, is 571. Given
that many of these are partial EST sequences, the true
average number of codons should be higher than this
value. Therefore, a value of L between 500 and 1,000
might be chosen to represent the data, assuming that about
75% of sites in an exon are effectively nonsynonymous.
Mean recombination rate per nucleotide estimated from the
data is 3 3 108 (¼ 0.003 cM/kb) (Betancourt and
Presgraves 2002). A higher value of r than this should be
used, considering that the presence of introns is ignored in
the model. T is assumed to be 3 3 107 generations (see
above).
Examination of the solutions derived above reveals
that Fop at equilibrium mainly depends on the ratio of l01
and l10 but not their absolute values, if the hitchhiking
effect is the major force reducing codon bias. I define b ¼
l10/l01. Although analytic solutions were obtained for
twofold degenerate sites, they are also applicable to
fourfold degenerate sites by distinguishing preferred from
all unpreferred alleles. Considering also the AT-biased
mutation in Drosophila (Petrov and Hartl 1999), which
causes more substitutions toward unpreferred alleles
(Akashi 1996; Powell and Moriyama 1997), b ranging
from 2 to 4 would be reasonable when applying the theory
to the data.
In figure 4, expectations of Fop as functions of dN for
various sets of parameter values are plotted against the
data for D. simulans. Population size N, which corresponds
Two forms of simple approximations were found for
the effect of strong directional selection on substitutions of
weakly selected mutants at linked sites. The first solution
(equation 6) explicitly models the fixation process of
a weakly selected allele linked to a beneficial allele. It uses
a similar approach as Gillespie (2001), which, however,
ignores genetic drift in finite populations. This method of
derivation was possible only for zero recombination. The
second solution (equation 8), which allows recombination
and provides a good approximation for small 2Nsw, is
theoretically less complete because it simply exploits the
fact that the fixation probability depends on a certain form
of effective population size (Ne). Agreement of the
simulation results and equation 8 (figs. 1B and 2) indicates
that the ‘‘coalescent effective’’ population size for neutral
variants (Gillespie 2000b) is approximately equal to the
‘‘fixation effective’’ population size (Otto and Whitlock
1997) for weakly selected allles (2Nsw 1) under the
model considered. As the perturbation of allele frequency
caused by hitchhiking is similar to that of a population
bottleneck (Barton 1998), the nature of the stochastic force
under recurrent hitchhiking should be similar to that under
population size fluctuation. Then the failure of equation 8
for moderately selected alleles (2Nsw . 1) may be
understood by the arguments of Otto and Whitlock
(1997). They showed that a simple approximation for the
fixation of a beneficial allele with a cyclically varying
population size, using the harmonic mean of changing
population size, is possible only when the population cycle
is sufficiently faster than the time scale of the fixation
process (on the order of 1/s). Therefore, harmonic mean
approximation works best for a very weakly selected
allele. If selection is strong and thus the time scale is short,
the fixation probability will not depend on the long-term
population size change but on the specific direction of
change at the beginning of the substitution. Similarly, the
probability of fixation at the weak locus in the hitchhiking
model depends on its time scale of fixation and the rate of
substitutions at the strong locus. If selection at the weak
locus is not so weak, the time scale of fixation becomes
short compared with that of a neural allele, and, thus, one
cannot expect that the approximation using equation 7,
which is based on the stochastic behavior of a neutral
allele, is still valid.
Codon usage bias has largely been investigated under
single-site models in which the frequency of the preferred allele is determined by mutation bias, weak selection
292 Kim
(on translation efficiency), and genetic drift (or effective
population size) (Bulmer 1991; McVean and Charlesworth
1999). This paradigm of codon bias prompted numerous
investigations to find and evaluate major predictors of
codon bias in Drosophila genome, such as local recombination rate, gene length, and gene expression level
(Kliman and Hey 1993; Comeron et al. 1999; Marais,
Mouchiroud, and Duret 2001). It is not straightforward to
find a direct role of the nonsynonymous substitution rate
(dN) in a simple model. However, it was previously
reported that the codon usage is more biased for amino
acids that are more conserved between species (Ticher and
Grauer 1989; Akashi 1994). Using a larger data set in
Drosophila, Betancourt and Presgraves (2002) showed
that dN is not only an additional contributing factor to
codon bias but also one whose correlation with codon bias
is much stronger than those with recombination rate and
gene length. This correlation with dN might be explained
in many ways. Akashi (1994) argued that the selection for
translational accuracy maintains a high frequency of
preferred codons for highly conserved amino acids for
which the cost of misincorporation is higher. Therefore,
a lower codon bias is expected at amino acid sites under
relaxed constraints. In a similar argument, constraints on
amino acid sites and the strength of selection on the
optimal codon usage may be correlated within a gene.
These hypotheses were not well supported, because the
exclusion of divergent amino acid sites does not change
the degree of correlation between dN and codon bias, and
because the rapid evolution of genes in the data set does
not appear to be caused by relaxed purifying selection
(Betancourt and Presgraves 2002). It is also possible that
the correlation of dN and codon bias is a secondary product
of correlation with other unobserved predictors, such as
gene expression level. Although these explanations cannot
be ruled out, this study focuses on the plausibility of the
hitchhiking effect of nonsynonymous substitutions as an
explanation for the correlation found by Betancourt and
Presgraves (2002). Because the effective population size is
a critical component in the single-site model of codon bias,
and selection on neighboring sites changes the effective
population size, it is not difficult to predict the effect of
interference (Hill and Robertson 1966; Barton 1995;
Gillespie 2001). Akashi (1996) also considered this effect
as one of possible explanations for fast protein evolution
and reduced codon bias in D. melanogaster relative to
D. simulans lineage.
Using an analytic approximation based on a twolocus interaction between weak and strong selection, this
study demonstrated that the observed correlation of codon
bias and dN can be well accounted for by the hitchhiking
effect if a significant fraction of amino acid substitutions is
driven by strong directional selection. Although parameter
estimation using a rigorous statistical method was not
conducted, using simplifying assumptions about numerous
genomic parameters, the average strength of selection
(2Ns) required to produce the observed pattern of codon
bias was inferred to be at least on the order of 100 (fig. 4).
Candidate genes for male accessory gland proteins, which
comprise 24.4% of the genes in the data set of Betancourt
and Presgraves (2002), are known to be under strong
FIG. 5.—Frequency of optimal codon usage as a function of
population size N. Fop is given by equation 6 with l10/l01 ¼ 2. Selection
coefficients for both loci are constant: sw ¼ 105 and ss ¼ 103. Mutation
rate at the strong locus is fixed as ls ¼ 108. Therefore, k ¼ 2Nlsu(ss)
(continuous curve). Dashed curve is drawn for k ¼ 0 (single-site model).
positive selection (Swanson et al. 2001). However, it is not
known in general whether other new adaptive amino acid
variants are under such strong positive selection in
Drosophila. As the required strength of selection to
explain a given level of codon bias critically depends on
the number of linked nonsynonymous sites and recombination rates, a more accurate inference of this parameter
will be possible if structure and local recombination rate
for each individual gene are taken into account in the
analysis. Although it was assumed above that a given
synonymous site is under the influence of strong selection
at linked nonsynonymous sites in the same gene only,
positive selection acting on a flanking regulatory region
and even strong selection on a neighboring gene may
further reduce the level of codon bias by hitchhiking
effects. If the rate of adaptive change in promoter region,
for example, increases along with the rate of positive
selection in coding region, this could lead to an overestimation of the strength of selection at nonsynonymous
sites.
As good approximation for codon bias under the twolocus model was obtained through equation 7, not only
strong directional selection but also any other selective
force that reduces the effective population size at a linked
site is expected to lower codon bias. Recent studies have
considered the effect of weak selection at linked sites.
McVean and Charlesworth (2000) and Comeron and
Kreitman (2002) showed that the interaction of many
segregating alleles at tightly linked synonymous sites
(‘‘weak selection Hill-Roberton [wsHR] interference’’ or
‘‘interference selection’’) lowers the level of codon bias
from the expectation of the single-site mutation-selectiondrift model. However, the preferred allele frequency under
wsHR interference deviates significantly from the standard
model only when there are a large number of synonymous
sites in tight linkage. At this point, the relative importance
of wsHR interference and the hitchhiking effect of strongly
selected nonsynonymous substitutions in determining the
degree of codon bias cannot be evaluated. Most likely,
these two forces should act simultaneously in nature.
However, unless there is a correlation between the strength
Hitchhiking Effect on Codon Bias 293
of wsHR and dN, the effect of wsHR may simply be
regarded as a factor lowering 2Nsw in the model above.
Surveys of protein-coding sequences from numerous
species revealed that the degree of codon bias is similar
among species of very different census population sizes,
such as E. coli, yeast and Drosophila (Powell and
Moriyama 1997). This is a puzzling observation in the
light of single-site mutation-selection-drift model of codon
bias, which allows a very narrow range of 2Nsw. It was
suggested that wsHR interference lowers codon bias from
the level predicted by the single-site model and maintains
intermediate levels of codon bias over several orders of
magnitude in population size (McVean and Charlesworth
2000). Hitchhiking effects of strong positive selection on
codon bias as modeled in this study may also reduce the
dependency of codon bias on population size. Figure 5
compares the expected levels of optimal codon usage
predicted by equation 6 (two-locus model) with and
without the hitchhiking effect. Population size is varying
from 104 to 108 while selection coefficients and mutation
rates for weak and strong loci remain constant. The reduction of codon bias caused by hitchhiking shows a rather
complex pattern with increasing population size. After
population size exceeds the point where preferred alleles
are predicted to reach fixation in the single-site model,
expected codon bias decreases with increasing population
size. Obviously, increasing input of positively selected
mutations with increasing population size accelerates
hitchhiking effects, which overpowers the increasing
intensity of selection (2Nsw) at the weak locus. A similar
curve was obtained by Gillespie (2001) when he examined
the substitution rate of weakly selected alleles linked to
a strongly selected locus with varying population size.
Therefore, the uniformity of codon bias over different
species might be explained at least partially by theory that
places the hitchhiking effect as a dominant stochastic force
governing molecular evolution and thus suggests ‘‘population size may not be relevant to a species’ evolution’’
(Gillespie 2001).
Acknowledgments
I thank Andrea Betancourt and Daven Presgraves for
their great support and help in the analysis of the
Drosophila data. I also thank Adam Eyre-Walker, Rasmus
Nielsen, Molly Przeworski, Wolfgang Stephan, and two
anonymous reviewers for their insights and comments that
greatly improved the manuscript. This research was
supported by National Science Foundation grant DEB0089487 to Rasmus Nielsen.
Literature Cited
Akashi, H. 1994. Synonymous codon usage in Drosophila
melanogaster: natural selection and translational accuracy.
Genetics 136:927–935.
———. 1995. Inferring weak selection from patterns of polymorphism and divergence at ‘‘silent sites in Drosophila DNA.
Genetics 139:1067–1076.
———. 1996. Molecular evolution between Drosophila melanogaster and D. simulans: reduced codon bias, faster rates of
amino acid substitution, and larger proteins in D. melanogaster.
Genetics 144:1297–1307.
Akashi, H., and S. W. Schaeffer. 1997. Natural selection and the
frequency distributions of ‘‘silent’’ DNA polymorphism in
Drosophila. Genetics 146:295–307.
Barton, N. H. 1995. Linkage and the limits to natural selection.
Genetics 140:821–884.
———. 1998. The effect of hitch-hiking on neutral genealogies.
Genet. Res. 72:123–133.
Betancourt, A. J., and D. C. Presgraves. 2002. Linkage limits the
power of natural selection in Drosophila. Proc. Natl. Acad.
Sci. USA 99:13616–13620.
Birky, C. W., and J. B. Walsh. 1988. Effects of linkage on rates
of molecular evolution. Proc. Natl. Acad. Sci. USA 85:6414–
6418.
Bulmer, M. 1991. The selection-mutation-drift theory of
synonymous codon usage. Genetics 129:897–907.
Comeron, J. M., and M. Kreitman. 2002. Population, evolutionary and genomic consequences of interference selection.
Genetics 161:389–410.
Comeron, J. M., M. Kreitman, and M. Aguadé. 1999. Natural
selection on synonymous sites is correlated with gene length
and recombination in Drosophila. Genetics 151:239–249.
Ewens, W. J. 1979. Mathematical population genetics. SpringerVerlag, New York.
Fay, J. C., G. J. Wyckoff, and C.-I. Wu. 2002. Testing the neutral
theory of molecular evolution with genomic data from
Drosophila. Nature 415:1024–1026.
Gerrish, P. J., and R. E. Lenski. 1998. The fate of competing
beneficial mutations in an asexual population. Genetica 102/
103:127–144.
Gillespie, J. H. 2000a. Genetic drift in an infinite population: the
pseudohitchiking model. Genetics 155:909–919.
———. 2000b. The neutral theory in an infinite population.
Gene 261:11–18.
———. 2001. Is the population size of a species relevant to its
evolution? Evolution 55:2161–2169.
Hill, W. G., and A. Robertson. 1966. The effect of linkage on the
limits to artificial selection. Genet. Res. 8:269–294.
Kim, Y., and W. Stephan. 2000. Joint effects of genetic
hitchhiking and background selection on neutral variation.
Genetics 155:1415–1427.
Kliman, R. M., and J. Hey. 1993. Reduced natural selection associated with low recombination in Drosophila melanogaster.
Mol. Biol. Evol. 10:1239–1258.
Marais, G., D. Mouchiroud, and L. Duret. 2001. Does
recombination improve selection on codon usage? Lessons
from nematode and fly complete genomes. Proc. Natl. Acad.
Sci. USA 98:5688–5692.
Maynard Smith J., and J. Haigh. 1974. The hitch-hiking effect of
a favourable gene. Genet. Res. 23:23–35.
McVean, G. A. T., and B. Charlesworth. 1999. A population
genetic model for the evolution of synonymous codon usage:
patterns and predictions. Genet. Res. 74:145–158.
———. 2000. The effects of Hill-Robertson interference
between weakly selected mutations on patterns of molecular
evolution and variation. Genetics 155:929–944.
McVean, G. A. T., and J. Vieira. 2001. Inferring parameters of
mutation, selection and demography from patterns of
synonymous site evolution in Drosophila. Genetics 157:
245–257.
Otto, S. P., and M. C. Whitlock. 1997. The probability of
fixation in populations of changing size. Genetics 146:723–
733.
Petrov, D. A., and D. L. Hartl. 1999. Patterns of nucleotide
substitution in Drosophila and mammalian genomes. Proc.
Natl. Acad. Sci. USA 96:1475–1479.
294 Kim
Powell, J. R., and E. N. Moriyama. 1997. Evolution of codon
usage bias in Drosophila. Proc. Natl. Acad. Sci. USA
94:7784–7790.
Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P.
Flannery. 1992. Numerical recipes in C. Cambridge University Press, Cambridge, U.K.
Sharp, P. M., and W.-H. Li. 1987. The rate of synonymous
substitution in enterobacterial genes is inversely related to
codon usage bias. Mol. Biol. Evol. 4:222–230.
Shields, D. C., P. M. Sharp, D. G. Higgins, and F. Wright. 1988.
‘‘Silent’’ sites in Drosophila genes are not neutral: evidence
of selection among synonymous codons. Mol. Biol. Evol.
5:704–716.
Smith, N. G. C., and A. Eyre-Walker. 2002. Adaptive protein
evolution in Drosophila. Nature 415:1022–1024.
Stephan, W., T. H. E. Wiehe, and M. W. Lenz. 1992. The effect
of strongly selected substitutions on neutral polymorphism:
analytical results based on diffusion theory. Theor. Popul.
Biol. 41:237–254.
Swanson, W. J., A. G. Clark, H. M. Waldrip-Dail, M. F.
Wolfner, and C. F. Aquadro. 2001. Evolutionary EST
analysis identifies rapidly evolving male reproductive
proteins in Drosophila. Proc. Natl. Acad. Sci. USA 98:
7375–7379.
Ticher, A., and D. Grauer. 1989. Nucleic acid composition,
codon usage, and the rate of synonymous substitution in
protein-coding genes. J. Mol. Evol. 28:286–298.
Wiehe, T. H. E., and W. Stephan. 1993. Analysis of a genetic
hitchhiking model, and its application to DNA polymorphism
data from Drosophila melanogaster. Mol. Biol. Evol. 10:842–
854.
Adam Eyre-Walker, Associate Editor
Accepted September 18, 2003