Bio 113 Winter 2005/6 Problem Set 1 Solutions 1) In section

Bio 113
Winter 2005/6
Problem Set 1 Solutions
1)
In section discussion, we saw that for n nucleotides, the probability that, starting with
nucleotide A at time zero, we end up with nucleotide A at time t is
1 n 1 nt
[1]
Pr[AA]t = +
e
n
n
The probability that, starting with a nucleotide A at time zero, we end up with something
other than A at time t is
1 1
[2]
Pr[AA]t = ent
n n
Where is the probability of a particular nucleotide change in one unit of time. Below
we plot functions [1] and [2] for n = 7 and = 0.01.
Pr[AA]t
Pr[A, not-A]t
t
Now consider a particular nucleotide site in homologous sequences of two species. We
want to know I, the probability of identity – i.e. the probability that nucleotides are the
same in the two species. Without loss of generality, let’s call the ancestor nucleotide A. I
is the probability that, given one of the ancestor possibilities, the nucleotides in both of
the present-day species are the same. In other words,
I = Pr[starting with A, we get A in both species that evolve independently up to time t]
+ Pr[starting with A, we get B in both species that evolve independently up to time t] +
Pr[starting with A, we get C in both species that evolve independently up to time t] + ...
=
= Pr[starting with A, we get A in species 1] Pr[starting with A, we get A in species 2]
+
Pr[starting with A, we get B in species 1] Pr[starting with A, we get B in species 2] +
...
= (Pr[AA]t ) 2 + (Pr[AA]t ) 2 (n 1)
Where the factor (n-1) comes from the fact that, given that we fixed the ancestral
nucleotide, there are n-1 possible other nucleotides that could be found in the descendant
sequences. Substituting expressions [1] and [2], we get
1 n 1 nt 2 1 1 nt 2
1 n 1 2nt
I = +
e + e ( n 1) = ... = +
e
n n
n
n
n
n
From this, sequence divergence (probability that nucleotides that came from the same
ancestral nucleotide are different in the two species) is
n 1 n 1 2nt n 1
[3]
D = 1 I =
e
=
(1 e2nt )
n
n
n
Now, we define K as the number of substitutions that actually happened in the two
lineages during the time t. In a given unit of time, the probability that the existing
nucleotide will change to some other nucleotide is 3. Therefore, in one of the lineages,
the expected number of nucleotide changes during time t is 3t. The number of changes
in both lineages is twice this number:
K = 6t
Now we wish to relate D and K. The exponent in [3] is just equal to –nK/3, which means
that
D=
nK nK
n 1 n
1 e 3 e 3 = 1
D
n
n 1
Taking natural logarithm of both sides in the right-hand equation, and then multiplying
both sides by –3/n, we get
K =
n 1
n
ln1
D
n 1 n
For n = 7, this becomes
7 6
K = ln1 D
6 7
2)
Isoleucine (I) is much more similar to leucine (L) than serine (S), in terms of several
physico-chemical properties (see, for instance, Figure 1.9 of Graur and Li). Accordingly,
one possible explanation for the above pattern is that nucleotide mutations that lead to a
change from I to L are generally not as harmful to the organism as those from I to S. As a
result, purifying natural selection purges mutants of the second type more effectively
from the population than those of the first type, allowing the mutations of the first type to
fix more often.
Another possibility is that mutational patterns, as opposed to selection pressures, are
causing the difference in substitution rates. For example, note that the codons for I are
AUU, AUC, and AUA. For each of them, there is a single one-nucleotide mutation that
can lead to codons that code for L (CUU, CUC, and CUA, respectively). However, at
least two of the nucleotides need to change in either of the codons for I in order to reach
either of the codons that code for S (which are UCU, UCC, UCA, and UCG). Thus, if
each single-nucleotide mutation is equally likely, we expect to see a larger number of I ->
L substitutions than I -> S substitutions, even in the absence of natural selection that
favors either substitution type more than the other.
A good way to distinguish between selection and mutation in causing such patterns of
substitution is to compare them to the substitution patterns in putatively non-functional
regions, where selection is much less likely to be in effect. For instance, we can look at
the rates of substitution between different nucleotides in pseudogenes, and assume that
these “background” substitution rates represent the underlying mutational patterns. We
can then check whether these background substitution rates can explain the amino-acid
substitution patterns in protein-coding regions.
3)
In mammalian genomes, in addition to neighbor-independent single-nucleotide
substitution rates, there is a known neighbor dependence effect: when a C is upstream of
a G nucleotide, the C->T transition rate is greatly elevated. This results in a frequent
conversion of CpG dinucleotides into TpG dinucleotides. Note that when such an event
happens on one of the DNA strands, on the opposite strand, a CpG dinucleotide is
converted into a CpA dinucleotide. In other words, this phenomenon has two
consequences: the nucleotide upstream of a G is more likely to be a T than a C, and the
nucleotide downstream of a C is more likely to be an A than a G.
In the mocrobial genome under study, we observe the opposite of the above observation –
the nucleotide downstream of a C is more likely to be a G than an A. This could result
from a phenomenon opposite of that found in the mammalian genomes, so that TpG
dinucleotides are converted to CpG dinucleotides at an extra-high rate (or, equivalently,
where the CpA dinucleotides tend to be converted to CpG).
4)
Just as in the derivation of the Jukes-Cantor correction, we require a couple quantities;
first, the probability that a nucleotide starts at A (T, G, or C) and is A (T, G, or C) at time
t. This probability, independent of which nucleotide we consider, is given to us on page
73 of Graur and Li, X = 1/4 (1 + e-4t + e-2(+)t). We also need the probability that a
nucleotide, starting at A, is some different nucleotide at time t. In the Kimura twoparameter model, this probability is split up into two mutually exclusive cases; the first is
where the nucleotide at time t differs from the initial one by a transition (call this
probability Y) and the second one is where it differs by a transversion (call this
probability Z). Again, from the text:
Y = 1/4 (1 + e-4t)
Z = 1/4 (1 + e-4t - e-2(+)t)
Now, as indicated in the hint, we find P = (probability two sequences differ by a
transition) and Q = (probability two sequences differ by a transversion). We will step
through the derivation of Q (P is derived with similar reasoning). Two sequences can
differ by a transversion at time t, assuming they started at the same nucleotide, in two
ways. First, one sequence stays the same with probability X and the other has a
transversion with probability Z. There are two types of transversions (from any particular
nucleotide) and we can reverse the order of which sequence stays the same so the
probability of this happening is 4XZ. The other way to get a difference by a transversion
is if one sequence has a transition and the other has a transversion. This occurs with
probability 4YZ. Thus, Q = 4XZ + 4YZ. Using similar reasoning, P = 2Z2 + 2XY. But
plugging and chugging we get
P = 1/4 (1 + e-8t – 2e-4(+)t)
Q = 1/4 (2 - 2e-8t)
We’re interested in finding K, the number of substitutions per site since divergence,
which is 2(+2)t. We want to manipulate P and Q so that we can construct K. By
rearranging and taking logarithms,
-8t = ln (1-2Q)
(1)
-4(+)t = ln (1-Q-2P)
(2)
Divide the equation (1) by -4 and (2) by -2 and add the results and we get K! So,
K = -1/4 ln (1-2Q) – 1/2 ln (1-Q-2P)
To get the exact result in Graur and Li, remember that –ln A = ln (1/A).
5)
First, calculate the allele frequencies from the data. Use the formula
x = freq(A) = AA + 1/2 AO + 1/2 AB
to find the frequency of A and similar formulae to find the frequencies of B and O (y and
z). Then, use Hardy-Weinberg to find the genotype frequency in the next generation. For
three alleles, the genotype frequencies sum to 1 and are
(x + y + z)2 = x2 + y2 + z2 + 2xy + 2xz + 2yz
where the homozygotes are first and the heterozygotes last.
Allele
Frequency
A = 0.26
B = 0.13
O = 0.61
Genotype
Frequency
AA = 0.0676
BB = 0.0676
OO = 0.3172
AB = 0.1586
AO = 0.0169
BO = 0.3721
6)
The size of Crake’s population after ten years is 100 210 = 102400.
a. The variance effective population size is the harmonic mean of the population sizes
at each of the time points in consideration.
Ne = 1 / (1/11 (1/100 + 1/200 + 1/400 + 1/800 + 1/1600 + 1/3200 + 1/6400 + 1/12800
+
1/25600 + 1/51200 + 1/102400))
= 550.3
b. The observed total heterozygosity is an estimator of , the population mutation
parameter, and = 4Neµ. If = 2.2 10-9 and Ne = 550.3, µ = 10-12, which is three
orders of magnitude smaller than average (Crake succeeded).
7)
a) Yes. The key is that the reading frame was not specified in the question. If the first
three bases, 'TGA' were in frame, the sequence could not be coding because TGA is a
stop codon. However, if the frame starts with the second base, giving GAG as the first
codon, then there are no stop codons in the transcript and the sequence could code for a
protein.
b) Two, at positions 12 and 25.
c) Assuming that the sequence has evolved neutrally, then Theta_W = NumSegSites /
Sum(1/(i - 1), i=2..n) / NumSites (Gillespie p. 43).
Then Theta_W = 2 / (Sum(1/(i - 1), i=2..6) / 40 ~ 0.0219.
Under neutrality, Theta_W is an estimator of 4 * Ne * u, and we can estimate u by
Theta_W/(4 * Ne). If Ne = 10^6, then u ~ 5.47 * 10^-9 /gen.
d) Three, at positions 12, 20, and 25.
e) i) By raw divergence, we see 3 changes per 40 sites, or 3/40 changes per site.
Divergence accrues in proportion to the mutation rate and time since speciation, K = 2 *
u * t. Thus t = K / (2*u).
t = (3/40) / (2 * 5.47 * 10 ^-9) ~ 6.9 * 10^6 generations.
If the species have ten generations per year, then we estimate the time since divergence to
be 6.9 * 10^5 = 690,000 years.
ii) The Jukes-Cantor correction is K = -(3/4) * ln( 1 – 4 * p / 3). Plugging in p = 3/40,
we get K ~ 0.079. (This is greater than 3/40 (0.075), which makes sense because the
observed number of differences underestimates the true number). If K ~ 0.079, then
t = (0.079) / (2 * 5.47 * 10 ^-9) ~ 7.3 * 10^6 generations and
the time since speciation is estimated at 730,000 years.
iii) To use the Kimura correction, we first need to estimate P, the proportion of
transitions, and Q, the proportion of transversions. The three site differences are C & T
(a transition), A & G (a transition), and A & T (a transversion). Then P = 2/40 and Q =
1/40. The correction formula is K = (1/2) ln(1/(1-2P-Q)) + (1/4)ln(1/(1-2Q)), so:
K = (1/2) ln(1/(1-2*(2/40)-(1/40))) + (1/4)ln(1/(1-2*(1/40))) ~ 0.0795
t = (0.0795) / (2 * 5.47 * 10 ^-9) ~ 7.4 * 10^6 generations and
the time since speciation is estimated at 740,000 years.
f) Only two sites differ between sim2 and mel, so the estimate for i) is 2/40 -> 460,000
years, with corresponding decreases in ii) and iii). Which estimate of divergence should
we trust, the one using sim1 or the one using sim2? It’s hard to tell from this data. Two
things that confound interpretation here are that, although we wouldn’t know it without
the rest of the simulans, we’re mistaking polymorphisms for divergence (sites 12 and 25),
which causes us to overestimate the divergence time, and our small sample size, which
gives us a large sampling error. A good idea would be to look at longer sequences to see
if they support the lower or the higher divergence estimate.
g) We’ve estimated divergence times of less than a million years, far less than the
established divergence time of 3 million years. One possibility is that we are looking at a
coding sequence, as considered in part (a). If so, lower divergence would be expected.
Another possibility is that we just happen to have a small sample which doesn’t reflect
well the true divergence of the species. Another possibility is that our estimate of the
mutation rate from the polymorphism data was inaccurate. A lower true mutation rate
would imply a longer divergence time.
(If you’re comfortable with statistics, here’s a more rigorous look to the small-sample
question. You don’t have to know this for the class, but it’s interesting. Typically you
see about 8% divergence at neutral loci between mel and sim. We can conceive of our
40-site sample as 40 Bernoulli trials, each independent and with probability of success
equal to 8%. The mean number of divergence sites you expect to see is 40 * 0.08 = 3.2,
and the standard deviation is sqrt( 40 * 0.08 * (1 – 0.08) ) ~ 1.7. So it’s not unlikely at all
to see three, two, or one fixed differences in a sample of size 40. In fact, the chance that
you see zero fixed differences is just (1 – 0.08) ^ 40, or about 4%. The data are not
inconsistent at all with a true divergence time of 3 million years.)
8)
This question does not have sharp right and wrong answers. Be sure that you can define
each of the terms and give a plausible example of how polymorphism and divergence
data would be affected by varying each term, e.g., increasing the mutation rate increases
the amount of polymorphism observed, and, all else being equal, some of these extra
polymorphisms will eventually fix and appear as increased divergence.
9)
We’ve posted a drift simulation on the course website. If you didn’t write your own, you’re encouraged
to download the program and try it out. (You’ll need a python interpreter (www.python.org) if you
don’t have one.) If you did write your own program, you might compare the output to the posted
program.
10)
Middle-Earth has two distinct populations of elves: those of Lothlorien, and those of
Mirkwood. The woods of Lothlorien are much better suited to habitation than those of
Mirkwood, so there are twice as many elves in the former as in the latter. These communities
have been isolated from each other for a long enough time that allele frequencies at many loci
have changed, but they are still fully capable of interbreeding, and therefore must be
considered a single species. Each population has only 2 alleles at the disposition locus
(congenial or arrogant), but the frequencies of the alleles are not the same in the two
populations. Each population is initially in Hardy-Weinberg equilibrium with itself. Now that
theAdversary has been defeated, the communities merge on their way back to the Elf
Havensacross the Sea.
a) Calculate the change in the proportion of homozygotes between the P1 generation where
the populations merge, and the F1 generation, assuming that mating is now random within the
entire Elven community.
Lothlorien
Population size = 2x
Frequency of allele = a
Mirkwood
Population size = x
Frequency allele = b
In the P1 generation,
Allele frequency = [(freq in L)*(pop size of L) + (freq in M)*(pop size in M)]/(total pop)
= (a*2x + b*x) / (3x)
= (2a + b) / 3
Freq of homozygotes = [(freq in L)*(pop size L) + (freq in M)*(pop size M)]/(total pop)
= [(a^2)(2x) + ((1-a)^2)(2x) + (b^2)x + ((1-b)^2)(x)] / (3x)
= [2(a^2) + (2(1-a)^2) + (b^2) + ((1-b)^2)] / 3
= (2a^2 + 2 – 4a + 2a^2 + b^2 + 1 – 2b +b^2) / 3
= (4a^2 – 4a + 2b^2 – 2b + 3) / 3
In the F1 generation,
Allele frequency is unchanged (random mating does not change the total occurrence of an
allele—it merely changes the frequencies of genotypes when things were not previously in
equilibrium.)
Freq of homozygotes = [(2a + b) / 3]^2 + [1 – (2a + b) /3]^2
The difference between these frequencies of homozygotes is:
Difference = Freq after – Freq before
= [(2a + b) / 3]^2 + [1 – (2a + b) /3]^2 – [(4a^2 – 4a + 2b^2 – 2b + 3) / 3]
= (4a^2 + 4ab + b^2) / 9 + 1 – (4a + 2b)/3 + (4a^2 + 4ab + b^2) / 9
- [(4a^2 – 4a + 2b^2 – 2b + 3) / 3]
9*dif = 8a^2 + 8ab + 2b^2 + 9 – 12a – 6b – 12a^2 + 12a –6b^2 + 6b – 9
9*dif = -4a^2 – 4b^2 + 8ab
9*dif = -4( a^2 – 2ab + b^2)
dif
= (-4/9) (a – b)^2
b) Does the sign of the change in the proportion of homozygotes depend on whether the larger
or smaller population initially had a higher incidence of arrogance? Explain your answer.
The difference will always be negative (or 0 if the frequencies were identical to begin with).
Mixing the entire population together will decrease the total number of homozygotes, as
there would have been a relative excess of them in whichever population had an initial
frequency furthest from 0.5.
11)
The peppered moth Biston Betularia can be one of two colors, white or dark brown. A single
locus with two alleles is responsible for determining the body color phenotype. Allele ‘M’ is
dominant to ‘m’, and its presence leads to a greater production of melanin that darkens the
moth’s body color.
An extremely large population of the peppered moth has thrived in a forest of dark brown and
white barked trees for many centuries. 55% of the moths in this population are white in color.
Both white and dark brown moths can survive in the forest because they are camouflaged
against trees that match their body color. When a fire sweeps though the forest, the benefits
enjoyed by the white moths are eliminated. While many of the moths and birds in the forest
survive the blaze, the dark colored moths are now better camouflaged against the dark burnt
bark of the trees. The white moths are so poorly camouflaged that they are immediately eaten
by predatory birds. None of the white moths are ever able to reproduce.
NOTE: Assume Hardy-Weinberg assumptions for the following questions including an
infinite population size.
a) What are P (MM), Q (Mm), R (mm), p (frequency of M) and q (frequency of m) for the
population of moths before the forest fire?
Q = 0.55, q = q^1/2 = 0.74, p = 1 – q = 0.26, P = p2 = .068, R = 2pq = 0.385
b) Assuming both colors of moths survive the forest fire with equal probability, what are P, Q,
R, p and q for the first generation of moths after the fire (at the time when they are born).
The white adults are all eliminated through predation after the fire. The new genotype
frequencies after this are:
Q’ = 0, P’ = .068/(0.68+0.385) = 0.15, R’ = 0.385/(0.68+0.385) = 0.85
This means that the allele frequencies are:
p’ = P’ +1/2R’ = 0.58
q’ = 1-0.58 = 0.42
The gene frequencies of the offspring rather after the birth are the same:
p’ = P’ +1/2R’ = 0.58
q’ = 1-0.58 = 0.42
and the genotype frequencies are hardy-Weinberg ones:
P1 = (p’)2 = .33
R1 = 2p’q’ = .49
Q1 = (q’)2 = .18
c) If white moths are destroyed generation after generation, find an equation for qt in terms of
qo.
This is very similar to b, but now we will express q1 in terms of qo explicitly. We can then
see that there is a simple solution of qt in the t-th generation in terms of q0 before the fire.
q1 =
p0q0
p 0 2 +2 p 0 q 0
=
q0
p 0 +2q 0
= 1+q0 0
q
=
q1
p1 +2q1
q1
q0
= 1+q
= 1+2q
1
0
similarly
q2 =
p1 q1
p1 2 +2 p1 q1
and
q
qt = 1+tq0 0
d) After how many generations will q = .05? How long will it take until q is exactly 0.
qt =
q0
(1+ t q 0 )
0.74
1
1
=
+t
(1+ t 0.74)
0.05 0.74
t = 18.65generations
0.05 =
It will take infinite generations to reach p = 0 in a population with infinite size.
12)
A great catastrophe befalls Whoville; when Horton falls asleep, some of the other residents of
the Jungle of Nool (claiming that they are acting in the best interests of the poor deranged
elephant) boil the dust speck that contains the Who's world. Only 1 male and 1 female
Whosurvive the horrible calamity. Horton recovers the dust speck, and with the aid of
caffeine and increased vigilance, he protects the tiny dust speck long enough for the survivors
to produce a (F1) generation of 10 Whos. The ability for these Whos to produce substantial
noise is controlled by a simple 2-allele locus, and before the catastrophe the incidence of the
loud allele was .25, while the incidence of the quiet allele was .75. Assuming that the Whos
were in Hardy-Weinberg equilibrium before the catastrophe,
The easiest way to do this is to set up a table
P1 type
pppp
pppq
ppqq
pqqq
qqqq
Probability of obtaining Prob p extinct in F1 Prob q extinct in F1
(3/4)^4
0
1
4[(3/4)^3](1/4)
(1/4)^20
(3/4)^20
6[(3/4)^2][(1/4)^2]
(2/4)^20
(2/4)^20
4(3/4)[(1/4)^3]
(3/4)^20
(1/4)^20
(1/4)^4
1
0
a) Use a Wright-Fisher model to predict the probability of the quiet allele being extinctin the
F1 generation.
The total probability that p goes extinct by F1 is the sum of the products of probability of
getting a certain P1 type and the probability of p going extinct in F1 from that type.
In this case, that means
4(3/4)^3 (1/4)(1/4)^20 + 6(3/4)^2 (1/4)^2 (1/2)^20 + 4(3/4)(1/4)^3 (3/4)
^20 + (1/4)^4 =~ .406%
b) Use a Wright-Fisher model to predict the probability of theloud allele being extinctin the
F1 generation.
The total probability that p goes extinct by F1 is the sum of the products of probability of
getting a certain P1 type and the probability of q going extinct in F1 from that type.
In this case, that means:
(3/4)^4 + 4(3/4)^3 (1/4)(3/4)^20 + 6(3/4)^2 (1/4)^2 (1/2)^20 + 4(3/4)(1/4)
^3 (1/4)^20 =~ 31.8%
c) Identify the shortcomings of the Wright-Fisher model in this example(ie, what might
actually happen in reality that wouldn't be indicated in the Wright-Fisher model). Give an
example in which these shortcomings could substantially change the probabilities calculated in
the parts b and c.
The Wright-Fisher model assumes that gametes are drawn from an infinite pool, and
doesn’t take into account the biology of the organisms. The actual probabilities would
depend on the genotypes of the surviving male and female and not just on the allele
frequencies. For example if the male is homozygous for loud allele and the female is
homozygous for the quiet allele, then all 10 offspring will be heterozygous for these allele
and the probability of any frequency other than 50/50 will be zero. Wright-Fisher assumes
that organisms can self. With sufficiently large populations this assumption does not matter
too much, but it does affect results significantly with the population sizes are very small.
13)
Scientists discover a very basic form of life on Mars. The genetic system is based on only four
amino acids - X, Y, Z, and W - and two nucleotides – A and B. Codons in this new system are
only two nucleotides long and code for the amino acids according to the following table.
Nucleotides Amino Acids
AA
X
AB
Y
BA
Z
BB
W
Assuming that a nucleotide changes to any other in a given time step t with probability
b, answer the following questions about the substitutions in a particular neutral amino acid
sequence.
HINTS:
• for simplicity assume that only a single substitution can occur in a given time step.
• compare the problem to Kimura’s 2-parameter correction
a) What is the equation K(t) for the expected number of amino acid changes that actually do
occur after time t.
Y
(AB)
b
b
X
(AA)
b
W
(BB)
b
Z
(BA)
The number of amino acid changes that take place in a given time is given
by K(t) = 2bt
b) Find a correction for the number of changes that have occurred (K) in terms of the number
of observed amino acid substitutions (Hint: there are two types of substitutions. Include both
in the formula for K.)
For ‘b’ it is easiest to realize the similarity of this model to that of the
Kimura model.
a
A
b
T
b
a
G
b
C
A
Switching the
positions of G
and C we see the
following.
b
T
b
a
b
C
b
G
Setting a equal
to 0 we have the
following. This
looks very
similar to our
model.
A
b
b
T
b
b
Seeing the similarity to Kimura’s model we can answer the rest of the
1 1
1
questions. The equation for I(t) = PrXX (t) = + e 4t + e 2t from the
4 4
2
substitution in the Kimura equation.
For the Kimura model the probability of transitions and transversions are
1 1
1
transitions Y (t ) = + e 4 t e 2 ( + ) t
4 4
2
given by:
1 1
transversions Z (t ) = e 4 t
4 4
In our case these equations can give us the equations we are asked for. All
we have to do is set a = 0.
1 1 4t 1 2t
+ e e
4 4
2
1 1
Pr(change to a paticular neighbor state ) = PrXY (t) = PrXZ (t) = e 4t
4 4
Pr(change to the state two subs. away ) = PrXW (t) =
C
G
Let A(t) and B(t) be
A(t) = Pr(change to any neighbor state ) = 2Pr(change to aparticular neighbor state ) = 2PrXZ (t) =
B(t) = Pr(change to the state two subs. away ) = PrXW (t) =
1 1 4t
e
2 2
1 1 4t 1 2t
+ e e
4 4
2
1 1 4t 1 1 4t
e + + e e 2t = 1 e 2t
2 2
2 2
1
2t = ln
1 A 2B 1
because K(t) = 2t, K = ln
.
1 A 2B Then A(t)+ 2B(t) =
14)
Elephant seals possess an interesting system of mating. One alpha-male lies in the center of a
harem of females and attempts to mate with as many of these females as possible throughout
the course of the mating season. The harem is also surrounded by 5 beta-males, usually
younger and smaller, who lie around the harem and protect it from invasion by other males.
When the alpha-male starts to mate, the beta-males use his distraction to do some mating of
their own with the females at the outer edges of the harem. Other males, not alphas or betas,
stay in the water and are not allowed to mate at all.
One would think that the alpha male sires the most offspring in this situation, however,
recent genetic studies have found that beta-males as a group actually have at least as much
mating success as the alpha-male. From a typical harem 50% of the children are sired from
beta males and 50% are sired from the alpha. Given that there are 40 alpha-males, and 2000
females in the population, assuming that each of the harems is exactly the same size, assuming
that all beta-males have exactly the same probability of having an offspring, and assuming that
all females are fertile and available for mating, calculate the effective population size for this
population of elephant seals.
NOTE: assume non-overlapping generations for simplicity.
In general…..
numberof alpha males = N
numberof beta males = 5N
numberof females = N f
Let F be autozygosity starting at time t 0 ,then.....
F1 = Pr(same alpha male alleles) + Pr(same beta male alleles) + Pr(same female alleles)
more exp licitly...
F1 = Pr(both alleles from some alpha male) * Pr(same allele |both from alpha male)
+ Pr(both from somebeta male) * Pr(same allele |both from somebeta male)
+ Pr(both from some females) * Pr(same allele| both from some female)
breakingthisup...
Pr(both alleles from alpha males) = ( Pr(allele is frommale )* Pr( fromalpha | from male ))
2
1 1 2 1
Pr(both alleles from alpha males) = * =
2 2 16
1
Pr(same allele | both from alpha male) =
2N
1 1 2 1
Pr(both from some beta male) = * =
2 2 16
1
Pr(same allele | both from beta male) =
2(5N )
2
Pr(both from some females) =
1
1
=
2
4
Pr(same allele | both from female) =
1
2N f
F1 =
3N f +10N
1
1 1
1
1 1
+
+
=
80N N f
16 2N 16 2(5N ) 4 2N f
beause F1 is normally
3N f +10N
40N N f
1
1
=
Ne =
80N N f
3N f +10N
2N
2N e
in our case N = 40, N f = 2000
so N e =
40(40)(2000)
(40)(2000) (4)(2000) 8000
=
=
=
= 500
3(2000)+10(40)
3(50)+10
3(5)+1
16
Note: you could also just work through the problem using numbers (i.e. without finding a
general equation)
15)
You find out that your favorite protein is translated from the following mRNA:
CUAUGGCAACAUCAUCAGCGGCA
a) Write down the amino acid sequence of the protein if translation starts at the first Met codon
encountered in the mRNA sequence
MET-ALA-THR-SER-SER-ALA-ALA
b) Now assume that translation starts at the first Tyr codon. Write down the Amino Acid
sequence of the new protein.
TYR-GLY-ASN-ILE-ILE-SER-GLY
c) Which protein should experience more amino acid changes per unit time if all cytosines in
your organism are methylated? Explain.
Sequence A – In this sequence cytosines in the corresponding DNA sequence
are all in the 2nd positions in codons and thus every change to a T would be
nonsynonymous. In the Sequence B the cytosines in the corresponding DNA
sequence are all in 3rd positions, thus generating mostly synonymous
substitutions.
16)
Imagine that you have cloned two homologous genes in two species of Drosophila. You
sequence and align them. Below are the short parts of the alignments of the expected mRNA
sequences. The sequences are parsed into expected codons (for instance, the first codon in the
Drosophila simulans sequence of Gene 1 is AUC)
Gene 1:
Drosophila simulans:
AUC-ACC-CAC-CAA-CAG-UUC-UGU-GCU
Drosophila melanogaster: AUG-ACA-CAC-CAA-CGG-UUC-UGC-GAU
Gene 2:
Drosophila simulans:
ACA-GAU-GGU-CCU-CGC-GUG
Drosophila melanogaster: ACA-CUU-AGU-AUU-CAC-GCA
a) Above are protein sequences from two genes in two closely related species, D.simulans and
D.melanogaster. Assuming that the path with the fewest nonsynonymous substitutions
represents the true path between the two codons, calculate Ka and Ks for both genes.
Pa(1) = 3 Na(1) = 19.33 Pa/Na = 0.15 Ka = 3/4*ln(1-4/3*(Pa/Na)) = 0.17
Ps(1) = 2 Ns(1) = 4.06 Ps/Ns = 0.49 Ks = 0.8
Ka/Ks = 0.21
Pa(2) = 7 Na(2) = 13 Pa/Na = 0.54 Ka = 0.95
Ps(2) = 1 Ns(2) = 4.63 Ps/Ns = 0.21 Ks = 0.25
Ka/Ks = 3.8
b) Would you be surprised to learn that both genes have no function at the level of the protein?
Explain.
Yes, it would be surprising. Genes that do not have a function at the level of the protein are
expected to have Ka/Ks equal 1. In both cases Ka/Ks seems distinct from 1, although to be
certain one needs to do appropriate statistical tests. Prima facie it is surprising.
c) Would you be surprised to learn that Gene 2 has been under strong selection to change its
protein sequence? Explain.
No, you wouldn’t be. It appears that there are more aminoacid substitutions that one would
expect under assumption that all aminoacid substitutions are neutral and have fixed
through genetic drift. The idea of natural selection promoting such changes would be
consistent with the observations.
17)
You are given two sequences (50 bp each) of the homologous pseudogene in two species of
yeast. The alignment is shown below:
CCTCGACGGCTTAGATCTGATCTGACCTAATGCTGCAATCGTTACAAAGT
CCTCCACGAGTAAGAGTTGATCCGACTTAGTCCTGCGATCGTTAGATAAT
You know that these species last shared a common ancestor 10 MYA and that both species go
through 50 generations a year.
a) Using Jukes-Cantor model of nucleotide substitution, estimate mutation rate per nucleotide
per year in these two species of yeast. Assume that mutation rate is the same in both species.
There are 14 substitutions in 50 bp. P = 14/50 = .28. K = (3/4)*ln(1-(4/3)*P)=0.35.
K = 2T (years)*µ(mutations per bp per year)
µ = .35/20 Myr = 1.75 10-8 mutations per bp per year
b) Do you believe that Jukes-Cantor model is appropriate in this case? Do you see any
evidence that you need to use Kimura 2-parameter model instead? Please explain.
There are 7 transitions and 7 transversions. On the Jukes-Cantor model we would expect all
rates to be equal and thus to have twice as many transversions as transitions. Kimura model
would probably be more appropriate here.
c) You sample 10 alleles of this pseudogene different by origin in the population of one of
these species of yeast. You sequence the same 50 bp region in all 10 cases and find that there
are 5 distinct alleles in the following proportions:
allele 1
allele 2
allele 3
allele 4
allele 5
6
1
1
1
1
Estimate the effective population size in this species of yeast.
First we estimate heterozygosity
H = 1 - (0.6)2 + 4*(0.1)2 = 0.6
= H/(1-H) = 1.5
Per nucleotide bp = 1.5/50 = 0.03
Per generation µ bp= 1.75 10-8/50 = 3.5 10-10 mutations per bp per generations
4Neµ bp = 0.03
and thus
Ne = 2.1 107
18)
In Drosophila the rate of all transitions and transversions are all very similar except for the
transition from C to T (or equivalently from G to A). You know that the average GC content of
neutral, entirely unconstrained DNA in Drosophila is 34%. Please estimate the relative
probability of a C to T transitions versus T to C transition.
Let all rates except C to T be equal to a and the rate C to T be equal to b. Then the rate of
changes from A or T nucleotides to G or C nucleotides per unit time in an infinitely long
stretch of neutral DNA is
(AT%)*4a,
because both A and T have either G or C to change to and each change happens with the
rate a.
The number of changes from G or C nucleotides similarly is
(GC%)*(2a+2b)
At equilibrium the rates are equal. Thus
(AT%)*4a = (GC%)*(2a+2b)
b/a = (2-3g)/g = 2.9
19)
Your friend just sequenced the full genome of a new species of bacteria. The GC content in this
species is 50%. Your friend decides to calculate the proportion of different kinds of nucleotide
pairs (dinucleotides) and finds among other things that C’s are followed by A’s 40% of the
time while 60% of the time they are followed by the other three nucleotides. This reminds you
of a similar pattern in the human genome. From what you know about methylation-dependent
deamination, please advise your friends which other dinucleotide frequencies he needs to look
at and give qualitative predictions of what he should see.
It appears that there is an overabundance of CpA dinucleotides relative to the random
expectation. Because the GC content is 50%, we would expect dinucleotides present at equal
proportions, with CpA frequency being equal to that of CpG, CpC, or CpT. We know that in
the human DNA many cytosines are methylated in CpG pairs resulting in a higher
probability of mutation to CpA or TpG. This would generated the observed increase in the
proportion of CpA pairs. In addition it would increase the proportion of TpG pairs and
decrease the proportion of CpG pairs realtive to random expectation.
20-30) See problems 2-12 in the other pdf file
31)
We can write q = -q2 p s / (1 – q2 s). Since the q is the frequency of the deleterious
allele, q raised the a high power, say 2, will be close to zero, and p will be close to 1.
Thus, the denominator of q is 1. q is the change in q due to selection. The change due
to mutation is
(1 – q) µ where µ is the mutation rate. These changes add up to zero at mutationselection equilibrium, so
(1 – q) q2 s = (1 – q) µ
q = (µ/s)1/2