Bio 113 Winter 2005/6 Problem Set 1 Solutions 1) In section discussion, we saw that for n nucleotides, the probability that, starting with nucleotide A at time zero, we end up with nucleotide A at time t is 1 n 1 nt [1] Pr[AA]t = + e n n The probability that, starting with a nucleotide A at time zero, we end up with something other than A at time t is 1 1 [2] Pr[AA]t = ent n n Where is the probability of a particular nucleotide change in one unit of time. Below we plot functions [1] and [2] for n = 7 and = 0.01. Pr[AA]t Pr[A, not-A]t t Now consider a particular nucleotide site in homologous sequences of two species. We want to know I, the probability of identity – i.e. the probability that nucleotides are the same in the two species. Without loss of generality, let’s call the ancestor nucleotide A. I is the probability that, given one of the ancestor possibilities, the nucleotides in both of the present-day species are the same. In other words, I = Pr[starting with A, we get A in both species that evolve independently up to time t] + Pr[starting with A, we get B in both species that evolve independently up to time t] + Pr[starting with A, we get C in both species that evolve independently up to time t] + ... = = Pr[starting with A, we get A in species 1] Pr[starting with A, we get A in species 2] + Pr[starting with A, we get B in species 1] Pr[starting with A, we get B in species 2] + ... = (Pr[AA]t ) 2 + (Pr[AA]t ) 2 (n 1) Where the factor (n-1) comes from the fact that, given that we fixed the ancestral nucleotide, there are n-1 possible other nucleotides that could be found in the descendant sequences. Substituting expressions [1] and [2], we get 1 n 1 nt 2 1 1 nt 2 1 n 1 2nt I = + e + e ( n 1) = ... = + e n n n n n n From this, sequence divergence (probability that nucleotides that came from the same ancestral nucleotide are different in the two species) is n 1 n 1 2nt n 1 [3] D = 1 I = e = (1 e2nt ) n n n Now, we define K as the number of substitutions that actually happened in the two lineages during the time t. In a given unit of time, the probability that the existing nucleotide will change to some other nucleotide is 3. Therefore, in one of the lineages, the expected number of nucleotide changes during time t is 3t. The number of changes in both lineages is twice this number: K = 6t Now we wish to relate D and K. The exponent in [3] is just equal to –nK/3, which means that D= nK nK n 1 n 1 e 3 e 3 = 1 D n n 1 Taking natural logarithm of both sides in the right-hand equation, and then multiplying both sides by –3/n, we get K = n 1 n ln1 D n 1 n For n = 7, this becomes 7 6 K = ln1 D 6 7 2) Isoleucine (I) is much more similar to leucine (L) than serine (S), in terms of several physico-chemical properties (see, for instance, Figure 1.9 of Graur and Li). Accordingly, one possible explanation for the above pattern is that nucleotide mutations that lead to a change from I to L are generally not as harmful to the organism as those from I to S. As a result, purifying natural selection purges mutants of the second type more effectively from the population than those of the first type, allowing the mutations of the first type to fix more often. Another possibility is that mutational patterns, as opposed to selection pressures, are causing the difference in substitution rates. For example, note that the codons for I are AUU, AUC, and AUA. For each of them, there is a single one-nucleotide mutation that can lead to codons that code for L (CUU, CUC, and CUA, respectively). However, at least two of the nucleotides need to change in either of the codons for I in order to reach either of the codons that code for S (which are UCU, UCC, UCA, and UCG). Thus, if each single-nucleotide mutation is equally likely, we expect to see a larger number of I -> L substitutions than I -> S substitutions, even in the absence of natural selection that favors either substitution type more than the other. A good way to distinguish between selection and mutation in causing such patterns of substitution is to compare them to the substitution patterns in putatively non-functional regions, where selection is much less likely to be in effect. For instance, we can look at the rates of substitution between different nucleotides in pseudogenes, and assume that these “background” substitution rates represent the underlying mutational patterns. We can then check whether these background substitution rates can explain the amino-acid substitution patterns in protein-coding regions. 3) In mammalian genomes, in addition to neighbor-independent single-nucleotide substitution rates, there is a known neighbor dependence effect: when a C is upstream of a G nucleotide, the C->T transition rate is greatly elevated. This results in a frequent conversion of CpG dinucleotides into TpG dinucleotides. Note that when such an event happens on one of the DNA strands, on the opposite strand, a CpG dinucleotide is converted into a CpA dinucleotide. In other words, this phenomenon has two consequences: the nucleotide upstream of a G is more likely to be a T than a C, and the nucleotide downstream of a C is more likely to be an A than a G. In the mocrobial genome under study, we observe the opposite of the above observation – the nucleotide downstream of a C is more likely to be a G than an A. This could result from a phenomenon opposite of that found in the mammalian genomes, so that TpG dinucleotides are converted to CpG dinucleotides at an extra-high rate (or, equivalently, where the CpA dinucleotides tend to be converted to CpG). 4) Just as in the derivation of the Jukes-Cantor correction, we require a couple quantities; first, the probability that a nucleotide starts at A (T, G, or C) and is A (T, G, or C) at time t. This probability, independent of which nucleotide we consider, is given to us on page 73 of Graur and Li, X = 1/4 (1 + e-4t + e-2(+)t). We also need the probability that a nucleotide, starting at A, is some different nucleotide at time t. In the Kimura twoparameter model, this probability is split up into two mutually exclusive cases; the first is where the nucleotide at time t differs from the initial one by a transition (call this probability Y) and the second one is where it differs by a transversion (call this probability Z). Again, from the text: Y = 1/4 (1 + e-4t) Z = 1/4 (1 + e-4t - e-2(+)t) Now, as indicated in the hint, we find P = (probability two sequences differ by a transition) and Q = (probability two sequences differ by a transversion). We will step through the derivation of Q (P is derived with similar reasoning). Two sequences can differ by a transversion at time t, assuming they started at the same nucleotide, in two ways. First, one sequence stays the same with probability X and the other has a transversion with probability Z. There are two types of transversions (from any particular nucleotide) and we can reverse the order of which sequence stays the same so the probability of this happening is 4XZ. The other way to get a difference by a transversion is if one sequence has a transition and the other has a transversion. This occurs with probability 4YZ. Thus, Q = 4XZ + 4YZ. Using similar reasoning, P = 2Z2 + 2XY. But plugging and chugging we get P = 1/4 (1 + e-8t – 2e-4(+)t) Q = 1/4 (2 - 2e-8t) We’re interested in finding K, the number of substitutions per site since divergence, which is 2(+2)t. We want to manipulate P and Q so that we can construct K. By rearranging and taking logarithms, -8t = ln (1-2Q) (1) -4(+)t = ln (1-Q-2P) (2) Divide the equation (1) by -4 and (2) by -2 and add the results and we get K! So, K = -1/4 ln (1-2Q) – 1/2 ln (1-Q-2P) To get the exact result in Graur and Li, remember that –ln A = ln (1/A). 5) First, calculate the allele frequencies from the data. Use the formula x = freq(A) = AA + 1/2 AO + 1/2 AB to find the frequency of A and similar formulae to find the frequencies of B and O (y and z). Then, use Hardy-Weinberg to find the genotype frequency in the next generation. For three alleles, the genotype frequencies sum to 1 and are (x + y + z)2 = x2 + y2 + z2 + 2xy + 2xz + 2yz where the homozygotes are first and the heterozygotes last. Allele Frequency A = 0.26 B = 0.13 O = 0.61 Genotype Frequency AA = 0.0676 BB = 0.0676 OO = 0.3172 AB = 0.1586 AO = 0.0169 BO = 0.3721 6) The size of Crake’s population after ten years is 100 210 = 102400. a. The variance effective population size is the harmonic mean of the population sizes at each of the time points in consideration. Ne = 1 / (1/11 (1/100 + 1/200 + 1/400 + 1/800 + 1/1600 + 1/3200 + 1/6400 + 1/12800 + 1/25600 + 1/51200 + 1/102400)) = 550.3 b. The observed total heterozygosity is an estimator of , the population mutation parameter, and = 4Neµ. If = 2.2 10-9 and Ne = 550.3, µ = 10-12, which is three orders of magnitude smaller than average (Crake succeeded). 7) a) Yes. The key is that the reading frame was not specified in the question. If the first three bases, 'TGA' were in frame, the sequence could not be coding because TGA is a stop codon. However, if the frame starts with the second base, giving GAG as the first codon, then there are no stop codons in the transcript and the sequence could code for a protein. b) Two, at positions 12 and 25. c) Assuming that the sequence has evolved neutrally, then Theta_W = NumSegSites / Sum(1/(i - 1), i=2..n) / NumSites (Gillespie p. 43). Then Theta_W = 2 / (Sum(1/(i - 1), i=2..6) / 40 ~ 0.0219. Under neutrality, Theta_W is an estimator of 4 * Ne * u, and we can estimate u by Theta_W/(4 * Ne). If Ne = 10^6, then u ~ 5.47 * 10^-9 /gen. d) Three, at positions 12, 20, and 25. e) i) By raw divergence, we see 3 changes per 40 sites, or 3/40 changes per site. Divergence accrues in proportion to the mutation rate and time since speciation, K = 2 * u * t. Thus t = K / (2*u). t = (3/40) / (2 * 5.47 * 10 ^-9) ~ 6.9 * 10^6 generations. If the species have ten generations per year, then we estimate the time since divergence to be 6.9 * 10^5 = 690,000 years. ii) The Jukes-Cantor correction is K = -(3/4) * ln( 1 – 4 * p / 3). Plugging in p = 3/40, we get K ~ 0.079. (This is greater than 3/40 (0.075), which makes sense because the observed number of differences underestimates the true number). If K ~ 0.079, then t = (0.079) / (2 * 5.47 * 10 ^-9) ~ 7.3 * 10^6 generations and the time since speciation is estimated at 730,000 years. iii) To use the Kimura correction, we first need to estimate P, the proportion of transitions, and Q, the proportion of transversions. The three site differences are C & T (a transition), A & G (a transition), and A & T (a transversion). Then P = 2/40 and Q = 1/40. The correction formula is K = (1/2) ln(1/(1-2P-Q)) + (1/4)ln(1/(1-2Q)), so: K = (1/2) ln(1/(1-2*(2/40)-(1/40))) + (1/4)ln(1/(1-2*(1/40))) ~ 0.0795 t = (0.0795) / (2 * 5.47 * 10 ^-9) ~ 7.4 * 10^6 generations and the time since speciation is estimated at 740,000 years. f) Only two sites differ between sim2 and mel, so the estimate for i) is 2/40 -> 460,000 years, with corresponding decreases in ii) and iii). Which estimate of divergence should we trust, the one using sim1 or the one using sim2? It’s hard to tell from this data. Two things that confound interpretation here are that, although we wouldn’t know it without the rest of the simulans, we’re mistaking polymorphisms for divergence (sites 12 and 25), which causes us to overestimate the divergence time, and our small sample size, which gives us a large sampling error. A good idea would be to look at longer sequences to see if they support the lower or the higher divergence estimate. g) We’ve estimated divergence times of less than a million years, far less than the established divergence time of 3 million years. One possibility is that we are looking at a coding sequence, as considered in part (a). If so, lower divergence would be expected. Another possibility is that we just happen to have a small sample which doesn’t reflect well the true divergence of the species. Another possibility is that our estimate of the mutation rate from the polymorphism data was inaccurate. A lower true mutation rate would imply a longer divergence time. (If you’re comfortable with statistics, here’s a more rigorous look to the small-sample question. You don’t have to know this for the class, but it’s interesting. Typically you see about 8% divergence at neutral loci between mel and sim. We can conceive of our 40-site sample as 40 Bernoulli trials, each independent and with probability of success equal to 8%. The mean number of divergence sites you expect to see is 40 * 0.08 = 3.2, and the standard deviation is sqrt( 40 * 0.08 * (1 – 0.08) ) ~ 1.7. So it’s not unlikely at all to see three, two, or one fixed differences in a sample of size 40. In fact, the chance that you see zero fixed differences is just (1 – 0.08) ^ 40, or about 4%. The data are not inconsistent at all with a true divergence time of 3 million years.) 8) This question does not have sharp right and wrong answers. Be sure that you can define each of the terms and give a plausible example of how polymorphism and divergence data would be affected by varying each term, e.g., increasing the mutation rate increases the amount of polymorphism observed, and, all else being equal, some of these extra polymorphisms will eventually fix and appear as increased divergence. 9) We’ve posted a drift simulation on the course website. If you didn’t write your own, you’re encouraged to download the program and try it out. (You’ll need a python interpreter (www.python.org) if you don’t have one.) If you did write your own program, you might compare the output to the posted program. 10) Middle-Earth has two distinct populations of elves: those of Lothlorien, and those of Mirkwood. The woods of Lothlorien are much better suited to habitation than those of Mirkwood, so there are twice as many elves in the former as in the latter. These communities have been isolated from each other for a long enough time that allele frequencies at many loci have changed, but they are still fully capable of interbreeding, and therefore must be considered a single species. Each population has only 2 alleles at the disposition locus (congenial or arrogant), but the frequencies of the alleles are not the same in the two populations. Each population is initially in Hardy-Weinberg equilibrium with itself. Now that theAdversary has been defeated, the communities merge on their way back to the Elf Havensacross the Sea. a) Calculate the change in the proportion of homozygotes between the P1 generation where the populations merge, and the F1 generation, assuming that mating is now random within the entire Elven community. Lothlorien Population size = 2x Frequency of allele = a Mirkwood Population size = x Frequency allele = b In the P1 generation, Allele frequency = [(freq in L)*(pop size of L) + (freq in M)*(pop size in M)]/(total pop) = (a*2x + b*x) / (3x) = (2a + b) / 3 Freq of homozygotes = [(freq in L)*(pop size L) + (freq in M)*(pop size M)]/(total pop) = [(a^2)(2x) + ((1-a)^2)(2x) + (b^2)x + ((1-b)^2)(x)] / (3x) = [2(a^2) + (2(1-a)^2) + (b^2) + ((1-b)^2)] / 3 = (2a^2 + 2 – 4a + 2a^2 + b^2 + 1 – 2b +b^2) / 3 = (4a^2 – 4a + 2b^2 – 2b + 3) / 3 In the F1 generation, Allele frequency is unchanged (random mating does not change the total occurrence of an allele—it merely changes the frequencies of genotypes when things were not previously in equilibrium.) Freq of homozygotes = [(2a + b) / 3]^2 + [1 – (2a + b) /3]^2 The difference between these frequencies of homozygotes is: Difference = Freq after – Freq before = [(2a + b) / 3]^2 + [1 – (2a + b) /3]^2 – [(4a^2 – 4a + 2b^2 – 2b + 3) / 3] = (4a^2 + 4ab + b^2) / 9 + 1 – (4a + 2b)/3 + (4a^2 + 4ab + b^2) / 9 - [(4a^2 – 4a + 2b^2 – 2b + 3) / 3] 9*dif = 8a^2 + 8ab + 2b^2 + 9 – 12a – 6b – 12a^2 + 12a –6b^2 + 6b – 9 9*dif = -4a^2 – 4b^2 + 8ab 9*dif = -4( a^2 – 2ab + b^2) dif = (-4/9) (a – b)^2 b) Does the sign of the change in the proportion of homozygotes depend on whether the larger or smaller population initially had a higher incidence of arrogance? Explain your answer. The difference will always be negative (or 0 if the frequencies were identical to begin with). Mixing the entire population together will decrease the total number of homozygotes, as there would have been a relative excess of them in whichever population had an initial frequency furthest from 0.5. 11) The peppered moth Biston Betularia can be one of two colors, white or dark brown. A single locus with two alleles is responsible for determining the body color phenotype. Allele ‘M’ is dominant to ‘m’, and its presence leads to a greater production of melanin that darkens the moth’s body color. An extremely large population of the peppered moth has thrived in a forest of dark brown and white barked trees for many centuries. 55% of the moths in this population are white in color. Both white and dark brown moths can survive in the forest because they are camouflaged against trees that match their body color. When a fire sweeps though the forest, the benefits enjoyed by the white moths are eliminated. While many of the moths and birds in the forest survive the blaze, the dark colored moths are now better camouflaged against the dark burnt bark of the trees. The white moths are so poorly camouflaged that they are immediately eaten by predatory birds. None of the white moths are ever able to reproduce. NOTE: Assume Hardy-Weinberg assumptions for the following questions including an infinite population size. a) What are P (MM), Q (Mm), R (mm), p (frequency of M) and q (frequency of m) for the population of moths before the forest fire? Q = 0.55, q = q^1/2 = 0.74, p = 1 – q = 0.26, P = p2 = .068, R = 2pq = 0.385 b) Assuming both colors of moths survive the forest fire with equal probability, what are P, Q, R, p and q for the first generation of moths after the fire (at the time when they are born). The white adults are all eliminated through predation after the fire. The new genotype frequencies after this are: Q’ = 0, P’ = .068/(0.68+0.385) = 0.15, R’ = 0.385/(0.68+0.385) = 0.85 This means that the allele frequencies are: p’ = P’ +1/2R’ = 0.58 q’ = 1-0.58 = 0.42 The gene frequencies of the offspring rather after the birth are the same: p’ = P’ +1/2R’ = 0.58 q’ = 1-0.58 = 0.42 and the genotype frequencies are hardy-Weinberg ones: P1 = (p’)2 = .33 R1 = 2p’q’ = .49 Q1 = (q’)2 = .18 c) If white moths are destroyed generation after generation, find an equation for qt in terms of qo. This is very similar to b, but now we will express q1 in terms of qo explicitly. We can then see that there is a simple solution of qt in the t-th generation in terms of q0 before the fire. q1 = p0q0 p 0 2 +2 p 0 q 0 = q0 p 0 +2q 0 = 1+q0 0 q = q1 p1 +2q1 q1 q0 = 1+q = 1+2q 1 0 similarly q2 = p1 q1 p1 2 +2 p1 q1 and q qt = 1+tq0 0 d) After how many generations will q = .05? How long will it take until q is exactly 0. qt = q0 (1+ t q 0 ) 0.74 1 1 = +t (1+ t 0.74) 0.05 0.74 t = 18.65generations 0.05 = It will take infinite generations to reach p = 0 in a population with infinite size. 12) A great catastrophe befalls Whoville; when Horton falls asleep, some of the other residents of the Jungle of Nool (claiming that they are acting in the best interests of the poor deranged elephant) boil the dust speck that contains the Who's world. Only 1 male and 1 female Whosurvive the horrible calamity. Horton recovers the dust speck, and with the aid of caffeine and increased vigilance, he protects the tiny dust speck long enough for the survivors to produce a (F1) generation of 10 Whos. The ability for these Whos to produce substantial noise is controlled by a simple 2-allele locus, and before the catastrophe the incidence of the loud allele was .25, while the incidence of the quiet allele was .75. Assuming that the Whos were in Hardy-Weinberg equilibrium before the catastrophe, The easiest way to do this is to set up a table P1 type pppp pppq ppqq pqqq qqqq Probability of obtaining Prob p extinct in F1 Prob q extinct in F1 (3/4)^4 0 1 4[(3/4)^3](1/4) (1/4)^20 (3/4)^20 6[(3/4)^2][(1/4)^2] (2/4)^20 (2/4)^20 4(3/4)[(1/4)^3] (3/4)^20 (1/4)^20 (1/4)^4 1 0 a) Use a Wright-Fisher model to predict the probability of the quiet allele being extinctin the F1 generation. The total probability that p goes extinct by F1 is the sum of the products of probability of getting a certain P1 type and the probability of p going extinct in F1 from that type. In this case, that means 4(3/4)^3 (1/4)(1/4)^20 + 6(3/4)^2 (1/4)^2 (1/2)^20 + 4(3/4)(1/4)^3 (3/4) ^20 + (1/4)^4 =~ .406% b) Use a Wright-Fisher model to predict the probability of theloud allele being extinctin the F1 generation. The total probability that p goes extinct by F1 is the sum of the products of probability of getting a certain P1 type and the probability of q going extinct in F1 from that type. In this case, that means: (3/4)^4 + 4(3/4)^3 (1/4)(3/4)^20 + 6(3/4)^2 (1/4)^2 (1/2)^20 + 4(3/4)(1/4) ^3 (1/4)^20 =~ 31.8% c) Identify the shortcomings of the Wright-Fisher model in this example(ie, what might actually happen in reality that wouldn't be indicated in the Wright-Fisher model). Give an example in which these shortcomings could substantially change the probabilities calculated in the parts b and c. The Wright-Fisher model assumes that gametes are drawn from an infinite pool, and doesn’t take into account the biology of the organisms. The actual probabilities would depend on the genotypes of the surviving male and female and not just on the allele frequencies. For example if the male is homozygous for loud allele and the female is homozygous for the quiet allele, then all 10 offspring will be heterozygous for these allele and the probability of any frequency other than 50/50 will be zero. Wright-Fisher assumes that organisms can self. With sufficiently large populations this assumption does not matter too much, but it does affect results significantly with the population sizes are very small. 13) Scientists discover a very basic form of life on Mars. The genetic system is based on only four amino acids - X, Y, Z, and W - and two nucleotides – A and B. Codons in this new system are only two nucleotides long and code for the amino acids according to the following table. Nucleotides Amino Acids AA X AB Y BA Z BB W Assuming that a nucleotide changes to any other in a given time step t with probability b, answer the following questions about the substitutions in a particular neutral amino acid sequence. HINTS: • for simplicity assume that only a single substitution can occur in a given time step. • compare the problem to Kimura’s 2-parameter correction a) What is the equation K(t) for the expected number of amino acid changes that actually do occur after time t. Y (AB) b b X (AA) b W (BB) b Z (BA) The number of amino acid changes that take place in a given time is given by K(t) = 2bt b) Find a correction for the number of changes that have occurred (K) in terms of the number of observed amino acid substitutions (Hint: there are two types of substitutions. Include both in the formula for K.) For ‘b’ it is easiest to realize the similarity of this model to that of the Kimura model. a A b T b a G b C A Switching the positions of G and C we see the following. b T b a b C b G Setting a equal to 0 we have the following. This looks very similar to our model. A b b T b b Seeing the similarity to Kimura’s model we can answer the rest of the 1 1 1 questions. The equation for I(t) = PrXX (t) = + e 4t + e 2t from the 4 4 2 substitution in the Kimura equation. For the Kimura model the probability of transitions and transversions are 1 1 1 transitions Y (t ) = + e 4 t e 2 ( + ) t 4 4 2 given by: 1 1 transversions Z (t ) = e 4 t 4 4 In our case these equations can give us the equations we are asked for. All we have to do is set a = 0. 1 1 4t 1 2t + e e 4 4 2 1 1 Pr(change to a paticular neighbor state ) = PrXY (t) = PrXZ (t) = e 4t 4 4 Pr(change to the state two subs. away ) = PrXW (t) = C G Let A(t) and B(t) be A(t) = Pr(change to any neighbor state ) = 2Pr(change to aparticular neighbor state ) = 2PrXZ (t) = B(t) = Pr(change to the state two subs. away ) = PrXW (t) = 1 1 4t e 2 2 1 1 4t 1 2t + e e 4 4 2 1 1 4t 1 1 4t e + + e e 2t = 1 e 2t 2 2 2 2 1 2t = ln 1 A 2B 1 because K(t) = 2t, K = ln . 1 A 2B Then A(t)+ 2B(t) = 14) Elephant seals possess an interesting system of mating. One alpha-male lies in the center of a harem of females and attempts to mate with as many of these females as possible throughout the course of the mating season. The harem is also surrounded by 5 beta-males, usually younger and smaller, who lie around the harem and protect it from invasion by other males. When the alpha-male starts to mate, the beta-males use his distraction to do some mating of their own with the females at the outer edges of the harem. Other males, not alphas or betas, stay in the water and are not allowed to mate at all. One would think that the alpha male sires the most offspring in this situation, however, recent genetic studies have found that beta-males as a group actually have at least as much mating success as the alpha-male. From a typical harem 50% of the children are sired from beta males and 50% are sired from the alpha. Given that there are 40 alpha-males, and 2000 females in the population, assuming that each of the harems is exactly the same size, assuming that all beta-males have exactly the same probability of having an offspring, and assuming that all females are fertile and available for mating, calculate the effective population size for this population of elephant seals. NOTE: assume non-overlapping generations for simplicity. In general….. numberof alpha males = N numberof beta males = 5N numberof females = N f Let F be autozygosity starting at time t 0 ,then..... F1 = Pr(same alpha male alleles) + Pr(same beta male alleles) + Pr(same female alleles) more exp licitly... F1 = Pr(both alleles from some alpha male) * Pr(same allele |both from alpha male) + Pr(both from somebeta male) * Pr(same allele |both from somebeta male) + Pr(both from some females) * Pr(same allele| both from some female) breakingthisup... Pr(both alleles from alpha males) = ( Pr(allele is frommale )* Pr( fromalpha | from male )) 2 1 1 2 1 Pr(both alleles from alpha males) = * = 2 2 16 1 Pr(same allele | both from alpha male) = 2N 1 1 2 1 Pr(both from some beta male) = * = 2 2 16 1 Pr(same allele | both from beta male) = 2(5N ) 2 Pr(both from some females) = 1 1 = 2 4 Pr(same allele | both from female) = 1 2N f F1 = 3N f +10N 1 1 1 1 1 1 + + = 80N N f 16 2N 16 2(5N ) 4 2N f beause F1 is normally 3N f +10N 40N N f 1 1 = Ne = 80N N f 3N f +10N 2N 2N e in our case N = 40, N f = 2000 so N e = 40(40)(2000) (40)(2000) (4)(2000) 8000 = = = = 500 3(2000)+10(40) 3(50)+10 3(5)+1 16 Note: you could also just work through the problem using numbers (i.e. without finding a general equation) 15) You find out that your favorite protein is translated from the following mRNA: CUAUGGCAACAUCAUCAGCGGCA a) Write down the amino acid sequence of the protein if translation starts at the first Met codon encountered in the mRNA sequence MET-ALA-THR-SER-SER-ALA-ALA b) Now assume that translation starts at the first Tyr codon. Write down the Amino Acid sequence of the new protein. TYR-GLY-ASN-ILE-ILE-SER-GLY c) Which protein should experience more amino acid changes per unit time if all cytosines in your organism are methylated? Explain. Sequence A – In this sequence cytosines in the corresponding DNA sequence are all in the 2nd positions in codons and thus every change to a T would be nonsynonymous. In the Sequence B the cytosines in the corresponding DNA sequence are all in 3rd positions, thus generating mostly synonymous substitutions. 16) Imagine that you have cloned two homologous genes in two species of Drosophila. You sequence and align them. Below are the short parts of the alignments of the expected mRNA sequences. The sequences are parsed into expected codons (for instance, the first codon in the Drosophila simulans sequence of Gene 1 is AUC) Gene 1: Drosophila simulans: AUC-ACC-CAC-CAA-CAG-UUC-UGU-GCU Drosophila melanogaster: AUG-ACA-CAC-CAA-CGG-UUC-UGC-GAU Gene 2: Drosophila simulans: ACA-GAU-GGU-CCU-CGC-GUG Drosophila melanogaster: ACA-CUU-AGU-AUU-CAC-GCA a) Above are protein sequences from two genes in two closely related species, D.simulans and D.melanogaster. Assuming that the path with the fewest nonsynonymous substitutions represents the true path between the two codons, calculate Ka and Ks for both genes. Pa(1) = 3 Na(1) = 19.33 Pa/Na = 0.15 Ka = 3/4*ln(1-4/3*(Pa/Na)) = 0.17 Ps(1) = 2 Ns(1) = 4.06 Ps/Ns = 0.49 Ks = 0.8 Ka/Ks = 0.21 Pa(2) = 7 Na(2) = 13 Pa/Na = 0.54 Ka = 0.95 Ps(2) = 1 Ns(2) = 4.63 Ps/Ns = 0.21 Ks = 0.25 Ka/Ks = 3.8 b) Would you be surprised to learn that both genes have no function at the level of the protein? Explain. Yes, it would be surprising. Genes that do not have a function at the level of the protein are expected to have Ka/Ks equal 1. In both cases Ka/Ks seems distinct from 1, although to be certain one needs to do appropriate statistical tests. Prima facie it is surprising. c) Would you be surprised to learn that Gene 2 has been under strong selection to change its protein sequence? Explain. No, you wouldn’t be. It appears that there are more aminoacid substitutions that one would expect under assumption that all aminoacid substitutions are neutral and have fixed through genetic drift. The idea of natural selection promoting such changes would be consistent with the observations. 17) You are given two sequences (50 bp each) of the homologous pseudogene in two species of yeast. The alignment is shown below: CCTCGACGGCTTAGATCTGATCTGACCTAATGCTGCAATCGTTACAAAGT CCTCCACGAGTAAGAGTTGATCCGACTTAGTCCTGCGATCGTTAGATAAT You know that these species last shared a common ancestor 10 MYA and that both species go through 50 generations a year. a) Using Jukes-Cantor model of nucleotide substitution, estimate mutation rate per nucleotide per year in these two species of yeast. Assume that mutation rate is the same in both species. There are 14 substitutions in 50 bp. P = 14/50 = .28. K = (3/4)*ln(1-(4/3)*P)=0.35. K = 2T (years)*µ(mutations per bp per year) µ = .35/20 Myr = 1.75 10-8 mutations per bp per year b) Do you believe that Jukes-Cantor model is appropriate in this case? Do you see any evidence that you need to use Kimura 2-parameter model instead? Please explain. There are 7 transitions and 7 transversions. On the Jukes-Cantor model we would expect all rates to be equal and thus to have twice as many transversions as transitions. Kimura model would probably be more appropriate here. c) You sample 10 alleles of this pseudogene different by origin in the population of one of these species of yeast. You sequence the same 50 bp region in all 10 cases and find that there are 5 distinct alleles in the following proportions: allele 1 allele 2 allele 3 allele 4 allele 5 6 1 1 1 1 Estimate the effective population size in this species of yeast. First we estimate heterozygosity H = 1 - (0.6)2 + 4*(0.1)2 = 0.6 = H/(1-H) = 1.5 Per nucleotide bp = 1.5/50 = 0.03 Per generation µ bp= 1.75 10-8/50 = 3.5 10-10 mutations per bp per generations 4Neµ bp = 0.03 and thus Ne = 2.1 107 18) In Drosophila the rate of all transitions and transversions are all very similar except for the transition from C to T (or equivalently from G to A). You know that the average GC content of neutral, entirely unconstrained DNA in Drosophila is 34%. Please estimate the relative probability of a C to T transitions versus T to C transition. Let all rates except C to T be equal to a and the rate C to T be equal to b. Then the rate of changes from A or T nucleotides to G or C nucleotides per unit time in an infinitely long stretch of neutral DNA is (AT%)*4a, because both A and T have either G or C to change to and each change happens with the rate a. The number of changes from G or C nucleotides similarly is (GC%)*(2a+2b) At equilibrium the rates are equal. Thus (AT%)*4a = (GC%)*(2a+2b) b/a = (2-3g)/g = 2.9 19) Your friend just sequenced the full genome of a new species of bacteria. The GC content in this species is 50%. Your friend decides to calculate the proportion of different kinds of nucleotide pairs (dinucleotides) and finds among other things that C’s are followed by A’s 40% of the time while 60% of the time they are followed by the other three nucleotides. This reminds you of a similar pattern in the human genome. From what you know about methylation-dependent deamination, please advise your friends which other dinucleotide frequencies he needs to look at and give qualitative predictions of what he should see. It appears that there is an overabundance of CpA dinucleotides relative to the random expectation. Because the GC content is 50%, we would expect dinucleotides present at equal proportions, with CpA frequency being equal to that of CpG, CpC, or CpT. We know that in the human DNA many cytosines are methylated in CpG pairs resulting in a higher probability of mutation to CpA or TpG. This would generated the observed increase in the proportion of CpA pairs. In addition it would increase the proportion of TpG pairs and decrease the proportion of CpG pairs realtive to random expectation. 20-30) See problems 2-12 in the other pdf file 31) We can write q = -q2 p s / (1 – q2 s). Since the q is the frequency of the deleterious allele, q raised the a high power, say 2, will be close to zero, and p will be close to 1. Thus, the denominator of q is 1. q is the change in q due to selection. The change due to mutation is (1 – q) µ where µ is the mutation rate. These changes add up to zero at mutationselection equilibrium, so (1 – q) q2 s = (1 – q) µ q = (µ/s)1/2
© Copyright 2026 Paperzz