Evolutionary Genetics: Part 3 Coalescent 2 – Effective Population size S. chilense S. peruvianum Winter Semester 2012-2013 Prof Aurélien Tellier FG Populationsgenetik Color code Color code: Red = Important result or definition Purple: exercise to do Green: some bits of maths Population genetics: 4 evolutionary forces random genomic processes (mutation, duplication, recombination, gene conversion) molecular diversity natural selection random spatial process (migration) random demographic process (drift) Effective population size The coalescent We can calculate many aspects of a genealogical (coalescent) tree for a population of size 2N Time to MRCA : E[TMRCA] = 4N (1 – 1/n) Length of a tree: E[L] ≈ 4N log(n-1) Time of coalescence of last two lineages : E[T2] = 2N 2N 2N/3 2N/6 2N/10 Definition The real physical population is likely not to behave as in the Wright – Fisher model Most populations show some kind of structure: Geographic proximity of individuals, Social constraints… The number of descendants may be > 1 for the Poisson distribution Effective population size = size of a Wright – Fisher population that would produce the same rate of genetic drift as the population of interest One consequence of drift: do two randomly picked offspring individuals have a common ancestor in the parent generation? Definition We will use here the inbreeding effective size = Ne Also called identity by descent population size Ne = 1/ (2 * P[T2 = 1]) Where T2 is given in generations, T2 = time until two lineages coalesce This depends on the immediate previous generation! An extension is: Ne(t) = E[T2 ] / 2 This relates to the number of generations until a MRCA is found in the population Definition For the haploid Wright – Fisher model Ne = 1/ (2 * P[T2 = 1]) With P[T2 = 1] = 1 / (2N) So that Ne = N The extension is: Ne(t) = E[T2 ] / 2 With E[T2] = 2N So that Ne(t) = N For the Wright – Fisher model the two definitions agree Calculating Ne Diploid model with different numbers of males and females Nf = number of females Nm = number of males Nf + Nm = N P[T2 = 1] = (1- 1/(2N)) * N /(8NfNm) Ne(t) = Ne = 4NfNm/(Nf+ Nm) For example: when some men have a harem, Nf = 20 and Nm = 1 What is Ne ? Calculating N and Ne Example based on human population: many human genes have an MRCA less than 200,000 years ago If one generation = 20 years So if 4Ne < E[MRCA] Ne < 200,000 / (4*20) => Ne < 2,500 !!!!!!!!! Of course N is bigger in human population, but Ne maybe be very small ☺ We will see how to estimate Ne from sequence data later on The coalescent – 2 role of mutations Coalescent tree + mutations The distribution of mutations amongst individuals can be summarized as a tree (on a genealogy) “Coalescent theory” John Wakeley, 2009 The distribution of mutations amongst individuals can be summarized as a tree (on a genealogy) Coalescent tree + mutations How to add mutation on a coalescent tree? In a Wright Fisher model: see drawing Probability of mutation = µ that an offspring changes its genotype And P[no mutation] = 1- µ This means for example: for a two allele model A and a: mutation to go from a to A, and vice and versa Classical model for DNA sequences is the so called infinite site model Definition: each new mutation hits a new site in the genome So it cannot be masked by back mutation Not affected by recurrent mutation Every mutation is visible except if lost by drift Models of mutation There are other models of sequence evolution, but these will not be used for now. Infinite allele model Definition: each mutation creates a new allele Example on a tree Finite site model Definition: mutations fall on a finite number of sites Example on a tree Coalescent tree + mutations How to add mutation on a coalescent tree? Probability of mutation = µ that an offspring changes its genotype And P[no mutation] = 1- µ Do you see where this is going? After t generations, what is the probability that there was no mutations? P[X>t] = (1- µ)t = e- µt So we can draw again in an exponential distribution the time until a new mutation occurs And put this on a tree, drawing for each branch the time to new mutation Coalescent tree + mutations How to add mutation on a coalescent tree? The mutation will be visible in all descendants from that branch 4 sites AAAA AAAA TTAA TTTT Coalescent tree + mutations How to add mutation on a coalescent tree? The mutation will be visible in all descendants from that branch 5 sites 4 sites AAAA AAAA TTAA One more mutation TTTT AAAAA AAAAG TTAAA TTTTA Mutations on a tree For neutral mutations we can do this process without changing the shape of the tree or the size of the tree Tree topology = shape and branching of the tree Branch lengths = length of branches usually in units of 2N generations BECAUSE Forward in time: a neutral mutation does not change the offspring distribution of an individual Backward in time: mutation does not change the probability to be picked as a parent Tree topology For neutral mutations we can do this process without changing the shape of the tree or the size of the tree Tree topology = shape and branching of the tree Branch lengths = length of branches usually in units of 2N generations Definitions: external branches and internal branches Tree topology and mutation We define mutations = SNPs depending on their frequency a b 1 2 3 4 Mutation a is found in two sequences = doubleton Mutation b is found in one sequence = singleton Mutations on a tree We are now interested in the number of mutations on each branch of the tree For a branch of length l The number of mutations follows a Poisson distribution with parameter (ll µ) So for the total tree: Poisson (Lµ) n −1 1 i =1 i Remember E[ L] = 4 N ∑ So we define S as the total number of mutations on a tree (on a set of sequences) E[ S ] = 4 N µ E[ L ] E[ S ] = θ 4N n −1 1 i =1 i 4N ∑ n −1 1 i =1 i E[ S ] = θ ∑ With θ=4Neµ The population mutation rate θ=4Neµ This is the crucial parameter: combines mutation and Ne θ is called the population mutation rate or scaled mutation rate We can estimate θ based on sequence data Two estimators have been derived: θ̟ derived by Tajima (1983) θS (or θW ) derived by Watterson (1975) Watterson estimator θS = θW is based on the number of segregating sites in a tree S, compared to the average branch length of sample of size n defined as θ S = S n −1 1 ∑ i =1 i n −1 1 i =1 i remember: E[ L] = 4 N ∑ This is the expected average number of segregating sites per given length of tree branch Tajima estimator θ̟ is defined as the number of average differences for all pairs of sequences in a sample Based on ̟ij which is the number of differences between two sequences i and j Defined as θπ = 1 2 π = π ij ∑ ∑ ij n(n − 1) i ≠ j n i≠ j 2 Because there are n(n-1)/2 pairs of sequences So take all sequences, and count for all pairs the number of differences, And then do the average Tajima estimator Based on πij which is the number of differences between two sequences i and j a b 1 2 3 4 Different mutations counts differently Mutation a is counted in four pairwise comparisons Mutation b is counted in three comparisons πij and thus θπ depends on how many mutations fall on internal or external branches Coalescent tree + mutations Example of calculation 4 sites θS = AAAA S n −1 1 ∑ i =1 i = 4 1+ 1 2 = 8 3 2 3+3+ 2 8 θπ = π ij = = ∑ n(n − 1) i ≠ j 3 3 ATAA TAAT TATA Watterson estimator θS = θW is based on the number of segregating sites in a tree S, compared to the average branch length of sample of size n defined as θ S = S n −1 1 ∑ i =1 i n −1 1 i =1 i remember: E[ L] = 4 N ∑ 2*1 3 * 1/3 4 * 1/6 5 * 1/10 n −1 2 1 1 1 1 1 E[ L] = 2 N 2 + 1 + + = 2 N 2(1 + + + ) = 4 N ∑ 3 2 2 3 4 i =1 i Neutral model of coalescent θ=4Neµ Very important result: θ S = θ̟ If the population follows a neutral model of coalescent with constant population size!!!! Estimating Ne It is possible to estimate Ne based on the two estimators IF and only IF you have independent data on the mutation rate Ne = θ̟ / 4µ = θS / 4µ This assumes: Infinite site model Constant Ne over time Homogeneous population (equal coalescent probability for all pairs) Estimating Ne Exercise Calculate θ̟, θS and estimate Ne For two datasets: In human populations: TNFSF-5-Humans.fas In Drosophila populations: 055-Droso.nex Define populations in Dnasp using: data => define sequence sets Then => Polymophism analysis For droso: europe and africa Mutation rate in humans = 1.2 * 10-8 per base per generation (Scally and Durbin, Nat Rev Genetics October 2012) Mutation rate in Drosophila = 10-8 per base per generation What are the differences? Heterozygosity Heterozygosity Definition: Heterozygosity H is the probability that two alleles taken at random from a population are different at a random site or locus. It is a key measure of diversity in populations If H0 is the heterozygosity at generation 0, then at generation 1: Assuming no new mutations H1 = 1 1 0 + (1 − )H0 2 Ne 2 Ne Proba to have the same parents at generation 0, with probability=0 to be different With proba 1-(1/2N) offsprings have different parents, and these parents have proba H0 (by definition) to be different Heterozygosity By iteration we get at generation t t 1 H t = 1 − H0 2 Ne This means that in the absence of mutation, heterozygosity is lost at a rate of (1/2N) every generation Heterozygosity + mutation With the infinite allele model assumption that every new mutation creates a new allele: Two contrary mechanisms drive the evolution of diversity in population: genetic drift and mutation If they have the same strength and balance each other = mutationdrift balance The change in heterozygosity between two generations is: ∆H = H t +1 − H t = − 1 H t + 2µ (1 − H t ) 2 Ne Heterozygosity + mutation ∆H = H t +1 − H t = − 1 H t + 2µ (1 − H t ) 2 Ne Change of heterozygosity due to random drift (always negative) Change of heterozygosity due to new mutations (always positive) At equilibrium the value of heterozygosity is Ĥ: 4 Ne µ ∆H = 0 ⇒ Hˆ = 1 + 4 Ne µ Ĥ=θ / (1+ θ) The value at equilibrium increases with increasing µ and Ne WHY? Mutation – Drift balance In the case of such model, we are interested in: The probability for a new mutation to get fixed? How long does it take to get fixed? Using a coalescent argument: fixation of the mutation occured if and only if the mutant is that ancestor, this probability = 1/ 2N The expected time of fixation is equal to the expected time to the MRCA, so it is = 4N What do we expect for selected loci? Mutation – Drift balance Substitution rate = rate at which mutations get fixed in a population/species It is called k A new mutation starts with frequency 1/ 2N in a population, The substitution rate occurs mutliplying the number of mutations in a population = 2 N µ And the probability that one mutation gets fixed = 1/ 2N So k = 2 N µ * (1/2N) = µ (Kimura) Most striking result: k does not depend on the effective population size
© Copyright 2026 Paperzz