Lecture 3: Effective population size, mutations, heterozygosity

Evolutionary Genetics: Part 3
Coalescent 2 – Effective Population size
S. chilense
S. peruvianum
Winter Semester 2012-2013
Prof Aurélien Tellier
FG Populationsgenetik
Color code
Color code:
Red = Important result or definition
Purple: exercise to do
Green: some bits of maths
Population genetics: 4 evolutionary forces
random genomic processes
(mutation, duplication, recombination, gene conversion)
molecular diversity
natural
selection
random spatial
process (migration)
random demographic
process (drift)
Effective population size
The coalescent
We can calculate many aspects of a genealogical (coalescent) tree for a
population of size 2N
Time to MRCA :
E[TMRCA] = 4N (1 – 1/n)
Length of a tree: E[L] ≈ 4N log(n-1)
Time of coalescence of last two lineages : E[T2] = 2N
2N
2N/3
2N/6
2N/10
Definition
The real physical population is likely not to behave as in the Wright – Fisher model
Most populations show some kind of structure:
Geographic proximity of individuals,
Social constraints…
The number of descendants may be > 1 for the Poisson distribution
Effective population size = size of a Wright – Fisher population that would
produce the same rate of genetic drift as the population of interest
One consequence of drift: do two randomly picked offspring individuals have a common
ancestor in the parent generation?
Definition
We will use here the inbreeding effective size = Ne
Also called identity by descent population size
Ne = 1/ (2 * P[T2 = 1])
Where T2 is given in generations, T2 = time until two lineages coalesce
This depends on the immediate previous generation!
An extension is:
Ne(t) = E[T2 ] / 2
This relates to the number of generations until a MRCA is found in the population
Definition
For the haploid Wright – Fisher model
Ne = 1/ (2 * P[T2 = 1])
With P[T2 = 1] = 1 / (2N)
So that Ne = N
The extension is:
Ne(t) = E[T2 ] / 2
With E[T2] = 2N
So that Ne(t) = N
For the Wright – Fisher model the two definitions agree
Calculating Ne
Diploid model with different numbers of males and females
Nf = number of females
Nm = number of males
Nf + Nm = N
P[T2 = 1] = (1- 1/(2N)) * N /(8NfNm)
Ne(t) = Ne = 4NfNm/(Nf+ Nm)
For example: when some men have a harem, Nf = 20 and Nm = 1
What is Ne ?
Calculating N and Ne
Example based on human population: many human genes have an MRCA less than
200,000 years ago
If one generation = 20 years
So if 4Ne < E[MRCA]
Ne < 200,000 / (4*20) => Ne < 2,500 !!!!!!!!!
Of course N is bigger in human population, but Ne maybe be very small ☺
We will see how to estimate Ne from sequence data later on
The coalescent – 2
role of mutations
Coalescent tree + mutations
The distribution of mutations amongst individuals can be summarized as a tree (on
a genealogy)
“Coalescent theory” John Wakeley, 2009
The distribution of mutations amongst individuals can be summarized as a tree (on
a genealogy)
Coalescent tree + mutations
How to add mutation on a coalescent tree?
In a Wright Fisher model: see drawing
Probability of mutation = µ that an offspring changes its genotype
And P[no mutation] = 1- µ
This means for example: for a two allele model A and a: mutation to go
from a to A, and vice and versa
Classical model for DNA sequences is the so called infinite site model
Definition: each new mutation hits a new site in the genome
So it cannot be masked by back mutation
Not affected by recurrent mutation
Every mutation is visible except if lost by drift
Models of mutation
There are other models of sequence evolution, but these will not be used
for now.
Infinite allele model
Definition: each mutation creates a new allele
Example on a tree
Finite site model
Definition: mutations fall on a finite number of sites
Example on a tree
Coalescent tree + mutations
How to add mutation on a coalescent tree?
Probability of mutation = µ that an offspring changes its genotype
And P[no mutation] = 1- µ
Do you see where this is going?
After t generations, what is the probability that there was no mutations?
P[X>t] = (1- µ)t = e- µt
So we can draw again in an exponential distribution the time until a
new mutation occurs
And put this on a tree, drawing for each branch the time to new mutation
Coalescent tree + mutations
How to add mutation on a coalescent tree?
The mutation will be visible in all descendants from that branch
4 sites
AAAA
AAAA
TTAA
TTTT
Coalescent tree + mutations
How to add mutation on a coalescent tree?
The mutation will be visible in all descendants from that branch
5 sites
4 sites
AAAA
AAAA
TTAA
One more mutation
TTTT
AAAAA
AAAAG
TTAAA
TTTTA
Mutations on a tree
For neutral mutations we can do this process without changing the shape of the
tree or the size of the tree
Tree topology = shape and branching of the tree
Branch lengths = length of branches usually in units of 2N generations
BECAUSE
Forward in time: a neutral mutation does not change the offspring distribution
of an individual
Backward in time: mutation does not change the probability to be picked as a
parent
Tree topology
For neutral mutations we can do this process without changing the shape of the
tree or the size of the tree
Tree topology = shape and branching of the tree
Branch lengths = length of branches usually in units of 2N generations
Definitions: external branches and internal branches
Tree topology and mutation
We define mutations = SNPs depending on their frequency
a
b
1
2
3
4
Mutation a is found in two sequences = doubleton
Mutation b is found in one sequence = singleton
Mutations on a tree
We are now interested in the number of mutations on each branch of the tree
For a branch of length l
The number of mutations follows a Poisson distribution with parameter (ll µ)
So for the total tree: Poisson (Lµ)
n −1
1
i =1 i
Remember E[ L] = 4 N ∑
So we define S as the total number of mutations on a tree (on a set of sequences)
E[ S ] = 4 N µ E[ L ]
E[ S ] =
θ
4N
n −1
1
i =1 i
4N ∑
n −1
1
i =1 i
E[ S ] = θ ∑
With θ=4Neµ
The population mutation rate
θ=4Neµ
This is the crucial parameter: combines mutation and Ne
θ is called the population mutation rate or scaled mutation rate
We can estimate θ based on sequence data
Two estimators have been derived:
θ̟ derived by Tajima (1983)
θS (or θW ) derived by Watterson (1975)
Watterson estimator
θS = θW is based on the number of segregating sites in a tree S, compared to
the average branch length of sample of size n
defined as θ S =
S
n −1
1
∑
i =1 i
n −1
1
i =1 i
remember: E[ L] = 4 N ∑
This is the expected average number of segregating sites per given length
of tree branch
Tajima estimator
θ̟ is defined as the number of average differences for all pairs of sequences
in a sample
Based on ̟ij which is the number of differences between two sequences i
and j
Defined as
θπ =
1
2
π
=
π ij
∑
∑
ij
n(n − 1) i ≠ j
 n  i≠ j
 
 2
Because there are n(n-1)/2 pairs of sequences
So take all sequences, and count for all pairs the number of differences,
And then do the average
Tajima estimator
Based on πij which is the number of differences between two sequences i and j
a
b
1
2
3
4
Different mutations counts differently
Mutation a is counted in four pairwise comparisons
Mutation b is counted in three comparisons
πij and thus θπ depends on how many mutations fall on internal or external
branches
Coalescent tree + mutations
Example of calculation
4 sites
θS =
AAAA
S
n −1
1
∑
i =1 i
=
4
1+
1
2
=
8
3
2
3+3+ 2 8
θπ =
π ij =
=
∑
n(n − 1) i ≠ j
3
3
ATAA
TAAT
TATA
Watterson estimator
θS = θW is based on the number of segregating sites in a tree S, compared to
the average branch length of sample of size n
defined as θ S =
S
n −1
1
∑
i =1 i
n −1
1
i =1 i
remember: E[ L] = 4 N ∑
2*1
3 * 1/3
4 * 1/6
5 * 1/10
n −1
2 1
1 1 1 
1


E[ L] = 2 N  2 + 1 + +  = 2 N  2(1 + + + )  = 4 N ∑
3 2
2 3 4 


i =1 i
Neutral model of coalescent
θ=4Neµ
Very important result:
θ S = θ̟
If the population follows
a neutral model of coalescent with constant population size!!!!
Estimating Ne
It is possible to estimate Ne based on the two estimators
IF and only IF you have independent data on the mutation rate
Ne = θ̟ / 4µ = θS / 4µ
This assumes:
Infinite site model
Constant Ne over time
Homogeneous population (equal coalescent probability for all pairs)
Estimating Ne
Exercise Calculate θ̟, θS and estimate Ne
For two datasets:
In human populations: TNFSF-5-Humans.fas
In Drosophila populations: 055-Droso.nex
Define populations in Dnasp using: data => define sequence sets
Then => Polymophism analysis
For droso: europe and africa
Mutation rate in humans = 1.2 * 10-8 per base per generation (Scally and Durbin,
Nat Rev Genetics October 2012)
Mutation rate in Drosophila = 10-8 per base per generation
What are the differences?
Heterozygosity
Heterozygosity
Definition: Heterozygosity H is the probability that two alleles taken
at random from a population are different at a random site or locus.
It is a key measure of diversity in populations
If H0 is the heterozygosity at generation 0, then at generation 1:
Assuming no new mutations
H1 =
1
1
0 + (1 −
)H0
2 Ne
2 Ne
Proba to have the same parents at
generation 0, with probability=0 to
be different
With proba 1-(1/2N) offsprings have
different parents, and these parents have
proba H0 (by definition) to be different
Heterozygosity
By iteration we get at generation t
t
1 

H t = 1 −
 H0
 2 Ne 
This means that in the absence of mutation, heterozygosity is lost at
a rate of (1/2N) every generation
Heterozygosity + mutation
With the infinite allele model assumption that every new mutation
creates a new allele:
Two contrary mechanisms drive the evolution of diversity in population:
genetic drift and mutation
If they have the same strength and balance each other = mutationdrift balance
The change in heterozygosity between two generations is:
∆H = H t +1 − H t = −
1
H t + 2µ (1 − H t )
2 Ne
Heterozygosity + mutation
∆H = H t +1 − H t = −
1
H t + 2µ (1 − H t )
2 Ne
Change of heterozygosity due to
random drift (always negative)
Change of heterozygosity due to new
mutations (always positive)
At equilibrium the value of heterozygosity is Ĥ:
4 Ne µ
∆H = 0 ⇒ Hˆ =
1 + 4 Ne µ
Ĥ=θ / (1+ θ)
The value at equilibrium increases with increasing µ and Ne
WHY?
Mutation – Drift balance
In the case of such model, we are interested in:
The probability for a new mutation to get fixed?
How long does it take to get fixed?
Using a coalescent argument: fixation of the mutation occured if and only
if the mutant is that ancestor, this probability = 1/ 2N
The expected time of fixation is equal to the expected time to the MRCA,
so it is = 4N
What do we expect for selected loci?
Mutation – Drift balance
Substitution rate = rate at which mutations get fixed in a
population/species
It is called k
A new mutation starts with frequency 1/ 2N in a population,
The substitution rate occurs mutliplying the number of mutations in a
population = 2 N µ
And the probability that one mutation gets fixed = 1/ 2N
So k = 2 N µ * (1/2N) = µ
(Kimura)
Most striking result: k does not depend on the effective population size