Estimating pairwise genetic distance and Nucleotide substitution

Estimating pairwise genetic distance
and
Nucleotide substitution models
(Models of sequence evolution)
The Phylogenetic Handbook: Chapter 4
Distance methods
Sequence alignment
(clustal X)
Gorilla
Human
Chimp
Orang
pairwise genetic distance
(distance measures)
GGTCCTAGGCC
GGTCACATGTC
GGTCATATCTC
GATACCAGCAC
Gorilla
Chimp
Human
Orang
tree
(UPGMA / NJ)
G
0
5
4
5
C
H
O
0
2
6
0
6
0
S1
S2
S3
S4
AACTGCATGGTAACAGGTTC
AAGTGCATGGTAACAGATTC
AAGTGGATGGTAATAGATTC
AAGTGGACGGTTATAGATTC
Genetic distance d (p distance)=
S1
S2
S3
S4
20
20
20
20
S1
S2
S3
S4
S1
0
2
4
6
S2
0.1
0
2
4
S3
0.2
0.1
0
2
S4
0.3
0.2
0.1
0
Number of substitutions between a pair of sequences
AACTGCATGGTAACAGGTTC
????GCATGGTAACAGATTC
AAGTGGATGGTAA???????
AAGTGGA?????ATAGATTC
total length of the sequence
20
16
13
15
S1
S2
S3
S4
S1
0
1
2
4
S2 S3 S4
0.06 0.15 0.26
0
0.11 0.18
1
0
0
2
0
0
Genetic distance (Evolutionary Distances)
• They measure the total number of substitutions that occurred on
both lineages since divergence from last common ancestor.
• p Distance d= # sub. b/w a pair of sequences/ sequence length
• Expressed in substitutions / site
• Can be used to calculate the av. rate of substitution.
• Rate= d/2
ancestor
sequence 1
sequence 2
Pairwise genetic difference = 2
Pairwise genetic distance d = 0.2 (also called p distance)
1 substitution
ATGTTCCATT
2 substitutions
ATGTTGCATT
ATCTTACATT
2 substitutions
When p distance is large it gives an underestimate of the actual divergence
Correction for multiple hits
(species 1)
ATGTTCCATT
1 substitution
(Ancestral species)
Observed Actual
Difference
2
3
p distance
0.2
0.3
ATGTTGCATT
2 substitutions
ATCTTACATT
(species 2)
Ancestral
species
G
G
a)
b)
[1]
Daughter
species
G
[2]
C
[1]
G
[1]
G
A
C
c)
C
G
[2]
G
C
G
G
[0]
G
d)
e)
[2]
T
[2]
T
[0]
T
Act.d
C
[1]
Obs.d
The problem of hidden or multiple changes
• d (true genetic distance) ≥ fraction of observed differences (p)
A
A
G
C
C
A
G
A
G A
A
A
G
• d = p + hidden changes
• Through hypotheses about the nature of the base substitution
process (models), it becomes possible to estimate d from
observed differences between sequences.
1
Correction
p distance
Actual d
Observed d
time
1
p distance
Actual d
0.5
Neutral marker
Protein coding gene
time
Models of sequence evolution / nucleotide substitution
Jukes-Cantor model (1969)
Assumes:
All substitutions equally likely
No among site rate variation (each site is equally likely to
undergo substitution)
Base frequencies are in equilibrium
Equilibrium base frequencies: A=T=C=G=1/4=0.25
(all four bases are present in equal
proportion in the sequence)
AAAAAAAA
?
Models of sequence evolution / nucleotide substitution
Jukes-Cantor model (1969)
Assumes:
All substitutions equally likely
No among site rate variation (each site is equally likely to
undergo substitution)
Base frequencies are in equilibrium
Equilibrium base frequencies: A=T=C=G=1/4=0.25
(all four bases are present in equal
proportion in the sequence)
AAAAAAAA
TAGCACTG
Jukes-Cantor model (1969)
A
G
A
PAC
C
T
PAG
G
PAG=PAT=PAC
T
Similarly
PGA=PGT=PGC
PTA=PTG=PTC
PCA=PCT=PCG
PAT
C
Also it is a reversible model PAG=PGA
Pij(t)= probability of change from i state to j
state at time t
Pii
A
Pii(t)= probability of no change at time t
Pij
Pij
G
Pij
C
Pii
Pii
Pij
Pij
T
Pij
Pii
Probability of no mutation at a site at different time intervals approximates a
(negative) exponential distribution
1.2
1
0.8
e-v
Probability of no mutation e-v
0.6
Probability of mutation 1- e-v
0.4
Where v=µt
µ = mutation rate = 10-4
t = time = million years or in generations
0.2
0
0
5000
10000 15000 20000 25000 30000 35000
generations
A
Jukes and Cantor (1969)
Pii (t)= probability of no change at time t
PAC
PAG
PAT
C
Two ways to get the same base
A possible bases at a site : a, t, g, c
PAA (t) = probability of picking any base x probability of picking the
same base
1x 1/4=1/4
Probability of picking a different base= 3/4
Probability that it does not undergo mutation
3/4x(e-v)
Pii (t)= 1/4 + 3/4(e-v)
Probability of no mutation at a given site is e-v
Where v=µt
G
T
Pij(t)= probability of change from i state to j state at time t
Pij = probability of picking any base x probability of picking a
particular base ¼ x probability of mutation
PAG
A
G
1/4(1-e-v)
PAC
PAT
Probability of mutation at a given site is (1-e-v)
C
Pij (t) = 1/4(1-e-v)
Pii (t) = 1/4 +
Pii
3/4(e-v)
T
Pij
A
Pij
Pij
Pij
C
Pii
G
Pii
Pij
T
Pij
Pii
A
Pij (t) = 1/4(1-e-v)
p=
3/4(1-e-2v)
Pij
C
Number of mutation along a branch=v
Actual distance d = mutations along two branches
x number of possible changes
d =-3/4 ln (1-4/3p)
G
Pij
v =-1/2 log(1-4/3p)
d = 2v(3/4)
Pij
T
Transitions Ts (α)
puriens
A
G
Transverstions Tv (β)
pyrimidines
C
T
Nuclear DNA Ts = 2-4Tv
MtDNA Ts = 8-10Tv
Base frequencies
A=T=G=C p=0.25
Kimura’s two parameter distance model
• Hypotheses of the model :
Substitutions occur according to two probabilities :
One for transitions, one for transversions.
Transitions : G <—>A or C <—>T
Transversions : other changes
• (p: α transitions; β transversions):
Assumptions:
Equal base frequencies (all four bases are present in equal
proportion in the sequence) f: A=T=G=C
α
A
G
No among site rate variation
β
β
1
d = − ln[(1− 2 αp − βq) 1 − 2qβ]
2
Kimura (1980) J. Mol. Evol. 16:111
β
C
β
α
T
Equal base frequencies (equilibrium frequency assumption)
all four bases are present in equal proportion in the sequence
f : A=T=G=C
Cytochrome b 1140bp (% base composition)
H. agilis
H. sapiens
M. mulatta
P. pygmaeus
G. Gorilla
P. Troglodytes
Average
Expected
T(U)
24.3
25.2
25.2
25.2
24.8
24.9
24.9
25
C
33.8
34.3
34.0
34.1
34.6
34.4
34.2
25
A
28.3
28.7
29.1
29.2
28.9
29.0
28.9
25
G
13.6
11.9
11.7
11.5
11.7
11.7
12.0
25
Variation in GC content (25-75%)
If some bases are more common than others then we might expect
some substitutions to be more common than others.
Models
Corrections
Jukes and Cantor 1969
multiple hits
but all substitution equally likely
equal base frequencies
Kimura 1980
multiple hits, Ts vs Tv
but equal base frequencies
Felsenstein 1981
multiple hits, unequal base frequencies
but all substitution equally likely
Hasegawa et al. 1985
multiple hits, unequal base frequencies, Ts vs Tv
General reversible
multiple hits, unequal base frequencies, all six
substitutions have different rates
Add on
Gamma correction
among site rate variation
GTR model: Substitution probability matrix for V
Each substitution class has a different rate
6 parameter values have to be estimated
A
G
C
T
A
x
a
b
d
G
a
x
c
e
C
b
c
x
f
T
d
e
f
x
b
c
C
xa=1-(a+b+d)
a
A
G
d
f
e
T
Among site rate variation / rate heterogeneity
Rate of nucleotide position can vary substantially for different
positions in a sequence.
Protein coding gene third position>first position> second position
Introns> exons
Gamma distribution, shape parameter alfa
Alfa> 1, bell shaped curve, weak heterogeneity
Alfa< 1, L shaped curve, strong rate heterogeneity
The genetic code
TTT
TTC
TTA
TTG
Phe
Phe
Leu
Leu
TCT
TCC
TCA
TCG
Ser
Ser
Ser
Ser
TAT
TAC
TAA
TAG
Tyr
Tyr
stop
stop
TGT
TGC
TGA
TGG
Cys
Cys
stop
Trp
CTT
CTC
CTA
CTG
Leu
Leu
Leu
Leu
CCT
CCC
CCA
CCG
Pro
Pro
Pro
Pro
CAT
CAC
CAA
CAG
His
His
Gln
Gln
CGT
CGC
CGA
CGG
Arg
Arg
Arg
Arg
ATT
ATC
ATA
ATG
Ile
Ile
Ile
Met
ACT
ACC
ACA
ACG
Thr
Thr
Thr
Thr
AAT
AAC
AAA
AAG
Asn
Asn
Lys
Lys
AGT
AGC
AGA
AGG
Ser
Ser
Arg
Arg
GTT
GTC
GTA
GTG
Val
Val
Val
Val
GCT
GCC
GCA
GCG
Ala
Ala
Ala
Ala
GAT
GAC
GAA
GAG
Asp
Asp
Glu
Glu
GGT
GGC
GGA
GGG
Gly
Gly
Gly
Gly
Gamma distribution (shape parameter α)
http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-Criteria/Phylogeny-Criteria.html
Other assumptions:
1) All nucleotide sites change independently of each other.
2) The substitution rate is constant over time and in different lineages
(Molecular clock assumption).
3) Base composition across taxa is at equilibrium.
Which model to choose?
Usually more complex models fit the data better.
But, the analysis becomes computationally intensive (more time).
More parameters need to be estimated from the same amount of
data, therefore more error associated with the estimate.
For mtDNA genes, HKY 85 is often used
Likelihood ratio test (Modeltest), AIC (JModeltest)