Models of sequence evolution III: time reversibility Model

Model assumptions and violations
Taxon
Hemiehinus
Hedgehog
Echinosorex
Pipistrellus
Echinops
Mogera
Urotrichus
Dormouse
Thryonomys
.
.
.
Horse
Rhinopholus
Dugong
Hippo
Donkey
Pika
SpermWhale
Macaca
Baboon
Gorilla
Chimp
Human
Gibbon
Orangutan
A
0.34
0.32
0.34
0.32
0.30
0.34
0.34
0.31
0.33
C
0.20
0.21
0.21
0.24
0.24
0.24
0.25
0.25
0.25
G
0.11
0.12
0.11
0.12
0.12
0.12
0.12
0.12
0.11
T
0.36
0.36
0.34
0.32
0.34
0.29
0.29
0.33
0.30
0.31
0.30
0.29
0.32
0.31
0.30
0.31
0.31
0.31
0.29
0.29
0.29
0.29
0.29
0.30
0.30
0.30
0.31
0.31
0.31
0.32
0.32
0.32
0.32
0.33
0.33
0.33
0.34
0.12
0.13
0.14
0.13
0.12
0.12
0.12
0.12
0.12
0.12
0.12
0.12
0.13
0.12
0.26
0.27
0.27
0.25
0.26
0.27
0.25
0.26
0.25
0.26
0.26
0.26
0.25
0.25
Recall: nucleotide
frequency bias.
Easy to see large
variability across taxa.
Especially contrast
hedgehogs and hominoid
primates.
• All models are subsets of the General Time Reversible
(GTR). Assumptions:
– Stationarity: no base composition bias across taxa
(across the tree).
– Symmetric substitution matrix implying time
reversibility: p(A T)=p(T A), p(A C)=p(C A)...
• Actual sequence data:
– Asymmetric substitution matrix: time irreversibility
– Nonstationary nucleotide frequencies: vary across
taxa: base composition bias.
Hedgehogs have very different nucleotide
frequencies from other mammals
2nd PC
Models of sequence evolution III:
time reversibility
Hedgehogs
low C and high T
Most general Time Reversible model
1st PC
high C and low T
General Time Reversible model
GTR: symmetric substitution matrix:
Symmetric substitution matrix
p(A>T)=p(T>A), p(A>C)=p(C>A)….
A
C
G
T
A
α
β
δ
C
α
γ
ε
G
β
γ
µ
T
δ
ε
µ
-
A
C
G
T
A
α
β
δ
C
α
γ
ε
G
β
γ
µ
T
δ
ε
µ
-
1
MacClade.
MacClade.
A C G T
A C G T
Easy to
check
graphically.
Appears
asymmetric.
A
C
G
T
Model violations (ML & Bayesian)
Asymmetric substitution matrix from MacClade
e.g
Observed
p(C>T) > p(T>C)
Expected
A
Statistical test:
view graph as
a table, export
and analyze
statistically.
C
G
T
Model violations (ML & Bayesian)
• Asymmetric substitution matrix
e.g
Ho: symmetric
substitutions rates
p(A>G) > p(G>A)…
63 mammals, 11kb mtDNA
1st position
Χ26 = 322
p<0.001
2nd position
Χ26 = 4.3 NS
3rd position
Χ26 = 1539
p<<0.001
Asymmetry of 12 nucleotide changes
2
63 mammals, 11kb mtDNA
Transitions
63 mammals, 11kb mtDNA
Transitions & transversions
Whale “lice” (cyamids) COI sequence
Whale “lice” (cyamids) COI sequence
•Asymmetry especially
problematic for ancient
divergences.
1st position
Χ26 = 10.3 NS
•Cyamid tree has much
more recent divergence
than mammals
2nd position
Χ26 = 6.7 NS
•But still appears
asymmetric.
Whale “lice” (cyamids) COI sequence
Small
sample
(n=31).
Randomly
asymmetric.
3rd position
Χ26 = 73.5
p<0.001
Whale “lice” (cyamids) COI sequence
1st position
Χ26 = 10.3 NS
Looks a little
1st position
asymmetric,
but p > 0.05. Χ26 = 10.3 NS
n=166
2nd position
Χ26 = 6.7 NS
2nd position
Χ26 = 6.7 NS
3rd position
Χ26 = 73.5
3rd position
Χ26 = 73.5
p<0.001
p<<0.001
3
Whale “lice” (cyamids) COI sequence
1st position
Χ26 = 10.3 NS
2nd position
Χ26 = 6.7 NS
rd
Truly departs 3 2 position
from symmetry Χ 6 = 73.5
p<<0.001
n=848
Why do we care?
• No easy solutions at this point
– Time irreversibility is still difficult to implement
• Nevertheless, knowledge helps
– Gives clues to why we obtain odd results
– In turn, lowers or heightens our faith in the results.
– Inspires math nerds to come up with better (not bigger)
models for us.
• Use maximum likelihood:
– Statistically robust to departures from model
assumptions.
Suggested Reading
•
•
•
•
Rosenberg, MS. 2005.My SSP: Non-stationary evolutionary sequence
simulation, including indels. Evolutionary Bioinfomatics Online. 81-83.
Sudhindra R. Gadagkar* and S Kumar. 2005. Letter. Maximum
Likelihood Outperforms Maximum Parsimony Even When Evolutionary
Rates Are Heterotachous. Molecular Biology and Evolution 22(11):21392141;
Jayaswal, V, LS Jermin, J Robinson. 2005. Estimation of Phylogeny
Using a General Markov Model. Evolutionary Bioinformatics Online. 6280.
Galtier, N Gouy, M. 1998. Inferring pattern and process: maximumlikelihood implementation of a nonhomogeneous model of DNA
sequence evolution for phylogenetic analysis. Molecular Biology and
Evolution 15(7):871-9.
4