Simple stochastic models

Models for DNA substitution
http://www.stat.rice.edu/
~mathbio/Polanski/stat655/
Plan
•
•
•
•
Basics
Models in discrete time
Model is continuous time
Parameter estimation
Nucleotides
• Adenine ( A ) or ( a )
• Guanine ( G ) or ( g )
purines
• Cytosine ( C ) or ( c )
• Thymine ( T ) or ( t )
pyrimidines
Substitution
Purine
Pyrimidine
Purine
Pyrimidine
Transitions
AG, G A, C T, T C
Purine
Pyrimidine
Pyrimidine
Purine
AT, T A, A C, C A
GT, T G, G C, C G
Transversions
Other
Deletions, insertions
Insertions in reverse order
Hypothesis
Substitution of nucleotides in the evolution
of DNA sequences can be modeled by a
Markov chain or Markov process
Other assumptions
• Stationarity
• Reversibility
Transition matrix
P =
a
g
c
t
a
paa
pag
pac
pat
g
pga
pgg
pgc
pgt
c
pca
pcg
pcc
pct
t
pta
ptg
ptc
ptt
Models – discrete time
Jukes – Cantor model
All substitutions are equally probable


 
1  3
 

1  3



P
 

1  3
 




1  3 
 
Stationary distribution

a
 g c t   0.25 0.25 0.25 0.25
Spectral decomposition of
0.25
0.25
Pn  
0.25

0.25
n
P
0.25 0.25 0.25
 0.75  0.25  0.25  0.25
 0.25 0.75  0.25  0.25
0.25 0.25 0.25
  (1  4 ) n 

 0.25  0.25 0.75  0.25
0.25 0.25 0.25



0.25 0.25 0.25

0
.
25

0
.
25

0
.
25
0
.
75


Remark
• When learning and researching Markov
models for nucleotide substitution, it greatly
helps to use a software for symbolic
computation, like Mathematica, Maple,
Scientific Workplace.
Kimura models
I)
 - probability of a transition
 - probability of a specific transversion



1    2 




1    2



P




1    2






1    2 

II) Kimura 3ST model
 - probability of : AG, C T
 - probability of : AC, G T
 - probability of : AT, C G
1      



P










1    





1    




1     
Stationary distribution

a
 g c t   0.25 0.25 0.25 0.25
Generalizations of Kimura models
By Ewens:
 - probability of : AG, C T
 - probability of : AC, A  T, G C, G T
 - probability of : CA, T  A, C G, T G
1    2
 
P










1    2





1    2




1    2 
Stationary distribution


 2(   )

a
 g  c t   

2(   )

2(   )


2(   ) 
Spectral decomposition
0.25 0.25 0.25
0.25  0.25  0.25
 0.25
 0.25
0.25 0.25 0.25
0.25  0.25  0.25
n
  (1  4  ) 

 0.25  0.25 0.25
0.25 0.25 0.25
0.25 



0.25 0.25 0.25

0
.
25

0
.
25
0
.
25
0
.
25


0
0 
 0.5  0.5
 0.5 0.5

0
0

 (1  2(   )) n 
 0
0
0.5  0.5


0
0

0
.
5
0
.
5


0.25
0.25
n
P 
0.25

0.25
By Blaisdell:
 - probability of : AG,
 - probability of : GA,
 - probability of : AC,
 - probability of : CA,
1    2



P





CT
TC
A  T, G C, G T
T  A, C G, T G





1    2





1    2




1    2 
Stationary distribution

a
 g c t  
  (   )
 (   )
 (   )
 (   ) 
 (    2 )  (    2 )  (    2 )  (    2 ) 


where
  
Remark: this model is not reversible
Felsenstein model
Probability of substitution of any nucleotide by another
is proportional to the stationary probability of the substituting
nucleotide
u g
u c
u t 
1  u  u a
 u

1

u

u

u

u

a
g
c
t


P
 u a
u g
1  u  u c
u t 


u g
u c
1  u  u t 
 u a
Stationary distribution

a
 g  c t 
HKY model
Hasegawa, Kishino, Yano
Different rates for transitions and transversions
1  u g

P



 v( c   t )
u a
v a
v a
u g
v c
v t


1  u a  v( c   t )
v c
v t


v g
1  u t  v( a   g )
u t

v g
u c
1  u c  v( a   g )
Eigenvalues of P
1  1
2  1  v
3  1  u ( c   t )  v( a   g )
4  1  v( c   t )  u ( a   g )
Left (row) eigenvectors
l2  ( c   t )  a
l1   a  g
( c   t ) g
c t 
( a   g ) c ( a   g ) t 
l3  0 0 1  1
l4  1  1 0 0
Right (column) eigenvectors
 1 
   



g
0


a
g


   
1



1
g
0


 a



1
t
 a   g 
 a 





r1  , r2  
, r3 
, r4  


c  t 
1




1
a
g









0 
c
 c  t 

1


 1 
 0 
 c   t 


 c  t 
General 12 parameter model
Tavare, 1986
1  uW uA g uB c uC t 
 uD 1  uX uE

uF

a
c
t


P
 uG a uH g 1  uY uI t 


 uJ a uK g uL c 1  uZ 
W  A g  Bc  Ct
X  D a  Ec  Ft
Y  G a  H g  It
Z  J a  K g  Lc
Stationary distribution

a
 g  c t 
Reversibility
A=D, B=G, C=J, E=H, F=K, I=L
Conclusion – the most general reversible model has
12 – 6 = 6 free parameters
Continuous – time models
Matrix of transition probabilites
P(t )  exp(Qt)
Q – intensity matrix
Jukes – Cantor model
 3
 
Q
 

 


 
 3

 

 3
 



 3 
Spectral decomposition of P(t)
0.25 0.25 0.25
0.25 0.25 0.25
P (t )  exp(Qt )  
0.25 0.25 0.25

0.25 0.25 0.25
 0.75  0.25  0.25
 0.25 0.75  0.25
exp(4t ) 
 0.25  0.25 0.75

 0.25  0.25  0.25
0.25
0.25

0.25

0.25
 0.25
 0.25

 0.25

0.75 
Kimura model
   2 
 
Q
 

 

   2




   2

 
 
 

   2 
Spectral decomposition of P(t)
0.25 0.25 0.25
0.25  0.25  0.25
 0.25
 0.25
0.25 0.25 0.25
0.25  0.25  0.25
  exp(4 t ) 

 0.25  0.25 0.25
0.25 0.25 0.25
0.25 



0.25 0.25 0.25

0
.
25

0
.
25
0
.
25
0
.
25


0
0 
 0.5  0.5
 0.5 0.5
0
0 

 exp(2(   )t ) 
 0
0
0.5  0.5


0
0

0
.
5
0
.
5


0.25
0.25
P(t )  
0.25

0.25
Parameter estimation
Jukes – Cantor model
Three things are equivalent due to reversibility:
Ancestor (A)
D1
D1
D2
A
D2
D2
A
D1
Probability that the nucleotides
are different in two descendants
p  p(t )  0.75(1  exp(8t ))
Estimating p
We have two DNA sequences of length N
D1: ACAATACAGGGCAGATAGATACAGATAGACACAGACAGAGCAGAGACAG
D2: ACAATACAGGACAGTTAGATACAGATAGACACAGACAGAGCAGAGACAG
Number of differences
p
=
N
1
4
t   log(1  pˆ )
8
3
Kimura model
p – probability of two different purines or
pyrimidines
q – probability of purine and pyrimidine
p  p(t )  0.25  0.25 exp(4t )  0.5 exp(2(   )t )
q  q(t )  0.5  0.5 exp(4t )