Estimating the Nucleotide Substitution Matrix Using a Full Four

Estimating the Nucleotide Substitution Matrix
Using a Full Four-State transition Rate Matrix
Ho-Lan Peng and Andrew R. Aschenbrenner
1
Introduction
The nucleotide substitution rate matrix has been the subject of great importance in molecular evolution because it describes the rates of evolutionary
changes across a number of DNA sequences. Previously, the substitution
process has been assumed to be a continuous time Markov chain, but various structures have been placed on the assumed infinitesimal matrix for
ease of computation. For example, the Jukes-Cantor (one parameter) assumes substitutions are equally likely. Even the most complex model, the
General Time Reversible, does not include each nucleotide transition as a
different variable. Furthermore, these models focus on estimation through
evolutionary distance as opposed to fully estimating the probability transition matrix. Our approach addresses this issue by estimating the transition
matrix assuming 12 unknown parameters.
2
2.1
Method
Non-Homogenous Differential Equation Approach
We assume that {Yi (t)} follows a continuous time Markov chain where i is
the site number and t corresponds to the species DNA sequence. The infinitesimal matrix Q is:




−
P
j6=1 q1j
q21
q31
q41
−
q
P 12
j6=2 q2j
−
q32
q42
q13
q
P 23
j6=3 q3j
q43
1
−
q14
q24
q
P 34
j6=4 q4j




Solving the backwards Kolmogorov differential equations allows us to determine the transition probabilities Pjk (t) in terms of these 12 qil . This requires
finding the general solution of a cubic polynomial to find the eigenvalues.
The log likelihood function is
log(L) =
m X
n
X
log[PYi (ti,l−1 ),Yi (ti,l ) (ti,l − ti,l−1 )]
i=1 l=2
where Yi (ti,l ) is the nucleotide base at the ith site for sequence ti,l . Since
estimation is conducted via Newton-Raphson, we need a good set of initial values to effectively reach the global maximum. This may help avoid
non-convergence or reaching a point that is not a MLE and hence has a
unreasonable standard error:
qjk =
2.2
total # of changes f rom nucleotide j to nucleotide k
total # of changes f rom nucleotide j to nucleotide j
Spectral Decomposition of eQt
The general solution of the Kolmogorov differential equations yields, P = eQt .
Using Spectral Value Decomposition (SVD), we can rewrite P as U eΛ U −1 ,
where Λ denotes the diagonal matrix of the eigenvalues and U is the matrix
with the eigenvectors as columns. This allows us to compute P , which can
be used in the likelihood to estimate the parameters of the infinitesimal matrix.
3
Data Analysis
We analyzed aligned sequence data from the p53 gene containing 1,539 sites
and 14 homologous species. We applied our method to estimate the substitution matrix and compared the result to the Jukes-Cantor, Kimura, and
General Time Reversible methods. All evolutionary models were calculated
using MEGA5. The comparisons between the methods along with the Akaike
Information Criterion (AIC) are displayed in Table 1.
2
4
4.1
Simulation
Simulation from Proposed Approach
We used the estimated qij0 s from the p53 dataset as the true value and simulated 1000 DNA sequences, each with 20 species, using this transition matrix. We assumed that the DNA substitution process is a continuous Markov
chain and simulated the DNA sequence data as a Markov chain. Within
these 1000 simulating data, there are 28 of them cannot be estimated by
Newton-Raphson. The coverage probabilities, bias, and the average standard error of the estimators for the rest 972 data are in table 2.
4.2
Compare to MSM method
The MSM package was published in December 10, 2012 and is used for
multi-state Markov and hidden Markov models in continuous time. We use
the syntax msm to find the estimators of the infinitesimal matrix for our 972
simulating datasets and obtain the coverage probability, bias, and average
standard error as well. See table 3.
5
Discussion
Preliminary results of the comparisons of simulations of the evolutionary
models (via AIC) look promising, but more simulation studies are needed
to assess if the full CTMC approach improves upon classical evolutionary
models. The nonhomogeneous DE approach will have trouble expanding to
higher number of states due to the complexity of solving higher order polynomials. Complex values in likelihoods also tend to hinder the estimation
process.
The use of the SVD method introduces a numerical approach which can
sidestep these issues and has the potential to expand the estimation to n
states. This would open up new applications of large finite state transition
matrix estimation such as codons (nucleotide triplets), disease classifications,
and cancer staging among others. Further investigation into the SVD method
is warranted to determine the efficacy and usefulness as an estimation tool.
3
Model
Estimated Substitution Matrix
Spectral
A
T/U
C
G
Decomposition
A
78.57
3.61
5.52 12.28
CTMC
T/U 3.29
79.53 13.94 3.22
3.70
C
4.17 10.86 81.25
G
10.31 2.67
4.14
82.86
Jukes-Cantor
A
T/U
C
G
(1 parameter)
A
75.01
8.33
8.33
8.33
8.33
T/U 8.33
75.01 8.33
C
8.33
8.33 75.01
8.33
G
8.33
8.33
8.33
75.01
Kimura
A
T/U
C
G
(2 parameter)
A
74.99
4.64
4.64 15.73
T/U 4.64
74.99 15.73 4.64
4.64
C
4.64 15.73 74.99
G
15.73 4.64
4.64
74.99
General Time
A
T/U
C
G
Reversible
A
75.41
3.72
6.25 14.62
T/U 4.28
71.28 20.32 4.12
C
5.25 14.81 75.37
4.57
G
13.65 3.33
5.07
77.95
Bold indicates transitional substitutions
Underline indicates changes to itself
Each row sums to 100
AIC
9271.56
11236.65
10948.70
10946.13
Table 1: Estimated Substitution Matrices for Various Molecular Evolution
Models
4
Coverage Probability
Bias
Average SE
q̂12
0.9414
4.56e-5
0.0039
q̂13
0.9372
6.52e-5
0.0048
q̂14
0.9506
9.07e-5
0.0070
q̂21
0.9527
6.38e-5
0.0037
Coverage Probability
Bias
Average SE
q̂31
q̂32
q̂34
q̂41
0.9455
0.9537 0.9496
0.9455
-7.28e-5 1.15e-4 -6.44e-5 1.11e-4
0.0035
0.0057 0.0032
0.0058
q̂23
q̂24
0.9506 0.9496
2.82e-4 3.24e-5
0.0074 0.0035
q̂42
q̂44
0.9362 0.9609
4.35e-5 -3.10e-6
0.0029 0.0036
Table 2: Bias, standard error and coverage probability for the estimated
transition rates using the proposed method
Coverage Probability
Bias
Average SE
q̂12
q̂13
q̂14
q̂21
0.7078 0.5957 0.8981 0.7305
5.87e-3 8.43e-3 4.19e-3 5.48e-3
0.0041 0.0050 0.0071 0.0038
q̂23
0.9198
2.72e-3
0.0074
q̂24
0.6296
5.92e-3
0.0037
Coverage Probability
Bias
Average SE
q̂31
q̂32
q̂34
q̂41
0.7994 0.7654 0.5463 0.9362
4.26e-3 1.15e-3 6.30e-3 5.44e-4
0.0037 0.0037 0.0034 0.0058
q̂42
0.4516
4.22e-3
0.0020
q̂44
0.2840
5.84e-3
0.0063
Table 3: Bias, standard error and coverage probability for the estimated
transition rates using MSM method
5