Algorithms for building models of molecular motion from

ALGORITHMS FOR BUILDING MODELS OF MOLECULAR MOTION FROM
SIMULATIONS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Nina Singhal Hinrichs
September 2007
ALGORITHMS FOR BUILDING MODELS OF MOLECULAR MOTION FROM
SIMULATIONS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Nina Singhal Hinrichs
September 2007
c Copyright by Nina Singhal Hinrichs 2007
°
All Rights Reserved
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
(Vijay S. Pande) Principal Co-Advisor
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
(Serafim Batzoglou)
Principal Co-Advisor
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
(Leonidas Guibas)
Approved for the University Committee on Graduate Studies.
iii
Abstract
Many important processes in biology occur at the molecular scale. A detailed understanding of these
processes can lead to significant advances in the medical and life sciences – for example, many diseases are caused by protein aggregation or misfolding. One approach to studying these systems is
to use physically-based computational simulations to model the interactions and movement of the
molecules. While molecular simulations are computationally expensive, it is now possible to simulate many independent molecular dynamics trajectories in a parallel fashion by using distributed
computing methods such as Folding@Home.
The analysis of these large, high-dimensional, data sets presents new computational challenges.
This dissertation presents a novel approach to analyzing large ensembles of molecular dynamics trajectories to generate a compact model of the dynamics. The model groups conformations into discrete states and describes the dynamics as Markovian, or history-independent, transitions between
the states. We will discuss why the Markovian state model (MSM) is suitable for macromolecular dynamics, and how it can be used to answer many interesting and relevant questions about the
molecular system. We will also present new approaches for many of the computational and statistical challenges in building such a model, specifically a novel algorithm for defining the states,
methods for comparing between different state definitions and determining the optimal number of
states, efficient error analysis techniques to determine the statistical reliability, and adaptive algorithms to efficiently design new simulations. The methods are applied to model systems as well as
molecular dynamics simulation data of several small peptides.
iv
Acknowledgements
I would first like to thank my family and friends for their love and support: my parents Kumud and
Kishore, who always inspired and encouraged me; my sister Monica, for her guidance and advice;
Tim Knight, Eran Guendelman, and Andrea Tompa for filling graduate school with fun memories;
and especially Tim Hinrichs, my best friend and husband, for sharing this wonderful experience
with me.
I would also like to acknowledge some of my collaborators: Peter Kasson for collaborations on
lipid vesicle simulations; John Chodera and Bill Swope for interesting discussions about Markov
state models and excellent collaborations on state decomposition algorithms; and all of the Pande
lab members, past and present, who I had the pleasure of working with.
There were numerous people who made their simulation data available for the analysis presented
in this thesis: Christopher Snow for the trpzip2 data set (Chapter 2); Eric Sorin for the Fs peptide
data set (Chapter 3); Jed Pitera for the trpzip2 data set (Chapter 3); John Chodera for the alanine
data set (Chapters 4 and 6); and Guha Jayachandran for the villin headpiece model (Chapter 6).
Several people had helpful comments on various parts of this thesis: Hans Andersen and Frank
Noé for enlightening conversations on the nature of Markov chain models; Vishal Vaidyanathan
for assistance with clustering algorithms (Chapter 3); Jed Pitera for insightful discussions and constructive comments on Chapter 3; Libusha Kelly, David Mobley and Guha Jayachandran for critical
comments on Chapter 3; Kishore Singhal for insightful discussions about sensitivity analysis (Chapters 5 and 6); and John Chodera for helpful comments on Chapters 4 and 6.
My thesis committee members deserve special thanks: Axel Brunger, as the chair of my orals
committee; Jean-Claude Latombe for inspiration about graphical kinetic models and for serving on
my orals committee; Leonidas Guibas for discussions about the geometric nature of conformation
space for being a committee member; Serafim Batzoglou, as my co-advisor and for discussions
about alignment which helped motivate many of the ideas in this thesis; and especially my advisor
Vijay Pande, for his help and guidance throughout my graduate career.
v
Contents
Abstract
iv
Acknowledgements
v
1 Introduction
1
2 Markovian state models
5
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
Theory and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2.1
Direct rate calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2.2
Sampling of paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.3
MSM generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2.4
Post-processing of MSMs . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.5
Reweighting of edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2.6
Mean first passage time and Pf old calculation . . . . . . . . . . . . . . . .
15
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.3.1
Model system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.3.2
Trpzip2 kinetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
Discussion and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.3
2.4
3 Automatic state decomposition
28
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2
Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.2.1
Markov chain and master equation models of conformational dynamics . .
32
3.2.2
Markov model construction from simulation data given a state partitioning
34
3.2.3
Requirements for a useful Markov model . . . . . . . . . . . . . . . . . .
35
vi
3.2.4
Validation of Markov models . . . . . . . . . . . . . . . . . . . . . . . . .
36
The automatic state decomposition algorithm . . . . . . . . . . . . . . . . . . . .
38
3.3.1
Practical considerations for an automatic state decomposition algorithm . .
38
3.3.2
Sketch of the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3.3
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.3.4
Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.4.1
Alanine dipeptide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.4.2
The Fs helical peptide . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.4.3
The trpzip2 β-peptide . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
3.6
Supporting Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
3.3
3.4
4 Model selection
63
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.2
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.2.1
Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.2.2
Parameter estimation in Bayesian Networks . . . . . . . . . . . . . . . . .
65
4.2.3
Scoring of Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . .
67
4.2.4
Markovian state models as Bayesian Networks . . . . . . . . . . . . . . .
70
4.2.5
Comparison between different Markovian state models . . . . . . . . . . .
71
4.2.6
Non-equilibrium data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.3.1
Model system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.3.2
Alanine peptide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
4.3
4.4
5 Error analysis methods
86
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.2
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
5.2.1
Mean first passage times . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
5.2.2
Transition probability distribution . . . . . . . . . . . . . . . . . . . . . .
89
5.2.3
Sampling based error analysis methods . . . . . . . . . . . . . . . . . . .
91
5.2.4
Non-sampling based error analysis method . . . . . . . . . . . . . . . . .
95
vii
5.3
5.4
5.2.5
Adaptive sampling algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
96
5.2.6
Extension to large systems . . . . . . . . . . . . . . . . . . . . . . . . . .
99
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.1
Demonstration of method 1 . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.2
Validity of approximations . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.3
Adaptive sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Discussion and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6 Eigenvalue and eigenvector error analysis
111
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3
6.4
6.2.1
Eigenvalue and eigenvector equations . . . . . . . . . . . . . . . . . . . . 114
6.2.2
Transition probability distribution . . . . . . . . . . . . . . . . . . . . . . 114
6.2.3
Distribution of eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . 116
6.2.4
Adaptive sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3.1
Eigenvalue distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3.2
Eigenvector distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3.3
Adaptive sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7 Conclusions
132
A Sampling from a Dirichlet distribution
134
B Sampling from a Multivariate Normal distribution
135
C MFPT sensitivity analysis
137
D Solving a bordered sparse matrix
139
E Eigenvalue sensitivity analysis
141
F Eigenvector sensitivity analysis
144
Bibliography
146
viii
List of Tables
3.1
Macrostates from a 20-state state decomposition of the Fs helical peptide. . . . . .
52
4.1
Four state definitions for the transition model between 9 conformations. . . . . . .
78
5.1
Summary of sampling based methods for calculating the error of the MFPT from
the initial state due to sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
94
Means and standard deviations of the MFPT distributions generated for the four
sampling and the non-sampling based error analysis methods. . . . . . . . . . . . . 106
5.3
Running times for the error analysis methods on calculating the MFPT distribution
of an 87 state example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
ix
List of Figures
2.1
The shooting algorithm for sampling paths. . . . . . . . . . . . . . . . . . . . . .
9
2.2
Clustering of MSM points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.3
Clustering of nodes to guarantee that all nodes can reach the final state. . . . . . .
11
2.4
Contour graph of the potential energy, E(x, y), of the model energy landscape. . .
18
2.5
The correlation between Pf old values calculated directly from many simulations and
MSM simulations on the model energy landscape. . . . . . . . . . . . . . . . . . .
2.6
The comparison between the MFPT calculated directly from many simulations and
from the MSM simulations as a function of temperature. . . . . . . . . . . . . . .
2.7
20
21
The comparison between the MFPT calculated from many simulations to the MFPT
calculated from reweighted versions of a single MSM as a function of temperature.
22
2.8
Error analysis of direct simulations and the various MSM techniques. . . . . . . .
23
2.9
The effect of clustering cutoff on the calculated MFPT for the model system and
trpzip2 peptide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.1
Flowchart of the automatic state decomposition algorithm. . . . . . . . . . . . . .
41
3.2
Potential of mean force and manual state decomposition for alanine dipeptide. . . .
47
3.3
Comparison of manual and automatic state decompositions for alanine dipeptide. .
49
3.4
Stability and recovery of optimal state decomposition for alanine dipeptide. . . . .
51
3.5
Implied time scales of the Fs peptide as a function of lag time for 20-state automatic
state decomposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6
Reproduction of observed state population evolution by Markov model for the Fs
peptide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7
54
56
Comparison of some trpzip2 macrostates found by automatic state decomposition
with misregistered hydrogen bonding states identified in a previous study. . . . . .
x
58
3.8
Implied time scales of trpzip2 as a function of lag time for 40-state automatic state
decomposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.1
The Dynamic Bayesian Network corresponding to a Markovian state model. . . . .
71
4.2
The transition probabilities and state definitions for a simple model with 9 conformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
74
The difference in scores between a 9-state and 3-state definition of the transition
model of 9 conformations for different lag times τ and number of data instances M .
77
4.4
Comparison of MSMs corresponding to the subdivision of states. . . . . . . . . . .
79
4.5
Several state decompositions for the terminally blocked alanine dipeptide. . . . . .
81
4.6
Comparison of different state definitions for the terminally blocked alanine peptide.
82
5.1
Distributions of the mean first passage time as generated by the first sampling based
method on the 87 state example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2
Distribution of the mean first passage time as calculated by the five error analysis
methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.3
Effect of adaptive sampling on the variance of the mean first passage time. . . . . . 107
5.4
Relationship between the number of samples and the variance for the even and adaptive sampling algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5
Percent of samples required for each state in the optimal allocation of samples per
state.
6.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Potential of mean force and manual state decomposition for terminally-blocked alanine peptide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2
Distributions of the five non-unit eigenvalues of the system shown in Fig. 6.1. . . . 123
6.3
The percent contribution of each state to the variance for the five non-unit eigenvectors (Eq. 6.24). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4
Distributions of the eigenvector components corresponding to the third and fifth
eigenvalues of the six-state model of the terminally blocked alanine peptide. . . . . 125
6.5
The contributions to the variance of the eigenvector components as decomposed by
transitions from each state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.6
The mean and standard deviation of the variance of the largest non-unit eigenvalue
of the six-state model of terminally blocked alanine peptide and the 2454-state
model of the villin headpiece for different sampling algorithms. . . . . . . . . . . . 128
xi
Chapter 1
Introduction
Many important processes in biology occur at the molecular scale. A detailed understanding of these
processes can lead to significant advances in the medical and life sciences – for example, many diseases are caused by protein aggregation or misfolding [Dob03] and potential drug molecules can be
designed by understanding their binding properties and conformational changes. These processes
have typically been studied through experiments. While such experiments can yield a wealth of
insight, they are often insufficient to describe the system dynamics on an atomic scale, which is desirable for many of the problems of interest. An alternate approach is to use physically-based computational simulations to model the interactions and movement of the molecules. These simulations
are typically performed in atomic detail and with small time steps in order to accurately reproduce
the underlying dynamics. The interesting movements of these systems occur over a relatively long
time scale (protein folding times range from microseconds to seconds) and, unfortunately, atomistic
molecular dynamics simulations of this length would take thousands of CPU-years to complete.
Several methods have been developed for making the computational problem tractable. It is
possible to trade accuracy for speed through the use of simplified models to simulate the molecular system (e.g., course-grained atoms, lattice models, simplified forcefields). When the resulting
loss in accuracy is unacceptable, another approach harnesses the power of parallel and distributed
computing. One strategy is to parallelize a single simulation across multiple processors. However, parallelizing a single simulation requires significant inter-processor communication because
of the long-range interactions between atoms. An alternate strategy is to perform a large number
of independent simulations, distributed across multiple processors with little or no communication.
While this strategy does not reduce the required time to generate long trajectories, it can efficiently
generate a large number of short, independent trajectories. With sufficient sampling, a set of short,
1
CHAPTER 1. INTRODUCTION
2
independent simulations can represent the full system dynamics by describing all short pieces that
can then be merged to describe the overall dynamics.
This dissertation involves the analysis of large ensembles (tens of thousands) of short, independent, molecular dynamics trajectories (nanoseconds to microseconds) to extract as much information as possible about the underlying system by constructing a compact model of the dynamics. The
actual dynamics are a series of transitions between the detailed configurations of the system, for
example, the Cartesian coordinates of all the atoms. To overcome the high dimensionality of the
configuration space, we developed a model which groups sets of similar configurations into discrete
states and describes the dynamics as transitions between these states. Once the states are defined, it
is possible to calculate the transition probabilities between these states by counting the transitions
that occur in the simulation data between the configurations assigned to each state.
We call the states and transition probabilities a Markovian state model (MSM), because the
model assumes that the transitions between states are history independent. When the transitions of
the underlying system are Markovian over the defined states, a random walk in the model will mimic
the true dynamics. It is possible to build this model from short (nanosecond length) trajectories, but
then use the constructed model to project to infinite timescales. We can thus study the mechanism
of long time scale events, such as protein folding, which were previously inaccessible through simulations. The model naturally handles complex kinetic behavior and can accurately describe kinetic
mechanisms through intermediate and trap states in the landscape. Important quantities, such as the
rate of folding, can be computed efficiently from this graph based model and compared with experimental results to validate the model. The MSM has been applied to small peptides, non-biological
polymers, and vesicle fusion, with good agreement to experimental rates.
Chapter 2 describes how to build a Markovian state model from molecular dynamics data and
how to efficiently compute kinetic properties from this model, such as the probability that a conformation will fold before unfolding, and the average time to transition between two states. It also
describes a method for modifying the MSM to calculate kinetic properties at parameters other than
the simulation parameters. The “proof of concept” of the Markovian state model is demonstrated
on dynamics over a simple two-dimensional energy landscape and simulation data of the 12-residue
β-peptide trpzip2. The material presented in this chapter was published in the Journal of Chemical
Physics [SSP04].
One of the main difficulties in building a MSM is in defining the states, as the model is only
useful and representative of the underlying dynamics if transitions on the defined states are, in fact,
Markovian. Previous descriptions of states usually relied on order parameters that assume that the
CHAPTER 1. INTRODUCTION
3
relevant degrees of freedom are known or on structural clustering. In Chapter 3, we describe an
iterative algorithm for automatically finding important states from the simulation data. Because the
configuration space is high dimensional, an automatic algorithm is required to avoid any inadvertent
subjective bias that could be introduced through manual methods. Our randomized heuristic algorithm combines clustering by structural similarity with clustering by kinetic similarity to produce
states that faithfully describe the dynamics of the system. Including kinetic similarity measures
greatly improves the history-independence of the states, since structurally similar configurations
may have different kinetic properties, and structurally diverse configurations may behave similarly.
The state decomposition algorithm is applied to three peptide systems, the terminally blocked alanine peptide, the 21-residue helical Fs peptide, and trpzip2. This chapter describes joint work
with John Chodera and William Swope which was published in the Journal of Chemical Physics
[CSP+ 07], and I contributed jointly to the design and implementation of the algorithm and the
application to the test systems.
MSMs are not unique and models with different numbers of states may all describe the system
dynamics. Chapter 4 adapts and applies concepts from structure learning of Bayesian Networks
to quantitatively compare models with different numbers of states and state definitions. We show
how to convert a MSM into a Bayesian Network and then evaluate how well the MSM fits the
data through maximum likelihood and Bayesian scoring functions. The advantage of the Bayesian
scoring function is that it automatically accounts for the amount of simulation data and can better
discriminate the appropriate number of states in the MSM. No previous methods for evaluating
Markovian state models explicitly quantified the tradeoff between the number of states in the model
and the amount of simulation data needed for their parameterization. The scoring functions are
evaluated on MSMs corresponding to a simple transition model and the alanine peptide.
Once a model has been created, there are numerous quantities which can be calculated from
it, and determining their accuracy is an important task. Since there is only a finite amount of simulation data, the transition probabilities in the MSM, which are defined from this data, will have
uncertainties associated with them. We developed techniques to assess the resulting uncertainty of
many important kinetic properties that can be calculated from the model. Previous methods for
uncertainty analysis relied on repeatedly sampling possible transition probabilities and calculating
the quantity of interest, which is computationally expensive and does not scale well for large models. We instead derived efficient closed-form functions that approximate the distributions of several
properties. Chapter 5 describes methods for the error analysis of the average time to transition
between two states and Chapter 6 extends these techniques to calculate the uncertainties in such
CHAPTER 1. INTRODUCTION
4
properties as the equilibrium distribution and reaction rates. We show that these functions are excellent approximations to the distributions obtained through repeated sampling and present efficient
sparse matrix techniques that allow these calculations to scale to large systems.
In addition, we developed methods to identify the states that contribute the most to the uncertainty and designed an adaptive sequential sampling scheme where one can selectively start new
simulations to reduce the uncertainty with greatly reduced computational cost. When applied to
a MSM of the 12-residue trpzip2 (Chapter 5), the adaptive sampling scheme required 20 times
fewer simulations as compared with the usual unguided simulations. When applied to a MSM of
the 36-residue α-helical villin headpiece (Chapter 6), the adaptive sampling scheme required three
orders of magnitude fewer simulations, thus demonstrating the power of simulation planning. Chapter 5 and Chapter 6 are based on papers published previously in the Journal of Chemical Physics
[SP05a, HP07].
Chapter 2
Markovian state models
We introduce an efficient new method for predicting protein folding rate constants and mechanisms
from molecular dynamics simulations. The Markovian state model (MSM) is a discrete representation of the underlying kinetics of the molecular system. Using the MSM framework, we show
how to quickly calculate the folding probability (Pf old ) and mean first passage time of all the sampled conformations. In addition, we provide techniques for evaluating these values under perturbed
conditions, specifically different temperatures, without expensive recomputations. To demonstrate
this method on a challenging system, we apply these techniques to a two-dimensional model energy
landscape and the folding of a tryptophan zipper β-hairpin.
2.1 Introduction
The direct simulation of protein folding has been a “grand challenge” of computational biology
for several decades [SB01]. Simulating protein folding is particularly challenging due to the long
time scales involved. While the fastest proteins fold on the microsecond to millisecond time scale,
atomistic molecular dynamics simulations are typically constrained to the nanosecond time scale.
In order to overcome this fundamental computational barrier, several new computational methods
have been proposed.
One such approach to study protein folding events is transition path sampling [DBCC98]. Given
an initial trajectory between the unfolded and folded regions, this method generates an ensemble of
different pathways that join the unfolded and folded regions. From these path ensembles, Bolhuis
and coworkers determined the formation order of hydrogen bonds and the hydrophobic core in a
β-hairpin [Bol03]. Using the fluctuation-dissipation theorem [Cha87], it is possible to calculate
5
CHAPTER 2. MARKOVIAN STATE MODELS
6
folding rates from these ensembles [DBCC98]. More recently, a new method called transition interface sampling [vEMB03] introduced an alternate method to calculate transition rates. One drawback
of these methods is that they do not utilize all the simulation results. To ensure that trajectories are
decorrelated, only every fifth or tenth pathway generated is added to the path ensemble. Also, these
methods require many individual path sampling simulations corresponding to different boundary
conditions in order to calculate rates. Since path sampling methods are computationally demanding, it is interesting to consider whether one can construct an algorithm which can more efficiently
utilize simulation data (e.g. folding trajectories) in order to predict folding rates and mechanisms.
There are also techniques that analyze the nature and kinetics of the folding process by representing possible pathways in a graph, or roadmap. These methods sample configuration space
and connect nearby points with weights according to their Monte Carlo probabilities. From these
graphs, it is possible to calculate such properties as the shortest path, most probable path, and Pf old
values [ABG+ 02], as well as analyze the order in which secondary structures form [STD+ 03].
The primary challenge of these techniques is how to sample the conformational states in order to
construct the pathway graph. The graph representation of protein folding pathways does not solve
the sampling problem, but recasts it, and sampling any continuous, high dimensional space is still
a difficult challenge. Previous graph-based methods have sampled configuration space uniformly
(e.g., choosing conformations at random) or used sampling methods biased towards the native state.
Clearly, as the protein size increases, it becomes exponentially difficult to sample the biologically
important conformations with random sampling. In addition, while probabilistic roadmap methods
can predict Pf old values [ABG+ 02] and suggest structure formation order [STD+ 03], they have not
included the time involved in the transitions. Because of this, one cannot predict time dependent
properties such as folding rates, and thus it is difficult to assess the experimental validity of these
methods.
This chapter outlines a novel combination of the techniques described above. We propose transforming the simulation data gathered from transition path sampling algorithms into a probabilistic
roadmap that includes information about the transition times. As opposed to traditional transition
path sampling analysis, this method would incorporate all of the simulated data into the results,
therefore potentially yielding an increase in efficiency. This method is also general enough to work
on any molecular dynamics data sets, not just those gathered from path sampling simulations. We
call our model a Markovian state model, or MSM, as it assumes Markovian transitions between
states. From this MSM we can quickly and simultaneously calculate such properties as the Pf old
for all configurations sampled and the mean first passage time (MFPT) from the unfolded state to
CHAPTER 2. MARKOVIAN STATE MODELS
7
the folded state from a single transition path sampling simulation. In addition, this method provides
a compact representation of the possible pathways in the system, which may be useful for understanding the mechanisms involved in folding. We suggest that our method would improve on the
current roadmap techniques by sampling points using molecular dynamics, thereby greatly increasing the probability that the configurations that are included are kinetically relevant. In addition, the
simulation time between points would inherently capture transition times, making the calculation of
folding rates possible.
In the following sections we describe the algorithms necessary to transform molecular trajectories into a MSM with the correct transition probabilities and times (Secs. 2.2.3 and 2.2.4). We
also provide methods that allow for data gathered at one set of parameters, such as temperature,
to be analyzed easily at other parameter values without the need for additional simulations (Sec.
2.2.5). We then describe linear algebra techniques to quickly calculate such values as Pf old and
MFPT (Sec. 2.2.6). We first give results on a model energy landscape and find that they are in good
agreement with results from direct simulations (Sec. 2.3.1). Finally, we apply these methods to the
analysis of existing simulation data of the folding of a small protein: the tryptophan zipper β-hairpin
[SQD+ 04] (Sec. 2.3.2).
2.2 Theory and methods
2.2.1 Direct rate calculations
One purpose of the MSM is to understand kinetics when one cannot easily simulate transitions from
one state to another (e.g., for slow transitions from the unfolded state to the folded state). However,
to validate the new methods for calculating kinetic properties, it is important to test the methods on
systems in which the direct kinetics simulations can be performed. In this case, one can calculate the
mean first passage time (in terms of number of Monte Carlo steps for Monte Carlo (MC) simulations
and simulated time for Langevin simulations) directly from many independent simulations, even if
these simulations are each shorter than the mean folding time.
If one assumes first order kinetics, the probability that a particle has reached the final state at
some time t is given by
Pf (t) = 1 − e−kt ,
(2.1)
where t is the time, k is the rate, and Pf (t) is the probability of having reached a final state by time
t. By running many independent simulations shorter than 1/k, one can estimate the cumulative
CHAPTER 2. MARKOVIAN STATE MODELS
8
distribution Pf (t), and hence fit the value for the rate, k. The mean first passage time is the average
time when a particle will first reach the final state, given that it is in an initial state at t = 0,
Z
∞
MFPT =
t=0
µ
¶
Z ∞
d
Pf (t) t dt =
kte−kt dt.
dt
t=0
(2.2)
Integrating by parts yields the solution
MFPT =
1
.
k
(2.3)
One could also find the MFPT by directly calculating the average time when each simulation
first reached a final state. However, if because of simulation time constraints, some simulations are
stopped before reaching the final state, the calculated MFPT will be too low. By first fitting the
rate to the Pf (t) data (which can be calculated accurately even if some simulations do not finish),
the MFPT value will be more accurate. For simple systems (such as the two-dimensional energy
landscape presented below), one can directly simulate kinetics on long time scales.
2.2.2 Sampling of paths
For systems where the probability of reaching the final state is very low, the above direct method
would require a large number of simulations to get a reasonable estimate of the rate of folding
and the mean first passage time. The method that we describe below is a modified version of the
shooting algorithm [DBC98] that has been shown to efficiently generate a sample of uncorrelated
transition paths leading from the initial region to the final region.
First, we must obtain some initial path between the initial and final regions. This can be obtained
from previous data, high temperature unfolding simulations, direct MC or Langevin simulations,
or some other means. We keep points on this path such that successive points are separated by
some time interval, τ . We can label the points along this path as {p0 , p1 , . . . , pn }, where n is the
length of the path. We generate new paths by picking a random point along the current path, pi ,
and shooting a new path from it by starting a new simulation from this point. Points are recorded
along this path every τ and are labeled {np0 , np1 , . . . , npm }. If neither the initial nor final state
is reached within some simulation time cutoff, we reject this path and the current path remains the
same for the next iteration. Otherwise, if either of these states is reached, we stop the simulation
at that time point and define our new current path as the combination of the previous current path
and the newly generated path. If the new path reached the initial state, then the new current path
CHAPTER 2. MARKOVIAN STATE MODELS
9
Figure 2.1: The shooting algorithm for sampling paths. The solid path shows an original path
between the initial and final regions. The red paths represent two possible new path segments,
corresponding to the new path reaching either the initial or final regions. In the case of the path
reaching the initial state, the accepted path would be {np02 , np01 , np00 , p4 , p5 , p6 }. In the case of the
path reaching the final state, the accepted path would be {p0 , p1 , p2 , p3 , np0 , np1 , np2 }.
is {npm , npm−1 , . . . , np0 , pi , pi+1 , . . . , pn }. If the new path reached the final state, then the new
current path is {p0 , p1 , . . . , pi , np0 , np1 , . . . , npm } (Fig. 2.1). We repeat this shooting step for some
set number of trials.
This sampling strategy will capture paths between the boundaries of the initial and final region.
If we are to calculate the MFPT between the initial and final regions, we must also simulate the time
a molecule can spend within the initial region. To do this, we start many simulations from within
the initial region and stop the simulations once the boundary of that region has been crossed.
2.2.3 MSM generation
Here, we describe how to generate the MSM of conformational states, including the probability and
time to traverse from node to node in the MSM. Each conformation in the paths accepted while
sampling paths is represented by a node in the MSM, nodei , for some unique index i. Successive
points in each accepted path segment are represented by edges in the MSM, edgeij , representing
an edge between nodei and nodej . Each edge has associated with it the simulation time taken
to traverse that edge, timeij , and the probability of taking that edge, Pij . We initialize all edge
probabilities to one and renormalize in the post-processing step. If there is no edge between states i
and j, we set the probability Pij to zero. The MSM may be generated from data from the transition
path sampling shooting algorithm as above, or on any existing simulation data, so long as the time
between points in the simulation is known. Simulations of different time resolutions may also be
included in a single MSM.
CHAPTER 2. MARKOVIAN STATE MODELS
10
The MSM is designed to embody the possible pathways that the molecule may take while
traversing the conformation space. Different paths generated by our simulation methods may pass
through very similar conformations, but since the conformation space is continuous, these points
will never be exactly the same. However, we wish to capture the fact that these paths reach essentially the same point. We can do this by clustering nearby points in conformation space according to
some metric. We define some cutoff value that represents how close two points need to be in order
for us to consider them to be the same kinetically. Then, we combine points that are within this
distance from one another according to some clustering algorithm. We may choose different cutoffs
for the different regions of conformation space, the initial region, the final region, and the transition
region.
To combine two points, we remove all the incoming and outgoing edges from one of the points
and connect them to the other point. If there are now multiple edges between two nodes, we combine
them into a single edge with the following values (Fig. 2.2):
Pijnew = Pij1 + Pij2 ,
timenew
=
ij
³
´ ³
´
Pij1 · time1ij + Pij2 · time2ij
Pij1 + Pij2
(2.4)
.
(2.5)
The coordinates of clustered points are represented as the weighted average of all points belonging
to the cluster.
2.2.4 Post-processing of MSMs
We need to ensure that every node in the MSM is able to reach a final state. Otherwise, since these
nodes will have an infinite mean first passage time, calculations done on the MSM will fail. We
identify the nodes that can reach a final state by performing a depth first search from the final states
over the incoming edges, and marking all nodes that are reachable.
We propose two different methods for removing the nodes that were not marked. In the first,
we simply delete those nodes, thus ensuring that all nodes in the MSM can reach a node in the final
state. If there are not many such nodes, this should not bias the results very much. However, if there
are many unmarked nodes, deleting these nodes could distort the results. An alternate method for
removing nodes that cannot reach the final state is to merge each into its closest neighbor until all
nodes can reach the final state (Fig. 2.3). This nearest neighbor provides the best guess to the future
dynamics of the unmarked node with respect to reaching the final state.
CHAPTER 2. MARKOVIAN STATE MODELS
11
Figure 2.2: Clustering of MSM points. If two nodes are closer than a cutoff for some metric, we
cluster together these points by replacing them with a new point containing all of their edges. The
left picture shows the nodes before clustering, with the dotted circle indicating the nodes that will
be merged. The right picture shows the nodes after this clustering step.
In addition, we normalize the probabilities on all the edges so that on each node, the sum of the
probabilities for all outgoing edges is one,
Pij
.
Pijnew = P
k Pik
(2.6)
The probability on each edge equals the number of times that transition was made divided by the
total number of transitions from that node. Given sufficient sampling, these probabilities will converge towards the actual probabilities of each transition from that node.
Figure 2.3: Clustering of nodes to guarantee that all nodes can reach the final state. In the left
picture, the gray nodes and edges are not able to reach a final state. The dotted circle indicates
which two nodes will be merged. The right picture shows the MSM after this step. All nodes can
now reach the final state.
CHAPTER 2. MARKOVIAN STATE MODELS
12
2.2.5 Reweighting of edges
The MSM now represents a discrete sampling of the conformation space, and the edges represent
the transitions between these states, weighted with the correct probabilities. If some parameters of
the system were to change, one could simply adjust the edge weights by the relative probabilities at
each value of the parameters to generate a MSM at the new parameters. This assumes that the states
and transitions that would be sampled at the new parameters are the same as those sampled at the
original parameters. For example, it is common to examine folding at a series of temperatures (T );
instead of rerunning the calculation for each temperature, it would be ideal if one could reweight an
existing MSM for different temperatures.
This reweighting scheme is loosely analogous to thermodynamic reweighting schemes [Iba01].
While our methodology is for kinetic properties, both methods share the idea of reweighting an
ensemble generated at one temperature to yield information at another, and thus both rest on the assumption that the ensemble generated would be useful under the perturbed conditions. Accordingly,
one would not expect reasonable results for perturbations that are too large (e.g., temperatures far
from the original sampling).
The probability of moving between nodes depends on the nature of the dynamics used, since the
dynamics simulations were used to estimate the transition probabilities. Below, we will examine
how to reweight edges to build a MSM at a different temperature under two dynamics schemes,
Monte Carlo dynamics and Langevin dynamics.
Monte Carlo dynamics
First, consider simulations performed using the Metropolis Monte Carlo algorithm to generate
moves. Given a current point, x, a new point x0 is chosen from a distribution η(x, x0 ). This move is
accepted according to the Metropolis criteria [MRR+ 53],
(
Pacc =
1
E(x0 ) ≤ E(x)
e−[E(x )−E(x)]/kB T
E(x0 ) > E(x)
0
)
,
(2.7)
where E(x) is the energy at point x, T is the temperature, and kB is Boltzmann’s constant.
The transition between two states as defined by this algorithm is then
(
Pij = η(nodei , nodej )
1
E(nodej ) ≤ E(nodei )
e−[E(nodej )−E(nodei )]/kB T
E(nodej ) > E(nodei )
)
.
(2.8)
CHAPTER 2. MARKOVIAN STATE MODELS
13
To reweight the edges at a new temperature, we need the relative probability of each transition at
the two temperatures. Dividing Eq. 2.8 at the two temperatures of interest, we get

1
Pij (T1 )
η(nodei , nodej , T1 ) 
−[E(node
,T
)−E(node
=
1
j
i ,T1 )]/kB T1
Pij (T2 )
η(nodei , nodej , T2 )  e
e−[E(nodej ,T2 )−E(nodei ,T2 )]/kB T2

E(nodej ) ≤ E(nodei ) 
E(nodej ) > E(nodei ) 
.
(2.9)
Assuming that η(nodei , nodej ) and E(node) are independent of temperature, we get the following
equation for the transition probabilities at the new temperature:
(
Pij (T2 ) =
Pij (T1 )
E(nodej ) ≤ E(nodei )
e−∆E[(1/kB T2 )−(1/kB T1 )] Pij (T1 ) E(nodej ) > E(nodei )
)
.
(2.10)
Langevin dynamics
Langevin dynamics is likely more representative of dynamical properties than Metropolis Monte
Carlo, especially since the kinetic interpretation of Monte Carlo relies on the physical nature of the
Monte Carlo moves chosen, η(x, x0 ). In Langevin simulations, one performs simulations using the
Langevin equation of motion,
Fext − mγ
dx
+ R = 0,
dt
hR(t)R(0)i = 2mγkB T δ(t),
(2.11)
(2.12)
to move particles, where Fext are the external forces acting on the particle, m is the mass, γ is the
friction coefficient, and R is a random force, assumed to be a Gaussian random variable with the
property given by Eq. 2.12. Rewriting this equation, we can find the change in position with respect
to the forces,
∆x =
µ
P
R∆t
mγ
¶
=
σ2 =
R∆t Fext ∆t
+
,
mγ
mγ
1
2
2
√ e−[(R∆t)/mγ) /2σ ] ,
σ 2π
2kB T ∆t
,
mγ
(2.13)
(2.14)
where T is the temperature and the random displacement is distributed according to a normal distribution with standard deviation σ defined in Eq. 2.14 [AT91].
We stress that if a Langevin probability is to be used for transition probabilities, it is imperative
CHAPTER 2. MARKOVIAN STATE MODELS
14
that successive nodes be highly related conformationally. Otherwise, if one tries to take large steps
(e.g., large ∆t), the constant external force approximation will not hold and the transition probabilities become irrelevant. This is the result of over extending the Langevin integrator and it is unclear
whether the resulting MSM probabilities will have the desired physical interpretation.
We can now define the transition probability between two nodes as the probability of the random
displacement needed,
R∆t
Fext (x(nodei ))∆t
= x(nodej ) − x(nodei ) −
mγ
mγ
Pij =
Y
α
α 2
2
1
√ e[−(∆xR ) /2σ ]
σ 2π
(2.15)
(2.16)
where α represents the dimension of the system. We again wish to compute the relative probabilities
at two temperatures, so we divide the above equation at the two temperatures,
Y
α
2
2
1
√ e[−(∆xR (T1 )) /2σ(T1 ) ]
Pij (T1 )
α σ(T1 ) 2π
=Y
.
α
2
2
1
Pij (T2 )
√ e[−(∆xR (T2 )) /2σ(T2 ) ]
α σ(T2 ) 2π
(2.17)
Assuming that the forces, mass, and friction coefficient are independent of temperature, and substituting for σ(T ) as defined in Eq. 2.14, we get the following equation:
Pij (T1 ) Y
=
Pij (T2 )
α
r
T2 −[(∆xα )2 mγ/4∆t][(1/kB T1 )−(1/kB T2 )]
R
e
T1
(2.18)
For both the Monte Carlo and Langevin dynamics schemes, we can now reweight MSMs generated from simulation data at one temperature to build a MSM at a different temperature. Since
we know the transition probability at the first temperature from the current MSM, we can easily
calculate the probability at the new temperature using Eq. 2.10 for Monte Carlo dynamics and Eq.
2.18 for Langevin dynamics. We again normalize all edge probabilities so that for each node, the
sum of the outgoing probabilities is one. In this way, we generate a MSM at a different temperature
without additional simulations. This analysis can be done on any parameter where it is possible to
define the relative transition probabilities in terms of the two parameter values.
CHAPTER 2. MARKOVIAN STATE MODELS
15
2.2.6 Mean first passage time and Pf old calculation
The MSM consists of a set of nodes and a set of transitions or edges between these nodes. Each edge
has a probability associated with it as well as the time taken to traverse that edge. One can define the
Pf old of a node as the probability that a particle started at that node would reach the final state before
reaching the initial state [DPG+ 98]. Pf old values have been shown to be useful in understanding
the nature of the folding pathway in simplified [DPG+ 98, PGTR98] and atomistic [PR99, LS01,
GC02] models. Typically, one calculates Pf old values by running multiple simulations (differing by
random number seeds or initial velocities) and recording the fraction that fold before they unfold.
While this is computationally tractable (compared with a full folding simulation starting from the
unfolded state) and naturally parallelizable on massively-parallel or grid-computing architectures, it
can still be a demanding computational task, especially if the Pf old values for many conformations
are sought.
Following Apaydin et al. [ABG+ 02], we will use the MSM to calculate Pf old values. The Pf old
can be defined conditionally based on the first transition made from the node,
Pf old (nodei ) =
X
P (transition(i → j)) Pf old (nodei |transition(i → j)),
(2.19)
j
where the sum is over all possible transitions from nodei . The possible transitions must be mutually
exclusive and the sum of their probabilities must be one. The probability of each of the transitions
from nodei , P (transition(i → j)), are simply the Pij values defined previously. By the postprocessing step, these probabilities satisfy the above condition. Pf old (nodei |transition(i → j)) is
simply the Pf old of nodej which results in the following set of equations:
Pf old (nodei ) =
X
Pij Pf old (nodej ),
j
Pf old (nodei ) = 1,
nodei ∈ F,
Pf old (nodei ) = 0,
nodei ∈ I,
(2.20)
where I is the initial region and F is the final region. This definition results in a series of n equations
for each of the n nodes in the system and n unknowns, the Pf old (nodei ) variables. This system of
CHAPTER 2. MARKOVIAN STATE MODELS
16
equations can be solved by iteration as follows:
Pf0old (nodei ) = 1, nodei ∈ F,
Initialize:
Pf0old (nodei ) = 0, nodei ∈
/ F;
X
Pft+1
Pij Pft old (nodej ),
old (nodei ) =
Iterate:
(2.21)
j
until each Pf old converges. This iteration method is known as Jacobi iteration [GvL96]. Instead
of always using the Pf old values from the previous iteration, one can use the new values from the
current iteration as soon as they become available. This results in the following iterative scheme
known as Gauss-Seidel iteration, which converges twice as fast as the Jacobi method [GvL96],
Pft+1
old (nodei ) =
X
Pij Pft old (nodej ) +
X
Pij Pft+1
old (nodej ).
(2.22)
j<i
j≥i
Analogously, we can also get rate information from the MSM. Indeed, rates are a primary mean
of comparison to experiment and are thus a critically important quantity to calculate in order to
experimentally validate any folding simulation. Rates have not been previously calculated from a
roadmap-type representation of states. Below we present a natural generalization of the method to
calculate Pf old values for the calculation of rates in an efficient and precise manner.
One can define the mean first passage time (MFPT) of any node as the average time taken to get
from that node to any node in the final state. The MFPT can be defined conditionally based on the
first transition made from the node,
MFPT(nodei ) =
X
P (transition(i → j)) MFPT(nodei |transition(i → j)),
(2.23)
j
where the sum is over all possible transitions from nodei . The MFPT of nodei given that a transition
to nodej was made is the time it took to get from nodei to nodej plus the MFPT from nodej . This
leads to an equation for the MFPT of
MFPT(nodei ) =
X
Pij (timeij + MFPT(nodej )).
(2.24)
j
In addition, we define the boundary condition
MFPT(nodei ) = 0, nodei ∈ F .
(2.25)
CHAPTER 2. MARKOVIAN STATE MODELS
17
This system of linear equations can be iterated in the same way as above except that the initial values
for the system should be
MFPT0 (nodei ) = 0,
nodei ∈ F,
0
MFPT (nodei ) = ∞, nodei ∈
/ F.
(2.26)
2.3 Results
2.3.1 Model system
We first test the methods outlined above on a simple, two-dimensional model system. Due to their
tractability, such model systems are useful for demonstrating the benefits of the proposed method.
Our model system is defined by an energy potential of
E(x, y) =
(4(1 − x2 − y 2 )2 + 2(x2 − 2)2 + ((x + y)2 − 1)2 + ((x − y)2 − 1)2 − 2)
6
(2.27)
and has been used previously to test transition path sampling methods [DBCC98]. The initial and
final regions were defined by circles centered at (-1,0) with a radius of 0.2 and (1,0) with a radius
of 0.3 respectively. A contour graph of this energy landscape is shown in Fig. 2.4. Since this model
system is computationally tractable, we can directly compare our proposed methods to direct, brute
force simulations of the kinetics. In particular, we will compare the two kinetics methods described
in Sec. 2.2.5: Monte Carlo and Langevin dynamics.
We performed 10,000 Monte Carlo simulations for each temperature ranging from 0.1 to 1.0, at
intervals of 0.1. The move set was defined in each dimension as a normal distribution centered on
the current point,
1
0
2
2
η(xα , x0α ) = √ e[−(xα −xα ) /2σ ] .
σ 2π
(2.28)
The standard deviation was defined according to the distance the particle is expected to travel because of diffusion,
r
σ=
D∆t
,
2
(2.29)
where ∆t is the time step and was defined as 0.0001 ps and D is the diffusion constant and equals
91.0 ps−1 . In addition, we also ran 10,000 Langevin simulations for each temperature ranging from
0.2 to 1.0, at intervals of 0.1. The forces, Fext , were defined as the gradient of the energy potential
given in Eq. 2.27, the mass m was defined to be 1, and the viscosity γ was defined as 91.0. The
time step ∆t for these simulations was 1.
CHAPTER 2. MARKOVIAN STATE MODELS
18
1.5
1
0.5
0
I
F
−0.5
−1
−1.5
−1.5
−1
−0.5
0
0.5
1
1.5
Figure 2.4: Contour graph of the potential energy, E(x, y), of the model energy landscape. The
initial and final regions are represented by the circles labeled I and F respectively. The energy
difference between the stable regions and the valley in between them is approximately 1 and between
the stable regions and the hill in between them is approximately 2 (energy in arbitrary units).
For each temperature and type of simulation, five sets of 10,000 independent simulations were
started from the initial state, and the time at which they reached the final state was recorded. The
initial point in each simulation was sampled randomly from points on the border of the initial region.
The mean first passage time for each set was calculated from these 10,000 trials.
MSM generation
To test Monte Carlo kinetics, MSMs were generated on the model energy landscape at temperatures
ranging from 0.1 to 1.0, at intervals of 0.1. The time step, ∆t, was 0.0001 and the interval at which
points on the paths were recorded, τ , was 0.005. Each shooting step was stopped if neither the
initial nor final regions were reached after a time of 1.0. Four independent MSMs were generated
at each temperature, and each MSM consisted of 10,000 attempted shooting moves. In addition,
50 paths were sampled from the initial state for each MSM. All points in either the initial or final
regions were clustered together. For the remaining points, the distance metric chosen was Euclidian
CHAPTER 2. MARKOVIAN STATE MODELS
19
√
distance and the clustering cutoff for each simulation was σ 5, where σ is the standard deviation
of the normal distribution from which the moves were selected, as defined in Eq. 2.29. Points were
clustered hierarchically with average-link clustering – the distance between two clusters is equal
to the average distance from any member of one cluster to any member of the other cluster. After
clustering, any points that could not reach the final state were deleted.
Analogously, Langevin dynamics was examined by generating MSMs at temperatures ranging
from 0.2 to 1.0, at intervals of 0.1. The time step was 1.0 and the interval at which points on the
paths were recorded was 10.0. Each shooting step was stopped if neither the initial nor final regions
were reached after a time of 10,000. Five MSMs were generated at each temperature, and each
MSM consisted of 10,000 attempted shooting moves. In addition, 50 paths were sampled from the
√
initial state for each MSM. Clustering was as above with the clustering cutoff as σ 1.5, where σ is
the standard deviation of the normal distribution from which the random component of each move
was selected, as defined in Eq. 2.14.
Pf old comparison
For one MC and Langevin MSM at each temperature, the Pf old values were calculated for every
node. Since it would be too time consuming to compute all Pf old values from many direct simulations, about 25-30 nodes were chosen at random from each MSM for comparison. 10,000 MC or
Langevin simulations were started at the given temperature from each of these coordinates to compute its Pf old value directly. Figure 2.5 shows the Pf old value calculated by many direct simulations
compared to those calculated from the MSMs for both simulation types.
The correlation coefficient between the direct MC values and MSM values is 0.989 over all
temperature values. For each individual MSM at a given temperature, the correlation coefficient
ranges from 0.986 to 0.994. The correlation coefficient between the direct Langevin values and
MSM values is 0.990 over all temperatures. The correlation coefficient ranges from 0.986 to 0.999
for each MSM at a given temperature. This shows excellent agreement of Pf old values over a wide
range of temperatures for both simulation types.
MFPT comparison
In addition to being able to calculate Pf old values at every node, the use of simulation data allows
us to estimate the transition times between nodes, and therefore to estimate the MFPT from the
initial state. We can compare the MFPTs calculated from the MSMs with the MFPTs we calculated
CHAPTER 2. MARKOVIAN STATE MODELS
20
Figure 2.5: The correlation between Pf old values calculated directly from many simulations and
MSM simulations on the model energy landscape. The left graph shows the comparison for Monte
Carlo simulations and the right one shows the same comparison for Langevin simulations.
directly from many MC or Langevin simulations (Fig. 2.6).
The MFPT calculated from the MSMs agree well with those calculated from direct simulations,
although the variance among MSM simulations is greater for all temperatures in the MC simulations
and for high temperatures in the Langevin simulations.
MFPT from reweighting of edges
We also tested how well our formulation for the reweighting of MSM edges based on temperature
was able to predict MFPTs at the new temperatures. For both MC and Langevin dynamics, five
additional MSMs were generated at temperatures of 0.2, 0.6, and 1.0. The edges on these MSMs
were reweighted to give MSMs at temperatures of 0.2 to 1.0 at an interval of 0.1. The MFPTs
calculated from these reweighted MSMs were then compared with those from the direct simulations
(Fig. 2.7).
For the MC simulations, the MFPT calculated from the reweighted MSMs generated at all three
temperatures agrees reasonably well over the entire temperature range. The MSMs generated at a
temperature of 0.2 show a systematic overestimation of the MFPT when reweighted to high temperatures. For the Langevin simulations, the MFPT calculated from the reweighted MSMs generated at
CHAPTER 2. MARKOVIAN STATE MODELS
21
Figure 2.6: The comparison between the MFPT calculated directly from many simulations and from
the MSM simulations as a function of temperature. The left graph shows the result for Monte Carlo
simulations and the right one shows the result for Langevin simulations.
temperatures of 0.6 and 1.0 also agreed well over the entire temperature range. However, the MFPT
calculated from the MSMs generated at a temperature of 0.2 greatly overestimated the MFPT for
higher temperatures.
When generating a MSM at lower temperatures, we may not be sampling the transitions relevant
at the higher temperatures. We examined this possibility by looking at the shortest possible path between the initial and final regions in the MSMs generated at different temperatures. For the Monte
Carlo simulations, the MSMs at temperatures of 0.6 and 0.2 showed an increase in the shortest path
length of the MSM at a temperature of 1.0 of 40% and 80% respectively. For the Langevin simulations, the MSMs at temperatures of 0.6 and 0.2 showed an increase over the shortest path length of
the MSM at a temperature of 1.0 of 90% and 300% respectively. These differences may account for
the low temperature MSMs’ inability to scale, since we never sample the faster transitions.
To estimate the error in the reweighted MSMs compared to the direct simulations and to MSMs
generated individually at each temperature, we compare the MFPT standard deviations at different
temperatures for the various methods (Fig. 2.8). Specifically, we examine the standard deviation
relative to the average MFPT at each temperature calculated from the five direct simulations, from
the four MSMs generated individually at that temperature, and from the five reweighted MSMs
generated at temperatures of 0.2, 0.6, and 1.0 and reweighted to that temperature.
CHAPTER 2. MARKOVIAN STATE MODELS
22
Figure 2.7: The comparison between the MFPT calculated from many simulations to the MFPT
calculated from reweighted versions of a single MSM as a function of temperature. The square,
diamond, and circle points represent the MSMs generated at temperatures of 0.2, 0.6, and 1.0 respectively. The cross points are from the direct MFPT calculations. The left graph shows the results
for Monte Carlo simulations and the right graph shows the results for Langevin simulations.
For both the MC and Langevin simulations, the reweighted MSMs have essentially the same
error as the MSMs generated individually at each temperature except for the Langevin MSM generated at a temperature of 0.2. At low temperatures, the MFPT calculated directly from simulations
has a percent standard deviation which is approximately half that of the MSMs. However, at higher
temperatures, the error between the MFPT calculated directly from simulations and that from MSMs
is only slightly lower for MC simulations and higher for Langevin simulations.
2.3.2 Trpzip2 kinetics
In addition to the model energy landscape, we applied our methods above to a small protein, the
12-residue tryptophan zipper β-hairpin, trpzip2 [CSS01]. Trpzip2 has previously been simulated
on Folding@Home [SP00]. Our goal here is to use these trajectories from Folding@Home to
build a MSM to further study the folding of trpzip2. This is a much more challenging test of
our methods than the simple two-dimensional example above. If successful, we propose that MSMbased methods would allow one to extend the Folding@Home distributed computing methods to
examine the folding of slower and more complex proteins. Indeed, MSM methods combined with
CHAPTER 2. MARKOVIAN STATE MODELS
23
Figure 2.8: Error analysis of direct simulations and the various MSM techniques. The graphs show
the standard deviation to the average direct simulation value at each temperature divided by that
average value. The left graph shows the results for Monte Carlo simulations and the right graph
shows the results for Langevin simulations.
Folding@Home-based sampling may also be able to tackle some fundamental issues in the simulation of protein folding, especially that of proteins with non-single-exponential folding kinetics
[Fer02]. The results for trpzip2 presented here are intended as a “proof of concept” application of
our methods to fully atomistic simulation.
Using Folding@Home [SP00], trpzip2 folding has been simulated using the OPLSaa all atom
parameter set [JMTR96] and the generalized Born/surface area implicit solvent model [QSHS97]
at a temperature of 296 K. Trajectories were started from an extended conformation and ranged in
length from 10 nanoseconds to 450 nanoseconds. To define the initial and final regions, we used
a combination of alpha carbon root mean square distance (RMSD) to the native state (pdb code
1LE1 [CSS01]) and hydrogen bond and trp-trp distances. In particular, we track the following set
of interatomic distances, where the number indicates the residue, n indicates the backbone amide
nitrogen, o indicates the backbone carbonyl, and w indicates the CD2 side chain atom:
d1 = (n3 − o10) + (o3 − n10) + (n5 − o8) + (o5 − n8)
+(w2 − w11) + (w2 − w9) + (w9 − w4)
d2 = d1 + (n1 − o12) + (o1 − n12).
(2.30)
CHAPTER 2. MARKOVIAN STATE MODELS
24
The initial region was defined as any configuration that had (RMSD ≥ 2.5 or RMSD + 0.125d1 ≥
7.75) and (RMSD + 0.125d1 ≥ 9.5). The final region was defined as any configuration that had
(RMSD < 2.5) and (RMSD + 0.125d1 < 7.75) and (d2 < 45). These cutoffs follow Snow et al.,
except for the dependence on d2 , which was added to discriminate between native states and a set
of frayed native-like states. For more on the simulation details, see Snow et al. [SQD+ 04].
To generate the MSM, we chose a tenth of this data set at random, resulting in 1,750 independent
trajectories. Of these trajectories, 14 reached the final folded state. Frames from the non-folding
trajectories were selected every 10 ns and frames from the folding trajectories were selected every
250 ps. This was done so that there would be more representative conformations in the transition and
final states, while still allowing the number of nodes to stay manageable. As discussed in the MSM
generation section (Sec. 2.2.3), because the edges contain the time taken to traverse them, multiresolution data can be accommodated. This selection of data resulted in a total of approximately
22,400 nodes.
The distance metric for clustering was defined as the root mean square deviation of the interheavy-atom distance matrix for two conformations. The clustering was performed hierarchically
using the average-link distance as the distance between two clusters. After clustering, nodes which
could not reach the final state were merged into their nearest neighbor.
The usefulness of a MSM depends upon the type of ensemble used for its construction (similar
to the concept of a basis set). Here, the underlying ensemble is fairly one sided, has relatively few
transitions, and is not at equilibrium. Specifically, very few trajectories reached the final state and
even fewer unfolded after having folded, so the set of transitions to the initial state was not very
well sampled. Accordingly, any Pf old values calculated would have been biased since the Pf old
measures the percentage of trajectories that reach the final state before reaching the initial state. On
the other hand, a good estimate of the MFPT was possible since it measures the average time taken
for a particle to reach the final state having started in the initial state, which is exactly what our data
set represents.
We examined a wide range of clustering cutoffs for both the initial and transitional regions to
estimate the MFPT. To compare the effects of the clustering cutoff, we also performed a similar
experiment on the model energy landscape. We ran 500 MC simulations started from the initial
state at a temperature of 0.6 with a time step of 0.0001, total time of 0.1, and recording points every
0.005. Of these trajectories, 135 reached the final state. Again, we varied the clustering cutoffs in
the initial and transitional regions (Fig. 2.9).
Holding the transitional region cutoff constant, increasing the initial clustering cutoff causes
CHAPTER 2. MARKOVIAN STATE MODELS
25
Figure 2.9: The effect of clustering cutoff on the calculated MFPT for the model system and trpzip2
peptide. One axis shows the cutoff in the initial region and the other shows the cutoff in the transitional region. The vertical axis shows the MFPT at each point. The graph on the left is from the
model energy landscape and the graph on the right is for the trpzip2 protein.
no change in the MFPT in the model system. For the trpzip2 data, increasing the initial region
clustering cutoff causes the MFPT to increase until a cutoff of 2 Å and then plateau. One reason
why the trpzip2 data shows an initial increase in MFPT that is not seen in the model system data is
because the trpzip2 conformations are in a much higher dimensional space. If we sample equally
in these two spaces, we expect the points in the higher dimensional space to be farther apart. After
clustering the two dimensional data to 0.0005, there are 87% fewer nodes. In comparison, after
clustering the initial region of the trpzip2 data to 2Å, there are only 76% fewer nodes. There is an
increase in the model system data, but only when the transitional cutoff is zero.
Holding the initial region cutoff constant and increasing the transitional region cutoff causes the
MFPT to decrease in both systems and for all values of the initial cutoff. In the transitional region,
we expect that the molecule will go through a series of sequential points on the way to the final
state. The rationale behind clustering is that we can merge together points which are similar by
some metric, thus assuming that transitions into or from one of the points is equally likely to go to
or come from the other point. This does not apply for points which are sequential along a pathway.
Therefore, merging these points causes the MFPT to decrease since we are essentially shortening
the length of the path. Because we do not expect any sequential patterns within states in the initial
region, increasing the clustering cutoff within this area does not have the same decreasing MFPT as
the transitional region.
CHAPTER 2. MARKOVIAN STATE MODELS
26
Over reasonable ranges of cutoffs for the initial (> 2Å) and transitional (1 − 2.5Å) regions, we
can estimate the MFPT as 2 − 9 microseconds. This estimate agrees well with experimental results
of 1.8 ± 0.01µs from fluorescence and 2.47 ± 0.05µs from IR [SQD+ 04]. This estimate also agrees
well with previous analysis of this simulation data of 8µs by fitting the rate directly for the full data
set and 4.5µs for the random one-tenth sample used in the MSM analysis.
2.4 Discussion and conclusions
We have introduced new computational tools for efficiently analyzing the data collected from Monte
Carlo or molecular dynamics simulations. These methods capture the probabilistic and time dependent nature of the kinetics in a compact Markovian state model representation which can easily be
analyzed for properties such as the Pf old and MFPT of every node.
The Pf old values calculated for a wide range of temperatures for both Monte Carlo and Langevin
simulation types show excellent agreement with those calculated by brute force. One area of improvement for this method is that while the MSM values give the same average MFPT as the MFPT
calculated from direct simulations, the variance is much higher. This is probably because the transition path sampling simulations used in building the MSM are somewhat dependent on the initial
path chosen. The only way in which edges can lead from the initial state is either from the initial path or from the sampling of the initial state. Edges resulting from the shooting algorithm all
lead into the initial state. One could fix this problem by having many initial paths. However, in a
protein example, folding trajectories may be very difficult to generate beforehand. Another way to
fix this problem and achieve more precise results would be to include shooting paths which move
backwards in time, thus allowing for more edges leading from the initial state. Monte Carlo and
Langevin dynamics cannot be run backward in time because the velocity is not maintained between
steps. If one were to use some other molecular dynamics simulation system that maintained velocity,
then the trajectories may be run backwards in time and this problem could be averted.
In addition to the algorithms necessary to create the MSM from simulation data, we have also
described methods for reweighting the edges of the MSM to analyze the system at different parameter values. In particular, we provided transformations for the edge weights at different temperatures, given that the simulations were from either MC or Langevin dynamics. These methods show
promise since we can analyze the system at many different parameter values without the need for
any additional simulations. The reweighting of the MSMs seemed to work well in general and gave
results with similar errors as MSMs generated individually at each temperature. The one exception
CHAPTER 2. MARKOVIAN STATE MODELS
27
was the Langevin MSMs generated at a temperature of 0.2. These MSMs did not give accurate
results when rescaled for temperatures greater than 0.3. One reason for this may be that at the low
temperature, the system did not have enough samples of the relevant transitions to scale to higher
temperatures. It may be necessary to generate composite MSMs consisting of data from many
different temperatures in order to properly rescale to a wider range of temperatures.
This method was then applied to existing folding simulation data from a small β-hairpin protein.
We were able to calculate folding rates which were in reasonable agreement with experimental data
and previous analysis of the simulation data. The majority of time in these simulations is spent
in the clustering step. Currently, we compute the full n2 distance matrix between all nodes in the
MSM, a very expensive calculation. Better clustering algorithms can reduce this computation time.
Chapter 3
Automatic state decomposition
In the previous chapter, we described a method for modeling the conformational dynamics of biological macromolecules over long time scales – a discrete-state Markovian state model, which can
be built from short molecular dynamics simulations. To construct useful models that faithfully represent dynamics at the time scales of interest, it is necessary to decompose configuration space into
a set of kinetically metastable states. Previous attempts to define these states have relied upon either
prior knowledge of the slow degrees of freedom or on the application of conformational clustering
techniques, as in Chapter 2, which assume that conformationally distinct clusters are also kinetically
distinct. In this chapter, we present a first version of an automatic algorithm for the discovery of
kinetically metastable states that is generally applicable to solvated macromolecules. Given molecular dynamics trajectories initiated from a well-defined starting distribution, the algorithm discovers
long-lived, kinetically metastable states through successive iterations of partitioning and aggregating conformation space into kinetically related regions. We apply this method to three peptides in
explicit solvent — terminally blocked alanine, the 21-residue helical Fs peptide, and the engineered
12-residue β-hairpin trpzip2 — to assess its ability to generate physically meaningful states and
faithful kinetic models.
3.1 Introduction
Many biomolecular processes are fundamentally dynamic in nature. Protein folding, for example,
involves the ordering of a polypeptide chain into a particular topology over the course of microseconds to seconds, a process which can go awry and lead to misfolding or aggregation, causing disease
[Dob03]. Enzymatic catalysis may involve transitions between multiple conformational substates,
28
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
29
only some of which may allow substrate access or catalysis [EBAK02, YR06, BMDW06]. Posttranslational modification events, ligand binding, or catalytic events may alter the transition kinetics
among multiple conformational states by modulating catalytic function, allowing work to be performed, or transducing a signal through allosteric change [FMA+ 01, CE05, MMGD06]. A purely
static description of these processes is insufficient for mechanistic understanding — the dynamical
nature of these events must be accounted for as well.
Unfortunately, these processes may involve molecular time scales of microseconds or longer,
placing them well outside the range of typical detailed atomistic simulations employing explicit
models of solvent. However, due to the presence of many energetic barriers on the order of the thermal energy, the uncertainty in initial microscopic conditions, and the stochasticity introduced into
the system by the surrounding solvent in contact with a heat bath, any suitable description of conformational dynamics must by necessity be statistical in nature. This has motivated the development
of stochastic kinetic models of macromolecular dynamics which might conceivably be constructed
from short dynamics simulations, yet provide a useful and accurate statistical description of dynamical evolution over long times.
Several approaches have been used to construct these models. Transition interface sampling
(TIS) [MvEB04], milestoning [FE04], and methods based on commitment probability distributions
[RP05, BS05] describe dynamics on a one-dimensional reaction coordinate, but only can be applied
if an appropriate reaction coordinate can be identified such that relaxation transverse to this coordinate is fast compared to diffusion along it. Discrete-state, continuous-time master equation models,
characterized by a matrix of phenomenological rate constants describing the rate of interconversion
between states [vK97], can be constructed by identifying local potential energy minima as states and
estimating interstate transition rates by transition state theory [CE90, KB95, BB98, LJB01, MW01,
MEW02, EW04]. Unfortunately, the number of minima, and hence the number of states, grows
exponentially with system size, making the procedure prohibitively expensive for larger proteins
or systems containing explicit solvent molecules. Others have suggested that stochastic models
of dynamics can be constructed by expansion of the appropriate dynamical operator in a basis set
[Sha96, US98, SF03], but this approach appears to be limited by the great difficulty of choosing
rapidly-convergent basis sets for large molecules, a process that is not fundamentally different from
identifying the slow degrees of freedom.
Instead, much work has focused on the construction of discrete- or continuous-time Markov
models to describe dynamics among a small number of states which may each contain many minima
within large regions of configuration space [GT94, dGDMG01, SPS+ 04b, SSP04, AFGL05, SP05c,
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
30
SKH05, SHCT05, SP05a, EPP05b, PP06]. In these models, it is hoped that a separation of time
scales between fast intrastate motion and slow interstate motion allows the statistical dynamics
to be modeled by stochastic transitions among the discrete set of metastable conformational states
governed by first-order kinetics. Consider, for example, the isomerization of butane, which has
three main metastable conformational states (gauche-plus, gauche-minus, and trans). At sufficiently
low temperature, dynamics is dominated by long dwell times within each of these three states,
punctuated by infrequent transitions between them. The slow interstate transition process is welldescribed by first order reaction kinetics for observation intervals longer than the fast molecular
relaxation time for intrastate dynamics due to the presence of a separation of time scales [Cha78].
Such a separation of time scales would be a natural consequence of the widely held belief that
the nature of the energy landscape of biomacromolecules is hierarchical [ABB+ 85, BF89, BK97,
LJB01, LJB02]. If the system reaches local equilibrium within the state before attempting to exit,
the probability of transitioning to any other state will be independent of all but the current state.
This allows the process to be modeled with either a discrete-time Markov chain (e.g. Ref. [SSP04])
or a continuous-time master equation model with coarse-grained time (e.g. Ref. [SKH05]). In either
model, processes occurring on time scales faster than the time to reach equilibrium within each state
cannot be resolved.
Markov models embody a concise description of the various kinetic pathways and their relative
likelihood, facilitating comparison with experimental data and providing a powerful tool for mechanistic insight. Once the model is constructed and the time scale for Markovian behavior determined,
it can be used to compute the stochastic temporal evolution of either a single macromolecule or a
population of noninteracting macromolecules, allowing direct comparison of simulated and experimental observables for both single-molecule or ensemble kinetics experiments. In addition, useful
properties difficult to access experimentally, such as state lifetimes [SPS04a], relaxation from experimentally inaccessible prepared states [CSPD06b], mean first passage times [SSP04], the existence
of hidden intermediates [ODB02], and Pf old values or transmission coefficients [LZSP04], can easily be obtained. This allows for both a thorough understanding of mechanism and the generation of
new, experimentally testable hypotheses.
To build such a model, it is necessary to decompose configuration space into an appropriate set
of metastable states. If the low-dimensional manifold containing all the slow degrees of freedom
is known a priori, then this can be partitioned into free energy basins to define the states, such as
by examination of the potential of mean force [SPS+ 04b, SKH05, SP05c, EPP05b, CSPD06b]. In
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
31
the absence of this knowledge, others have turned to conformational clustering techniques to identify conformationally distinct regions which may also be kinetically distinct [KTB93, dGDMG01,
SSP04, AFGL05].
In this chapter, we adopt a strategy first suggested for the discovery of metastable states in
molecular systems by researchers at the Konrad-Zuse-Zentrum für Informationstechnik [SFHD99].
The principal idea is this: If configuration space could be decomposed into a large number of
small cells, the probability of transitioning between these cells in a fixed evolution time could be
measured. This probability is a measure of kinetic connectivity among the cells, which allows the
identification of aggregates of these cells that approximate true metastable states [SH02]. Unfortunately, the choice of how to divide configuration space into cells is not straightforward. Suppose one
is to consider the analysis of some fixed amount of simulation data. If configuration space is decomposed very finely, the boundaries between metastable states can in principle be well-approximated,
but the estimated cell-to-cell transition probabilities will become statistically unreliable. On the
other hand, if configuration space is decomposed too coarsely, the transition probabilities may be
well-determined, but the boundaries between metastable states cannot be clearly resolved, potentially disrupting or destroying the Markovian behavior of interstate dynamics. An optimal choice
would ultimately require knowledge of the metastable regions in order to determine the best decomposition of space into cells.
We propose an iterative procedure to determine both the choice of cells and their aggregates
to approximate the desired metastable states. We use a conformational clustering method to carve
configuration space into an initial crude set of cells (splitting), and a Monte Carlo simulated annealing procedure to collect metastable collections of cells into states (lumping). This cycle is repeated,
with the splitting procedure now applied individually to each state to generate a new set of cells, and
the lumping procedure applied to the entire set of cells to redefine states until further application
of this procedure leaves the approximations to metastable states unchanged. This procedure allows
state boundaries to be iteratively refined, as regions that mistakenly have been included in one state
can be split off and regrouped with the proper state. Throughout this process, we require that the
cells never become so small that estimation of the relevant transition matrix elements is statistically unreliable. Our proposed method is efficient, of O(N ) complexity in the number of stored
configurations, and can easily be parallelized.
This chapter is organized as follows: In Section 3.2, we give an overview of the Markov chain
model and its construction, elaborate on desirable properties of an algorithm to partition configuration space into states, and outline the principles underlying the algorithm we present here. In
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
32
Section 3.3, we provide a detailed description of the automatic state decomposition algorithm and
its implementation. In Section 3.4, we apply this algorithm to three model peptide systems in explicit solvent to assess its performance: alanine dipeptide, the 21-residue Fs helix-forming peptide,
and the 12-residue engineered trpzip2 hairpin. Finally, in Section 3.5, we discuss the advantages
and shortcomings of our algorithm, with the hope that future state decomposition algorithms can
address the remaining challenges.
3.2 Theory
Some discussion of the stochastic model of kinetics considered here and the theory underlying
the method is appropriate before describing the algorithmic implementation in detail. The actual
implementation of the algorithm used here is described in detail in Section 3.3.
3.2.1 Markov chain and master equation models of conformational dynamics
Consider the dynamics of a macromolecule immersed in solvent, where the solvent is at equilibrium
at some particular temperature of interest. We presume that all of configuration space has already
been decomposed into a set of nonoverlapping regions, or states, which together form a complete
decomposition of configuration space. The method by which these states are identified is described
in subsequent sections.
If we observe the evolution of this system at times t = 0, τ, 2τ, . . ., where τ denotes the observation interval, we can represent this sequence of observations in terms of the state the system visits
at each of these discrete times. The sequence of states produced is a realization of a discrete-time
stochastic process. For this process to be described by a Markov chain, it must satisfy the Markov
property, whereby the probability of observing the system in any state in the sequence is independent of all but the previous state. For a stationary process on a finite set of L states, this process can
be completely characterized by an L×L transition matrix1 T(τ ) dependent only on the observation
interval, or lag time, τ . The element Tji (τ ) denotes the probability of observing the system in state
j at time t given that it was previously in state i at time t−τ . If this process satisfies detailed balance
(which we will assume to be the case for physical systems of the sort we consider here [vK97]) we
1
In this chapter, we adopt the notation for a column-stochastic transition matrix, in which the columns sum to unity.
This differs from the notation in some previously-cited references and the other chapters, which use a row-stochastic
transition matrix, equal to the transpose of the column-stochastic matrix used here.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
33
additionally have the requirement
Tji peq,i = Tij peq,j ,
(3.1)
where peq,i denotes the equilibrium probability of state i.
The vector of probabilities of occupying any of the L states at time t (here also referred to as
the vector of state populations, such as in an experiment involving a population of noninteracting
macromolecules) can be written as p(t). If the initial probability vector is given by p(0), we can
write the probability vector at some later time t = nτ as
p(nτ ) = T(nτ )p(0) = [T(τ )]n p(0).
(3.2)
This is a form of the Chapman-Kolmogorov equation.
Alternatively, the process can be characterized in continuous time by a matrix of phenomenological rate constants K, where the element Kji , j 6= i denotes the nonnegative phenomenological
P
rate from state i to state j. The diagonal elements are determined by Kii = − j6=i Kji to ensure
the columns sum to zero so as to conserve probability mass. Time evolution is then governed by the
equation
ṗ(t) = Kp(t),
(3.3)
where the dot represents differentiation with respect to time. This evolution equation has formal
solution
p(t) = eKt p(0),
(3.4)
where the exponential denotes the formal matrix exponential. Eq. 3.3 is often referred to as a master
equation [vK97, OSW77] describing evolution among a discrete set of states in continuous time.
It is important to note that, despite the fact that p(t) is formally defined for all times t, we do not
expect Eq. 3.4 to hold for all times t for physical systems of the sort we consider here. In particular,
for states of finite extent in configuration space, there exists a corresponding limit for the time
resolution for which dynamics will appear Markovian; processes that occur on time scales shorter
than this will be incorrectly described by the master equation.
There is an obvious relationship between the transition matrix T(τ ) and the rate matrix K
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
34
evident from comparison of Eqs. 3.2 and 3.4:
T(τ ) = eKτ .
(3.5)
If the process can be described by a continuous-time Markov process at all times, then this process
can be equivalently described at discrete time intervals by the corresponding transition matrix. The
converse may not always be true due to sampling errors in T(τ ), though methods exist to recover
rate matrices K consistent with the observed data and the requirements of detailed balance and
nonnegative rates [GT94, SKH05].
The transition and rate matrices have eigenvalues µk (τ ) and λk , respectively, and share corresponding right eigenvectors uk . The detailed balance requirement additionally ensures that all
eigenvalues are real, and we here presume them to be sorted in descending order. µk (τ ) and λk are
related by
µk (τ ) = eλk τ .
(3.6)
The eigenvalues each imply a time scale
−1
τk = −λ−1
k = −τ [ln µk (τ )] ,
(3.7)
and the associated eigenvector gives information about the aggregate conformational transitions that
are associated with this time scale [Sch99, SFHD99, Hui01, SH02]. In particular, the components
of uk sum to zero for each k ≥ 2, and the aggregate dynamical mode corresponds to transitions
from states with positive eigenvector components to states with negative components, and viceversa, with the degree of participation in the mode governed by the magnitude of the eigenvector
component. This property can be useful in identifying metastable states.
For the remainder of this chapter, we will refer exclusively to the discrete-time Markov chain
model picture without loss of generality (Eq. 3.2).
3.2.2 Markov model construction from simulation data given a state partitioning
Once a statistical-mechanical ensemble describing equilibrium and a microscopic model describing
dynamical evolution in phase space have been selected, the transition matrix T(τ ) can be estimated
from molecular dynamics simulations. For a system in which dynamical evolution is Newtonian
and, at equilibrium, configurations are distributed according to a canonical distribution at a given
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
35
temperature, Swope et al. [SPS04a] show that the transition probability Tji (τ ) can be written as the
following ratio of canonical ensemble averages:
R
Tji (τ ) =
=
dz(0) e−βH(z(0)) χj (z(τ )) χi (z(0))
R
dz(0) e−βH(z(0)) χi (z(0))
hχj (τ )χi (0)i
,
hχi i
(3.8)
(3.9)
where z(t) denotes a point in phase space visited by a trajectory at time t, χi (z) denotes the indicator function for state i (which assumes a value of unity if z is in state i, and zero otherwise),
β ≡ (kB T )−1 the inverse temperature, H(z) the Hamiltonian, and hAi the canonical ensemble
expectation of a phase function A(z) at inverse temperature β.
Given a set of simulations initiated from an equilibrium distribution, the expectations in Eq. 3.9
can be computed independently by standard analysis methods [AT91]. Estimation of the correlation
function in the numerator can make use of both the stationarity of an equilibrium distribution (by
considering overlapping intervals of time τ ), and the microscopic reversibility (by considering also
time-reversed versions of the simulations) of Newtonian trajectories. Alternatively, if an equilibrium
distribution within each state can be prepared, one can also directly estimate a column of transition
matrix elements by computing the fraction of trajectories initially at equilibrium within state i that
terminate in state j a time τ later. More elaborate methods based on equilibrium ensembles prepared
within special selection cells that are not coincident with the states [SPS04a, SPS+ 04b] or partition
of unity restraints [Web06] can also be used to compute transition matrix elements efficiently.
3.2.3 Requirements for a useful Markov model
For any given state partitioning, the dynamics of the system will be Markovian on some time scale.
For example, if the lag time τ is so long as to approach the time for the system to relax to an equilibrium distribution from any arbitrary starting distribution, a single application of the transition matrix
T(τ ) produces the invariant equilibrium distribution. However, if this τ exceeds the time scale of
the process of interest, our model is not useful2 for describing it, and therefore it is advantageous to
attempt to find a state decomposition that is Markovian on a shorter time scale in order to extract
useful dynamical information about this process.
For a given state i, we will define its internal equilibration time, τint,i , as the characteristic time
2
Equilibrium probabilities can still be extracted from the stationary eigenvector (the eigenvector corresponding to an
eigenvalue of unity) of such a transition matrix, which may have some utility if one had constructed the transition matrix
from trajectories not initiated from distributions at equilibrium globally.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
36
one must wait before the system, initially in a configuration within state i, generates a new uncorrelated configuration within the state by dynamical evolution. This internal equilibration time, or
memory time, closely related to the molecular relaxation time scale τmol in Chandler’s reactive flux
formulation of transition state theory [Cha78], depends, of course, on the choice of state decomposition. We can denote the longest of these times over all states by τint . If the lag time is longer than
τint , we will expect the system to have lost memory of its previous location within any state it may
have been in, either remaining within that state or transitioning to a new one, and for dynamics on
this set of states to be independent of history. On the other hand, for lag times shorter than τint , we
cannot guarantee that transition probabilities are independent of history everywhere. This suggests
a way in which the utility of various decompositions can be measured. For a fixed number of states,
the most useful model will partition configuration space to yield the shortest τint , as this model can
be used to study the widest range of dynamical processes.
In addition to producing transition probabilities that are history independent at a relevant lag
time, we impose additional conditions on our states to ensure the resulting model also provides
physical and chemical insight. In order for the states to be defined such that equilibration within a
state is rapid, we desire that the region of configuration space defining each state be connected. A
state composed of two or more unconnected regions of configuration space defies the assumption
that equilibration within the state is much faster than the characteristic time to leave it.
3.2.4 Validation of Markov models
Once a decomposition of configuration space is chosen, we are faced with the task of determining
the observation time interval τ at which dynamics in this state space appears Markovian. Unfortunately, we cannot directly compute the internal state equilibration times, though examination of the
eigenvalues of the transition matrix restricted to a state may give a lower bound on this time in the
absence of statistical uncertainty [MSF05]. The most rigorous test for Markovian behavior would
be a direct check of history independence. The simplest test of this type is to compute second order
transition probabilities and compare them to the appropriate products of the first order transition
probabilities to see if their disagreement is statistically significant. While it is possible to estimate
the second order probabilities from the simulation data, this requires the estimation of three-time
correlation functions, which often possess statistical uncertainties so large as to render them useless
for this kind of test [CSPD06a]. Additionally, this would miss possible yet unlikely higher order
history dependencies.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
37
Information-Theoretic metric
Another approach, from Park et al. [PP06], uses concepts from information theory to compute
the conditional mutual information conveyed by the second-to-last state, which quantifies the discrepancy between observed second-order transition probabilities and the estimate modeled from
first-order transition probabilities. The result of this analysis is a scalar that quantifies the degree of
history dependence. For a pure first-order Markov process, the mutual information will be zero, as
no additional information is gained by including additional history. While this method also requires
computing three-time correlation functions, which may individually have substantial uncertainties,
the weighted combination of these into a single value reduces the uncertainty in the resulting metric. Unfortunately, there is no rigorous criteria for how small this measure must be in order for the
model to be considered acceptably Markovian.
Chapman-Kolmogorov
Alternatively, raising the transition matrix to a power n (hence summing over the intermediate
states) and comparing with the observed transition probabilities for a lag time of nτ , such that one
is effectively determining whether the Chapman-Kolmogorov equation (Eq. 3.2) is satisfied, helps
to reduce the uncertainty so that the test becomes practical. This is equivalent to propagating the
population in time out of a probability distribution confined to each state i initially, and comparing
the model evolution with the observed transition probabilities over times much longer than τint .
This serves as a check to ensure that the model is at least consistent with the dataset from which
it was constructed, to within the statistical uncertainty of the transition matrices obtained from the
dataset. This method was employed, for example, in Refs. [SPS04a, CSPD06b], and is used here as
well.
Implied Timescales
Swope et al., [SPS04a] suggested a number of additional tests for signatures of Markov behavior,
the most sensitive of which appears to be examining the behavior of the implied time scales of the
transition matrix T(τ ), which can be computed from the eigenvalues of the transition matrix by
Eq. 3.7, as a function of increasing lag time τ [CSPD06a]. At sufficiently large τ , the implied
time scales will be independent of τ , implying that exponentiation of the transition matrix is nearly
identical to constructing the transition matrix using longer observation time intervals (Eq. 3.2). The
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
38
shortest observation time interval for which this holds can be correlated with the internal equilibration time τint , and descriptions of the behavior of the system using that state decomposition should
be Markovian for all lag times τ ≥ τint . This is also a test of whether the Chapman-Kolmogorov
equation holds, but as it computes only L numbers and orders them by time scale, it allows emphasis
to be placed on the longest time scales in the system. Implied time scales were used for all systems
considered here.
Unfortunately, this last method has some drawbacks. First, small uncertainties in the eigenvalues of the transition matrix can induce very large uncertainties in the implied time scales. With
increasing lag time τ , the number of statistically independent observed transitions from which T(τ )
is estimated diminishes, and the statistical uncertainty in the implied time scales τk will grow. Second, while stability of the implied time scales with respect to lag time is a necessary consequence
of history independence, it is not itself sufficient to guarantee history independence, though we may
be unlikely to encounter physical systems for which this is problematic. However, tests on simple
models indicate that the information theoretic metric suggests the emergence of Markovian behavior
on similar lag times to this method, suggesting some degree of fundamental equivalence [PP06].
3.3 The automatic state decomposition algorithm
Based on the theory above, we provide a list of practical considerations for an automatic state decomposition algorithm and then present an algorithm that meets them. The algorithm operates on an
ensemble of molecular dynamics trajectories where conformations have been stored at regular time
intervals. In this work, we apply the method to a set of equilibrium trajectories at the temperature
of interest, but the algorithm can in principle be applied to trajectories generated from biased initial
conditions, provided the unbiased transition probabilities between regions of configuration space
can be computed. We stress that the algorithm presented here is simply a first attempt at a truly
general and automatic algorithm for use with biomacromolecules.
3.3.1 Practical considerations for an automatic state decomposition algorithm
There are several desirable properties that a state decomposition should possess to be both useful
and practical:
1. It is not uncommon for simulations performed on distributed computing platforms such as
Folding@Home [SP00, PBC+ 03], supercomputers such as Blue Gene [FGM+ 03, GFR+ 05],
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
39
or even computer clusters to generate datasets that may contain 105 to 107 configurations
in up to 104 trajectories, therefore rendering impractical the use of any algorithm with a
computational complexity greater than O(N log N ) in the number of configurations.
2. We assume configurations lie exclusively in the configuration space of the macromolecule.
We presume decorrelation of momenta and reorganization of the solvent is faster than processes of interest3 .
3. Molecules may have symmetries due to the presence of chemically equivalent atoms such as
in aromatic rings, methyl protons, and the oxygens of carboxylate groups. The state decomposition should be invariant to permutations of these atoms.
4. The state decomposition algorithm should produce a decomposition for which dynamics appears to be Markovian at the shortest possible lag time τ , so as to produce the most useful
model.
5. The resulting model should not include so many states so that the elements of the transition
matrix will be statistically unreliable.
3.3.2 Sketch of the method
A state decomposition algorithm intended to produce the most useful Markov models, as discussed
in Section 3.2.3 above, would generate models that minimize the internal equilibration time τint , the
minimum time for which the model behaves in a Markovian fashion. If states can be constructed
where the time scale for equilibration within each state is much shorter than the time scale for transitions among the states, we would expect interstate dynamics to be well-modeled by a Markov chain
after sufficiently long observation intervals. Unfortunately, τint is difficult to determine directly, so
we are instead forced to identify some surrogate quantity whose maximization will hopefully lead
to improved separation between the time scales for intrastate and interstate transitions. Following
the approach of Ref. [HS05], we define a measure of the metastability Q of a partitioning into L
macrostates as the sum of the self-transition probabilities for a given lag time τ :
Q ≡
L
X
Tii (τ ).
(3.10)
i=1
3
We recognize that solvent coordinates may be critical in some phenomena, but dealing with solvent degrees of
freedom would also require accounting for the indistinguishability of solvent molecules upon their exchange. We leave
this to further versions of the algorithm.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
40
For τ = 0, Q = L, and decays to unity as τ grows large enough for the self-transition probabilities
Tii to reach the equilibrium probabilities of each macrostate. Poor partitionings will result in a
small Q, as trajectories started in some states will rapidly exit; conversely, good partitionings into
strongly metastable states will result in a large Q, as trajectories will remain in each macrostate for
long times. In the absence of statistical uncertainty, Q is bounded from above by the sum of the L
largest eigenvalues of the true dynamical propagator for the system [HS05].
The goal of our algorithm is to identify a partitioning into L contiguous macrostates that maximizes the metastability Q. While in principle, the boundaries between these macrostates can be
varied directly to optimize Q, in analogy to variational transition state theory [TGK96], a complicated parameterization may be necessary to describe the potentially highly convoluted hypersurfaces
separating the states, and Q may have multiple maxima in these parameters. Instead, we choose an
approach based on splitting the conformation space into a large number of small contiguous microstates and then lumping these microstates into macrostates to maximize the metastability.
This approach is similar to the approach of Schütte and coworkers described in Ref. [SFHD99],
but with a substantial difference. In their work, each degree of freedom of the molecule (such as a
torsion angle) is subdivided independently to produce a multidimensional grid. As the number of
states is exponential in the number of degrees of freedom, this approach quickly becomes intractable
for macromolecules that possess large numbers of degrees of freedom, even if the sparsity of the
transition matrix is taken into account. Instead, we choose to let the data define the low-dimensional
manifold of configuration space accessible to the macromolecule, and we can apply any clustering
algorithm that is O(N log N ) in the number of configurations to decompose the sampled conformation space into a set of K contiguous microstates. This step corresponds to the first split step in
Figure 3.1.
Once the conformation space is divided into K microstates, we lump the microstates together
to produce L < K macrostates with high metastability, Q. This corresponds to the first lump step
in Figure 3.1. The difficulty here is that the uncertainty in the metastability of a partitioning can
be large if any macrostate contains very few configurations. Since a macrostate may consist of a
single microstate, the microstates must be large enough for the self-transition elements to be statistically well-determined. This comes at a price: with large microstates, the procedure may have
difficulty accurately determining the boundaries between macrostates because the resolution of partitioning is limited by the finite extent of the microstates. Additionally, the choice of decomposition
into microstates is arbitrary, whereas we would like the state decomposition algorithm to produce
equivalent sets of macrostates regardless of the quality of the initial partitioning.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
SPLIT
41
K-medoid partitioning
of entire sampled space
LUMP
maximize trace of
macrostate transition matrix
ITERATE
SPLIT
K-medoid partitioning
on each macrostate
LUMP
lump over all microstates
to maximize trace
Figure 3.1: Flowchart of the automatic state decomposition algorithm. We consider K microstates
which are used as the basis to construct L < K macrostates that are the approximations to the true
metastable states in the system.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
42
To overcome these difficulties, we iterate the aforementioned procedure. After microstates
are combined into macrostates, each macrostate is again fragmented into a new set of microstates
(the second split step in Figure 3.1). The refined set of all microstates is then lumped to form
refined macrostates (the second lump step in Figure 3.1). In this way, the boundaries between
macrostates are iteratively refined, and regions incorrectly lumped in previous iterations may be split
off and lumped with the correct macrostate in subsequent iterations. At convergence, no shuffling
of conformations between macrostates should occur.
There is unfortunately no unambiguous way to choose the number of states L. If there is a
clean separation of time scales, examination of the eigenvalue spectrum of the microstate transition
matrix may suggest an appropriate value of L [SH02]. In a hierarchical system, there will be many
gaps in the eigenvalue spectrum and many choices of L will lead to good Markovian models of
varying complexity. There is, however, a tradeoff between the number of states and the amount
of data needed to obtain a model with the same statistical precision. It may be necessary to apply
the algorithm repeatedly with different choices of L to produce a model adequate for describing
the time scales of interest. L could even be chosen dynamically at each iteration of the algorithm,
though we did not choose to do so in this version.
3.3.3 Implementation
There are a number of implementation choices to be made in the algorithm given above, and here
we briefly summarize and justify our selections.
Splitting
For the split step, we choose to apply K-medoid clustering [HTF01] for a fixed number of iterations
because of its O(KN ) time complexity (where K can be taken to be constant) and ease of parallelization. Additionally, K-medoid clustering has an advantage over the more popular K-means
clustering [Mac67] in this application, as it does not require averaging over conformations, which
may produce nonsensical constructs when drastically different conformations are included in the
average. Splitting by K-medoid clustering is initiated from a random choice of K unique conformations to function as generators. All conformations are assigned to the microstate identified by
the generator they are closest to by some distance metric (defined below). Next, an attempt is made
to update the generator of each microstate. K members of the microstate, drawn at random, are
evaluated to see if they reduce the intrastate variance of some distance metric from the generator. If
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
43
so, the configuration for which the intrastate variance is minimal is assigned as the new generator.
All conformations are then reassigned to the closest generator, and the process of updating the generators is repeated. In standard K-medoid applications, this procedure is iterated to convergence,
but since the purpose of the splitting phase is simply to divide the sampled manifold of configuration
space into contiguous states, ensuring that each state is significantly populated, only five iterations
of this procedure were used.
For the distance metric, we selected the root-mean squared deviation (RMSD), computed after a
minimizing rigid body translation and rotation using the rapid algorithm of Theobald [The05]. In the
first splitting iteration, only Cα atoms were used to compute the RMSD due to the expense of having
to cluster all conformations in the dataset; in subsequent iterations, all heavy atoms (excepting
those indistinguishable by symmetry) were used, as well as sidechain polar hydrogens. This metric
was chosen because it possesses all the qualities of a proper distance metric [Ste02], accounts for
both local similarities between pairs of conformations as well as global ones, and runs in time
proportional to the number of atoms, as opposed to a metric such as distance matrix error (DME
or dRMSD), which scales as the square of the number of atoms. In molecules with additional
symmetry, the distance metric can be adjusted accordingly. Our choice of distance metric is not the
only one that would suffice; any distance metric which can distinguish between kinetically distinct
conformations is sufficient for this algorithm. In contrast, using something like backbone RMSD
throughout the process may be a poor distance metric since it would ignore potentially relevant
sidechain kinetics.
Lumping
Lumping to L states so as to maximize the metastability Q of the macrostates proceeds in two
stages. In the first stage, information on the metastable state structure contained in the eigenvectors
associated with the slowest time scales [Sch99, Hui01, DHFS00, SH02] is used to construct an
initial guess at the optimal lumping. Because the eigenvectors contain statistical noise, this may
not actually be optimal; so, we include a second stage that uses a Monte Carlo simulated annealing
(MCSA) optimization algorithm to further improve the metastability. Though the MCSA algorithm
could in principle be used without the first stage to find optimal lumpings, we find its convergence
is greatly accelerated by use of the initial guess. Ensuring connectivity during the lumping stage
would be difficult due to the need to enumerate neighbors in configuration space, but in practice, we
find this unnecessary.
In the first stage, a transition matrix among microstates is computed (using Eq. 3.9) taking
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
44
advantage of both stationarity and time-reversibility, for a short lag time τ , typically the shortest interval at which configurations were stored. Motivated by the Perron cluster cluster analysis (PCCA)
algorithm of Deuflhard et al. [DHFS00], an initial guess for the optimal lumping of microstates
to macrostates is generated using the left eigenvectors4 associated with the largest eigenvalues of
the microstate transition matrix. We begin by assigning all microstates to a single macrostate. For
each eigenvalue, the corresponding eigenvector contains information about an aggregate transition
between the set of microstates with positive eigenvector components and the set with negative components, with a time scale determined by the eigenvalue. Equilibration within each set must occur
on a faster time scale, provided the eigenvalues are non-degenerate. We can therefore use this information to identify one macrostate to divide in two. We select the macrostate with the largest
L1 norm of eigenvector components (restricted to microstates belonging to the macrostate), after
subtracting the mean of these components. In Ref. [DHFS00], the sign structure alone was used to
split these sets, but since we restrict the splitting to a single macrostate, we split about the mean, so
that microstates with eigenvector components above the mean become one macrostate and the rest
go into another. This procedure is performed for eigenvectors k = 2, . . . , L in order, which should
correspond to the slowest processes in the system, generating a total of L macrostates.
Due to statistical noise in the eigenvectors and near-degeneracy in the eigenvalues, this procedure does not always result in the lumping with the maximal metastability Q. Therefore, in the
second stage, the metastability was maximized using a Monte Carlo simulated annealing (MCSA)
algorithm, using the eigenvector-generated lumping as an initial seed. In each step of the Monte
Carlo procedure, a microstate was selected with uniform probability and assigned to a random
macrostate. If this proposed move would leave a macrostate empty or did not change the partitioning, it was rejected immediately. The proposed partitioning was accepted with probability
min{1, eβ∆Q }. The effective inverse temperature parameter β was set to be equal to the step number, and the MCSA procedure run for 20,000 steps. Twenty independent MCSA runs were initiated
from the initial eigenvector-based partitioning, and the partitioning with the highest metastability
sampled in any run was selected to define the lumping into macrostates. No attempt was made to
optimize the annealing schedule. It should be noted that the metastability Q is not the only surrogate
that could be optimized in order to produce a useful state decomposition5 .
The left eigenvector vk is simply related to the right eigenvector uk by (vk )i = p−1
eq,i (uk )i [OSW77].
One could choose to maximize the largest eigenvalues or fastest time scales of the lumped transition matrix, the
product of eigenvalues (which would give more weight to faster time scales), or even a weighted sum of the eigenvalues,
where the weights might be due to the equilibrium importance of the eigenmode in dynamics or in modeling a process
of interest. Unfortunately, these quantities all necessitate computing some eigenvalues or the determinant of the lumped
transition matrix for every proposed lumping to be evaluated by the MCSA algorithm, which would add a significant
4
5
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
45
Iteration
For the remaining iterations, the K-medoid clustering is repeated independently on each macrostate
for five iterations. In general, we split each macrostate into 10 microstates, unless otherwise noted.
However, we wish to ensure statistical reliability of the transition probability matrix, so if the expected microstate size (estimated by the population of the macrostate divided by K) falls below
some threshold (100 configurations unless otherwise noted), we split to a smaller number of states
such that the expected size is above the threshold. The lumping step is then repeated on all resulting
microstates. The entire procedure of splitting and lumping is repeated for a total of 10 iterations,
which for the applications considered here was sufficient for convergence of the metastability.
3.3.4 Validation
To validate the model, we examine the largest implied time scales as a function of lag time, as computed from the eigenvalues of the transition matrix by Eq. 3.7. In particular, we attempt to determine
the minimum lag time after which the implied time scales appear to be independent of lag time to
within the estimated statistical uncertainty (see Section 3.2.4). To estimate statistical uncertainties
in the implied time scales and other quantities, we perform a bootstrap procedure [Efr79] on the
pool of independent trajectories. Forty bootstrap replicates, each consisting of a number of trajectories equal to the number of independent trajectories in the dataset pool, are generated by drawing
from the pool with replacement. For alanine dipeptide, 100 bootstrap replicates were used. For each
replicate, the implied time scales or other quantity is computed, and either the standard deviation
over the sample of replicates computed (if reported in the text as a ± b) or a 68% confidence interval
centered on the sample mean estimated (if depicted in a figure as vertical error bars).
We also estimate the number of statistically independent visits to each macrostate. Since sequential samples from a single trajectory are temporally correlated, we compute the integrated autocorrelation time [SABW82, Jan02] τac,i for each macrostate i. Ignoring statistical uncertainty, this
correlation time is an upper bound on the equilibration time within a state; long-lived states will
necessarily have long autocorrelation times, but trajectories trapped within them may contain many
uncorrelated samples if the internal equilibration time is short. In the absence of a convenient way
to quantify the internal equilibration time for each state, the autocorrelation time provides a better
estimate of the appropriate time scale than the time to reach global equilibrium τeq . The effective
computational burden. Alternatively, other quantities could be computed from the transition matrix directly, such as
the state lifetimes estimated from the self-transition probabilities as τL,i = (1 − Tii )−1 . However, the combination of
computational and theoretical convenience makes the use of metastability a natural choice here.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
46
number of independent samples for each state is estimated by summing the number of independent
samples from each trajectory (which are assumed independent), where the effective number of ineff ≈ min{1, N /g }, where N
dependent samples of state i from trajectory n is computed as Nn,i
n,i i
n,i
τ
ac,i
is the number of configurations from trajectory n in state i, and gi = 1 + 2 τsample
is the statistical
inefficiency of state i, where τsample is the sampling interval between conformations.
3.4 Applications
3.4.1 Alanine dipeptide
We first demonstrate application of the automatic state decomposition algorithm to a simple model
system, terminally blocked alanine peptide (sequence Ace-Ala-Nme) in explicit solvent. Because
the slow degrees of freedom (φ and ψ torsions, labeled in Figure 3.2, left) are known a priori6 , it is relatively straightforward to manually identify metastable states from examination of
the potential of mean force, making it a popular choice for the study of biomolecular dynamics
[AFC99, BDC00, MW01, HK03, CIL04, CSPD06b]. Previously, a master equation model constructed using six manually identified states (Figure 3.2, right) was shown to reproduce dynamics
over long times (with the time to reach equilibrium over 100 ps at 302 K) given trajectories only
6 ps in length [CSPD06b]. We therefore determine whether the automatic algorithm can recover
a model of equivalent utility to this manually constructed six-state decomposition for this system,
as well as study its convergence properties. Because the algorithm uses the solute Cartesian coordinates, rather than the (φ,ψ) torsions, this is a good test of whether good approximations to the
true metastable states can be discovered without prior knowledge of the slow degrees of freedom.
For ease of visualization, however, we project the state assignments onto the (φ,ψ) torsion map for
comparison with our manually constructed states.
Simulation details
Trajectories were obtained from the 400 K replica of a 20 ns/replica parallel tempering simulation7 described in Ref. [CSPD06b], and consisted of an equilibrium pool of 1,000 constant-energy,
6
Simulations of alanine dipeptide examining the committor distribution have implicated solvent coordinates as the
next-slowest degrees of freedom [BDC00, MD05], but we have previously verified that φ and ψ torsions form a sufficient
basis for the slow degrees of freedom on time scales of 6 ps and greater [CSPD06b].
7
Note that only 10 ns/replica were used in Ref. [CSPD06b] — the data presented here includes an additional 10
ns/replica of production simulation. Additionally, configurations containing cis-ω torsions discussed in the text were
not observed in the first 10 ns/replica cited in the previous study – these conformations only appeared in the latter 10
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
47
9
8
7
60
#
1
2
6
"
!
5
"
Ace - Ala - Nme
6
4
-60
3
3
4
2
5
-60
!
1
60
Figure 3.2: Potential of mean force and manual state decomposition for alanine dipeptide. Left:
The terminally-blocked alanine peptide with φ, ψ, and ω backbone torsions labeled. Right: The
potential of mean force in the (φ, ψ) torsions at 400 K estimated from the parallel tempering simulation, truncated at 10 kB T (white regions), with reference scale (far right) labeled in units of kB T .
Boundaries defining the six states manually identified in Ref. [CSPD06b] from examining the 302
K PMF are superimposed, and the states labeled.
constant-volume trajectory segments 20 ps in length with configurations stored every 0.1 ps. The
peptide was modeled by the AMBER parm96 forcefield [KDC+ 97], and solvated in TIP3P water
[JCM+ 83]. The previous study [CSPD06b] considered the dynamics at 302 K, but resorted to a focused sampling strategy where a number of trajectories were initiated from equilibrium distributions
within constricted selection cells [SPS04a] in order to obtain statistically reliable estimates of the
transition matrix. Here, as the focus was on locating these metastable states from equilibrium data,
we found it necessary to use equilibrium data from a higher temperature — here, the 400 K replica
— in order to obtain sufficient numbers of trajectories covering the entirety of the landscape. A 2D
potential of mean force (PMF) at 400 K in the (φ, ψ) backbone torsions was estimated from the
parallel tempering simulation using the weighted histogram analysis method [KBS+ 92, CSP+ 06]
by discretizing each degree of freedom into 10◦ bins (Figure 3.2). Because the (φ, ψ) torsions are
supposed to be the only slow degrees of freedom in the system, we can associate basins in the potential of mean force with metastable states. The six such states identified from the 302 K PMF in
the previous study [CSPD06b], identified as dark lines in Figure 3.2, can be seen to adequately separate the free energy basins observed at 400 K. We take this decomposition as our reference “gold
standard”, and compare the one obtained from our automatic state decomposition algorithm with it.
ns/replica.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
48
Automatic State Decomposition
First, the automatic state decomposition method described in Section 3.3 was applied to this dataset
in a fully automatic way to obtain six macrostates that could be compared with the “gold standard”.
Since there is only one Cα atom in the peptide, we opted to use the backbone RMSD (including
the amide proton and carbonyl oxygen) in the first stage, splitting to 100 microstates; subsequent
iterations used the distance metric and splitting procedure described in Section 3.3.3. A single
sampling interval — 0.1 ps — was used for the calculation of the metastability metric Q used in
lumping, as described in Section 3.3.2. Application of the state decomposition algorithm to the
entire dataset revealed a state that heavily overlapped with several others when projected onto the
(φ, ψ) map, along with an extremely long time scale associated with its transitions (data not shown).
Closer examination of the ensembles of configurations contained in this overlapping state revealed
that the overlapping regions differed by a peptide bond isomerization; a small population of the
trajectories contained an N-terminal ω peptide bond in the cis state, rather than the typical trans
state. The number of trajectories that connected these states was extremely small. Examination of
the parallel tempering data revealed that the majority of these transitions had occurred at a much
higher temperature, and that the cis-ω configurations found at 400 K had reached this temperature
by annealing from the higher temperature; in the majority of trajectories at 400 K that contained
cis-ω configurations, the peptide remained in this state over the duration of the trajectory. This
is a clear demonstration of how the automatic algorithm can discover additional slow degrees of
freedom that the experimenters may not realize are important. For subsequent investigation, due to
the extremely small number of transitions, trajectories containing conformations that included cis-ω
bonds (a total of 25 trajectories) were removed from the set of trajectories used for analysis, leaving
975 trajectories.
Comparison with manual state decomposition
The results of the automatic state decomposition algorithm applied to this reduced dataset can
be seen in Figure 3.3, in comparison with the “gold standard” manual state decomposition from
Ref. [CSPD06b] and a “poor” manual decomposition that is expected to fail to reproduce kinetics
well because its states include internal kinetic barriers8 . Independent applications of the automatic
8
The poor partitioning was defined as follows: (1) φ ∈ [(179, −135], ψ ∈ (98, 48]; (2) φ ∈ (−135, −60], ψ ∈
(98, 48]; (3) φ ∈ (179, −135], ψ ∈ (48, 98]; (4) φ ∈ (−135, −60], ψ ∈ (48, 98]; (5) φ ∈ (−60, 179], ψ ∈ (98, −45];
(6) φ ∈ (−60, 179], ψ ∈ (−45, 98]. Specified intervals denote intervals on the torus, which is continuous from -180 to
+180. All torsion angles are specified in degrees.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
49
timescales
Q = 5.59 ± 0.02
manual good
manual poor
Q = 3.21 ± 0.05
automatic
Q = 5.64 ± 0.02
automatic
"
!k (ps)
40
!
Q = 5.64 ± 0.03
20
0
0
2
4
6
8
lag time ! (ps)
10
Figure 3.3: Comparison of manual and automatic state decompositions for alanine dipeptide. The
left panels depict state partitionings, and the right panels the associated time scales (in picoseconds)
as a function of lag time with uncertainties shown, as estimated from the procedure described in
Section 3.3.4. Axes are the same in all plots. Top two panels: Manual “good” or “gold standard”
state decomposition from Ref. [CSPD06b] and manual “poor” state decomposition, where the state
boundaries are grossly distorted so as to include internal kinetic barriers within the states. Bottom
two panels: Two nearly-equivalent partitionings obtained from the automatic state decomposition
algorithm.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
50
method were observed to yield two distinct decompositions with metastabilities within statistical
uncertainty, both of which slightly exceeded the metastability of the manual decomposition (Figure
3.3, bottom two plots). In the first automatic decomposition, six states in the same general locations
as the manual “gold standard” decomposition are observed, though the boundaries are somewhat
perturbed. However, the time scales as a function of lag time are not significantly different from
those of the manual “gold standard” decomposition (Figure 3.3, right). In the other automatic decomposition, states 3 and 4 of the manual decomposition (numbering given in Figure 3.2) have been
merged into a single state, and state 5 of the manual decomposition has been fragmented into two
states. Despite this, the time scales as a function of lag time again appear to be statistically indistinguishable from those of the “gold standard”, suggesting that this model may have equal utility.
This suggests that the phenomenological rates may not be very sensitive to the exact choice of state
boundaries after the Markov time, as recrossings will have been suppressed by this time. The fact
that this lumping does not disrupt the behavior of the model substantially is not altogether surprising, because the barrier separating states 3 and 4 is rather small, and these states act like a single
state even on time scales of a few picoseconds or greater. In contrast, the “poor” decomposition has
extremely short time scales which do not appear to level off over the course of 10 ps.
Stability of state decomposition
To examine the ability of the algorithm to recover optimal partitionings, the automatic state decomposition algorithm was applied to both the “gold standard” and “poor” manual decompositions
(Figure 3.4) to see whether these partitionings would be maintained over the course of subsequent
iterations. Ten iterations were conducted, with each macrostate split to ten microstates in the first
iteration, rather than the entire configuration space being split into 100 states. In both cases, the
algorithm converged to nearly equivalent partitionings after ten iterations (Figure 3.4), as verified
by examination of the converged time scales (data not shown). This suggests the method yields partitionings that are relatively stable and optimal. From the “poor” manual decomposition, however, a
number of conformations in manual states 5 and 6 are incorrectly grouped with state 2, though this
did not significantly affect the time scales. Further investigation showed that the algorithm never
split these conformations from state 2, partly due to their comprising only 1 % of the population of
the state. Splitting each macrostate into more microstates should alleviate this problem.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
final
"
poor manual decomposition
good manual decomposition
initial
51
!
Figure 3.4: Stability and recovery of optimal state decomposition for alanine dipeptide. Top: Ten
cycles of automatic state decomposition applied to a “good” manual partitioning (left) to yield an
automatic partitioning (right). Bottom: Ten cycles of automatic state decomposition applied to a
“poor” manual partitioning (left) to yield an automatic partitioning (right).
3.4.2 The Fs helical peptide
To illustrate behavior of the automatic state decomposition method on a larger peptide system with
fast kinetics, we applied it to the 21-residue helix-forming Fs peptide, which has been studied
extensively both experimentally [LK92, LK93, WCG+ 96, TEH97, LKSA01] and computationally
[GS02, ZLCD04, SP05b, SP05c]. Since helix formation occurs on the nanosecond time scale,
Sorin et al. were able to reach equilibrium from both helix and coil conformations and observe
equilibrium conformational dynamics using ensembles of molecular dynamics trajectories on the
distributed computing platform Folding@Home [SP05c]. Two sets of 1,000 trajectories at 302 K
of varying length of the capped Fs peptide (sequence Ace-A5 [AAARA]3 A-Nme), one set initiated
from an ideal helix and another from a random coil, were obtained from Sorin et al. [SP05c]; details
of the simulation protocol are available therein. The first 40 ns of each trajectory, a conservative
overestimate of the time to reach equilibrium from either helix or coil, was discarded, and the two
sets of trajectories combined to yield a total of 1,689 trajectories varying in length from 10 ns to
95 ns with a sampling interval of 100 ps. In total, this equilibrium dataset contained nearly 65 µs
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
52
state
1
2
3
4
5
6
7
members
τac (ns)
state
358 712
3.1
8
98 222
0.9
9
46 921
1.4
10
22 559
0.6
11
22 367
4.0
12
15 859
1.3
13
11 975
1.6
14
members
τac (ns)
state
11 053
2.2
15
11 024
2.0
16
7 976
2.2
17
7 808
1.2
18
7 771
1.6
19
5 978
11.3
20
5 626
2.3
members
τac (ns)
4 396
4.3
1 856
5.0
955
10.3
531
47.0
525
29.1
490
15.2
Table 3.1: Macrostates from a 20-state state decomposition of the Fs helical peptide. The backbone
is depicted in alpha carbon trace, and arginine sidechains are shown in blue (Arg10), magenta
(Arg15), and green (Arg20) for clarity.
of simulation data in 642,604 conformations. The peptide was modeled using the AMBER-99φ
forcefield [WCK00, SP05c] and solvated in TIP3P water [JCM+ 83]. Though the Berendsen weakcoupling scheme [BPvG+ 84] was employed for thermal and pressure control9 , we presume the
trajectories still obey microscopic reversibility when only the coordinates of the macromolecular
solute are considered for the purposes of computing transition probabilities.
Comparison of states
We performed automatic state decomposition on this dataset to generate a set of 20 macrostates
through 10 iterations of splitting and lumping. In the first iteration, the sampled region of conformation space was split into 400 microstates. In subsequent iterations, each macrostate was split into
50 microstates (or, if the expected microstate size was less than 500 configurations, the maximum
number of microstates such that the expected microstate size was above 500).
Automatic state decomposition produced a structurally diverse set of states (Table 3.1), ranging
in size from over 350,000 members to 500 members, with the majority containing from 5,000 to
9
We note that Berendsen thermal control, here applied independently to the peptide and solvent, modulates the velocities of the peptide atoms during the course of the simulation, which may have a nonphysical effect on dynamics and affect
interstate transition rates. However, since we compare our Markov model with the original simulation dataset, rather than
directly with experiment, this is not of concern.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
53
20,000 members. The states include a large state (state 1 of Table 3.1), consisting of slightly over
half the total conformations in the dataset containing both extended coil and helical conformations;
a pure helix state (state 15); a number of helix/coil states which are bent in half to different degrees
to form tertiary contacts (states 2–14); and a number of smaller helical states which are bent into
circles to form tertiary interactions (states 16–20). A previous analysis [SP05c] of this data clustered conformations into states based on various order parameters: the number of helical residues,
number of helical segments (stretches of helical residues), length of the longest helical segment, and
radius of gyration. We compared the macrostates generated by the automatic algorithm with these
clusters, and found that while some states are similar, namely the bi-nucleated helices of different
sizes, most were quite different. The most significant difference was the grouping of helix and coil
conformations into a single macrostate in the lumping phase of the automatic algorithm, whereas
the order parameter-based clustering kept helix and coil states distinct [SP05c]. When examining
individual trajectories, we noticed conformations would rapidly transition between helices and coils
between consecutive 100 ps frames of the trajectory, suggesting that their rapid interconversion
justifies their lumping into a single macrostate. Additionally, the clustering based on helical order
parameters was unable to distinguish certain structures that involved long-lived tertiary contacts,
such as the bent and circular helical states. Interestingly, a previous study employing the related
AMBER parm03 forcefield [DWC+ 03] identified similar configurations to those noted by the automatic state decomposition, terming these states helix (state 15), helix-turn-helix (states 3, 6 – 8),
adjusted helix-turn-helix (states 4 – 5, 9 – 12, 14), and globular helix (states 16 – 20).
Kinetic analysis
We then examined the implied time scales as a function of lag time (Figure 3.5). Lumping appeared
to preserve the longest time scales found in the microstate transition matrix (data not shown), indicating that our lumping scheme had been successful in identifying a nondestructive lumping into
kinetically metastable states at each iteration. Over the course of 10 iterations, the metastability
(as optimized with a lag time of 100 ps) increased from 12.5 ± 0.3 to 14.5 ± 0.1, suggesting that
the iterative refinement was improving the quality of the state decomposition. On the first iteration,
the longest time scales increase nearly linearly with lag time, while on the last iteration, some of
the longest time scales become stable by a lag time of 4 – 5 ns, suggesting Markovian behavior for
some of the processes.
Using the interpretation of eigenvector components in terms of aggregate modes described in
Section 3.2.1, the longest time scale was found to correspond to movement between the extended
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
54
35
30
τk (ns)
25
20
15
10
5
0
0
1
2
3
4
5
6
7
lag time τ (ns)
8
9
10
Figure 3.5: Implied time scales of the Fs peptide as a function of lag time for 20-state automatic state
decomposition. The five longest time scales are shown. Circles represent the maximum likelihood
estimate, and vertical bars depict 68% symmetric confidence intervals about the mean. Note the
time scales associated with two processes appear to cross, but are here colored and uncertainties are
estimated using the bootstrap procedure by ordering the time scales computed from each bootstrap
replicate by rank. This may cause the uncertainties depicted here to be an underestimate of the true
uncertainties of each process.
helix/coil state (state 1) and one of the twisted helix-turn-helix states (state 18) with only 500 members. We found, however, that state 18 appeared a small number of times in thirty trajectories, and
over 450 times in a single trajectory. Further examination revealed that conformations belonging to
this state were almost exclusively temporally adjacent to conformations belonging to state 5, and
structural comparison of conformations of these two states showed they were strikingly similar. This
suggests that slight conformational differences between conformations in states 18 and 5 allowed
the K-medoid clustering algorithm to partition between these states in a splitting step, and since
state 18 was mainly isolated in a single trajectory, its self-transition probability was maximized by
not lumping it with state 5, even though the two behaved in a similar kinetic fashion. Indeed, when
we manually lump states 18 and 5, the longest time scale, corresponding to transitions involving
state 18, disappears, but the remaining time scales are all preserved (data not shown).
A potential cause of the increase with lag time observed in some of the other long time scales
may be due to the finite length of trajectories. If the state is long-lived, and occurs near the trajectory
beginning or end, then it can be seen that the estimated self-transition probability Tii artificially
increases as a function of lag time. This effect is most pronounced when a state occurs in very few
trajectories, and appears to be mitigated when the state occurs in many trajectories at random times
within the trajectory.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
55
In order to determine which states are poorly characterized, we estimated the number of statistically independent visits to each macrostate using the autocorrelation time given in Sec. 3.3.4. As
the correlation functions became statistically unreliable at times larger than 10 ns, a least squares
linear fit to the log of the computed correlation function over the first 10 ns was used to estimate
the tail of the function at times greater than 10 ns, and this combined correlation function was integrated to obtain the autocorrelation time. Computed state autocorrelation times are given in Table
3.1. For many states, the correlation time was 1 – 2 ns, giving thousands of independent samples;
however, for five states, including the four involved in the four longest time scales, the correlation
times were between 10 and 50 ns, suggesting that the dataset contained less than 50 independent
samples of these states. Currently, in the automatic state decomposition algorithm, we try to reduce
the statistical uncertainty in the transition matrix by limiting the expected population of each state
to be greater than some minimum number of configurations. Since the conformations appearing
within some states may be highly correlated, the number of conformations within a state is not the
best measure of how statistically well-determined its transition elements are; instead, it may be advantageous to place a lower limit on the effective number of independent visits to each state, which
is far less than the number of configurations it contains. Alternatively, it may be necessary to ensure
better characterization of these states by conducting additional simulations from them, provided the
equilibrium transition probabilities can still be computed.
We constructed a Markov model from the transition matrix estimated at a 5 ns lag time, where
some (though apparently not all) of the time scales to have stabilized. The Chapman-Kolmogorov
test (Sec. 3.2.4) can assess how well the model reproduces the observed kinetics. The time evolution
of probability density out of three states (state 2, a populous state; state 13, a moderately populated
state; and state 19, a sparsely populated state) over the course of 50 ns is shown in Figure 3.6. The
Markov model appears to do a very reasonable job of predicting the time evolution of the system to
within statistical uncertainty over many times longer than the lag time used to construct it. In fact,
the time evolution was well modeled for evolution out of all states, except for state 13, for which
dynamics seemed to be particularly poorly reproduced. This state has a long correlation time, and
many trajectories seem to contain only a single configuration that is part of this state, suggesting its
boundaries are simply poorly resolved. Regardless, the time evolution is generally well-modeled
for this system.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
56
population
1
0.5
0
0
10
20
30
time (ns)
40
50
Figure 3.6: Reproduction of observed state population evolution by Markov model for the Fs peptide. The time evolution of the Markov model constructed from the 5 ns lag time transition matrix is
shown by the filled circles with flat error bars, which denote the 68% confidence interval estimated
from a sample of 40 bootstrap realizations, with each realization the result of a 5 ns transition matrices estimated by a bootstrap sample of trajectories. Vertical bars without flat ends denote the 68%
confidence interval centered on the sample mean for the probability of finding the system in the
20 macrostates a given time after initial preparation in a specific state. The system was originally
prepared in state 2 (top, red), 13 (middle, yellow), or 19 (bottom, purple). The most populous states
are colored green (state 1), red (state 2), and blue (state 3).
3.4.3 The trpzip2 β-peptide
As an illustration of the application of the state decomposition algorithm to a system with complex
kinetics implying the existence of multiple metastable states [YG04], we considered the engineered
12-residue β-peptide trpzip2 [CSS01]. A set of 323 10 ns constant-energy, constant-volume simulations of the unblocked peptide10 simulated using the AMBER parm96 forcefield [KDC+ 97] in
TIP3P water [JCM+ 83] was obtained from Pitera et al. [PHS06]; details of the simulation protocol
are provided therein. The trajectories were initiated from an equilibrium sampling of configurations at 425 K, a temperature high enough to observe repeated unfolding and refolding events at
equilibrium. Configurations were sampled every 10 ps, giving a total of 3.23 µs of data in 323,000
configurations.
10
Note that the peptide studied experimentally in Refs. [CSS01] and [YG04] was synthesized with an amidated Cterminus, whereas the termini of the simulated peptide in the dataset considered here were left zwitterionic.
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
57
Comparison of states
The automatic state decomposition method was applied to obtain a set of 40 macrostates in 10
iterations of splitting and lumping. The algorithm was performed as described in Section 3.3.3,
except for the first iteration, where the conformations were split into 400 microstates.
Figure 3.7 depicts some of the final set of 40 macrostates compared with a set of states identified by consideration of backbone hydrogen bonding patterns in the previous study by Pitera et al.
[PHS06]11 . As the trajectories considered here were resampled to 10 ps intervals (rather than 1 ps in
Ref. [PHS06]) we found less than five examples of the +2 and -2 hydrogen bonding states identified
in Ref. [PHS06], and therefore exclude them from comparison. The automatic state decomposition
method recovers states corresponding to the native, +1C, and +1N hydrogen bonding patterns, and
often further resolves them based on the orientation of the tryptophan sidechains (Figure 3.7, A,
C, D). However, the -1N hydrogen bonding pattern is not further resolved, and instead is grouped
into a state of mostly disordered hairpins; further examination is necessary to determine whether
the algorithm simply failed to resolve this state or if the state is simply not long-lived. In addition
to recovering most of the manually identified misregistered states, the algorithm was also able to
greatly resolve the state labeled as “unfolded” in Pitera et al. (in that it did not conform to any
of the enumerated hydrogen bonding patterns) into substates which exhibit considerable structure
(E–J). Some of these kinetically resolved states have distinct hydrogen bonding patterns, such as
where both strands are rotated (H), causing the tryptophan sidechains to appear on the opposite
face, or where the misregistration is greater than two residues (G, J). This demonstrates the utility
of the method in identifying additional kinetically relevant states that were not initially part of the
experimental hypothesis space.
Kinetic analysis
Figure 3.8 depicts the implied time scales of the kinetic model as a function of lag time. The longest
time scale ranges between 25 and 35 ns and appears to stabilize over the range of lag times considered, though the uncertainty is quite large. Eigenvector analysis (described in Sec. 3.2.1) shows
that this time scale corresponds to transitions between the unfolded and disordered hairpin states
(E) and the hairpin with both strands rotated (H). The states labeled H together totaled 935 conformations, but appeared in only 13 trajectories, with over 95% of the conformations appearing in a
single trajectory. Correlation time analysis (Sec. 3.3.4) suggests there are less than 10 independent
11
The complete set of macrostates is shown in a figure included as Supplementary Information of Ref. [CSP+ 07].
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
states from Pitera et al.
10
12
58
automatic state decomposition
8
A
native
1
3
5
12
10
8
B
-1N
1
3
+1C
12
10
3
1
+1N
1
12
C
5
10
3
D
5
''unfolded''
G
F
E
H
I
J
Figure 3.7: Comparison of some trpzip2 macrostates found by automatic state decomposition with
misregistered hydrogen bonding states identified in a previous study. Left: The five hydrogen bonding patterns enumerated in Pitera et al. [PHS06] that occurred in sufficient numbers in the subsampled trpzip2 dataset used here, with representative conformational ensembles. Blue squares denote
backbone amide hydrogen bond donors, and red circles denote backbone carbonyl hydrogen bond
acceptors. Right: A selection of macrostates discovered by automatic state decomposition that contain the largest numbers of hydrogen bonding pattern states. The backbone is depicted in alpha
carbon trace, and tryptophan sidechains are shown in light blue (Trp2), orange (Trp4), magenta
(Trp9), and teal (Trp11).
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
59
50
τk (ns)
40
30
20
10
0
0
1
2
3
4
lag time τ (ns)
5
6
7
Figure 3.8: Implied time scales of trpzip2 as a function of lag time for 40-state automatic state
decomposition. The five longest time scales are shown.
samples for each of the three states, so proper resolution of this time scale would require more data.
The second longest time scale grows to about 15 ns, levels off by around 4 ns, and corresponds to
transitions between the unfolded and disordered hairpin states (E) and the native backbone states
(A). The states involved in this transition are much better characterized, with a total of over 25,000
conformations appearing in over half the trajectories. The next three longest time scales were all
between 3 and 4 ns and correspond to movement between the unfolded state (E) and various sets
of misregistered states, namely the newly identified misregistered states I and J, and the +1C state
(C). Unfortunately, these time scales are on the order of the time to reach global equilibrium, so it
is difficult to characterize these transitions well.
3.5 Discussion
Markov models are expected to be effective and efficient ways to statistically summarize information
about the pathways (mechanism) and time scales for heterogeneous biomolecular processes such
as protein folding. The great challenge in their use lies in defining an appropriate state space.
Here, we have presented a new algorithm for automatically generating a set of configurational states
that is appropriate for describing peptide conformational dynamics in terms of a Markov model,
though we expect it to be applicable to macromolecular dynamics in general. The algorithm uses
molecular dynamics simulations as input, and generates state definitions using information about the
temporal order of conformations seen in the trajectories. The importance of having an automatic
algorithm, i.e., one that requires little or no human intervention, is that without it, human bias
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
60
may inadvertently produce incorrect interpretations of the mechanism of conformational change by
imposing a particular view of the simulation data. Additionally, molecular simulation datasets are
becoming so large and complex that effectively summarizing the data or extracting insight becomes
increasingly impractical unless the experimenter analyzes the data with a specific hypothesis in
mind. Construction of a Markov model, however, allows for a “hypothesis-free” investigation of
conformational dynamics, provided that the state space is sufficiently well sampled.
Our algorithm is based on the availability of large numbers of molecular dynamics simulations
of appropriate simulation length such as might be generated by a supercomputer or a large (possibly
distributed) cluster. Current technology allows for the production of thousands of simulations that
can be tens of nanoseconds in length, hundreds of trajectories of up to hundreds of nanoseconds
in length, or dozens that are on the order of a microsecond in length. Since our goal has been
to develop Markov models that accurately characterize the time evolution of ensembles of macromolecules over experimental time scales (that can range from microseconds to milliseconds) from
short simulations of single molecules, our approach places strong emphasis on the longest time
scales observed in molecular simulations. For example, recognizing that ill-formed states often
result in artificially shortened time scales, we sought to find states that maximize the time scales
implied by their corresponding transition matrix for a particular choice of lag time and number
of states. This resulted in the maximization of the metastability as a computationally convenient
surrogate for minimizing the internal equilibration time τint .
For the three data sets to which we have applied the method, there have been a number of
important successes. For alanine dipeptide, the algorithm discovered a distinct manifold of states
that consisted of conformations containing a cis-ω peptide bond. This manifold was discovered
because it was kinetically distinct, rather than structurally distinct. Also, for alanine dipeptide,
the method produces states that are robust and structurally very similar to the best ones produced
manually, as well as kinetically indistinguishable to within statistical uncertainty according to our
validation metrics. The application of the method to the Fs peptide data set produced a set of states
somewhat different from those identified previously from the clustering of helical order parameters
[SP05c]. The states produced by the algorithm properly identified many very long lived (metastable)
conformations whose lifetimes and kinetics might be experimentally relevant. The Markov model
produced from this state decomposition and a 5 ns transition matrix was shown to reproduce the
observed state populations over 50 ns to within statistical uncertainty. Finally, for the application
of the method to the trpzip2 peptide the states constructed were consistent with ones previously
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
61
identified [PHS06]. This was very encouraging since the previously constructed states used an intramolecular hydrogen bonding criterion and the automatic algorithm utilized different observables
and metrics, heavy atom RMSD and kinetics, to resolve states. Moreover, the automatic algorithm
more finely resolved what was considered to be the “unfolded” ensemble into metastable states that
were not identified by the decomposition based on hydrogen bonding patterns.
Therefore, the algorithm is achieving many of its design objectives. It provides a method for
identifying and characterizing the slower degrees of freedom of a molecular system. It correctly
identifies metastable states, dividing structurally similar conformations into multiple sets that have
short times for intraconversion but long times for interconversion, and combines conformations that
rapidly interconvert even though they may be structurally diverse. This is a prerequisite to capturing
a concise description of the pathways for conformational changes. Once meaningful states are
identified, the transition matrix itself encapsulates the branching ratios for various pathways and the
time scales for overall relaxation to equilibrium from any arbitrary starting ensemble.
Work is ongoing to establish standards for the amount and nature of simulation data (number
and length of simulations) needed to develop useful and sufficiently precise Markov models as well
as investigations of the effect of quality metrics other than the metastability on the nature of the
resulting states and time scales. Metrics for assessing the quality of the resulting model also need
to be examined to complement, or as alternatives to, seeking stability of the implied time scales
with respect to lag time. Finally, alternative approaches to performing this state decomposition
are a further matter of current study, such as the method of Noé and coworkers appearing in this
issue, motivated by much the same ideas of metastability but employing different methods for the
construction of a microstate space [NHSS07].
A general observation about the models produced using states defined by our method is that
Markovian behavior is not obtained until lag times that are only an order of magnitude shorter than
the longest time scales. Recall that the utility of a state space depends to a large extent on how early
Markovian behavior is observed compared to the processes of interest. There are multiple possibilities for why this might be the case. For some molecular systems, there may be no identifiable
metastable states in the usual sense. The existence of experimentally observed metastable states in
protein systems (e.g., native, intermediate, unfolded) combined with the observation of metastable
states in models of small solvated peptides [CSPD06b] argues that this is unlikely. It could be that
statistical uncertainty is undermining both the metastability quality metric and the tests for Markovian behavior. Alternatively, the way we establish boundaries between states may not be flexible
enough to adequately divide true metastable regions. It may also be that we simply need to allow
CHAPTER 3. AUTOMATIC STATE DECOMPOSITION
62
more states to be produced, resulting in subdivision of states that have internal barriers, to reduce the
Markov times. Both of these latter possibilities could in principle be easily addressed by allowing
the creation of more states. However, the creation of more states, especially ones with low populations, leads inevitably to situations where transition probabilities become statistically unreliable
given a fixed quantity of equilibrium data.
Long time scales are ultimately the result of infrequent events, and even for large but finite equilibrium datasets these will be small in number, with resulting small off-diagonal transition probabilities that are statistically unreliable. This has placed us in the particularly difficult but unavoidable
situation of attempting to optimize a statistically uncertain objective function. One solution to this
problem, of course, is to consider this algorithm as only the first step of an iterative process where
important states and transitions are identified, and then further simulations are performed to improve
the characterization of important regions of conformation space. This will allow refinement of the
state space and improved precision for important selected transition probabilities. Information from
the subsequent simulations could be combined with that from the first set using the selection cell approach described previously [SPS04a]. Selection of states, or regions of configuration space, from
which further simulations should be initiated could be chosen based on uncertainty considerations,
to be described in Chapters 5 and 6.
3.6 Supporting Information
A Fortran 90/95 implementation of the automatic state decomposition algorithm presented here is
available for download as part of the Supplementary Information of Ref. [CSP+ 07]. The latest version of the code, along with the alanine dipeptide dataset, can be obtained from http://www.
dillgroup.ucsf.edu/˜jchodera/code/automatic-state-decomposition/.
The trpzip2 dataset is available directly from WCS upon request (E-mail: swope@almaden.
ibm.com). A gallery of all macrostates produced by the 40-state decomposition of the trpzip2
peptide is also available as part of the Supplementary Information of Ref. [CSP+ 07].
Chapter 4
Model selection
A common approach for describing the conformational dynamics of biological molecules is to
model the dynamics as a discrete-state Markovian model. The previous chapter gave an automatic
method for building a state decomposition, given a target number of states. However, it is currently
difficult to determine the correct number of states in the decomposition for a given set of simulation
data. This chapter outlines a maximum likelihood score and a Bayesian score for the fit of a given
model to the observed data. We show how these scores can be used to compare state decompositions, which may differ in the state definitions themselves and the number of states. The scoring
functions are tested on decompositions of a simple transition model between 9 conformations and
on state decompositions of the terminally blocked alanine peptide. We demonstrate how the maximum likelihood score always prefers a state definition with more states, while the Bayesian score
correctly assesses the tradeoff between the number of states and the amount of data.
4.1 Introduction
Computational simulations are often used to study the movement of biological molecules. A Markovian state model (MSM) is a convenient method for analyzing the simulation data by first discretizing the conformation space of the molecule into some number of states. The kinetics of the system
are then described as Markovian, or history-independent, transitions between the states. We have
described an automatic algorithm which tries to find stable states in the conformation space such
that the dynamics over the states will be Markovian (Chapter 3). Once the states are defined, we
find the Markovian transition probabilities by simply counting the number of transitions between
states observed in the simulation data, at some lag time between successive conformations, τ . The
63
CHAPTER 4. MODEL SELECTION
64
model is then a compact description of the dynamics, and can be projected out to long time scales.
There are a few tests for evaluating how well a given MSM describes the simulation data. One
previous method for evaluating a MSM involves calculating the implied time scales, which are
calculated from the eigenvalues of the transition probability matrix, as a function of the lag time at
which the transitions are counted [SPS04a]. If the model is Markovian after some lag time, τint , the
implied time scales will be constant for lag time τ ≥ τint . A “good” model is one for which τint
is short. An alternate method for evaluating a MSM is to calculate the metastability of the model
[HS05]. The metastability, Q, of a MSM is defined as the sum of the self-transition probabilities
calculated at some lag time τ .
One way to reduce the lag time after which the MSM will appear to be Markovian is to add more
states to the MSM. Adding more states will make the extent of each state smaller, therefore reducing
the lag time after which the dynamics will appear to be Markovian. However, by making the states
smaller, there will be more parameters in the MSM – the transition probabilities between the states.
By adding more states to the model, we are estimating more parameters with the same amount of
data, thus increasing the uncertainty in the parameters. The previous metrics for evaluating a MSM
do not take into account the amount of simulation data, and therefore this increased uncertainty in
the transition probabilities. Errors in estimating the transition probabilities, in turn, lead to errors in
the prediction of kinetic properties such as the rates of folding, as we will discuss in later chapters
(Chapters 5 and 6).
In this chapter, we introduce a maximum likelihood scoring function and a Bayesian scoring
function which are based on the scoring of Bayesian Networks to evaluate how appropriate a MSM
is for a given data set. We first introduce Bayesian Networks and scoring functions, and then show
how to use the scoring functions to compare between different MSMs. The maximum likelihood
and Bayesian scoring functions are tested on MSMs of a simple transition model between 9 conformations. We show that the scoring functions are able to select the correct 3-state decomposition of
the transition model. We also show how the Bayesian scoring function is better than the maximum
likelihood scoring function in determining the appropriate number of states in the decomposition,
which depends on both the amount of data and the computed transition probabilities. The Bayesian
scoring function is also used to determine which state in a MSM should be subdivided in order to
best improve the model. The scoring functions are then tested on state decompositions of the terminally blocked alanine peptide, and we show how they are able to provide more information about
the quality of the resulting MSMs than previous techniques.
CHAPTER 4. MODEL SELECTION
65
4.2 Methods
In this section, we first introduce the basic formalism of a Bayesian Network (Sec. 4.2.1), including
techniques for estimating the parameters (Sec. 4.2.2) and different scoring functions (Sec. 4.2.3). We
then describe how a Markovian state model is transformed into a corresponding Bayesian Network
(Sec. 4.2.4) and how the scoring functions are used to compare different MSMs (Sec. 4.2.5).
4.2.1 Bayesian Networks
Assume that there exists a set of variables {X1 , X2 , . . . , XN }, each of which has possible values x1j ∈ V al(X1 ), x2j ∈ V al(X2 ), . . ., xN j ∈ V al(XN ). Here, we adopt the notation of
an uppercase letter representing a variable and a lowercase letter representing possible values for
that variable. The variables {X1 , X2 , . . . , XN } have a joint probability distribution over them,
P (X1 , X2 , . . . , XN ). A Bayesian Network (BN) is a way to compactly represent this joint distribution by assuming certain conditional independence relationships between the variables [HM81,
Pea88].
Explicitly, a Bayesian Network is a directed acyclic graph, G, where the probability of each
node in the graph is given just in terms of its parents. If a variable Xi has parents P a(Xi ), then
P (Xi |X1 , . . . , Xi−1 , Xi+1 , . . . , XN ) = P (Xi |P a(Xi )). For each variable Xi , the Bayesian Network will have a set of parameters ΘXi |P a(Xi ) , where each parameter θxij |πi ∈P a(Xi ) defines the
probability that a certain value of Xi , xij , occurs given specific values for all the parents, πi .
Sometimes, the data set we are interested in does not consist of a fixed set of variables, but instead involves observations at different points through time. A Dynamic Bayesian Network (DBN)
can be used to represent this type of time-series data. In a DBN, at each time slice, we have some set
of variables whose joint probability distribution is represented as a Bayesian Network. In addition,
we can have edges between variables in different time slices, indicating how the variables evolve
through time, given previous values. The process modeled by a DBN is assumed to be stationary,
that is, the dependencies between time slices are independent of time.
4.2.2 Parameter estimation in Bayesian Networks
Assume that we know the structure of a Bayesian Network (the set of variables and edges) and
that we want to estimate values for the parameters, Θ. Also assume that we are given complete
instances of the variables, drawn from their joint probability distribution. Let M be the number
of data instances we have, where each data instance consists of an assignment of values to all the
CHAPTER 4. MODEL SELECTION
66
variables: {x1 [m], x2 [m], . . . , xN [m]}. We can summarize the data, D, with count variables, where
M [(· · · )] is the number of data instances where (· · · ) holds.
Maximum likelihood estimation
The likelihood of observing the data, D, given some value of the parameters, Θ, is simply the
product over all the variables of the product over all the data instances of the probability of observing
that instance of the variable:
L(Θ : D) =
N Y
M
Y
θxi [m]|πi [m]∈P a(Xi ) .
(4.1)
i=1 m=1
The maximum likelihood estimates of the parameters are simply those values which maximize the
likelihood function in Eq. 4.1. By grouping terms and using the count variables, the likelihood
function reduces to
L(Θ : D) =
N
Y
ki
Y
Y
i=1 πi ∈P a(Xi ) j=1
M [x ,πi ]
θxij |πiji
,
(4.2)
where ki is the number of possible instances of Xi , size(V al(Xi )). Typically, we take the natural
log of the likelihood function:
l(Θ : D) =
N
X
X
ki
X
M [xij , πi ] ln θxij |πi .
(4.3)
i=1 πi ∈P a(Xi ) j=1
The likelihood function decomposes by variable Xi , and by maximizing the log-likelihood, the
maximum likelihood estimates of the parameters ΘXi |P a(Xi ) are given as
θ̂xij |πi ∈P a(Xi ) =
M [xij , πi ]
,
M [πi ]
(4.4)
the fraction of instances where xij is true, restricted to the case where the parents have value πi .
Bayesian estimation
Instead of calculating a single value for the parameters in the Bayesian Network, it is possible
to ascribe a distribution over the parameters. Using Bayes’ rule, the probability of a particular
CHAPTER 4. MODEL SELECTION
67
parameter is
P (θxij |πi ∈P a(Xi ) |D) =
P (D|θxij |πi )P (θxij |πi )
,
P (D)
(4.5)
where P (θxij |πi ) is some prior probability over the parameters.
A convenient choice for the prior distribution is the Dirichlet distribution, which is the conjugate
prior of the multinomial distribution from which our data is observed. The Dirichlet distribution
with variables p and parameters u is defined as
K
Dirichlet(p; u) =
1 Y ui −1
pi
,
Z(u)
(4.6)
i=1
QK
Γ(ui )
´,
Z(u) = ³i=1
PK
Γ
i=1 ui
(4.7)
where Z(u) is a normalizing constant and Γ is the gamma function.
If we define the prior of the parameters ΘXi |πi ∈P a(Xi ) as a Dirichlet distribution with parameters
αxi1 |πi , αxi2 |πi , . . ., αxik
i
|πi ,
and we observe counts M [xi1 , πi ], M [xi2 , πi ], . . ., M [xiki , πi ], then
the posterior distribution is
P (ΘXi |πi ∈P a(Xi ) |D) =
Dirichlet(Xi ; αXi |πi + M[xi , πi ])
.
P (D)
(4.8)
4.2.3 Scoring of Bayesian Networks
If we have a Bayesian Network, we can evaluate how well it represents the data by calculating
different likelihoods of observing the data given the model. Below, we present two scoring functions
corresponding to the maximum likelihood and the marginal likelihood.
Maximum likelihood scoring function
One choice of scoring function is the maximum likelihood scoring function (scoreL ), which is the
maximum possible likelihood of the data given the Bayesian Network:
scoreL (G, D) = max(l(hG, Θi : D).
Θ
(4.9)
CHAPTER 4. MODEL SELECTION
68
The maximum likelihood occurs when the maximum likelihood parameters, Θ̂, are used, which
were defined above in Eq. 4.4:
D
E
scoreL (G, D) = l( G, Θ̂ : D).
(4.10)
The maximum likelihood score is therefore equal to
scoreL (G, D) =
N
X
ki
X
X
M [xij , πi ] ln θ̂xij |πi
i=1 πi ∈P a(Xi ) j=1
=
N
X
ki
X
X
M [xij , πi ] ln
i=1 πi ∈P a(Xi ) j=1
M [xij , πi ]
.
M [πi ]
(4.11)
The major downside of using scoreL is that as more parameters and dependencies are added
to the Bayesian Network, the fit to the data will never decrease. Even if the underlying probability
distribution satisfies conditional independence between variables, it is unlikely that the empirical
data will also satisfy these independencies. While a more complicated model will provide a better
fit to the given data set, it is likely that it will overfit to the training data, and the model may lose its
ability to predict new data, as each parameter must be estimated with fewer data samples.
Bayesian scoring function
The maximum likelihood scoring function prefers more complicated models since it selects the
best parameters, Θ̂, to calculate the score. An alternative scoring function is the Bayesian score
(scoreB ), which uses the entire distribution over the parameters to calculate the marginal likelihood.
As opposed to Eq. 4.9, where the maximum likelihood values for Θ were chosen, in the Bayesian
score, we integrate over all possible values for Θ:
Z
scoreB (G, D) =
(l(hG, Θi : D)dΘ.
Θ
(4.12)
CHAPTER 4. MODEL SELECTION
69
Substituting in the likelihood of observing the data, D, given the graph, G, and the parameters, Θ,
the Bayesian scoring function becomes
Z
scoreB (G, D) = ln
= ln
P (D|G, Θ)P (Θ|G)dΘ
Θ
Z Y
N
Y
ki
Y
Θ i=1 π ∈P a(X ) j=1
i
i
M [x ,πi ]
θxij |πiji
P (Θ|G)dΘ,
(4.13)
where P (Θ|G) is some prior distribution of the parameters, given the graph, G.
For each set of parameters ΘXi |πi ∈P a(Xi ) , if the prior distribution P (ΘXi |πi ∈P a(Xi ) |G) is defined as a Dirichlet distribution with parameters αxij |πi , as in Eq. 4.8, the above integral has a
closed-form solution [CH92]:
scoreB (G, D) = ln
N
Y
Y
i=1 πi ∈P a(Xi )
=
N
X

X
ln
i=1 πi ∈P a(Xi )
where αXi |πi ≡
ki
Y
Γ(αxij |πi + M [xij , πi ])
Γ(αXi |πi )
Γ(αXi |πi + M [πi ])
Γ(αxij |πi )
j=1
Γ(αXi |πi )
+
Γ(αXi |πi +M [πi ])
ki
X
j=1

Γ(αxij |πi +M [xij , πi ])
, (4.14)
ln
Γ(αxij |πi )
Pki
j=1 αxij |πi .
Comparison of scores
If we have two different Bayesian Networks, we may wish to evaluate which is a better model for
the data. Often, this test is done in terms of a likelihood ratio, known as a Bayes factor when the
marginal likelihood is used. A likelihood ratio is simply the ratio of the probabilities of observing
each of the two events, which in our case are the probabilities that the given Bayesian Network
produced the data. If we assume that the two BNs are equally likely a priori, then
ratio =
P (BN2 |D)
,
P (BN1 |D)
(4.15)
where the probability is calculated either as the maximum likelihood (scoreL ) or as the marginal
likelihood (scoreB ). The ratio value gives how much more likely BN2 is to have generated the data
than BN1 . Since the scoring functions we have outlined calculate the natural log of the likelihood,
CHAPTER 4. MODEL SELECTION
70
to compare the scores we simply take the natural log of Eq. 4.15:
logratio = ln P (BN2 |D) − ln P (BN1 |D).
(4.16)
For each of the scoring functions, the difference of the scores between the two BNs which we are
comparing gives how many orders of magnitude the second BN is more likely than the first to have
produced the data set. When comparing scores, we will use the normalized log ratio, where we
normalize by the number of data instances, M :
nlr =
ln P (BN2 |D) − ln P (BN1 |D)
.
M
(4.17)
This normalized log ratio tells on average how many orders of magnitude the second BN is more
likely than the first on any particular data instance.
4.2.4 Markovian state models as Bayesian Networks
Assume we have a molecular dynamics trajectory of the form {c(0), c(1), c(2), . . .}, where c(t)
represents the conformation of the molecule at time t, and τ = 1 is the lag time between consecutive
observations. The conformation of the molecule may, for example, be represented as the spatial
coordinates of all of the atoms of the molecule. Estimating the dynamics between conformations is
difficult because the high dimensionality of the conformation space makes it unlikely that a given
conformation will be visited more than once in the trajectory data.
A Markovian state model attempts to reduce this dimensionality by mapping each conformation
of the molecule to one of kS discrete states:
s(t) = f (c(t)),
(4.18)
where f is the function of the Markovian state model which maps conformations to states. The goal
in a Markovian state model is to group conformations together into states such that the conformations within a state will transition between each other rapidly. When the transitions within a state
are faster than the transitions between states, the transitions between states are well approximated
as Markovian, or history-independent transitions.
We can represent a Markovian state model with the Dynamic Bayesian Network shown in Figure
4.1. In this network, the variables X and Y represent the state which the system is in at time points
separated by lag time τ . The variables A and B represent the conformation which the system is in
CHAPTER 4. MODEL SELECTION
71
X
Y
A
B
Figure 4.1: The Dynamic Bayesian Network corresponding to a Markovian state model. The variables X and Y represent the state of the system at consecutive time points and the variables A and
B represent the corresponding conformations.
at the corresponding times. We assume the transitions between states are Markovian, and thus the
probability of state Y only depends on the previous state X. We also assume that the probability of
a conformation at a given time, A or B, is only dependent on the current state, X or Y respectively.
4.2.5 Comparison between different Markovian state models
If we have two different Markovian state models, M SM1 and M SM2 , each will correspond to
different functions f which map the conformations to states, f1 , and f2 . These functions may differ
both in the number of states, kS1 and kS2 , and in the mapping itself. Each of these Markovian
state models will correspond to different Dynamic Bayesian Networks. The goal of this work is to
determine which Markovian state model is more likely, given the data set, D. We therefore measure
how well each Dynamic Bayesian Network fits the data set, D, with the different scoring functions
described in Sec. 4.2.3.
The mapping from trajectory data to DBN data instance is relatively straightforward. We first
divide the trajectories into non-overlapping pairs of conformations separated by lag time τ , since we
assume that each data instance is independent. Then, for each pair of conformations, we calculate
the values of the corresponding state variables using the functions f1 and f2 . For M SM1 , we call
the corresponding DBN G1 , which has data instances
{X = f1 (c(t)), Y = f1 (c(t + τ )), A = c(t), B = c(t + τ )},
(4.19)
and for M SM2 , we call the corresponding DBN G2 , which has data instances
{X = f2 (c(t)), Y = f2 (c(t + τ )), A = c(t), B = c(t + τ )}.
(4.20)
CHAPTER 4. MODEL SELECTION
72
Since the probability of a conformation given a state (P (A|X) or P (B|Y ) in Fig. 4.1) is independent of the time slice, we share one set of parameters, ΘC|S , instead of the two sets, ΘA|X and ΘB|Y .
The parameters ΘC|S are estimated using all the data instances {X[m], A[m]} ∪ {Y [m], B[m]}.
This reduces the number of parameters in the DBN and will give more precise estimates of the
probability of a conformation given a state.
Calculating the maximum likelihood score (scoreL ) for each of the two DBNs (G1 and G2 ) is
straightforward using Eq. 4.11, and is simply:
scoreL (G, D) =
kS
X
k
M [x] ln
x=1
k
k
k
S X
S
S X
C
M [x] X
M [y, x] X
M [c, s]
+
+
.
M [y, x] ln
M [c, s] ln
M
M [x]
M [s]
x=1 y=1
s=1 c=1
(4.21)
To calculate the Bayesian score (scoreB ) for each DBN, we must also define the prior distribution
over the parameters. The scoring function assumes the prior distribution is a Dirichlet distribution,
but we still need to select the prior parameters. The weight of a prior distribution is equal to the
sum of the prior distribution parameters. We could select a uniform distribution for the prior by
setting all prior parameter values to one, but then different DBNs may have different prior weights,
since a DBN with more states would have more prior parameters. Instead, we chose to use a BDe
prior [HGC95], so that the weight of the prior is the same over all DBNs. If we define some joint
probability distribution over the variables, P 0 (X1 , X2 , . . . , XN ), and some total prior weight, M 0 ,
the BDe prior is defined as:
αxij |πi ∈P a(Xi ) = M 0 P 0 (xij , πi ).
(4.22)
In our case, we calculated the prior parameters of the distributions of X and Y , αX and αY |X
with prior weight M 0 = 1 and uniform distribution P 0 (X, Y ). Therefore, the prior parameters for
the two DBNs are
αxG1 =
1
;
kS1
G1
αy|x
=
1
;
kS2 1
αxG2 =
1
;
kS2
G2
αy|x
=
1
.
kS2 2
(4.23)
We calculated the prior parameters of the distribution of the conformations, αC|S , with prior weight
M 0 = 1 and the distribution P 0 (C, S) as uniform where s = f (c) and zero otherwise.
Once the prior is defined, we calculate the Bayesian score for each DBN by substituting these
CHAPTER 4. MODEL SELECTION
73
prior parameter values into Eq. 4.14:
kS X
kS
X
Γ(1/kS2 + M [y, x])
1
+
ln
scoreB (G, D) = ln
Γ(1 + M )
Γ(1/kS2 )
x=1 y=1


kS
X
X
αC|s
Γ(αc|s + M [c, s])
ln
 (4.24)
+
+
ln
Γ(αC|s + M [s])
Γ(αc|s )
s=1
c:f (c)=s
4.2.6 Non-equilibrium data
All of the above analysis assumed that each data instance was selected from the joint equilibrium
distribution over all the variables. For molecular dynamics simulations, this assumption corresponds
to the assumption that the simulation is at equilibrium. However, simulations typically take a long
time to equilibrate, and it will be useful to use non-equilibrium data as well.
In the context of scoring the corresponding Dynamic Bayesian Networks, it is possible to model
non-equilibrium data using interventions. An intervention is when the value of one or more of the
variables, Xi , is assigned to a particular value Xi = xi . The Bayesian Network corresponding to a
data instance with an intervention simply removes all incoming edges to the variables whose values
were assigned and defines P (Xi = xi ) = 1.
To model non-equilibrium data, we assume that we assign the value of the first conformation
of each data instance (A in Fig. 4.1) instead of selecting it from its equilibrium distribution. This
corresponds to an intervention at the variable A, and we thus remove the edge from X to A in the
DBN shown in Fig. 4.1. The probability distribution associated with A is then defined as P (A =
a) = 1, where a is the specific conformation for a given data instance. Since the value of the
variable X is determined as a function of the first conformation, we also define the probability
P (X = x) = 1, where x = f (a).
We can now calculate the maximum likelihood and Bayesian scores of non-equilibrium data.
We remove the term corresponding to P (X), and we calculate the term P (C|S) using only the data
instances {Y [m], B[m]} as opposed to the set {X[m], A[m]} ∪ {Y [m], B[m]}. It is also possible
to calculate the scores of a mixture of equilibrium and non-equilibrium data [Pe’03].
CHAPTER 4. MODEL SELECTION
0.01
c1
0.005
c2
0.005
0.004
c3
0.01
s1
74
0.02
c4
0.004
0.01
c5
0.01
0.004
c6
0.02
0.05
c7
0.004
0.025
c8
0.025
s2
c9
0.05
s3
Figure 4.2: The transition probabilities and state definitions for a simple model with 9 conformations. Self transition probabilities are set so the outgoing probability for each conformation is equal
to one. All other transition probabilities not specified are equal to zero. The conformations are
grouped into three states, as shown by the dotted lines.
4.3 Results
4.3.1 Model system
To test the methods outlined above, we first use a simple transition model between 9 conformations
(Figure 4.2) which has been studied previously [SPS04a]. To generate data from this transition
model, we first select a conformation at random from its equilibrium distribution, and then select a
transition at random from the possible transitions from that conformation, with probabilities given
in Fig. 4.2. The transition probabilities given are the transition probabilities at lag time τ = 1.
To generate data at different lag times, we can generate sequential transitions and only record data
at intervals corresponding to τ , or equivalently, we can generate data from the transition matrix
raised to the power τ . To better represent molecular dynamics data, we will assume that each time
we visit ci , we generate a unique instance of that conformation, which is not equivalent to other
visits to ci (this is important when estimating ΘC|S ). While this model has transitions between 9
conformations, Swope et al. have shown that by grouping the conformations into 3 states (shown in
Fig. 4.2), the dynamics appear to be Markovian after some lag time [SPS04a].
CHAPTER 4. MODEL SELECTION
75
Comparison of different state decompositions
As described in Sec. 4.2.5, the scoring functions can be used to compare MSMs with different state
definitions. In this section, we focus only on the effect of the state definition. We compare the
scores of different possible state definitions with kS = 3 over the transition model shown in Fig.
4.2. We computed the scores for the 27 possible MSMs corresponding to all 3-state decompositions
of the conformations, where the conformations in a state had to be sequential. For lag times τ
ranging from 1 – 1,000 and total number of data instances M ranging from 1,000 – 50,000, we
generated independent data sets from the transition model at the given lag time. We then calculated
the maximum likelihood score and Bayesian score for each of the DBNs (Fig. 4.1) corresponding to
the 27 MSMs. Over this range of lag times and amount of data, the highest scoring MSM for both
the maximum likelihood and Bayesian scoring functions had the state definition:
{c1 , c2 , c3 }, {c4 , c5 , c6 }, {c7 , c8 , c9 }.
(4.25)
The next highest MSMs were consistently the following state definitions:
• {c1 , c2 , c3 }, {c4 , c5 }, {c6 , c7 , c8 , c9 }
• {c1 , c2 }, {c3 , c4 , c5 , c6 }, {c7 , c8 , c9 }
• {c1 , c2 , c3 }, {c4 , c5 , c6 , c7 }, {c8 , c9 }
• {c1 , c2 , c3 , c4 }, {c5 , c6 }, {c7 , c8 , c9 }
The decomposition with the highest score for both scoring functions was the one designed such that
equilibration within each state would be faster than transitions out of the state [SPS04a]. Indeed, by
examining the eigenvectors of the transition matrix between the 9 conformations, we can see that the
conformations grouped together in Eq. 4.25 behave the same for the two longest time scales of the
system (data not shown). Thus, the scoring functions are able to select meaningful state definitions.
Comparing MSMs with different numbers of states
We now demonstrate how the scoring functions perform when comparing MSMs with different
numbers of states. Using the transition model between 9 conformations depicted in Fig. 4.2, we
define two Markovian state models over the conformations. The first MSM is defined such that the
first three conformations map to one state, the second three conformations map to a second state,
CHAPTER 4. MODEL SELECTION
76
and the last three conformations map to a third state, as shown in Fig. 4.2, and as validated as the
best 3-state definition in the previous section. We call the DBN (Fig. 4.1) corresponding to this
MSM, G1 . The second MSM is simply defined such that each conformation maps to a unique state.
We call the DBN corresponding to this MSM, G2 .
We compare the scores of these two DBNs for differing lag times τ and total independent data
instances M . The top panel of Fig. 4.3 shows the normalized difference in maximum likelihood
scores for the two DBNs and the bottom panel shows the normalized difference in Bayesian scores,
as computed with Eq. 4.17. Values above zero indicate that DBN G2 , corresponding to the MSM
with 9 states, is that many orders of magnitude more likely to have produced an average data instance
than DBN G1 , the MSM with 3 states.
We see that the maximum likelihood score (top panel) always prefers the MSM with 9 states,
since the score difference is always greater than zero. The preference is stronger when the lag
time τ is short, compared to when it is longer. A lag time of τ indicates selecting transitions with
probability equal to the transition matrix given in Fig. 4.2 raised to the power τ . It is easy to verify
that as the lag time increases, the transition probabilities on conformations within the same state
in the 3-state MSM become more similar. Therefore, the relative difference of the probabilities
of the 3-state MSM and the 9-state MSM is smaller than at shorter lag times, when the transition
probabilities within each state are more different. The maximum likelihood score difference is
relatively insensitive to changes in the number of data instances.
Conversely, the Bayesian score (bottom panel) differs in preference between the 9-state MSM
and the 3-state MSM depending on both the lag time τ and the number of data instances M . The
contour indicating equal probability between the two MSMs is shown in bold. For a fixed amount
of data, the Bayesian score prefers the 9-state MSM at short lag times, but switches preference to
the 3-state MSM at longer lag times. As discussed above, this is because the transition probabilities
for the 3-state MSM are better approximations of the true transition probabilities between the 9
conformations at longer lag times, since the transition probabilities of conformations belonging to
a single state become more similar. In addition, the Bayesian score also depends on the number of
data instances. For a fixed lag time τ , the Bayesian score prefers the 3-state MSM for low amounts
of data and switches to the 9-state MSM at higher amounts of data. This is because, even though
the true underlying kinetic model can only be represented by the 9-state MSM, at small amounts
of data, the 3-state MSM is more predictive of future data instances since there are less parameters
which need to be estimated.
CHAPTER 4. MODEL SELECTION
77
10000
data instances (M)
0.4
8000
0.2
6000
0
4000
−0.2
2000
−0.4
50
100
150
lag time (τ)
200
250
10000
data instances (M)
0.4
8000
0.2
6000
0
4000
−0.2
2000
−0.4
50
100
150
lag time (τ)
200
250
Figure 4.3: The difference in scores between a 9-state and 3-state definition of the transition model
of 9 conformations for different lag times τ and number of data instances M . The top contour plot
shows the average normalized score difference (Eq. 4.17) for the maximum likelihood score. The
bottom contour plot shows the average normalized score difference for the Bayesian score, where
the bold contour corresponds to zero. Both plots are averaged over 20 independent data sets for
each lag time and number of data instances.
CHAPTER 4. MODEL SELECTION
MSM
M SMorig
M SM1
M SM2
M SM3
78
State definition
{c1 , c2 , c3 }
{c1 }, {c2 }, {c3 }
{c1 , c2 , c3 }
{c1 , c2 , c3 }
{c4 , c5 , c6 }
{c4 , c5 , c6 }
{c4 }, {c5 }, {c6 }
{c4 , c5 , c6 }
{c7 , c8 , c9 }
{c7 , c8 , c9 }
{c7 , c8 , c9 }
{c7 }, {c8 }, {c9 }
Table 4.1: Four state definitions for the transition model between 9 conformations.
The Bayesian score is thus superior to the maximum likelihood score in determining an appropriate number of states in a state decomposition. While a maximum likelihood score will always
prefer dividing the conformation space into more states, the Bayesian score will only prefer more
states if there is sufficient data to justify adding more parameters to the DBN. When the transition
probabilities from the suggested new states are significantly different from one another, as in the
above case for short lag times τ , less data is necessary for the justification, and as the transition
probabilities become more similar, as for longer lag time τ , more data is necessary for the justification.
Determination of weakest state
By designing state decompositions which are hierarchical, we can determine which state in a given
decomposition should be subdivided in order to create a better decomposition. In this section, we
will show how this state corresponds to the least Markovian state which has sufficient data instances
to warrant further subdivision.
We define 4 MSMs over the transition model shown in Fig. 4.2. The first MSM, M SMorig ,
maps the conformations to 3 states, as shown in Fig. 4.2 and as described in the previous sections.
We also define three MSMs, each with 5 states, which subdivide one of the states into 3 states
corresponding to the conformations. The MSMs which subdivide states s1 , s2 , or s3 are named
M SM1 , M SM2 , or M SM3 , respectively. The state definitions are summarized in Table 4.1.
We compute the maximum likelihood and Bayesian scores of the DBNs (Fig. 4.2) corresponding
to each of the 4 MSMs over a range of lag time τ and total number of data instances M . Figure 4.4
shows the normalized score difference between either M SM1 , M SM2 , or M SM3 and M SMorig .
Scores greater than zero indicate the given subdivided MSM is preferred, and scores less than zero
indicate the original MSM is preferred.
The top panels show the normalized score difference for a constant amount of data M = 10, 000
CHAPTER 4. MODEL SELECTION
79
0.1
0.08
MSM1
MSM1
MSM
MSM
2
0.06
MSM
3
0.04
∆ scoreB / M
∆ scoreL / M
0.1
0.02
3
0
0
500
lag time (τ)
1000
0.025
MSM1
0.02
MSM
0.015
MSM
2
3
0.01
0
500
lag time (τ)
−0.02
MSM
1
MSM2
−0.04
0.005
0
1000
0
∆ scoreB / M
∆ scoreL / M
0
2
MSM
0.05
MSM3
0
1
2
3
4
5
data amount (M) x 104
−0.06
0
1
2
3
4
5
data amount (M) x 104
Figure 4.4: Comparison of MSMs corresponding to the subdivision of states. The normalized score
difference for either the maximum likelihood score (left panels) or the Bayesian score (right panels)
is shown for constant number of data instances M = 10, 000 (top panels) or constant lag time
τ = 400 (bottom panels). Each line compares the listed MSM with M SMorig . Each data point
shows the mean and standard deviation as calculated from 20 independent data sets of the given size
and lag time.
and varying lag time τ for the maximum likelihood score (left) and the Bayesian score (right). If
we were to compute the differences of the scores of the subdivided MSMs, M SM1 , M SM2 , and
M SM3 , we would see that both scoring functions prefer M SM1 , followed by M SM2 , followed
by M SM3 . This corresponds to the scoring functions wanting to subdivide state s1 , s2 , and s3 in
that order. By calculating the eigenvalues and eigenvectors of the transition matrix between all 9
conformations, we see that state s1 takes the longest time to equilibrate within the state, followed by
state s2 , followed by state s3 . The scoring functions are thus able to discern that subdividing state
s1 would improve the MSM the most, since s1 has the longest internal equilibration time, and thus
is the least Markovian.
CHAPTER 4. MODEL SELECTION
80
The bottom panels of Fig. 4.4 show the normalized score difference for a constant lag time
τ = 400 and differing amounts of data. We see that the maximum likelihood score is relatively
insensitive to the amount of data, though at very low amounts of data the preference for the subdivided MSMs are slightly higher. This is because there are greater fluctuations in the empirical data,
causing more deviation between transition probabilities of conformations belonging to a single state
in M SMorig . The Bayesian score, on the other hand, prefers the original MSM for low amounts
of data and only prefers the subdivided MSMs when there is sufficient data to be able to correctly
parameterize the additional transition probabilities.
For this example, each state has the same equilibrium probability, and thus there are the same
number of data instances, on average, for transitions out of each of the three states. We can vary
the amount of data for each state by not selecting the first conformation (A in Fig. 4.1) from its
equilibrium distribution, and then model the resulting DBN using interventions as discussed in
Sec. 4.2.6. When we do this, the Bayesian score correctly varies the preference between the three
subdivided MSMs depending on the amount of data samples from each state, while the maximum
likelihood score always retains the same preferences as in the equal data case (data not shown).
Thus, the Bayesian score is better at discriminating when it is possible to subdivide a state to produce
a better model.
4.3.2 Alanine peptide
We now evaluate how well the scoring functions perform on several previously generated state
decompositions of the terminally blocked alanine peptide. We use a data set consisting of 975
trajectories from the 400 K replica of a 20 ns/replica parallel tempering simulation with conformations stored every 0.1 ps [CSPD06b]. The peptide was modeled by the AMBER parm96 forcefield
[KDC+ 97], and solvated in TIP3P water [JCM+ 83]. Full simulation details can be found in Ref.
[CSPD06b].
The alanine peptide molecule is depicted on the left side of Fig. 4.5, with the main degrees of
freedom, the φ and ψ torsion angles labeled. The middle panels show four previously computed
state decompositions for the alanine peptide, projected into the φ-ψ space (Chapter 3). The right
panels show the implied time scales as a function of lag time τ , and the metastability Q, computed
at lag time τ = 0.1 ps. The implied time scales are calculated from the non-unit eigenvalues of the
transition matrix as
τk = −
τ
,
ln λk
(4.26)
CHAPTER 4. MODEL SELECTION
81
Figure 4.5: Several state decompositions for the terminally blocked alanine dipeptide. The φ and
ψ angles of the alanine peptide are shown in the left figure. Four state decompositions, M SM1 ,
M SM2 , M SM3 , and M SM4 are shown in the center panels, with the state definitions labeled for
M SM1 . The right panels show the implied time scales as a function of lag time for each MSM, as
well as the metastability Q, calculated at a lag time of τ = 0.1 ps.
where τ is the lag time, λk is the kth eigenvalue of the transition matrix, and τk is the corresponding
kth implied time scale [SPS04a]. If the dynamics over the states in the MSM are Markovian after
some lag time τint , then the implied time scales will be constant for τ > τint . The metastability Q
is calculated as the sum of the self-transition probabilities of the transition probability matrix.
We compare the MSMs defined by the four 6-state decompositions depicted in Fig. 4.5, a manually defined good state decomposition (M SM1 ), a manually defined poor state decomposition
(M SM2 ), an automatically generated state decomposition nearly equivalent to the good decomposition (M SM3 ), and an automatically generated state decomposition which groups together two
of the manually defined good states (3 and 4 as labeled in Fig. 4.5) and subdivides another of the
CHAPTER 4. MODEL SELECTION
82
0.2
∆ scoreB / M
∆ scoreL / M
0.2
0
−0.2
−0.4
−0.2
−0.4
0
5
lag time (ps)
10
0
0.04
0.04
0.02
0.02
∆ scoreB / M
∆ scoreL / M
0
0
−0.02
−0.04
−0.06
0
5
lag time (ps)
10
5
lag time (ps)
10
5
lag time (ps)
10
0
−0.02
−0.04
−0.06
0
Figure 4.6: Comparison of different state definitions for the terminally blocked alanine peptide.
The left panels shows the normalized maximum likelihood score difference for the four MSMs for
different lag times and the right panels shows the Bayesian scores. The differences are calculated
with respect to the average score over the four MSMs (top) or with respect to the average of the
three good MSMs, M SM1 , M SM3 , and M SM4 (bottom).
manually defined good states (5 as labeled in Fig. 4.5) (M SM4 ). We calculate the maximum likelihood and Bayesian scores for the DBNs (Fig. 4.1) corresponding to the four MSMs over a range of
lag times. Figure 4.6 shows the maximum likelihood scores (left panel) and Bayesian scores (right
panel) for the four MSMs.
We see that the manually defined poor state decomposition (M SM2 ) scores much worse compared to the other state decompositions for both the maximum likelihood and Bayesian scores over
all lag times. The scores of the other three state decompositions are more similar. This is consistent with other metrics for evaluating state decompositions, the metastability and the implied time
scales, as shown in Fig. 4.5.
For the three good state definitions, there are some variations in the preferences with lag time,
which are emphasized in the bottom panels of Fig. 4.6. For short lag times, both the maximum
CHAPTER 4. MODEL SELECTION
83
likelihood and the Bayesian scores prefer M SM1 and M SM3 over M SM4 . By examining the
eigenvectors at short lag times τ , we see that the average time to transition between manually defined
states 3 and 4 (as labeled on Fig. 4.5) is approximately 1 – 2 ps. The preference for M SM1 and
M SM3 , which separate these states, over M SM4 , which groups these states together, indicates that
we have sufficient data to support separating these states, in order to capture the different transition
probabilities from these states, at short lag times τ .
At longer lag times, the maximum likelihood scores and Bayesian scores disagree over which
is the preferred MSM. The maximum likelihood scores give equal preference to the three good
decompositions, while the Bayesian score prefers M SM4 over M SM1 and M SM3 . For a given
data set, the number of independent data instances decreases with lag time, since we take nonoverlapping pairs of conformations, separated by lag time τ , to ensure independence of the data
instances. At the longer lag times τ , the transition probabilities from manually defined states 3 and
4 are more similar, and we no longer have sufficient data to justify separating them into their own
states. The preferences determined by the Bayesian score take into account the amount of data,
and thus differ from the maximum likelihood score preferences when the number of data instances
decrease.
The previous scoring metrics, metastability and implied time scales, as shown in Fig. 4.5, were
unable to determine the differences between manual decomposition M SM1 , automatic decomposition M SM3 , and automatic decomposition M SM4 . The maximum likelihood and Bayesian scoring
functions, on the other hand, were able to resolve finer preferences between the MSMs, which can
be validated by looking at the eigenvectors of the transition matrix.
4.4 Conclusions
It is becoming common to study the dynamic properties of biomolecular systems through computer
simulations. From these simulations, it is possible to build a Markovian state model which assumes
Markovian transitions between discrete regions of the conformation space. In the previous chapters, we have given algorithms for automatically decomposing the conformation space into states
(Chapter 3), and methods for efficiently computing kinetic properties such as the average time for
the molecule to fold (Chapter 2).
The main intuition behind a Markovian state model is that if we define states such that the
conformations within a state transition between each other rapidly, then the dynamics are well approximated by a Markov chain over the states. If the conformations within a state interconvert
CHAPTER 4. MODEL SELECTION
84
quickly, then their outgoing transition probabilities will become similar at lag times longer than
the time it takes for the interconversion. In this chapter, we showed how to convert a MSM into a
Dynamic Bayesian Network, and then how to use functions for scoring DBNs, the maximum likelihood scoring function and the Bayesian scoring function, to compare between different MSMs.
These functions give higher scores when the conformations within a state have similar transition
probabilities. The key advantage of the Bayesian scoring function over the maximum likelihood
scoring function is that the maximum likelihood scoring function always prefers MSMs which divide the conformation space into more states, while the Bayesian scoring function will only prefer
a MSM with more states if there is sufficient data to characterize the new states.
For a simple transition model between 9 conformations, we have shown how the scoring functions are able to select the best 3-state decomposition of the space. We have also shown how the
Bayesian scoring function can determine the appropriate number of states for a given amount of
data. One of the unresolved problems in performing a state decomposition (Chapter 3) is in knowing how small (in terms of number of member conformations) we can allow a state to become.
Previously, we bounded the lower size of a state based on some arbitrary number of conformations.
However, we can use the Bayesian scoring function to determine the optimal size of any given state.
We can generate a new MSM which subdivides any state into substates, and then, as we showed for
the model system, compare the Bayesian scores of the two MSMs, thus determining whether we
have sufficient data in that region of conformation space to support the new state definition.
We have also compared different state definitions of the terminally blocked alanine peptide.
Four state decompositions for the alanine peptide were previously defined, a manually defined good
decomposition, a manually defined poor decomposition, and two decompositions automatically created by the state decomposition algorithm (Chapter 3). The four decompositions were analyzed
previously using the implied time scales as a function of lag time and the metastability of the decomposition. These metrics found that the manually defined poor decomposition was worse than the
other three decompositions, which performed nearly equivalently. Using the maximum likelihood
and Bayesian scoring functions, we were better able to characterize these state decompositions.
The scoring functions gave preferences between the three good decompositions as a function of lag
time, which were validated by looking at the eigenvalues and eigenvectors of the transition matrices. The Bayesian scoring function additionally gave preferences based on the decreasing amount
of independent data with lag time.
This chapter has shown how to distinguish between different MSMs to determine which is the
“best” MSM for the data set. But, being the best MSM for the data set does not imply that the MSM
CHAPTER 4. MODEL SELECTION
85
is adequate for studying the underlying problem of biological importance. In order to use a MSM to
calculate kinetic properties of the system, the transitions between the states need to be Markovian
at the lag time used to calculate the transition probabilities. We can calculate this lag time using
the implied time scale test previously mentioned [SPS04a]. If the MSM is not Markovian at the
lag time at which we wish to use the model, we can try to reduce this Markovian lag time τint by
subdividing states and testing the new state definitions using the Bayesian score as above. However,
if the Bayesian score prefers the original MSM, this indicates that we need more data in that region
of the conformation space. Integrating the state decomposition problem, model selection problem,
and simulation planning problem to build MSMs is the subject of future work.
Chapter 5
Error analysis methods
Once a Markovian state model (MSM) has been built for a given set of simulation data, there are
numerous properties which we can calculate from the model. In this chapter, we analyze the errors
in the model caused by finite sampling to the calculated mean first passage time (MFPT) from the
initial to the final states. We give different methods with various approximations to determine the
precision of the reported MFPTs. These approximations are validated on an 87 state toy Markovian
system. In addition, we propose an efficient and practical sampling algorithm that uses these error
calculations to build a MSM that has the same precision in mean first passage time values but
requires an order of magnitude fewer samples. We also show how these methods can be scaled to
large systems using sparse matrix methods.
5.1 Introduction
To meet the challenge of modeling the conformational dynamics of biological macromolecules over
long time scales, much recent effort has been devoted to constructing stochastic kinetic models, often in the form of discrete-state Markovian state models, from short molecular dynamics simulations
[SSP04, SPS04a, SPS+ 04b]. It is efficient to calculate kinetic properties such as the probability that
a conformation will fold (Pf old ) or the average time taken for a given conformation to fold (MFPT)
from this type of model [SSP04]. The MSM also allows one to easily combine and analyze simulation data started from various conformations and naturally handles intermediate states and traps.
This approach has been applied to small protein systems [SSP04, SPS+ 04b], a non-biological polymer [EP04, EPP05b], and vesicle fusion [KKS+ 06] with good agreement with experimental rates.
While these kinetic models agree well with experiments, only a single value for the rate was
86
CHAPTER 5. ERROR ANALYSIS METHODS
87
used in the comparison. It is also important to determine the uncertainty in this value, so one can
know the confidence of the results. One main source of error is caused by grouping conformations
into states and assuming that transitions between these states are Markovian. If we look at a protein
and consider each conformation as its own state, on the tens of picosecond and longer time scale,
the transitions follow a Markovian pattern. Unfortunately, sampling transitions between an infinite
number of states is impractical; therefore, the Markovian state model groups conformations into a
finite number of discrete states. However, it has been shown that if the conformations are grouped
incorrectly, the state space is no longer Markovian, and any analysis that assumes a Markovian process may produce incorrect results [SPS04a]. Even if the states are defined such that the transitions
between them are Markovian, the results could still be in error. This second source of error results
from the finite sampling of transitions between states, which gives uncertainties in the transition
probability estimates and in turn leads to uncertainties in the values we calculate, such as the MFPT.
There has been some recent work on error analysis in these kinetic models of a protein conformation space. Swope et al. focused on the problem of defining states which meet the Markovian
criteria. They provided tests for whether or not a given state space definition is history independent
[SPS04a]. Here, we will focus on the error caused by finite sampling. Some recent work has involved a Bayesian approach to sampling possible transition probability matrices and solving each
sample for the value of interest [SKH05]. While this approach can estimate errors, it does not scale
well for systems with large numbers of states. In addition, we want to determine the transitions that
contribute the most to the uncertainty, which the current techniques do not allow. If we can identify
these transitions, additional simulations can be started from them to increase the overall precision.
In this chapter, we will discuss novel methods for computing the error in a Markovian state
model for molecular dynamics caused by finite sampling of transitions. We will give different
methods for calculating the error from finite sampling and how it translates into errors in the mean
first passage time and other estimates. The methods employ a set of approximations and lead to an
efficient and practical closed-form solution for the uncertainty. We will also present a new sequential
sampling algorithm that uses these error estimates to improve the sampling efficiency by over an
order of magnitude. In addition, we discuss how the use of sparse matrix techniques will allow these
methods to scale to systems with large numbers of states. These algorithms are then tested and the
approximations validated a toy Markovian system.
CHAPTER 5. ERROR ANALYSIS METHODS
88
5.2 Methods
Molecular dynamics simulations are often used to understand protein kinetics. A question then
arises as how to best analyze these molecular dynamics trajectories. In Chapter 3, we discussed
new methods for clustering conformations from the trajectories into discrete states, which tried to
ensure that the transitions between the states were Markovian, or history independent. The transition
probabilities between states were estimated by counting the number of times each transition was
observed in the trajectories. From this Markovian state model (MSM), it was possible to efficiently
calculate kinetic properties such as the Pf old and the mean first passage time (MFPT).
In this chapter, we are interested in determining the uncertainty in the kinetic results that can be
calculated from this graph-based model. We will assume that we can define a Markovian state space
for the protein, though forming states that meet this criterion is not a trivial task [SPS04a, CSP+ 07].
Even with this assumption, one can still have errors in the results. Since we can only finitely
sample the transitions between states, we will have some statistical uncertainty in the transition
probabilities. Therefore, any value we calculate from the transition probabilities will also have an
uncertainty associated with it.
In this section, we will first discuss how to calculate the MFPT from the transition probabilities.
We then derive the distribution for the transition probabilities, and define both sampling and nonsampling based methods for calculating the distribution of the MFPT. From these error estimates,
we develop an efficient adaptive sampling technique designed to increase precision. Lastly, we will
show how to use sparse matrix manipulations that permit the scaling of these methods to systems
with large numbers of states.
5.2.1 Mean first passage times
In a Markovian state model, we represent the conformation space by K discrete states, each of
which corresponds to some distinct group of protein conformations. Let us define the probability
of transitioning from state i to state j at a time step of ∆t as pij . We also assume that these states
are Markovian, i.e., that the transitions between them are history independent at ∆t. We can use
the transition probabilities to calculate kinetic properties of the system such as the probability of
folding or mean first passage time to reach the final state. These quantities are defined by sets of
linear equations that are based on the transition probabilities.
For example, the equations for the mean first passage time, x, from any state to the final state,
CHAPTER 5. ERROR ANALYSIS METHODS
are of the form
xi =
89

K
X


 ∆t +
xj pij
i 6= K
j=1



0
,
(5.1)
i=K
where the Kth state represents the final state [SSP04]. Writing this in matrix form, we have



p11 − 1
p12
···
p1K

 p
p22 − 1
···
p2K
21


.

..


 p(K−1)1
···
p(K−1)(K−1) − 1 p(K−1)K

0
···
0
1
x1

 x
2

 .
  ..


  xK−1

xK


−∆t
 

  −∆t 
 

  . 
 =  ..  ,
 

 

  −∆t 
 

0
(5.2)
where the last line is the boundary condition that the mean first passage time from the final state is
zero. The matrix on the left side of Eq. 5.2 will be referred to as A, with rows ai . We will use these
mean first passage time equations as an example throughout the chapter.
5.2.2 Transition probability distribution
Finite sampling causes uncertainties in the estimates of the transition probabilities between states.
In this section, we derive a distribution over possible transition probability vectors. Define p∗ij as
the actual transition probability from state i to j at a time step of ∆t. The sum of the transition
probabilities from state i is equal to 1:
K
X
p∗ij = 1.
(5.3)
j=1
We do not know these actual transition probabilities, but we can estimate them by sampling transitions between states.
Since we make the assumption that our state space is Markovian, each transition sample originating from state i will be a random variable with K possible values occurring with probabilities
p∗ij for j = 1 . . . K. Define the transition count zij as the total number of transition samples which
start in state i and end in state j, and define ni as the total number of samples originating from state
i:
K
X
zij = ni .
(5.4)
j=1
The distribution of the zij variables follows the multinomial distribution with parameters ni , p∗i1 ,
CHAPTER 5. ERROR ANALYSIS METHODS
90
p∗i2 , . . ., p∗iK [JKB97]. From these transition counts, we can calculate the maximum likelihood
estimates of the transition probabilities, p̂ij , which, for the multinomial distribution, are simply the
number of transitions from state i to state j divided by the total number of transitions from state i
[JKB97]. Thus,
p̂ij =
zij
.
ni
(5.5)
However, the maximum likelihood estimates give no indication of the uncertainties in the transition probabilities. Using these same transition counts, zij , and Bayesian analysis, we can compute
the distribution over all possible vectors of transition probabilities, as opposed to simply calculating
the most likely transition probability vector. Each set of possible transition probabilities, pi1 , pi2 ,
P
. . ., piK , where 0 ≤ pij ≤ 1 and K
j=1 pij = 1, has some chance of producing the transition counts,
zij , that we observed. The probability of a particular vector pi being the true transition probability
vector, given the observed transition counts, is, from Bayes’ rule,
iK
P (pi ),
P (pi |zi ) ∝ P (zi |pi )P (pi ) = pzi1i1 pzi2i2 . . . pziK
(5.6)
where P (pi ) is the prior probability over the transition probability vectors, i.e., the distribution of
transition probability vectors before observing any data.
A typical choice for the prior is the Dirichlet distribution, the conjugate prior of the multinomial
distribution. This means that if the prior, P (pi ), is a Dirichlet distribution, then the posterior,
P (pi |zi ) , is also a Dirichlet distribution [KBJ00]. The Dirichlet distribution with variables p and
parameters u is defined as
K
1 Y ui −1
pi
,
Dirichlet(p; u) =
Z(u)
(5.7)
i=1
where Z(u) is a normalizing constant defined in Appendix A. If we define the prior of the transition
probabilities as a Dirichlet distribution with parameters αi1 , αi2 , . . ., αiK and we observe transition
counts zi1 , zi2 , . . ., ziK , the posterior of the transition probabilities is a Dirichlet distribution with
parameters αi1 + zi1 , αi2 + zi2 , . . ., αiK + ziK . For notational convenience, we define the Dirichlet
counts as
uij = αij + zij .
(5.8)
Therefore, assuming a Dirichlet prior, the distribution of the transition probabilities, pi , given the
observed data counts is Dirichlet(pi ; ui ).
CHAPTER 5. ERROR ANALYSIS METHODS
91
Choosing the parameters for the prior completes the description of the distribution. The Dirichlet distribution is non-informative for any parameter αij = 0. If we set αij = 0 and do not observe
any transitions from state i to state j, the posterior of pij will always equal zero. However, for
molecular dynamics, not seeing a particular transition over some finite sampling does not imply that
the transition can never occur. So, we will restrict ourselves to positive priors. Possible choices for
the prior distribution are the uniform distribution, αi1 = αi2 = . . . = αiK = 1, and the symmetric
Dirichlet, αi1 = αi2 = . . . = αiK . In the limit, as the sampling (and therefore transition counts)
increases, the distribution of the transition probabilities will not depend on the choice of the prior
distribution, therefore making further calculations insensitive to the choice of prior distribution.
It will be useful to state the expected values of the posterior distribution of the transition probabilities for future reference, where wi is a normalizing weight variable [KBJ00]:
p̄ij
= E(pij ) =
wi =
K
X
uij
,
wi
uij .
(5.9)
j=1
5.2.3 Sampling based error analysis methods
In Chapter 2, we used p̂ij in Eq. 5.2 to calculate an estimate of the true values of the mean first
passage times, x. We now wish to calculate the distribution of x given the distribution of the pi
values. In particular, we are interested in the MFPT from the initial state to the final state, since this
value can be converted to the rate of folding and compared with experiments. Therefore, we are
interested in the distribution of the term x1 .
We propose a number of sampling-based methods for calculating the distribution of x1 . All of
these methods involve repeatedly generating a sample of transition probabilities and converting to
a sample of x1 . We can either sample the transition probabilities from the Dirichlet distributions or
from approximate multivariate normal (MVN) distributions. Then, we can either solve the above
set of linear equations or we can substitute into a first order Taylor series approximation to the set
of equations. Each of these four options will be described below, and can be combined to give four
sampling-based methods for calculating error, which are summarized below.
CHAPTER 5. ERROR ANALYSIS METHODS
92
Sampling from the Dirichlet distribution
As shown in Sec. 5.2.2, if we assume a Dirichlet prior, the posterior distribution of pi , given the
transition counts, is a Dirichlet distribution with parameters ui , as defined in Eq. 5.8. A method for
generating samples from the Dirichlet distribution is given in Appendix A.
A sample of the A matrix, defined in Eq. 5.2, consists of K independent samples from Dirichlet
distributions, each corresponding to one row of the matrix. As was shown in Appendix A, each sample from a Dirichlet distribution takes expected time O(8KQ), where Q is the time to sample from
a normal distribution. Therefore, each sample of transition probabilities requires time O(8K 2 Q).
Sampling from a Multivariate Normal distribution
Sampling from the Dirichlet distribution is very expensive. In an attempt to reduce this cost, the
true Dirichlet distribution of the pi parameters is approximated by a multivariate normal distribution
(MVN). In addition, the MVN has some nice properties that we will exploit in Sec. 5.2.4.
If pi is distributed as Dirichlet(pi ; ui ), then by the central limit theorem, the distribution of
pi converges to a multivariate normal distribution [Rao73] with mean µi and covariance matrix Σi
given by
µi =
Σi =
£
1
wi2 (wi
+ 1)
ui
,
wi
¤
wi Diag(ui ) − ui uT
i ,
(5.10)
(5.11)
where the superscript “T” denotes the transpose and Diag(ui ) represents a matrix with entries uij
along the diagonal. A method for creating samples of the pi variables from this approximate MVN
distribution is given in Appendix B.
For each sample of the A matrix, we must generate K independent samples from the MVN
distributions. As described in Appendix B, each sample of a MVN distribution requires time
O(KQ + K), where Q is the time to sample from a normal distribution. Therefore, each sample of transition probabilities takes time O(K 2 Q + K 2 ). In addition, there is a one-time cost of
O(K) for each of the K MVN distributions.
This approximation assumes that the central limit theorem holds and that the transition probabilities are well approximated by multivariate normal distributions. One drawback of the MVN
approximation is that it permits negative values of the transition probabilities. While these negative
values are invalid from a physical perspective, they generally do not affect the error calculations.
CHAPTER 5. ERROR ANALYSIS METHODS
93
Solving sets of linear equations
Once we have generated a sample of transition probabilities and therefore the A matrix from either
Dirichlet or MVN distributions, we can simply solve Eq. 5.2 to find the MFPT vector. This can
be done by factoring the matrix into the form A = LU where L is a lower triangular matrix and
U is an upper triangular matrix with unit entries along the diagonal followed by forward and back
substitutions [GvL96]. The cost of this algorithm is (1/3)K 3 for the factoring plus O(K 2 ) for the
substitutions.
Taylor series approximation
Solving the set of linear equations for each sample directly is expensive. Instead, we can approximate the solution using a first order Taylor series expansion. The mean first passage time from the
initial state, as given by Eq. 5.2, depends on all the parameters aij as well as the parameter ∆t. We
assume for now that ∆t is a constant, but can modify this if we allow variable time steps in the
simulation data. Thus,
x1 = f (a11 , a12 , · · · , aKK ) = f (A).
(5.12)
The function f does not, in general, have a simple form. We therefore perform a first order Taylor
series expansion to x1 around the expected values of the parameters, as given in Eq. 5.9:
x1 = x̄1 + ∆x1 = f (Ā) +
∂f
∂f
∂f
|Ā ∆a11 +
|Ā ∆a12 + · · · +
| ∆aKK ,
∂a11
∂a12
∂aKK Ā
(5.13)
where x̄1 ≡ f (Ā) and the ∆aij are small perturbations in the parameters. Thus,
∆x1 =
∂f
∂f
∂f
|Ā ∆a11 +
|Ā ∆a12 + · · · +
| ∆aKK
∂a11
∂a12
∂aKK Ā
(5.14)
Appendix C gives an efficient way for computing all the terms of the form ∂f /∂aij |Ā in Eq. 5.14.
Once we have generated a sample of the transition probabilities from either Dirichlet or MVN
distributions, we convert to a sample of aij using Eq. 5.2 and then a sample of ∆aij by subtracting
the expected values āij . We next substitute these values into Eq. 5.13 to find a sample from the
distribution of x1 . There are K 2 terms in Eq. 5.13, so the substitution will take time O(K 2 ) per
sample. In addition, as shown in Appendix C, there is a one-time cost of O((1/3)K 3 + 3K 2 ) to
both solve for x̄1 = f (Ā) and generate the partial derivative terms in the Taylor series.
CHAPTER 5. ERROR ANALYSIS METHODS
Dirichlet distribution
Linear algebra
Method 1
¡ 2
¢
N 8K Q+ 13 K 3
= O(N K 3 )
No assumptions
Taylor series
approximation
Method 3
¡ 2
¢
1 3
2
2
3 K +K +N 8K Q+K
= O(K 3 + N K 2 )
Ignores higher order terms
in Taylor series
94
Multivariate normal distribution
Method 2
¢
¡
K 2 +N K 2 Q+K 2 + 13 K 3
= O(N K 3 )
Assumes central limit
theorem holds
Permits negative
transition probabilities
Method 4
¡ 2
¢
1 3
2
2
2
2
3 K +K +K +N K Q+K +K
= O(K 3 + N K 2 )
Assumes central limit
theorem holds
Ignores higher order terms
in Taylor series
Table 5.1: Summary of sampling based methods for calculating the error of the MFPT from the
initial state due to sampling. Each cell gives the running time, where N is the number of samples,
K is the number of states, and Q is the time taken to sample from a normal distribution, and
advantages or assumptions for the method.
Sampling methods summary
Combining the techniques for sampling and the solution of the linear system given above, we get
four sampling-based methods for finding the uncertainty in the MFPT from the initial state. Table
5.1 shows the running times and various assumptions of each of the four sampling-based methods,
where Q is the time to sample from a normal distribution, and we assume that we take a total of N
samples to estimate the distribution of x1 .
Method 1 that we propose is similar to previous methods to calculate error in these models
[SKH05]. We have also introduced new methods that use different approximations and improve the
running time of the error analysis.
CHAPTER 5. ERROR ANALYSIS METHODS
95
5.2.4 Non-sampling based error analysis method
The methods in Sec. 5.2.3 all relied on sampling possible transition probabilities from either the
Dirichlet or MVN distributions to get samples from the distribution of x1 . If we make both the
MVN approximation and the Taylor series expansion, we can derive a closed-form representation
for the distribution of x1 .
First, we will rewrite Eq. 5.14 by grouping K terms at a time as


∆a11
·
¸ ∆aK1
 . 

..
 ..  + · · · + ∂f | · · · ∂f | 
.
Ā
Ā 


∂aK1
∂aKK
∆a1K
∆aKK

·
∆x1 =
∂f
∂f
|Ā · · ·
|
∂a11
∂a1K Ā
¸


.

(5.15)
For notational convenience, we define the above vectors as the sensitivity si and deviation ∆ai ,
·
sT
i
=
¸
∂f
∂f
| ···
| ,
∂ai1 Ā
∂aiK Ā
∆aT
= [∆ai1 · · · ∆aiK ] .
i
(5.16)
Therefore,
∆x1 =
K
X
sT
i ∆ai .
(5.17)
i=1
The vector ∆ai is equal to ai − āi and, with the MVN approximation, has mean 0 and covariance
matrix Σi given by Eq. 5.11. As described in Appendix B, linear combinations of MVN distributions are also MVN distributions, and Eq. B.3 gives that ∆x1 is distributed as normal with mean 0
and variance σ 2 , where
2
σ =
K
X
sT
i Σi si .
(5.18)
i=1
Substituting Eq. 5.11 for Σi , we see that
σ
2
=
=
K
X
1
w2 (wi
i=1 i
K
X
+ 1)
w2 (wi
i=1 i
+ 1)
1
£
¤
T
sT
i wi Diag(ui ) − ui ui si
£ T
¤
T
wi si Diag(ui )si − (sT
i ui )(ui si ) .
(5.19)
Therefore, x1 , which equals x̄1 + ∆x1 , has a normal distribution with mean x̄1 and variance given
CHAPTER 5. ERROR ANALYSIS METHODS
96
by Eq. 5.19. These closed-form expressions for the mean and variance of x1 give the closed-form
distribution of x1 .
In addition, the distribution of x1 can be computed efficiently. As described in the Taylor series
expansion section (Sec. 5.2.3), we can calculate x̄1 and all the partial derivative terms in the sensitivity vectors in time O((1/3)K 3 + 3K 2 ). Since the variance is the sum of vector dot products
(rather than matrix vector products), we can calculate the sum in Eq. 5.19 in time O(K 2 ). The
running time for calculating the distribution of x1 is thus O((1/3)K 3 + 4K 2 ), a large improvement
over the running time for any of the sampling based methods.
5.2.5 Adaptive sampling algorithm
To generate molecular dynamics data, the typical method is either to start all simulations from one
state, or to generate a representative set of starting conformations from, for example, high temperature unfolding [DL93] or replica exchange [SO99], and start a number of simulations from each
of these conformations. If our trajectories are sufficiently long or we have enough trajectories from
relevant conformations, we can hope to sample all the important transitions. In the framework of a
Markovian state model with the state space defined, we are sampling transitions with no guidance
from which ones are more or less uncertain given our current data.
In this section, we present an algorithm that uses the error analysis techniques described in the
previous sections to achieve much higher precision in the quantities of interest for the same number
of total simulations. In addition to the total error in the values, which we have already discussed, we
also want to determine the main contributors to this error so that we can selectively add simulations
to those regions that give rise to the greatest uncertainties. One advantage of the Taylor series
methods (sampling based methods 3 and 4 and the non-sampling based method) outlined above is
that they naturally decompose the contribution of each element in the matrix to the variation in x1 .
Since the elements in the same row of A are not independent, we look at the combined contributions
associated with each row. Each row of A corresponds to transitions from a single state, so if we find
that one row contributes the most to the uncertainty in x1 , we can decrease the uncertainty from that
row by generating new transitions from that state.
In principle, we could also find the main error contributors directly from the set of linear equations using techniques such as analysis of variance and statistical design of experiments [BHH78].
However, these are computationally expensive when the problem dimension is high. In this section, we will focus only on the non-sampling based method for calculating the error. First, we will
show how to minimize the variance given the actual transition probability matrix. In practice, we do
CHAPTER 5. ERROR ANALYSIS METHODS
97
not know this matrix, but it will be useful to compare the variance achieved by different sampling
algorithms to this “optimal” variance.
Assume that we know the actual transition probability matrix, P∗ . With a total of M simulations, we can calculate the optimal allocation of simulations per row that minimizes the variance in
the MFPT from the initial state. If we allocate wi simulations to row i, the expected counts for that
row are u∗ij = p∗ij wi . Substituting into Eq. 5.19, the variance of x1 is
νi∗
K
X
νi∗
,
wi + 1
i=1
¤
1 T£
=
si wi Diag(u∗i ) − u∗i u∗T
si
i
2
wi
£
¤
∗
∗ ∗T
= sT
si ,
i Diag(pi ) − pi pi
σ2 =
(5.20)
where we separate out the νi∗ terms which do not depend on the allocation of simulations, wi .
We can minimize the quantity σ 2 with respect to the variables wi subject to the constraint that
the total number of simulations is equal to M ,
K
X
wi = M.
(5.21)
i=1
Solving this minimization problem gives
wi =
√
(M + K) νi
− 1.
PK √
ν
j
j=1
(5.22)
If we know the transition probability matrix and make the MVN and Taylor series approximations,
Eq. 5.22 gives the optimal number of simulations per row that minimizes the variance of the MFPT
from the initial state. Strictly speaking, the variables wi should be constrained to be positive integers.
However, for large sample size M , we can round the values to get good approximations.
However, in general, we do not know the true transition probability matrix. We now outline an
adaptive sampling algorithm that attempts to approximate the optimal number of simulations per
row. Assume that we have observed some transitions and have the Dirichlet transition counts of uij .
CHAPTER 5. ERROR ANALYSIS METHODS
98
From Eq. 5.19, the variance of x1 is
σ
2
=
ν̄i =
K
X
ν̄i
,
wi + 1
i=1
£
sT
i Diag(p̄i )
¤
− p̄i p̄T
i si ,
(5.23)
since p̄ij = uij /wi . This equation is similar to Eq. 5.20, but we have replaced the actual values of
the transition probabilities, p∗i , with the expected values of the transition probabilities, p̄i .
Our goal is to decrease the variance of x1 , σ 2 , given a total number of simulations, M . Since
the expected values of the transition probabilities are just estimates to the actual values, they may
change as we add new simulations. Therefore, we will use these estimates to start a few new
simulations, re-evaluate the expected values, and repeat.
Let us assume that with our current expected value estimates, we can start m more simulations
from any states. In the simplest implementation of the adaptive sampling algorithm, we will start
all m simulations from the same state j, but we could easily modify this using the analysis above to
start a total of m simulations from different states. The expected values of the transition probabilities
may change after these m simulations, but our best guess for them are the current expected values.
Therefore, the only term in Eq. 5.23 that changes with these additional simulations is the term
corresponding to the jth row. The expected change of variance in x1 is
∆σ 2 =
ν̄j
ν̄j
−
,
wj + m + 1 wj + 1
(5.24)
which is simply the difference of the jth term with m additional simulations and the original jth
term. Using the above equation, we calculate the expected decrease in variance caused by adding
m more simulations to any specified row, select the row that reduces the variance the most, and
start m more simulations from that state. Repeating this process, we adaptively add samples to our
transition counts.
The adaptive sampling algorithm is given as Algorithm 1. The tolerance criteria for the while
loop could be that the total number of simulations is less than some maximum (as we motivated
above), the total variance σ 2 is larger than some tolerance, or the decrease in σ 2 is larger than
some value. Using this algorithm, we either decrease the total variance with the same number of
simulations, or decrease the number of simulations necessary for a given precision.
CHAPTER 5. ERROR ANALYSIS METHODS
99
Algorithm 1 The adaptive sampling algorithm
1: Generate initial simulations and transition counts
2: while some tolerance criteria do
£
¤
3:
ν̄i ← sT
) − p̄i p̄T
i Diag(p̄iµ
i si
¶
ν̄j
ν̄j
4:
best ← argmaxj
−
wj + 1 wj + m + 1
5:
Start m more simulations from state best
6: end while
5.2.6 Extension to large systems
As the simulated system becomes large, or if we include spatial degrees of freedom in the MSM,
the number of states required may become unwieldy. In these cases, both the storage requirements
and the cost associated with the linear algebra of Eq. 5.2 become prohibitive.
In general, since we are looking at molecular dynamics at a small time step, we will not see
transitions between all pairs of states. We only expect to see transitions between states that are sufficiently close conformationally to move between each other in time, ∆t. For this reason, we expect
that the observed transition counts, zij , will be sparse, i.e., most of them will equal zero. However,
the Dirichlet distribution of the transition probabilities also depends on the prior probability distribution, which may not be sparse. In this section, we will describe how to maintain the sparsity of
the transition counts in our calculations, even with a dense prior, since sparse matrix calculations
are much more efficient in terms of both storage and computation.
First we show how to decompose the transition probabilities into a dense term and a sparse term
and how to write this as a bordered sparse matrix [BR74]. Then, we show how to efficiently solve
this system using sparse matrix techniques. We also discuss the implementation of the adaptive
sampling algorithm using similar techniques.
We will focus first on solving Eq. 5.2 at the expected transition probability values p̄ij . In our
system, the matrix Z, with elements zij , is sparse. Assume that the prior probabilities αij are
symmetric for each row.
α11
=
α12
= ···
=
α1K
=
c1 ,
α21
..
.
=
α22
= ···
=
α2K
=
c2 ,
αK1 = αK2 = · · ·
= αKK
= cK .
(5.25)
CHAPTER 5. ERROR ANALYSIS METHODS
100
We can represent this prior compactly as the product of two vectors,
α = c1T ,
(5.26)
where 1 is a column vector with all unit entries. Each expected transition probability value, p̄ij , is
defined by Eq. 5.9 as
zij + αij
,
wi
K
X
=
(zij + αij ) .
p̄ij
=
wi
(5.27)
j=1
Therefore, the expected values of the transition probabilities are
P̄ = W−1 (Z + α),
(5.28)
where W is a diagonal matrix with entries wi along the diagonal. Eq. 5.2 can be rewritten as
(−I + P̄)x = b,
(−I + W−1 (Z + c1T ))x = b,
(5.29)
where I is the identity matrix and Z is sparse. Technically, we need the last row of the matrix to
correspond to the boundary condition. This does not change the sparse structure of the matrix, so
we will ignore it for notational simplicity.
The matrix in Eq. 5.29 is generally dense. However, noting that it is a rank one update of a
matrix with sparse structure, Z, simple algebra gives the augmented system
"
F
#"
c
1T −1
x
y
#
"
=
Wb
0
#
,
(5.30)
where
F=Z−W
(5.31)
and F is a sparse matrix. We have thus shown how to rewrite Eq. 5.2 as a bordered sparse matrix.
Using standard LU decomposition to solve the system of equations in Eq. 5.30 takes time
O((1/3)K 3 ). However, using sparse matrix techniques, it is possible to store only the nonzero
CHAPTER 5. ERROR ANALYSIS METHODS
101
entries and solve for the LU factors of the matrix F efficiently, where both L and U are sparse
[DER86]. The complexity of solving the sparse matrix is very system dependent; however the
running time is much less than O((1/3)K 3 ). Appendix D shows how to efficiently solve the system
given in Eq. 5.30 by using the LU factors of F. Thus, we can leverage sparse matrix algorithms
even with a dense prior.
Recall that this decomposition is possible because the expected values of the transition probabilities can be separated into a sparse term and a dense term. We cannot use these sparse matrix
schemes for methods 1 and 2 outlined above. In those cases, we need to solve the system of linear
equations after generating a sample of the transition probabilities, which is unlikely to be the sum of
a sparse term and a low-rank dense term. However, we can use these schemes for the Taylor series
methods, since they rely on solving Eq. 5.2 at the expected values of the transition probabilities, p̄ij ,
which is what we have outlined above. This reduces the time for finding x̄1 = f (Ā) and the partial
derivative terms in the Taylor series from O(K 3 ) to O(K 2 + sparse matrix time). In particular, this
reduces the running time of the non-sampling based method to O(K 2 + sparse matrix time).
The above discussion assumes a symmetric prior, but we can generalize these results to any
prior that is the outer product of two vectors. Specifically, we can have a prior that is the Boltzmann
probability of transitioning between two states as given by their energy difference, which may be a
more natural choice of prior parameters for molecular dynamics:


eE1 /T

 h
..
 · e−E1 /T
α∝
.


E
/T
K
e
···
e−EK /T
i
.
(5.32)
In addition to using sparse matrices for the error analysis, we can also use these techniques
during the adaptive sampling algorithm. Since each iteration of the adaptive sampling algorithm
only adds simulations from a single state i, only the ith row of the matrix F, as defined by Eq. 5.31,
is updated in this iteration. Say we observe a total number of new transitions from state i to each
0 . The new ith row of the matrix F will now equal
state j, zij
(
fij0 =
0
zij + zij
P
0
zii + zii0 − wi − K
j=1 zij
i 6= j
i=j
.
(5.33)
CHAPTER 5. ERROR ANALYSIS METHODS
102
We can represent this change as a rank one update to F,

0
F0 = F + ei · zi1
...
0
zi(i−1)
zii0 −
K
X

0
zij
0
zi(i+1)
...
0 
ziK
,
(5.34)
j=1
and use the previously described techniques for converting to a bordered matrix to reuse the LU
factors and reduce the computation time. After some number of adaptive iterations, it will be worthwhile to add the updated counts to the F matrix and re-factor this matrix, since each update increases
the size of the system by one and the updated rows and columns of the factors are generally dense.
5.3 Results
The error analysis methods presented in this paper assume that the defined state space is Markovian.
For molecular systems, it is difficult to define states that meet this criterion. Though there are tests
for Markovian behavior in a system [SPS04a], it is unclear whether these tests are both necessary
and sufficient. Therefore, since we have assumed that the state space is Markovian in order to
calculate the error from sampling, we test the methods given above on a toy system with 87 states
and Markovian transitions between the states.
We want the transition probabilities to be representative of molecular kinetics, which random
transition probabilities are not. Therefore, we construct the transition probabilities of the toy system,
p∗ij , from existing simulation data of a small protein, the 12-residue tryptophan zipper β-hairpin,
TZ2 [CSS01]. We took a subset of 1,750 independent molecular dynamics trajectories generated by
Snow et al. [SQD+ 04] which were started from the unfolded state and taken at a resolution of 10
ns (a total of approximately 12,000 conformations). These conformations were then clustered using
hierarchical clustering with a cutoff of 3.25 angstroms to result in a total of 87 states [SSP04].
We define the transition count zij as the sum over all trajectories of the number of transitions
from state i to state j at a time step ∆t of 10 ns. We define the transition probability matrix P∗ of
our toy system as the expected transition probabilities as defined by Eq. 5.9,
 u11
 w1
 .
∗
P =  ..
 u
K1
wK
···
..
.
u1K 
w1 

,
uKK 
wK
(5.35)
where uij = αij + zij , we use a symmetric Dirichlet prior of αi1 = αi2 = · · · = αiK = 1/K,
CHAPTER 5. ERROR ANALYSIS METHODS
103
and wi are the normalization constants. Since this cluster space is not Markovian on the time scales
of the source trajectories, the transition matrix does not represent a Markovian model for protein
folding and thus we will not use the analysis presented below to draw conclusions about the protein
system. However, there certainly exists a Markovian model with these transition probabilities, and
thus we can use this matrix in our analysis as long as we restrict our comments to the nature of the
error analysis and sampling, which is the goal of this chapter. The results presented below are on
the toy Markovian system with 87 states, transition probabilities given by Eq. 5.35, and a time step
∆t of 10 ns.
5.3.1 Demonstration of method 1
Given a transition probability matrix P∗ we can calculate the MFPT from the initial state, x∗1 ,
using Eq. 5.2. In this section we will demonstrate that if we sample transitions from the matrix P∗
and use method 1 on these transition counts, the distribution of x1 which we calculate is a good
approximation to x∗1 .
For our toy system, the transition probability matrix P∗ is given by Eq. 5.35. We sample transitions from this matrix by first selecting a row i, and then choosing a transition j, with probability
p∗ij . We generate transition counts by sampling transitions from each row of P∗ independently and
summing the number of transitions from state i to state j. For each of these transition counts, we use
method 1 to estimate the distribution of x1 . We have taken 10,000 independent samples of possible
transition probability matrices and solved each for x1 . Figure 5.1 shows the actual value of x∗1 for
the matrix P∗ as well as the distributions of x1 for six different transition count matrices, generated
with either 500, 1,000, 5,000, 10,000, 50,000, or 100,000 independent transition samples per row.
It is easy to verify that the actual value x∗1 falls within each distribution. Also, as the number
of transition samples increases, the distribution of x1 becomes narrower and centers around the
actual value, x∗1 . While we have only shown one distribution for each number of samples per row
in the figure, these results are typical. Also, we repeated these experiments for random transition
probability matrices P∗ and found similar results (data not shown).
5.3.2 Validity of approximations
We have demonstrated above that given transition counts, we can calculate the distribution of x1 by
sampling from possible transition matrices that could have produced the observed data and solving
the system of linear equations for each sample. But, as described in Sec. 5.2, this procedure is
CHAPTER 5. ERROR ANALYSIS METHODS
104
Figure 5.1: Distributions of the mean first passage time as generated by the first sampling based
method on the 87 state example. The solid line shows the true value of the mean first passage
time for the toy system given by the transition probability matrix P∗ . The dotted distributions were
generated using method 1 for transition count matrices sampled from P∗ with total numbers of 500,
1,000, 5,000, 10,000, 50,000, or 100,000 samples per row.
computationally expensive. Therefore, we proposed two approximations, the MVN approximation
to the Dirichlet distribution and the Taylor series approximation to the set of linear equations, which
when taken together, give an efficient closed-form approximation to the distribution of x1 . We now
demonstrate the validity of these approximations.
First, we generated transition counts from 2,000 independent samples per row of the toy system
as described above. We then ran methods 1 – 4, as well as the non-sampling based method to calculate the distribution of x1 from these transition counts. For each of methods 1 – 4, we used 10,000
independent samples of transition probability matrices. Figure 5.2 shows the resulting distributions
of x1 for each of the four sampling based methods, as well as the density for the non-sampling based
method.
Methods 1 and 2 and methods 3 and 4 overlay almost exactly, showing that the MVN approximation to the Dirichlet distribution is valid. Between the linear equation methods (1 and 2) and the
Taylor series methods (3 and 4), there is a slight difference since the Taylor series method ignores
higher-order terms. However, this difference is mostly in the tails of the distributions and is minimal
if one only cares about the mean and variance of the distribution (Table 5.2). In addition, it is clear
that the non-sampling based method overlays sampling based method 4 exactly, which is expected
CHAPTER 5. ERROR ANALYSIS METHODS
105
Figure 5.2: Distribution of the mean first passage time as calculated by the five error analysis methods. The vertical line indicates the mean first passage time at the expected values of the transition
probabilities, x̄1 .
since they solve the same approximations to the problem.
For this example, we compared the running times of the various methods. The code was implemented in MATLAB and run on a Dual Athelon MP 2200+ (1.8 GHz) computer. Table 5.3 gives
the running times for the five different methods required to generate the histograms shown in Figure
5.2. While we did not fully optimize the code for sampling from the Dirichlet and MVN distributions and did not yet implement the sparse matrix solver, these running times clearly demonstrate
the superiority of the non-sampling based method for error analysis. These tests were repeated on
random matrices and the toy system with varying levels of total transitions per row with similar
results (data not shown). In addition, we tried different prior probability distributions and found no
noticeable change in the results for small priors.
5.3.3 Adaptive sampling
Our goals for this chapter were to both calculate the error in the MFPT as well as use this error to
improve the results. Above, we have shown how to efficiently calculate the error in the MFPT, and
CHAPTER 5. ERROR ANALYSIS METHODS
Method 1
Method 2
Method 3
Method 4
Non-Sampling
106
Mean
Standard deviation
6178
6199
6026
6041
6044
973
1023
940
930
931
Table 5.2: Means and standard deviations of the MFPT distributions generated for the four sampling
and the non-sampling based error analysis methods (all units are in nanoseconds). All methods show
reasonable agreement for these quantities, but differ between the linear equation methods (methods
1 and 2) and the Taylor series methods (methods 3 and 4 and non-sampling method).
we have demonstrated that the approximations we made were reasonable. Now, we show how using
these error estimates improves the sampling through an adaptive algorithm. We demonstrate how
the adaptive sampling algorithm compares to both an even allocation of samples and the optimal
allocation of samples.
We will generate samples from the toy system given by the transition probability matrix, P∗ ,
defined in Eq. 5.35. Assume we can take m transition samples in each round, we can decide where
to allocate the samples before each round, and that we have a limit on the total number of samples.
An even sampling algorithm will always take the same number of samples from each state in each
round, m/K. In the simplest implementation of the adaptive sampling algorithm, we calculate the
contribution to the variance of x1 for each row and add all m samples to the row that is expected to
decrease the variance the most.
We ran both the even and adaptive sampling algorithms by generating samples from the matrix
Method 1
Method 2
Method 3
Method 4
Non-Sampling
Pre-processing
Sampling
Solving
Total for 10,000 samples
NA
3.05 s
NA
3.05 s
0.05 s
4164 s
273 s
4164 s
273 s
NA
14.5 s
15.2 s
4.6 s
5.1 s
0.02 s
4178.50 s
291.25 s
4168.60 s
281.15 s
0.07 s
Table 5.3: Running times for the error analysis methods on calculating the MFPT distribution of
an 87 state example. The sampling based methods 1 – 4 used 10,000 independent samples of the
transition probabilities.
CHAPTER 5. ERROR ANALYSIS METHODS
107
Figure 5.3: Effect of adaptive sampling on the variance of the mean first passage time. The blue
points are the variance generated from the even sampling algorithm and the purple points are the
variance from the adaptive sampling algorithm. The black line shows the variance when the samples
are distributed optimally.
P∗ . We started with the Dirichlet prior of αij = 1/K and an initial 10 samples per row and added
870 samples in each round until we had a maximum of 500,000 samples. Figure 5.3 shows the
variance of x1 versus the total number of samples over 20 independent runs of each algorithm, with
the blue points from the even sampling algorithm and the purple points from the adaptive sampling
algorithm. The dark blue and purple lines on the figure represent the variance in x1 for the even and
adaptive sampling schemes respectively, generated by keeping the transition counts proportional to
P∗ . The black line on the figure is the variance of x1 when the samples are distributed optimally, as
described in Sec. 5.2.5. These solid lines were generated by scaling each row in the matrix by the
desired number of samples for that row. We can see that the adaptive sampling algorithm rapidly
achieves the optimal variance.
It is clear that the adaptive sampling algorithm achieves either a much higher precision in MFPT
with the same number of samples or, conversely, requires many fewer samples to achieve a given
precision. Figure 5.4 shows these relations and demonstrates that the adaptive sampling algorithm,
CHAPTER 5. ERROR ANALYSIS METHODS
108
Figure 5.4: Relationship between the number of samples and the variance for the even and adaptive
sampling algorithms. The top panel shows the ratio of the number of even samples to the number of
adaptive samples required for a desired variance. The bottom panel shows the ratio of the variance
of the even sampling to the variance of the adaptive sampling for the same number of samples.
for this data set, achieved a greater than 20-fold reduction in the number of samples or increase in
precision. If we look at the optimal allocation of samples per row, we see that the distribution is far
from uniform across the states (Figure 5.5). Again, this algorithm was tested on random matrices
with similar results (data not shown).
5.4 Discussion and conclusions
Given that we can generate a large number of molecular dynamics trajectories using distributed
computing methods, such as Folding@Home, it is important to develop efficient techniques for
analyzing the data. In previous chapters, we described a technique for building a graph of the
important states of a protein and estimating transition probabilities between these states. In this
CHAPTER 5. ERROR ANALYSIS METHODS
109
Figure 5.5: Percent of samples required for each state in the optimal allocation of samples per state.
chapter, we discuss methods for ascertaining the uncertainties in important kinetic properties that
can be calculated from this graph.
The focus of this chapter is in the computation of error from finite sampling. Given that a state
definition is Markovian, we have shown that the distribution of transition probabilities, estimated
from molecular dynamics data, is Dirichlet if one assumes a Dirichlet prior. From these distributions, we gave a method which, given sufficient samples of transitions, can calculate the distribution
of a desired quantity, such as the mean first passage time. We also presented and validated two
approximations that have little effect on the accuracy of the results and improve the efficiency of the
first method. When taken together, these approximations yield an efficient closed-form approximation to the uncertainty, which can be calculated in the same amount of time as one solution to the
set of linear equations.
While we have not gone into detail in this chapter, it should be noted that the analysis presented
above could easily be modified to calculate errors in Pf old values or any other quantity that can be
represented by a set of linear equations. Similarly, it is possible to modify the sensitivity analysis to
calculate the error of other functions of the MFPT vector, such as the sum of errors or the norm.
The analysis presented here assumes that it is possible to define states that behave in a Markovian
manner at a given time step. This is not a trivial task, and incorrect state definitions may lead to
unpredictable results. In this chapter, we did not address the errors that may arise from incorrect
state definitions. Instead, we cite tests that can be performed on a specific state definition to see
whether or not it is Markovian. Some of these tests rely on finding eigenvalues or other properties
CHAPTER 5. ERROR ANALYSIS METHODS
110
of transition probability matrices. The methods that we presented here for finding uncertainties in
solutions to linear equations can also be generalized to finding uncertainties in eigenvalues [VS83],
which we will do in Chapter 6. It is important when running these Markovian tests to find the errors
from finite sampling, since it may be possible to pass the tests within the error, or we may find that
the sampling errors are too large to draw any meaningful conclusions about the Markovian-ness of
the system.
For systems that can be reduced to a small number of states, the efficiency gains in the uncertainty estimates are minimal compared to the time it takes to generate the molecular dynamics data
and cluster the conformations. However, systems may have important transitional and rotational
degrees of freedom, e.g., two proteins moving in relation to one another or a binding event. In these
cases, for each relevant spatial state of the protein, we would need separate states for each conformation of the protein and an MSM with tens of thousands of states may be necessary. The sampling
based methods will hardly be practical for such large systems. Our non-sampling based, closedform solution for uncertainty, when taken together with the sparse matrix manipulations given in
Sec. 5.2.6, would provide efficient ways to measure uncertainties.
In addition to the error analysis techniques, we also presented an adaptive sampling method that
can produce a given precision with over an order of magnitude reduction in the number of samples
required by a naive sampling algorithm. We outlined how this algorithm can be applied efficiently
for systems with many states and demonstrated the large gains in either sampling time or precision.
We also outlined sparse matrix techniques that can improve the efficiency of both the error analysis
and adaptive sampling.
In conclusion, we have developed efficient and practical computational tools and algorithms
that find the main sources of error in a Markovian state model caused by finite sampling. We have
shown that our approximate solutions are in good agreement with the actual distributions, and are
computationally far more efficient. In addition, we gave an algorithm that uses the error analysis to
greatly reduce the number of samples necessary to build an MSM with the same precision. Lastly,
we gave techniques for using sparse matrix manipulations that will allow the handling of systems
with large numbers of states.
In the future, when the adaptive sampling algorithm generates new simulations of molecular
systems, some modifications to the algorithm may be necessary. For example, it is likely that we will
need to re-cluster the conformations as we gather new data in order to meet the Markovian criteria.
Also, if we begin new simulations from a given state, we should pick the starting conformation at
random from all the conformations in the state, so that we do not bias the transitions from that state.
Chapter 6
Eigenvalue and eigenvector error
analysis
Markovian state models (MSMs) are a convenient and efficient means to compactly describe the
kinetics of a molecular system as well as a formalism for using many short simulations to predict
long time scale behavior. Building a MSM consists of grouping the conformations into states and
estimating the transition probabilities between these states. In the previous chapter, we described
an efficient method for calculating the uncertainty due to finite sampling in the mean first passage
time between two states. In this chapter, we extend the uncertainty analysis to derive similar closedform solutions for the distributions of the eigenvalues and eigenvectors of the transition matrix,
quantities that have numerous applications when using the model. We demonstrate the accuracy
of the distributions on a six-state model of the terminally blocked alanine peptide. We also show
how to significantly reduce the total number of simulations necessary to build a model with a given
precision using these uncertainty estimates for the blocked alanine system and for a 2454-state MSM
of the villin headpiece.
6.1 Introduction
One approach to studying the movement of biomolecules is to use molecular dynamics. After
generating large ensembles of molecular dynamics simulations, we wish to analyze these trajectories
to find thermodynamic properties such as the equilibrium conformational distribution of the protein
and kinetic properties such as the rate and mechanism of folding. A recent approach for such
analysis involves graph-based models of protein kinetics that divide the conformation space into
111
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
112
discrete states and calculate transition probabilities or rates between the states based on molecular
dynamics trajectories [KTB93, GT94, SSP04, SPS04a, SPS+ 04b, SKH05, AFGL05, SHCT05].
These Markovian state models (MSMs) allow one to easily combine and analyze simulation data
started from arbitrary conformations and naturally handle the existence of intermediate states and
traps. This approach has been applied to small protein systems [SSP04, SPS+ 04b, JVP06], a nonbiological polymer [EP04, EPP05a], and vesicle fusion [KKS+ 06, KZP+ 07] with good agreement
with experimental rates. The MSM uses discrete states, and we expect the transition probabilities to
be insensitive to the exact state boundaries after sufficient transition time [Cha78, AD81, VD85]. An
alternative approach uses fuzzy partitions and partial membership of conformations into states, and
may be able to better characterize transition regions and describe dynamics at shorter time scales
[WG02, Web06].
For any quantities which can be calculated from the MSM, such as the mean first passage
times between states [SSP04], probability of folding from a given conformation [SSP04], or rates
[SPS04a], it is also important to determine the uncertainty in these values, so that one can form an
idea about the confidence of the results. One main source of error is caused by grouping conformations into states and assuming that transitions between these states are Markovian. It has been
shown that if the conformations are grouped incorrectly or if the transition probabilities are calculated from a time step which is too short, the transitions are no longer history independent, and any
analysis that assumes a Markovian process may produce incorrect results [SPS04a]. Even if the
states are defined such that the transitions between them are Markovian, the results could still be
in error. This second source of error results from the finite sampling of transitions between states,
which gives uncertainties in the transition probability estimates and, in turn, leads to uncertainties
in the values we calculate.
In the previous chapter (Chapter 5), we focused on the uncertainties caused by finite sampling
and showed how to efficiently calculate the resulting uncertainty in the mean first passage time
between two states. Those methods can easily be applied to calculate the uncertainty in any quantity that can be expressed as the solution of a set of linear equations of the transition probabilities.
However, many interesting collective properties of the system are described using the eigenvalues
and eigenvectors of the transition matrix. For example, the eigenvalues correspond to the aggregate time scales of the system, and thus can be compared with experiments to validate the model
[KTB93, SPS+ 04b]. Additionally, they are used in some tests for determining the time at which the
system becomes Markovian [SPS04a]. The eigenvectors are useful in determining the states which
participate in the relaxation process corresponding to a given eigenvalue, and can be used to group
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
113
kinetically similar states [Sch99, SFHD99, Hui01, SH02, DW04].
In this chapter, we will extend the uncertainty analysis methods presented previously [SP05a]
to estimate the uncertainties in the eigenvalues and eigenvectors of a transition matrix caused by
finite sampling. These error estimates can again be calculated with efficient closed-form solutions.
Moreover, these error estimates can be used to adaptively direct further simulations to reduce the
uncertainties of functions of the eigenvalues or eigenvectors. The validity of the error estimates is
demonstrated on a small system, the terminally blocked alanine peptide, and the power of adaptive
sampling is demonstrated on the alanine peptide and a model of the villin headpiece.
6.2 Methods
Molecular dynamics simulations are a popular tool for understanding molecular motion. Analyzing these trajectories to extract kinetic information is a difficult task. Recent work [KTB93, GT94,
SSP04, SPS04a, SPS+ 04b, SKH05, AFGL05, SHCT05] has involved modeling the system as a
Markovian state model, where the conformation space of the molecule is divided into discrete regions, or states, and transition probabilities are calculated between the states. If the transitions
between the states are Markovian, or history independent on some time scale, it is possible to model
the long time scale behavior of the system as a Markov chain on the Markovian state model graph.
Determining a state space over which transitions are Markovian is a difficult task and there
has been much work on determining appropriate decompositions [CSP+ 07, NHSS07]. Even if an
appropriate decomposition can be found for which the dynamics are Markovian at some lag time,
the kinetic properties calculated from the model still have uncertainties. Since we can only sample
a finite number of transitions between states, the estimated transition probabilities between states
will have statistical uncertainty. Therefore, any value calculated from the transition probabilities
will also have an uncertainty associated with it. In the previous chapter, we mapped the uncertainty
in the transition probabilities to uncertainties in the mean first passage time between two states, or
other similar quantities that are solutions of linear equations in the transition probabilities.
In the following section, we calculate efficient closed-form expressions for the uncertainties in
the eigenvalues and eigenvectors of the Markovian state model, which describe the full kinetics of
the system. The basis for the derivation and many of the equations are similar to those for the mean
first passage time (Chapter 5). However, we reproduce them here for clarity.
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
114
6.2.1 Eigenvalue and eigenvector equations
In a Markovian state model, we represent the conformation space by K discrete states, each of
which corresponds to some distinct group of molecular conformations. Let us define the probability
of transitioning from state i to state j at a time step of ∆t as pij .
An eigenvalue λ of a matrix P is defined as
Pvλ = λvλ ,
(6.1)
where vλ is the eigenvector corresponding to eigenvalue λ. We define the matrix A with rows ai as




A = P − λI = 


p11 − λ
p12
···
p1K
p21
p22 − λ
..
.
···
p2K
pK1
···
p(K−1)K
pKK − λ




,


(6.2)
where I is the identity matrix. Eq. 6.1 is then equivalent to
Avλ = 0,
(6.3)
and has non-trivial solution vλ when the determinant of the matrix is zero,
det(A) = 0.
(6.4)
6.2.2 Transition probability distribution
Finite sampling causes uncertainties in the estimates of the transition probabilities between states. A
derivation and complete explanation of the distribution over transition probability vectors has been
given before in Sec. 5.2.2. Here, we summarize the main results.
We define p∗ij as the actual transition probability from state i to j at a time step of ∆t, where
the sum of the transition probabilities from state i is equal to one. We can estimate these transition
probabilities by independently sampling transitions between states i and j, either through independent simulations or by only including transitions separated by the lag time at which the transitions
are Markovian. We generate counts zij which are the total number of transition samples from state
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
115
i to state j. We define ni as the total number of samples originating from state i,
ni =
K
X
zij .
(6.5)
j=1
The distribution of the zij variables follows the multinomial distribution with parameters ni , p∗i1 ,
p∗i2 , . . ., p∗iK [JKB97].
Using Bayesian analysis, we can compute the distribution over all possible vectors of transition
probabilities. The probability of a particular column vector pi being the true transition probability
vector, given the observed transition counts, is, from Bayes’ rule,
iK
P (pi |zi ) ∝ P (zi |pi )P (pi ) = pzi1i1 pzi2i2 . . . pziK
P (pi ),
(6.6)
where P (pi ) is the prior probability over the transition probability vectors, i.e., the distribution
representing the state of knowledge of transition probability vectors before observing any data.
A convenient choice for the prior is the Dirichlet distribution, the conjugate prior of the multinomial distribution. The Dirichlet distribution with variables p and parameters u is defined as
K
1 Y ui −1
pi
Dirichlet(p; u) =
Z(u)
(6.7)
i=1
where Z(u) is a normalizing constant defined in Appendix A and Γ is the gamma function. If we
define the prior of the transition probabilities as a Dirichlet distribution with parameters αi1 , αi2 , . . .,
αiK and we observe transition counts zi1 , zi2 , . . ., ziK , the posterior of the transition probabilities
is a Dirichlet distribution with parameters αi1 + zi1 , αi2 + zi2 , . . ., αiK + ziK . For notational
convenience, we define the Dirichlet counts as
uij = αij + zij .
(6.8)
Therefore, assuming a Dirichlet prior, the distribution of the transition probabilities, pi , given the
observed data counts is Dirichlet(pi ; ui ). In the limit, as the sampling (and therefore transition
counts) increases, the distribution of the transition probabilities will not depend on the choice of the
prior distribution.
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
116
It will be useful to state the expected values of the posterior distribution of the transition probabilities for future reference, where wi are normalizing weight variables [KBJ00]:
p̄ij
= E(pij ) =
wi =
K
X
uij
,
wi
uij .
(6.9)
j=1
6.2.3 Distribution of eigenvalues and eigenvectors
It is possible to repeatedly sample from the transition probability posterior distribution and find
the eigenvalues and eigenvectors for each sample to determine the posterior distributions of these
quantities. However, this method is very expensive, both because many samples are required to accurately describe the distribution and the solution of the eigenvalue system is expensive (O(K 3 ) plus
some small number of iterations [GvL96]) for each sample. For these reasons, we will make two approximations that will yield efficient closed-form solutions for the distributions of the eigenvalue λ
and the corresponding eigenvector vλ . If the distributions of multiple eigenvalue/eigenvector pairs
are desired, this procedure would need to be repeated independently for each pair.
Taylor series approximation
First, we will approximate the eigenvalue and eigenvector of interest with a Taylor series expansion
about these values calculated at the mean values of the transition probabilities. We define the mean
matrix Ā as




Ā = 


p̄11 − λ
p̄12
···
p̄1K
p̄21
p̄22 − λ
..
.
···
p̄2K
p̄K1
···
p̄(K−1)K
p̄KK − λ




,


(6.10)
where the variables p̄ij are defined in Eq. 6.9. The mean eigenvalue λ̄ satisfies the equation:
det(Ā) = 0,
(6.11)
and the mean eigenvector v̄λ satisfies the equation:
Ā|λ̄ v̄λ = 0.
(6.12)
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
117
The first-order Taylor series expansion for the eigenvalue λ as a function of the transition probabilities is
λ = λ̄ +
∂λ
∂λ
∂λ
|Ā ∆p11 +
|Ā ∆p12 + · · · +
| ∆pKK ,
∂p11
∂p12
∂pKK Ā
(6.13)
where the ∆pij are small perturbations in the parameters. Appendix E shows how to compute the
terms ∂λ/∂pij |Ā in Eq. 6.13 efficiently.
Similarly, the first-order Taylor series expansion for the eigenvector vλ as a function of the
transition probabilities is
vλ = v̄λ +
∂vλ
∂vλ
∂vλ
|Ā ∆p11 +
|Ā ∆p12 + · · · +
| ∆pKK .
∂p11
∂p12
∂pKK Ā
(6.14)
Appendix F shows how to calculate all the terms ∂vλ /∂pij |Ā in Eq. 6.14 efficiently.
Multivariate normal approximation
As shown in Sec. 6.2.2, the transition probabilities pi are distributed according to Dirichlet distributions. If the sample size is sufficiently large, then, by the central limit theorem, the distribution of
pi converges to a multivariate normal distribution (MVN) with mean µi and covariance matrix Σi
given by
µi =
Σi =
£
1
wi2 (wi
+ 1)
ui
,
wi
¤
wi Diag(ui ) − ui uT
i ,
(6.15)
(6.16)
where the superscript “T” denotes the transpose, Diag(ui ) represents a matrix with entries uij along
the diagonal, and the wi terms are the normalizing weight variables defined in Eq. 6.9 [Rao73]. The
covariance matrix in this distribution enforces the constraint that each possible transition probability
vector pi must sum to unity.
Closed-form solutions
Making both the Taylor series and multivariate normal approximations leads to closed-form expressions for the distributions of the eigenvalue λ and its corresponding eigenvector vλ . For notational
convenience, we define the deviation vector ∆pi , sensitivity of λ vector sλi , and sensitivity of vλ
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
118
matrix Svi λ ,
∆pT
= [∆pi1 · · · ∆piK ] ,
i
·
¸
³ ´T
∂λ
∂λ
λ
si
=
|
···
| ,
∂pi1 Ā
∂piK Ā


∂v1λ
∂v1λ
 ∂p |Ā · · · ∂p |Ā 
i1
iK




..
..
Svi λ = 
.
.
.


λ
 ∂v λ

∂vK
K
|Ā · · ·
|Ā
∂pi1
∂piK
(6.17)
We can then rewrite Eqs. 6.13 and 6.14 by grouping K terms at a time as
λ = λ̄ +
K ³ ´
X
T
sλi
∆pi ,
i=1
K
X
vλ = v̄λ +
Svi λ ∆pi .
(6.18)
i=1
The vector ∆pi is equal to pi − p̄i and, with the MVN approximation, has mean 0 and covariance
matrix Σi given by Eq. 6.16. Linear combinations of MVN random variables are also MVN random
variables, as described in Appendix B [Rao73].
Therefore, λ has a normal distribution with mean λ̄ and variance σ 2 , 1
λ ≈ N (λ̄, σ 2 ),
where
σ2 =
K ³ ´
X
T
Σi sλi .
sλi
(6.19)
(6.20)
i=1
Substituting Eq. 6.16 for Σi , we see that
³ ´T £
¤ λ
1
sλi
wi Diag(ui ) − ui uT
i si
2
w (wi + 1)
i=1 i
· ³ ´
¸
K
³ ´T
X
T
1
λ
λ
λ
T λ
wi si
Diag(ui )si − ( si
ui )(ui si ) .
=
w2 (wi + 1)
i=1 i
σ2 =
1
K
X
(6.21)
If the distribution of multiple eigenvalues is desired, it is possible to group terms in Eq. 6.13 similarly to how we
group terms for the eigenvectors in Eq. 6.14 to find the covariance matrix between the eigenvalues.
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
119
Similarly, vλ has a multivariate normal distribution with mean v̄λ and covariance matrix Σ2vλ ,
¡
¢
vλ ≈ MVN v̄λ , Σ2vλ ,
(6.22)
where
Σ2vλ
=
=
K
X
i=1
K
X
i=1
¡
¢T
Svi λ Σi Svi λ
h
¡ vλ ¢T ¡ vλ ¢ ³ ¡ vλ ¢T ´i
1
vλ
w
S
Diag(u
)
Si
− Si ui ui Si
.
i
i
i
wi2 (wi + 1)
(6.23)
Computational cost
The closed-form solutions given in Eqs. 6.19 and 6.22 require solving for λ̄ and v̄λ which take
time O(K 3 ) [GvL96]. Appendix E shows that we can find all the partial derivative terms for the
eigenvalue in time O(K 2 ) and Appendix F shows we can find all the partial derivative terms for the
eigenvector in time O(K 3 ). Since the variance in Eq. 6.21 for the eigenvalue is the sum of vector
dot products (rather than matrix-vector products), we can calculate it in time O(K 2 ). Similarly,
since the covariance matrix in Eq. 6.23 for the eigenvector is the sum of matrix-vector products
(rather than matrix-matrix products), we can calculate it in time O(K 3 ).
6.2.4 Adaptive sampling
As described previously (Sec. 5.2.5), we can decompose the closed-form normal or multivariate
normal distributions for the eigenvalue or eigenvector to calculate the contribution to the variance
from the elements in each row of the transition matrix, corresponding to the transitions from a single
state. We can then start new simulations from the states which contribute the most to the variance
in order to improve the overall precision.
The variance of the eigenvalue λ decomposes as
q̄i
K
X
q̄i
,
wi + 1
i=1
³ ´T £
¤ λ
=
sλi
Diag(p̄i ) − p̄i p̄T
i si ,
σ2 =
(6.24)
where we have separated out the q̄i terms which do not depend on the allocation of samples, wi .
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
120
If we were to add m more samples and assume that the expected transition probabilities remain
constant, we can choose the state i which will decrease this variance the most as
µ
i = argmax
q̄i
q̄i
−
wi + 1 wi + m + 1
¶
.
(6.25)
Similar calculations can be performed to obtain the state which contributes the most to any function
of the covariance matrix of the eigenvector.
6.3 Results
To test the closed-form solutions for the distribution for an eigenvalue λ given in Eq. 6.19 and an
eigenvector vλ given in Eq. 6.22, we compare the distributions with those obtained from sampling
from the posterior transition probability distribution and solving each sample for the eigenvalue or
eigenvector of interest. We can test all combinations of the two assumptions given above using
different sampling and solving methods. Namely, method 1 will sample from the Dirichlet distributions and solve for the eigenvalues or eigenvectors directly, method 2 will sample from the MVN
distributions and solve for the eigenvalues or eigenvectors directly, method 3 will sample from the
Dirichlet distributions and substitute into the Taylor series approximations, and method 4 will sample from the MVN distributions and substitute into the Taylor series approximations. In this way, we
can determine independently whether the MVN approximations and the Taylor series approximations are valid. The equations derived above (Eqs. 6.19 and 6.22) are simply closed-form solutions
to the sampling-based method 4.
We apply these methods to calculate the distributions of eigenvalues and eigenvectors in the
terminally blocked alanine peptide (Fig. 6.1) to demonstrate that the multivariate normal and Taylor
series approximations are valid. Stable states on the conformational landscape have previously
been identified [CSPD06b]. A set of 30,000 shooting trajectories (5,000 initiated from equilibrium
distributions within each of the six states) at 302 K was obtained from Chodera et al. [CSPD06b].
We count transitions between these states at a lag time ∆t of 0.1 ps, counting only one transition
per trajectory to ensure independence of the data. The state decomposition is non-Markovian at this
lag time; therefore, the eigenvalues and eigenvectors of the transition matrix may not correspond
to the true underlying alanine dynamics. However, it is still important to determine the error from
sampling in the eigenvalues and eigenvectors, since these values are used in tests for Markovian
behavior [SPS04a] and clustering of states [CSP+ 07]. Further, the primary focus in this chapter is
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
121
Figure 6.1: Potential of mean force and manual state decomposition for terminally-blocked alanine
peptide. Left: The terminally-blocked alanine peptide with φ and ψ backbone torsion labeled.
Right: The potential of mean force in the (φ, ψ) torsions at 302 K estimated from the parallel
tempering simulation. Boundaries defining the six states manually identified by Chodera et al.
[CSPD06b] are superimposed and the states labeled.
in validating the mathematical modeling of the distributions of eigenvalues and eigenvectors. The
counts for this system are

4830

 211


 169
Z=

 3

 0

7
153
4788
1
13
0
5
15
2
0
0
1
0
0
0





4604 226
0
0 
,

158 4823
3
0 

0
4
4978 18 

0
0
62 4926
(6.26)
and we set the prior αij = 1/6, as previously described [SP05a].
6.3.1 Eigenvalue distributions
Figure 6.2 shows the distributions for the five non-unit eigenvalues as calculated from the normal
distribution in Eq. 6.19 (red lines), and from the four sampling based methods described above. It
is clear that for the fifth and sixth eigenvalue, the normal distributions are excellent matches with
the sampling based distributions. For the second eigenvalue, there are slight discrepancies between
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
122
the Dirichlet samples and the MVN samples. For the third and fourth eigenvalues, there appear to
be differences between the methods which solve for the eigenvalues directly and those which make
the Taylor series approximation.
When there are multiple eigenvalues that are close in magnitude, small perturbations in the transition probabilities may result in the shifting of the rank of the eigenvalues of the corresponding
perturbed matrix with respect to the original eigenvalues. Therefore, when the eigenvalues of the
perturbed matrix are solved directly, one cannot simply take, for example, the second largest eigenvalue of the perturbed matrix to calculate the distribution of the eigenvalue that was second largest
in the original matrix. In this system, the third and fourth eigenvalues overlap in range. In the direct
solutions of eigenvalues, the distribution of the third largest eigenvalue is therefore shifted to the
right of the distribution of the eigenvalue ranked third in the original matrix. It is possible to match
eigenvalues based on their corresponding eigenvectors, but we have not done that here. A benefit of the Taylor series approximation is that it automatically calculates deviations to the particular
eigenvalue of interest, and thus is insensitive to these changes in rank.
The Taylor series expansion also immediately decomposes into contributions from the transitions out of each state, as discussed in Sec. 6.2.4. Figure 6.3 shows the contribution of each state
to the variance in each eigenvalue (normalized such that the total contribution for each eigenvalue
sums to one). These values are the q̄i in Eq. 6.24. If we wished to determine from which states to
start new simulations to reduce the variance in any of the eigenvalues, we would use Eq. 6.25, since
the expected decrease in variance depends on the current number of samples from a given state.
However, since the shooting trajectories have an equal number of samples from each state, we can
use Fig. 6.3 to see that, for example, we should add more samples to state 5 to decrease the variance
of the second eigenvalue. It would be very difficult to extract this information if one were to sample
possible transition probabilities and solve each sample for the eigenvalues.
6.3.2 Eigenvector distributions
In addition to the distributions of the eigenvalues, we are also interested in the distribution of the
eigenvectors. Figure 6.4 shows the mean and variance calculated from the closed form distribution
(Eq. 6.22) and the four sampling based methods for the eigenvector components corresponding to
the third (top panel) and fifth (bottom panel) eigenvalues. The inset in the top panel shows the full
distributions for the second eigenvector component. Since the eigenvalues may shift in rank with the
perturbations to the transition probabilities, the eigenvector component distribution generated from
solving for the eigenvectors directly is clearly bimodal. Because the third and fourth eigenvalues
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
123
Figure 6.2: Distributions of the five non-unit eigenvalues of the system shown in Fig. 6.1. The
red lines indicate the normal distributions calculated using Eq. 6.21, and the magenta, green, blue,
and cyan density plots indicate the distributions generated from the four sampling based methods, Dirichlet sampling and direct solving, MVN sampling and direct solving, Dirichlet sampling
and Taylor series substitution, and MVN sampling and Taylor series substitution, respectively, for
20,000 samples.
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
124
Figure 6.3: The percent contribution of each state to the variance for the five non-unit eigenvectors
(Eq. 6.24).
overlap in range (as shown in Fig. 6.2), the eigenvectors calculated by ranking the eigenvalues
in order actually correspond to different processes. As in the case with the eigenvalues, the Taylor
series methods are insensitive to rank ordering changes, and calculate the true, unimodal distribution
of the eigenvector components.
The fifth eigenvalue, however, is well separated from the other eigenvalues, and the full distribution of the third eigenvector component (shown in the bottom inset of Fig. 6.4), is a good
approximation of the actual distribution. The distributions of eigenvector components are not independent – they also have some covariance between them, which is not shown here. While we have
only shown the mean and variances for two of the eigenvectors and the full distributions for two of
the components, the results are similar across eigenvectors and components (data not shown).
The variance of each eigenvector component can again be decomposed into contributions from
transitions leaving each state. Figure 6.5 shows this decomposition for the eigenvectors corresponding to the third (top panel) and fifth (bottom panel) eigenvalues. The values shown are the percent
contribution to the sum of the variances of all the components. We can see that for the third eigenvector, the sixth component has the most variance and can be improved by adding more samples
from states 5 and 6. For the fifth eigenvector, the fifth and sixth components are quite precise, and
the remaining four components depend to different degrees on the transitions from the first four
states. Again, this information would be very difficult to extract by sampling from the transition
matrix and solving the eigenvectors for each sample.
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
125
Figure 6.4: Distributions of the eigenvector components corresponding to the third (top panel) and
fifth (bottom panel) eigenvalues of the six-state model of the terminally blocked alanine peptide.
Distributions are calculated either from the MVN distribution given in Eq. 6.22 or from the samples
obtained by methods 1 − 4 described above. The insets show the actual distribution for the second
eigenvector component (top inset) and third eigenvector component (bottom inset).
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
126
Figure 6.5: The contributions to the variance of the eigenvector components as decomposed by
transitions from each state. The top panel corresponds to the third eigenvalue and the bottom panel
corresponds to the fifth eigenvalue.
6.3.3 Adaptive sampling
In addition to efficiently calculating the uncertainties in the eigenvalues and eigenvectors, we also
wish to use these estimates to improve the sampling as described in Sec. 6.2.4. We compare the
adaptive sampling algorithm to equilibrium sampling, where the number of trajectories initiated
from each state is proportional to the equilibrium probability of the state, and even sampling, where
an equal number of trajectories are initiated from each state.
Assume we can take m transition samples in each round, we can decide where to allocate the
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
127
samples before each round, and that we have a limit on the total number of samples. In the equilibrium sampling algorithm, for each new transition sample, we will choose the state from which to
initiate the sample with probability equal to the equilibrium probability of the state. The equilibrium
probability of a state is the eigenvector, properly normalized, corresponding to the unit eigenvalue
of the transition probability matrix, calculated from all the simulation data. The even sampling
algorithm will always take the same number of samples from each state in each round, m/K. In
the simplest implementation of the adaptive sampling algorithm, we calculate the contribution from
each row to the variance of the quantity of interest and add all m samples to the row that is expected
to decrease the variance the most.
Terminally blocked alanine peptide
For the alanine peptide, we choose to adaptively sample in order to reduce the variance of the largest
non-unit eigenvalue, which corresponds to transitions between states 1, 2, 3, and 4 and states 5 and
6. In general, one would probably determine some function of the variances of multiple eigenvalues
to minimize. Given our dataset, we can simulate the sampling algorithms by randomly selecting
trajectories without replacement from the set of shooting trajectories initiated from each state. In
this way, we can estimate either how many shooting trajectories would be necessary to achieve a
given precision in the eigenvalue or the possible precision with a given number of total trajectories.
The transition matrix used to calculate the equilibrium probabilities of each state was the count
matrix given in Eq. 6.26, normalized such that each row summed to one.
We began by setting the prior αij = 1/6 and selecting 100 trajectories at random from each of
the six states. In the equilibrium sampling algorithm, for each round, the initial state for each of
60 additional trajectories was selected with probability equal to the equilibrium probability of each
state. A random trajectory from each of these states was then selected. In the even sampling algorithm, in each round ten random trajectories were added from each state. In the adaptive sampling
algorithm, in each round we determined which state would decrease the variance of λ2 the most
(Eq. 6.25) and added 60 random trajectories from that state. For each of the sampling algorithms,
this procedure was repeated until all the trajectories from any state were selected.
Figure 6.6 (top panel) shows the variance of λ2 as a function of the number of samples for the
equilibrium, even, and adaptive sampling algorithms. Plotted are the mean and standard deviation
of the variance of λ2 over 20 independent trials of each of the three algorithms. It is clear that
the adaptive sampling algorithm outperforms the even sampling algorithm on average by a factor
of around five. The equilibrium sampling strategy performs very poorly for this example, with a
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
128
Figure 6.6: The mean and standard deviation of the variance of the largest non-unit eigenvalue
of the six-state model of terminally blocked alanine peptide (top panel) and the 2454-state model
of the villin headpiece (bottom panel) as a function of the total number of samples for twenty
independent trials each of the equilibrium (green dots), even (blue circles), and adaptive (purple
squares) sampling algorithms.
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
129
variance of one to two orders of magnitude more than the even or adaptive sampling strategies for
the same number of total simulations.
Villin headpiece
The adaptive sampling algorithm was also performed on a 2454-state model built from simulation
data of the 36-residue alpha-helical villin headpiece. A complete description of the simulation
details and Markovian state model construction is given by Jayachandran et al. [JVP06]. This
model was constructed at a lag time of 10 ns and has been shown to reproduce the simulation data at
this resolution. Because the median number of transition samples from a state was less than 500, we
chose to run the sampling algorithms on the dataset with replacement. As in the alanine example,
we compare the variance of the largest non-unit eigenvalue for the three sampling algorithms.
The prior is initialized to αij = 1/2454 and for each sampling algorithm we began by selecting
ten transition samples at random from each state (with replacement). In the equilibrium sampling
algorithm, the equilibrium distribution of each state was calculated from the transition probability
matrix, obtained from Jayachandran et al. [JVP06]. In each round, the initial state for each of 2454
additional samples were selected with probability equal to the equilibrium probability of the state.
A random transition sample from each of these states was then selected. In the even sampling
algorithm, in each round, one random sample was added from each state. In the adaptive sampling
algorithm, in each round we determined which state would decrease the variance of λ2 the most
(Eq. 6.25) and added 2454 random transition samples from that state. For each of the sampling
algorithms, this procedure was repeated for a total of 100 rounds.
Figure 6.6 (bottom panel) shows the mean and variance of the variance of λ2 as a function of
the total number of transition samples for 20 independent trials each of the equilibrium, even, and
adaptive sampling algorithms. After a few rounds, the variance of λ2 from the adaptive sampling
algorithm is over three orders of magnitude less than the variance from the equilibrium and even
sampling algorithms.
6.4 Discussion and Conclusions
Given that we can generate a large number of molecular dynamics trajectories using distributed
computing methods, such as Folding@Home, it is important to develop efficient techniques for
analyzing the data. One compact way to model the data is to build a graph of the important states
of the molecule and model kinetics as a Markov chain on this graph. In the previous chapter, we
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
130
discussed methods for calculating the uncertainties due to finite sampling in kinetic properties such
as the mean first passage time and other solutions to linear equations that can be calculated from the
model.
However, many applications of the model use the eigenvalues and eigenvectors of the transition
matrix, since these correspond with the aggregate rates and the participants in those rates of the
system. For example, the eigenvalues directly correspond with the rates between sets of states,
and thus can be compared with experiments, the eigenvalues and their implied time scales are used
in tests for Markovian behavior [SPS04a], and the eigenvectors guide a clustering of states based
on kinetic similarity [SH02]. In all these applications, it is useful to know the uncertainty of the
eigenvalues and eigenvectors, since the uncertainties may influence any decisions made with these
values.
This chapter gives efficient closed-form solutions for the uncertainty in the eigenvalues and
eigenvectors of the transition matrix caused by finite sampling. By making two simple approximations, that the distribution of transition probabilities is well approximated by a multivariate normal
distribution and that a first-order Taylor series expansion is adequate to describe the eigenvalues
and eigenvectors, we have shown how to calculate the distribution of eigenvectors and eigenvalues.
The closed-form solutions can be calculated in roughly the same amount of time as simply solving
for the eigenvalues and eigenvectors of the expected transition matrix, and therefore is much more
efficient than any sampling based scheme. The method is thus scalable to large systems with many
states. In addition, we may expect the transition counts to be sparse, and as we previously discussed
(Sec. 5.2.6) [SP05a], it is possible to leverage sparse matrix techniques to solve for a limited number
of eigenvalues and eigenvectors of the system [Ruh98, LSY98], and to update the transition counts
using bordered systems [BR74].
We have shown on a simple alanine peptide system that the distributions of eigenvectors and
eigenvalues are in good agreement with those obtained from sampling possible transition probability matrices and solving for the eigenvalues and eigenvectors of each sample. An additional
benefit of the closed-form solutions are that they automatically account for changes in the rank of
the eigenvalues due to perturbations in the transition matrix. There is no need to determine the
correspondence of eigenvalues between different samples of the transition matrix. While we only
presented results on this six state system for ease of visualization, we have tested these methods on
larger systems with similar results.
One downside of these methods is that they assume the counts from state i to state j are all
independently observed. However, this assumption is only used in the derivation of the posterior
CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS
131
transition probability distribution. If we were to relax this assumption, we could still use the Taylor
series approximations with samples from whatever distribution we believe the transition probabilities arise from. For example, if we have data at shorter intervals than the lag time ∆t, we could use
overlapping segments to calculate the transition counts. These counts would no longer be independent, but as long as a multivariate normal approximation to their distribution could be calculated,
the closed-form distributions derived here could easily be modified. Some other properties, such
as enforcing detailed balance, may not easily be approximated by multivariate normal distributions.
In these cases, if one can generate samples from the distribution, one can substitute these into Eqs.
6.13 and 6.14 to approximate the eigenvalues and eigenvectors of interest. This is still much more
efficient than solving for the eigenvalues and eigenvectors directly for each sample.
The variance of the distributions for the eigenvalues and eigenvectors easily decompose into
contributions relating to the transition probabilities from each state. We showed how to use adaptive
sampling techniques to leverage this information and intelligently select simulations from our data
set to reduce the variance of the largest non-unit eigenvalue for the six-state alanine system and for
a 2454-state model of the villin headpiece. The gain in precision or reduction in the total number
of samples for the alanine system was modest because there were only six states in the model.
However, for the villin system, the gain in precision from the adaptive sampling algorithm was over
three orders of magnitude. While most of the states in the villin system were moderately populated
at equilibrium, the largest non-unit eigenvalue was only sensitive to the transitions from a handful of
states. We fully expect to see similar benefits for other molecular systems. The ability to calculate
errors in a closed-form manner allows one to easily perform adaptive sampling techniques to reduce
the uncertainty. Because the required simulation time for sampling one transition is on the order of
one CPU day for many protein systems, an iterative, adaptive sampling algorithm will be easy to
integrate into a distributed computing framework, such as Folding@Home.
In conclusion, we have developed error analysis methods to calculate the distributions of eigenvalues and eigenvectors in a Markovian state model caused by finite sampling. We have shown that
the approximate solutions are in good agreement with the actual distributions, and are computationally far more efficient. We have also shown how to perform adaptive sampling to reduce the
computational cost needed to build a model with a given precision.
Chapter 7
Conclusions
In this thesis, we have investigated a method for efficiently using large amounts of molecular dynamics simulation data to study the kinetic properties of the molecular system. The Markovian state
model, introduced in Chapter 2, is a powerful framework for studying different biological properties through simulations, and models the dynamics of the system as Markovian stochastic transitions
between discrete states in the conformation space of the molecule. By dividing the conformation
space into states and only characterizing transitions at short times, it is possible to use many short
trajectories, which can be efficiently simulated using a distributed network such as Folding@Home.
The MSM captures complex kinetic behavior, and scales to systems with many intermediate and
trap-like states.
In this dissertation, we described numerous important tools for building a Markovian state model
from molecular dynamics simulation data. Chapter 2 introduced the basic idea of the MSM and described how important kinetic properties can be efficiently calculated from the model. In Chapter 3,
we presented a novel algorithm which used both structural characteristics and the kinetic information encoded in the trajectories to automatically find states in the conformation space such that the
transitions between them will be Markovian, given a target number of states. Chapter 4 addressed
the problem of determining the appropriate number of states in the MSM. We defined a Bayesian
scoring function which, depending on both the transitions and amount of simulation data, could
determine the correct complexity of the MSM. There are many kinetic properties which one can
calculate from a Markovian state model, and Chapters 5 and 6 gave closed-form functions for the
uncertainties in the mean first passage time to the final state, equilibrium distributions, and reaction
rates. These error analysis methods were several orders of magnitude faster than previous methods
to determine the uncertainties in these values. If we wish to reduce these uncertainty values, we
132
CHAPTER 7. CONCLUSIONS
133
can perform more simulations, and Chapters 5 and 6 also introduced a new sequential sampling
algorithm, which used the error estimates to reduce the amount of sampling by up to three orders of
magnitude depending on the system size.
When put together, these methods give a powerful new way to perform simulations to study
important biological problems. We first explore the configuration space by starting simulations
from a broad sampling of configurations. Then, we can build state decompositions (Chapter 3) and
determine the appropriate number of states for the data (Chapter 4). To refine the model, we can
selectively start more simulations from those areas that contribute the most to the error in whatever
property we are interested in studying (Chapters 5 and 6). As we add more simulation data, both
the appropriate number of states and the state decomposition will need to be updated. In this way,
we can greatly reduce the amount of simulation required to study important biological properties.
Appendix A
Sampling from a Dirichlet distribution
If p is distributed by a Dirichlet distribution with parameters u, then
K
Dirichlet(p; u) =
1 Y ui −1
pi
,
Z(u)
i=1
where Z(u) is a normalizing constant defined as
QK
Γ(ui )
´.
Z(u) = ³i=1
PK
Γ
i=1 ui
(A.1)
We can sample from p by using the fact that if Y1 , · · · , YK are independent gamma random
P
variables with parameters ui > 0 respectively, and Y0 = K
i=1 Yi , then p1 = Y1 /Y0 , · · · , pK =
YK /Y0 are distributed by the Dirichlet distribution with parameters u1 , u2 , · · · , uK [Dev86]. There
are many algorithms to sample from a gamma distribution. We use Best’s rejection algorithm for
parameters greater than one (1978) and Best’s RGS algorithm (1983) for parameters less than or
equal to one [Dev86].
Each Dirichlet distribution sample needs K gamma distribution samples. For the gamma sampling algorithms chosen, each gamma sample requires two normal distribution samples and has a
rejection constant of less than four. We ignore the time taken for mathematical functions like exponentiation and square roots. Therefore, the maximum expected time to generate one gamma sample
is O(8Q), and the expected time to generate one Dirichlet sample is O(8KQ), where Q is the time
taken to sample from a normal distribution.
134
Appendix B
Sampling from a Multivariate Normal
distribution
To sample from MVN(µ, Σ), we use the fact that if we have a vector
y ≈ MVN(µ, Σ),
(B.1)
y0 = Ry + b,
(B.2)
y0 ≈ MVN(Rµ + b, RΣRT ).
(B.3)
and we define a new vector
then y0 is distributed as [Rao73]
To generate samples from MVN(µi , Σi ), we first generate a sample
y ≈ MVN(0, I),
(B.4)
pi = Li y + µi ,
(B.5)
Li LT
i = Σi .
(B.6)
and then calculate
where
There will be an initial cost of (1/6)K 3 to perform the decomposition given in Eq. B.6. Then,
135
APPENDIX B. SAMPLING FROM A MULTIVARIATE NORMAL DISTRIBUTION
136
for each desired sample, we need K independent samples from the normal distribution N (0, 1), and
K 2 multiplications in Eq. B.5. Thus, the total time per sample is O(KQ + K 2 ), where Q is the
time to sample from the normal distribution.
However, the covariance matrices of the MVN distributions in which we are interested are structured; they are rank-one updates of diagonal matrices (Eq. 5.11). Therefore, it is possible to implicitly calculate the matrix Li in Eq. B.6 in time O(K), and to perform the multiplication in Eq. B.5
in time O(K) using slight modifications of the methods described by Gill et al. [GGMS74]. Thus,
the revised total time per sample is O(KQ + K).
Appendix C
MFPT sensitivity analysis
In this appendix we show how the terms in the Taylor series expansion in Eq. 5.14 can be computed
efficiently using adjoint systems. For details, see Vlach and Singhal [VS83].
Consider a system of linear equations of the form
Ax = b.
(C.1)
We are interested in the sensitivity of x to small perturbations in the elements of A, aij . Let us
differentiate each side of the equation with respect to aij :
∂x
∂A
x+A
= 0.
∂aij
∂aij
(C.2)
The derivative of the A matrix with respect to one of the entries is simply a matrix with a unit value
at position (i,j) and zeros elsewhere. This is compactly written as
∂A
= ei eT
j ,
∂aij
(C.3)
where ei is the basis vector given by the ith column of the identity matrix. Thus, substituting Eq.
C.3 into Eq. C.2 gives
∂x
= −A−1 ei eT
j x.
∂aij
(C.4)
For our application, we are only interested in the sensitivity of the first element of x, x1 .
∂x
∂x1
−1
= eT
= −eT
ei eT
1
1A
j x.
∂aij
∂aij
137
(C.5)
APPENDIX C. MFPT SENSITIVITY ANALYSIS
138
Let us define the column vector ϕ as
−1
ϕT = eT
.
1A
(C.6)
∂x1
= −ϕT ei eT
j x = −ϕi xj .
∂aij
(C.7)
Substituting Eq. C.6 into Eq. C.5 gives
The vector ϕ is called the adjoint vector and is the solution of the adjoint system of equations
AT ϕ = e1 .
(C.8)
We wish to evaluate the ∂x1 /∂aij terms at the matrix Ā, which corresponds to the expected values
of the parameters. Determination of the vectors x̄ = x|Ā and ϕ̄ = ϕ|Ā involves solving the sets of
linear equations,
Āx̄ = b,
(C.9)
ĀT ϕ̄ = e1 .
(C.10)
We can then find all the ∂x1 /∂aij |Ā terms by simply taking the outer product of ϕ̄ and x̄: ϕ̄x̄T .
We can calculate the vector x̄ by solving Eq. C.9 in O((1/3)K 3 ) operations by factoring Ā
into its LU factors [GvL96]. The solution of the adjoint system in Eq. C.10 only requires O(K 2 )
operations as the same LU factors can be used. All the terms ∂x1 /∂aij |Ā can be computed from
these two solutions in O(K 2 ) operations.
Appendix D
Solving a bordered sparse matrix
In this appendix, we show how to efficiently solve the system of linear equations given in Eq. 5.30,
"
F
#"
c
1T −1
#
x
"
=
y
Wb
0
#
,
where F is a sparse matrix, by using the LU factors of F, LF and UF , to find the LU factors of the
matrix on the left hand side of Eq. 5.30, which we will name G.
We are looking for factors LG and UG such that
"
L11
0
#"
l21 l22
|
{z
}|
LG
U11 u12
0
{z
1
#
"
=
}
F
c
#
1T −1
|
{z
}
UG
,
(D.1)
G
where we place zeros and ones such that LG and UG are lower and upper triangular respectively,
and UG has unit entries along the diagonal. We have the following four conditions from Eq. D.1:
L11 U11 = F,
L11 u12 = c,
l21 U11 = 1T ,
l21 u12 + l22 = −1.
(D.2)
To satisfy the first equation, we simply set L11 and U11 to the LU factors of F, LF and UF .
The other three equations are then easily solved for u12 , l21 , and l22 via either forward or back
139
APPENDIX D. SOLVING A BORDERED SPARSE MATRIX
140
substitution. From these LU factors, we can efficiently solve Eq. 5.30 and therefore our original
equations for the mean first passage times.
Appendix E
Eigenvalue sensitivity analysis
In this appendix we show how the terms in the Taylor series expansion in Eq. 6.13 can be computed
efficiently. For details, see Vlach and Singhal [VS83].
The objective is to find ∂λ/∂pij , where λ is an eigenvalue and is defined such that



det P
| −
{z λI} = 0.
(E.1)
A
Consider the factors of A,
A = LU,
(E.2)
where L is a lower triangular matrix and U is an upper triangular matrix with unit entries along the
diagonal. The determinant of a product is the product of the determinants,
det(A) = det(L)det(U).
(E.3)
The determinants of triangular matrices are simply the product of the diagonal elements. Since
the matrix U has unit values along its diagonal, its determinant is equal to one. Thus, for the
determinant of A to equal zero, the matrix L must have a zero element along its diagonal. Assume
that this zero element is in the last row: lKK = 0 (partial or full pivoting may be needed to ensure
this [VS83]).
We can relate the partial derivative of λ to the derivative of lKK using the chain rule,
∂lKK ∂λ
∂lKK
dlKK
=
+
= 0,
dpij
∂λ ∂pij
∂pij
141
(E.4)
APPENDIX E. EIGENVALUE SENSITIVITY ANALYSIS
142
where the derivative must equal zero since the value of lKK is fixed at zero for λ to be an eigenvalue.
To find the terms ∂lKK /∂λ and ∂lKK /∂pij , we differentiate Eq. E.2 above by a general parameter
h,
∂A
∂L
∂U
=
U+L
.
∂h
∂h
∂h
(E.5)
We define vectors x and xa as follows:
Ux = eK ,
LT xa = 0,
(E.6)
where eK is the column vector corresponding to the Kth column of the identity matrix. We force a
non-trivial solution for xa by setting
xaK = 1.
(E.7)
We then pre- and post- multiply Eq. E.5 by the vectors xa and x:
(xa )T
∂L
∂U
∂A
x = (xa )T
Ux + (xa )T L
x.
∂h
∂h
∂h
(E.8)
Substituting the definitions in Eq. E.6 into Eq. E.8 gives
(xa )T
∂L
∂U
∂A
x = (xa )T
eK + 0T
x.
∂h
∂h
∂h
(E.9)
The first term on the right hand side can be reduced since L is a lower triangular matrix. Postmultiplying its derivative by eK gives a vector in which all entries are zero, except the last one,
which is ∂lKK /∂h. We pre-multiply by (xa )T , which gives xaK · ∂lKK /∂h = ∂lKK /∂h, since
we defined xaK = 1 in Eq. E.7. The second term on the right hand side of Eq. E.9 is equal to zero.
Therefore, Eq. E.9 can be rewritten as
(xa )T
∂A
∂lKK
x=
.
∂h
∂h
(E.10)
We can now calculate the remaining terms in Eq. E.4 by setting h in Eq. E.10 equal to either λ
or pij ,
¶
µ
¶
µ
∂λ
a T ∂A
a T ∂A
x
+ (x )
x = 0.
(x )
∂λ
∂pij
∂pij
(E.11)
Matrix A is defined in Eq. 6.2, and it is easy to see that ∂A/∂λ = −I and ∂A/∂pij = ei eT
j .
APPENDIX E. EIGENVALUE SENSITIVITY ANALYSIS
Therefore,
(xa )T ei eT
xai xj
∂λ
j x
=
=
.
∂pij
(xa )T x
(xa )T x
143
(E.12)
We wish to evaluate the ∂λ/∂pij terms at the matrix Ā, which corresponds to the expected values of the parameters. Determination of the vectors x̄ = x|Ā and x̄a = xa |Ā involves decomposing
the matrix Ā into factors L̄ and Ū and then solving the following sets of linear equations:
Ūx̄ = eK ,
(E.13)
L̄T x̄a = 0.
(E.14)
We can then find all the ∂λ/∂pij |Ā terms by simply normalizing the outer product of x̄a and x̄:
x̄a x̄T
. Factoring Ā into its LU factors takes time O((1/3)K 3 ). The solutions of Eqs. E.13
(x̄a )T x̄
and E.14 take time O(K 2 ) for forward or backward substitution. All the terms ∂λ/∂pij |Ā can be
computed from these two solutions in O(K 2 ) operations. This must be done independently for each
eigenvalue for which the uncertainty is desired.
Appendix F
Eigenvector sensitivity analysis
In this appendix, we show how the partial derivative terms in Eq. 6.14 can be computed efficiently.
We start with the eigenvector equation, Eq. 6.3:
Avλ = 0.
(F.1)
To differentiate this equation with respect to a parameter pij , we must use the chain rule, since A is
a function of pij and λ, and λ is a function of pij ,
µ
∂A ∂λ
∂A
+
∂pij
∂λ ∂pij
¶
vλ + A
∂vλ
= 0.
∂pij
(F.2)
From Eq. 6.2, ∂A/∂λ = −I and ∂A/∂pij = ei eT
j . As derived in Appendix E, ∂λ/∂pij =
xai xj
,
(xa )T x
along with the definitions for x and xa . We therefore have the system of linear equations,
∂vλ
A
=−
∂pij
µ
¶
∂A
∂λ
−
I vλ ,
∂pij
∂pij
(F.3)
where all the terms on the right hand side are known. We wish to evaluate the partial derivative
terms at the matrix Ā, which corresponds to the expected values of the parameters,
∂vλ
Ā
| =−
∂pij Ā
µ
¶
∂A
∂λ
−
I v̄λ .
∂pij
∂pij
(F.4)
However, the matrix Ā is singular, so we must enforce one additional constraint. As eigenvectors
are only determined to within a constant factor, they are often normalized such that their magnitude
144
APPENDIX F. EIGENVECTOR SENSITIVITY ANALYSIS
145
is equal to one.
(vλ )T vλ = 1.
Differentiating this constraint gives
2(vλ )T
(F.5)
∂vλ
= 0.
∂pij
(F.6)
Combining this constraint with Eq. F.4 gives
"

#
Ā
·
(v̄λ )T
µ
∂vλ  −
=
∂pij

¶
∂A
∂λ
−
I v̄λ 
∂pij
∂pij
.
0
(F.7)
Equation F.7 can be separated into two parts based on the terms on the right hand side. Substituting in the values for the partial derivatives of A and λ, we get
"
#
Ā
· bij 1 =
(v̄λ )T
"
)T
#
−ei eT
j v̄λ
,
0

· bij 2

xai xj
v̄
λ
,
=  (xa )T x
0
∂vλ
∂pij
= bij 1 + bij 2 .
#
Ā
(v̄λ
"
(F.8)
Simple algebra shows that one can introduce the K vectors ci and the vector d and solve the following sets of equations:
"
"
Ā
(v̄λ )T
Ā
(v̄λ )T
#
"
· ci =
#
"
·d =
and then compute
∂vλ
ci
=
+
∂pij
(vλ )j
−ei
0
v̄λ
#
∀i,
#
0
µ
(xa )T x
xai xj
,
(F.9)
¶
d.
(F.10)
We can solve each of the systems of equations in Eq. F.9 by augmenting the LU factors of Ā,
calculated to determine the sensitivity of the eigenvalue λ, in time O(K 2 ). Since there are K + 1
equations, the total time to calculate all the sensitivity terms is O(K 3 ).
Bibliography
[ABB+ 85]
Anjum Ansari, Joel Berendzen, Samuel F. Bowne, Hans Frauenfelder, Icko E. T.
Iben, Todd B. Sauke, Erramilli Shyamsunder, and Robert D. Young. Protein states
and proteinquakes. Proc. Nat. Acad. Sci. USA, 82:5000–5004, 1985.
[ABG+ 02]
M. Serkan Apaydin, Douglas L. Brutlag, Carlos Guestrin, David Hsu, and JeanClaude Latombe. Stochastic roadmap simulation: An efficient representation and
algorithm for analyzing molecular motion. In Proc. ACM Int. Conf. on Computational Biology (RECOMB), pages 12 – 21, 2002.
[AD81]
John E. Adams and Jimmie D. Doll. Dynamical aspects of precursor state kinetics.
Surface Science, 111:492–502, 1981.
[AFC99]
Joannis Apostolakis, Philippe Ferrara, and Amedeo Caflisch. Calculation of conformational transitions and barriers in solvated systems: Application to the alanine
dipeptide in water. J. Chem. Phys., 110(4):2099–2108, 1999.
[AFGL05]
Michael Andrec, Anthony K. Felts, Emilio Gallicchio, and Ronald M. Levy. Protein
folding pathways from replica exchange simulations and a kinetic network model.
Proc. Natl. Acad. Sci. USA, 102:6801–6806, 2005.
[AT91]
M. P. Allen and D. J. Tildesley. Computer simulation of liquids. Clarendon Press,
Oxford, 1991.
[BB98]
Keith D. Ball and R. Stephen Berry. Realistic master equation modeling of relaxation on complete potential energy surfaces: Kinetic results. J. Chem. Phys.,
109(19):8557–8572, 1998.
[BDC00]
Peter G. Bolhuis, Christoph Dellago, and David Chandler. Reaction coordinates of
biomolecular isomerization. Proc. Natl. Acad. Sci., 97(11):5877–5882, 2000.
146
BIBLIOGRAPHY
[BF89]
147
Y. S. Bai and M. D. Fayer. Time scales and optical dephasing measurements: Investigation of dynamics in complex systems. Phys. Rev. B, 39:11066–11084, 1989.
[BHH78]
George E. P. Box, William G. Hunter, and J. Stuart Hunter. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley, New
York, 1978.
[BK97]
Oren M. Becker and Martin Karplus. The topology of multidimensional potential
energy surfaces: Theory and application to peptide structure and kinetics. J. Chem.
Phys., 106(4):1495–1517, 1997.
[BMDW06] David D. Boehr, Dan McElheny, H. Jane Dyson, and Peter E. Wright. The dynamic energy landscape of dihydrofolate reductase catalysis. Science, 313:1638–
1642, 2006.
[Bol03]
Peter G Bolhuis. Transition-path sampling of beta-hairpin folding. Proc. Natl. Acad.
Sci., 100(21):12129–12134, Oct 2003.
[BPvG+ 84] H. J. C. Berendsen, J. P. M. Postma, W. F. van Gunsteren, A. DiNola, and J. R. Haak.
Molecular dynamics with coupling to an external bath. J. Chem. Phys., 81(8):3684–
3690, 1984.
[BR74]
J. R. Bunch and D. R. Rose. Partitioning, tearing and modification of sparse linear
systems. J. Math. Anal. Appl., 48:574–593, 1974.
[BS05]
Alexander Berezhkovskii and Attila Szabo. One-dimensional reaction coordinates for
diffusive activated rate processes in many dimensions. J. Chem. Phys., 122:014503,
2005.
[CE90]
Ryszard Czerminski and Ron Elber. Reaction path study of conformational transitions
in flexible systems: Application to peptides. J. Chem. Phys., 92(9):5580–5601, 1990.
[CE05]
Jean-Pierre Changeux and Stuart J. Edelstein. Allosteric mechanisms of signal transduction. Science, 308:1424–1428, 2005.
[CH92]
Gregory F. Cooper and Edward Herskovits. A Bayesian method for the induction of
probabilistic networks from data. Machine Learning, 9(4):309 – 347, 1992.
BIBLIOGRAPHY
[Cha78]
148
David Chandler. Statistical mechanics of isomerization dynamics in liquids and the
transition state approximation. J. Chem. Phys., 68(6):2959–2970, 1978.
[Cha87]
David Chandler. Introduction to Modern Statistical Mechanics. Oxford University
Press, New York, 1987.
[CIL04]
Dmitry S. Chekmarev, Tateki Ishida, and Ronald M. Levy. Long-time conformational
transitions of alanine dipeptide in aqueous solution: Continuous and discrete-state
kinetic models. J. Phys. Chem. B, 108:19487–19495, 2004.
[CSP+ 06]
John D. Chodera, William C. Swope, Jed W. Pitera, Chaok Seok, and Ken A. Dill.
Use of the weighted histogram analysis method for the analysis of simulated and
parallel tempering simulations. Submitted to J. Chem. Theor. Comput., 2006.
[CSP+ 07]
John D Chodera, Nina Singhal, Vijay S Pande, Ken A Dill, and William C Swope.
Automatic discovery of metastable states for the construction of Markov models of
macromolecular conformational dynamics. J. Chem. Phys., 126(15):155101, Apr
2007.
[CSPD06a]
John D. Chodera, William C. Swope, Jed W. Pitera, and Ken A. Dill. Describing protein folding kinetics by molecular dynamics simulations. 3. Validation of state space
decomposition, with application to terminally-blocked alanine in explicit solvent. In
preparation, 2006.
[CSPD06b]
John D. Chodera, William C. Swope, Jed W. Pitera, and Ken A. Dill. Long-time protein folding dynamics from short-time molecular dynamics simulations. Multiscale
Model. Simul., 5:1214–1226, 2006.
[CSS01]
Andrea G. Cochran, Nicholas J. Skelton, and Melissa A. Starovasnik. Tryptophan
zippers: Stable, monomeric β-hairpins. Proc. Natl. Acad. Sci., 98(10):5578–5583,
2001.
[DBC98]
Christoph Dellago, Peter G. Bolhuis, and David Chandler. Efficient transition path
sampling: Application to the Lennard-Jones cluster rearrangements. J. Chem. Phys.,
108(22):9236–9245, 1998.
BIBLIOGRAPHY
[DBCC98]
149
Christoph Dellago, Peter G. Bolhuis, Félix S. Csajka, and David Chandler. Transition
path sampling and the calculation of rate constants. J. Chem. Phys., 108(5):1964,
1998.
[DER86]
Iain S. Duff, Albert M. Erisman, and John K. Reid. Direct Methods for Sparse Matrices. Oxford University Press, New York, 1986.
[Dev86]
Luc Devroye. Non-Uniform Random Variate Generation. Springer-Verlag, New
York, 1986.
[dGDMG01] Bert L. de Groot, Xavier Daura, Alan E. Mark, and Helmut Grubmüller. Essential
dynamics of reversible peptide folding: Memory-free conformational dynamics governed by internal hydrogen bonds. J. Mol. Biol., 309:299–313, 2001.
[DHFS00]
P. Deuflhard, W. Huisinga, A. Fischer, and Ch. Schütte. Identification of almost
invariant aggregates in reversible nearly uncoupled Markov chains. Linear Algebra
Appl., 315(1-3):39–59, August 2000.
[DL93]
V Daggett and M Levitt. Protein unfolding pathways explored through molecular
dynamics simulations. J. Mol. Biol., 232(2):600–619, Jul 1993. Comparative Study.
[Dob03]
Christopher M. Dobson. Protein folding and misfolding. Nature, 426:884–890, 2003.
[DPG+ 98]
Rose Du, Vijay S. Pande, Alexander Yu. Grosberg, Toyoichi Tanaka, and Eugene S.
Shakhnovich. On the transition coordinate for protein folding. J. Chem. Phys.,
108(1):334–350, 1998.
[DW04]
Peter Deuflhard and Marcus Weber. Robust perron cluster analysis in conformation
dynamics. Linear Algebra and its Applications, 398:161–184, Mar 2004.
[DWC+ 03]
Yong Duan, Chun Wu, Shibasish Chowdhury, Mathew C. Lee, Guoming Xiong, Wei
Zhang, Rong Yang, Piotr Cieplak, Ray Luo, Taisung Lee, James Caldwell, Junmei
Wang, and Peter Kollman. A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations. J.
Comput. Chem., 24:1999–2012, 2003.
[EBAK02]
Elan Zohar Eisenmesser, Daryl A. Bosco, Mikael Akke, and Dorothee Kern. Enzyme
dynamics during catalysis. Science, 295:1520–1523, 2002.
BIBLIOGRAPHY
[Efr79]
150
B. Efron. Bootstrap methods: Another look at the jackknife. Ann. Statist., 7(1):1–26,
1979.
[EP04]
Sidney P. Elmer and Vijay S. Pande.
Foldamer simulations: Novel computa-
tional methods and applications to poly-phenylacetylene oligomers. J. Chem. Phys.,
121:12760, 2004.
[EPP05a]
Sidney P. Elmer, Sanghyun Park, and Vijay S. Pande. Foldamer dynamics expressed
via Markov state models. I. Explicit solvent molecular-dynamics simulations in acetonitrile, chloroform, methanol, and water. J. Chem. Phys., 123:114902, 2005.
[EPP05b]
Sidney P. Elmer, Sanghyun Park, and Vijay S. Pande. Foldamer dynamics expressed
via Markov state models. II. State space decomposition. J. Chem. Phys., 123:114903,
2005.
[EW04]
David A. Evans and David J. Wales. Folding of the GB1 hairpin peptide from discrete
path sampling. J. Chem. Phys., 112(2):1080, 2004.
[FE04]
Aton K. Faradjian and Ron Elber. Computing time scales from reaction coordinates
by milestoning. J. Chem. Phys., 120(23):10880–10889, 2004.
[Fer02]
Alan R. Fersht. On the simulation of protein folding by short time scale molecular
dynamics and distributed computing. Proc. Nat. Acad. Sci. USA, 99:14122–14125,
2002.
[FGM+ 03]
B. G. Fitch, R. S. Germain, M. Mendell, J. Pitera, M. Pitman, A. Rayshubskiy,
Y. Sham, F. Suits, W. Swope, T. J. C. Ward, Y. Zhestkov, and R. Zhou. Blue Matter,
an application framework for molecular simulation on Blue Gene. J. Parallel Distrib.
Comput., 63:759–773, 2003.
[FMA+ 01]
Hans Frauenfelder, Benjamin H. McMahon, Robert H. Austin, Kevin Chu, and
John T. Groves. The role of structure, energy landscape, dynamics, and allostery in
the enzymatic function of myoglobin. Proc. Nat. Acad. Sci. USA, 98(5):2370–2374,
2001.
[GC02]
Jorg Gsponer and Amedeo Caflisch. Molecular dynamics simulations of protein folding from the transition state. Proc. Natl. Acad. Sci., 99(10):6719–6724, May 2002.
BIBLIOGRAPHY
[GFR+ 05]
151
Robert S. Germain, Blake Fitch, Aleksandr Rayshubskiy, Maria Eleftheriou,
Michael C. Pitman, Frank Suits, Mark Giampapa, and T.J. Christopher Ward. Blue
Matter on Blue Gene/L: Massively parallel computation for biomolecular simulation.
In CODES+ISSS ’05: Proceedings of the 3rd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pages 207–212, New
York, NY, USA, 2005. ACM Press.
[GGMS74]
Philip E. Gill, Gene H. Golub, Walter Murray, and Michael A. Saunders. Methods
for updating matrix factorizations. Math. Comput., 28:505–535, 1974.
[GS02]
Angel E. Garcı́a and Kevin Y. Sanbonmatsu. α-helical stabilization by side chain
shielding of backbone hydrogen bonds. Proc. Nat. Acad. Sci. USA, 99:2782–2787,
2002.
[GT94]
Helmut Grubmüller and Paul Tavan. Molecular dynamics of conformational substates
for a simplfied protein model. J. Chem. Phys., 101(6):5047–5057, September 1994.
[GvL96]
Gene H. Golub and Charles F. van Loan. Matrix Computations. The Johns Hopkins
University Press, London, third edition, 1996.
[HGC95]
David Heckerman, Dan Geiger, and David M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning,
20(3):197 – 243, 1995.
[HK03]
Gerhard Hummer and Ioannis G. Kevrekidis. Coarse molecular dynamics of a peptide
fragment: Free energy, kinetics, and long-time dynamics computations. J. Chem.
Phys., 118(23):10762–10773, June 2003.
[HM81]
Ronald. A. Howard and James. E. Matheson. Influence diagrams. In Ronald. A.
Howard and James. E. Matheson, editors, Readings on the Principles and Applications of Decision Analysis, volume II, pages 721–762. Strategic Decisions Group,
Menlo Park, CA, 1981.
[HP07]
Nina Singhal Hinrichs and Vijay S. Pande. Calculation of the distribution of eigenvalues and eigenvectors in markovian state models for molecular dynamics. J. Chem.
Phys., 126:244101, 2007.
BIBLIOGRAPHY
[HS05]
152
Wilhelm Huisinga and Bernd Schmidt. Advances in Algorithms for Macromolecular Simulation, chapter Metastability and dominant eigenvalues of transfer operators.
Lecture Notes in Computational Science and Engineering. Springer, 2005.
[HTF01]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical
Learning. Springer, 2001.
[Hui01]
Wilhelm Huisinga. Metastability of Markov systems: A transfer operator based approach in application to molecular dynamics. PhD thesis, Free University of Berlin,
Berlin, Germany, May 2001.
[Iba01]
Yukito Iba. Extended ensemble monte carlo. International Journal of Modern Physics
C, 12:623, 2001.
[Jan02]
Wolfhard Janke. Statistical analysis of simulations: Data correlations and error estimation. In J Grotendorst, D Marx, and A Murmatsu, editors, Quantum Simulations of
Complex Many-Body Systems: From Theory to Algorithms, volume 10, pages 423–
445. John von Neumann Institute for Computing, 2002.
[JCM+ 83]
W. L. Jorgensen, J. Chandrasekhar, J. D. Madura, R. W. Impey, and M. L. Klein.
Comparison of simple potential functions for simulating liquid water. J. Chem. Phys.,
79:926, 1983.
[JKB97]
Norman L. Johnson, Samuel Kotz, and N. Balakrishnan. Discrete Multivariate Distributions. Wiley, New York, 1997.
[JMTR96]
W.L. Jorgensen, D.S. Maxwell, and J. Tirado-Rives. Development and testing of
the opls all-atom force field on conformational energetics and properties of organic
liquids. Journal of the American Chemical Society, 118(45):11225–11236, 1996.
[JVP06]
Guha Jayachandran, V. Vishal, and Vijay S. Pande. Using massively parallel simulation and Markovian models to study protein folding: examining the dynamics of the
villin headpiece. J. Chem. Phys., 124(16):164902, Apr 2006.
[KB95]
Ralph E. Kunz and R. Stephen Berry. Statistical interpretation of topographies and
dynamics of multidimensional potentials. J. Chem. Phys., 103(5):1904–1912, 1995.
[KBJ00]
Samuel Kotz, N. Balakrishnan, and Norman L. Johnson. Continuous Multivariate
Distributions. Wiley, New York, 2000.
BIBLIOGRAPHY
[KBS+ 92]
153
Shankar Kumar, Djamal Bouzida, Robert H. Swendsen, Peter A. Kollman, and
John M. Rosenberg. The weighted histogram analysis method for free-energy calculations on biomolecules. I. The method. J. Comput. Chem., 13(8):1011–1021, 1992.
[KDC+ 97]
P. A. Kollman, R. Dixon, W. Cornell, T. Vox, C. Chipot, and A. Pohorille. The
development/application of a “minimalist” organic/biochemical molecular mechanic
force field using a combination of ab initio calculations and experimental data. In
A. Wilkinson, P. Weiner, and W. F. van Gunsteren, editors, Computer Simulation of
Biomolecular Systems, volume 3, pages 83–96. Kluwer/Escom, 1997.
[KKS+ 06]
Peter M Kasson, Nicholas W Kelley, Nina Singhal, Marija Vrljic, Axel T Brunger,
and Vijay S Pande. Ensemble molecular dynamics yields submillisecond kinetics and
intermediates of membrane fusion. Proc. Natl. Acad. Sci., 103(32):11916–11921,
Aug 2006.
[KTB93]
Mary E. Karpen, Douglas J. Tobias, and Charles L. Brooks III. Statistical clustering
techniques for the analysis of long molecular dynamics trajectories: Analysis of 2.2ns trajectories of YPGDV. Biochemistry, 32:412–420, 1993.
[KZP+ 07]
Peter M. Kasson, Afra Zomorodian, Sanghyun Park, Nina Singhal, Leonidas J.
Guibas, and Vijay S. Pande. Persistant voids: A new structural metric for membrane
fusion. Bioinformatics, 23(14):1753–1759, 2007.
[LJB01]
Yaakov Levy, Joshua Jortner, and Oren M. Becker. Dynamics of hierarchical folding
on energy landscapes of hexapeptides. J. Chem. Phys., 115(22):10533–10547, 2001.
[LJB02]
Yaakov Levy, Joshua Jortner, and R. Stephen Berry. Eigenvalue spectrum of the
master equation for hierarchical dynamics of complex systems. Phys. Chem. Chem.
Phys., 4:5052–5058, 2002.
[LK92]
David J. Lockhart and Peter S. Kim. Internal Stark effect measurement of the electric
field of the amino terminus of an α helix. Science, 257:947–951, 1992.
[LK93]
David J. Lockhart and Peter S. Kim. Electrostatic screening of charge and dipole
interactions with the helix backbone. Science, 260:198–202, 1993.
BIBLIOGRAPHY
[LKSA01]
154
Igor K. Lednev, Anton S. Karnoup, Mark C. Sparrow, and Sanford A. Asher. Transient UV Raman spectrocopy finds no crossing barrier between the peptide α-helix
and fully random coil conformation. J. Am. Chem. Soc., 123:2388–2392, 2001.
[LS01]
Lewyn Li and Eugene I. Shakhnovich. Constructing, verifying, and dissecting the
folding transition state of chymotrypsin inhibitor 2 with all-atom simulations. Proc.
Natl. Acad. Sci., 98(23):13014–13018, Nov 2001.
[LSY98]
R. B. Lehoucq, D. C. Sorensen, and C. Yang. ARPACK Users’ Guide: Solution of
Large-Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. SIAM
Publications, Philadelphia, 1998.
[LZSP04]
Peter Lenz, Bojan Zagrovic, Jessica Shapiro, and Vijay S. Pande. Folding probabilities: A novel approach to folding transitions and the two-dimensional ising-model. J.
Chem. Phys., 120(14):6769–6778, April 2004.
[Mac67]
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical statistics and
probability, pages 281–297. University of California Press, 1967.
[MD05]
Ao Ma and Aaron R. Dinner. Automatic method for identifying reaction coordinates
in complex systems. J. Phys. Chem. B, 109:6769–6779, 2005.
[MEW02]
Paul N. Mortenson, David A. Evans, and David J. Wales. Energy landscapes of model
polyalanines. J. Chem. Phys., 117(3):1363–1376, 2002.
[MMGD06] Nazli Maki, Karobi Moitra, Pratiti Ghosh, and Saibal Dey. Allosteric modulation
bypasses the requirement for ATP hydrolysis in regenerating low affinitiy transition
state conformation of human P-glycoprotein. J. Biol. Chem., 281(16):10769–10777,
2006.
[MRR+ 53]
N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. J. Chem. Phys., 21(6):1087–
1092, 1953.
[MSF05]
Eike Meerback, Christof Schütte, and Alexander Fischer. Eigenvalue bounds on restrictions of reversible nearly uncoupled Markov chains. Lin. Alg. Appl., 398:141–
160, 2005.
BIBLIOGRAPHY
[MvEB04]
155
Daniele Moroni, Titus S. van Erp, and Peter G. Bolhuis. Investigating rare events by
transition interface sampling. Physica A, 340:395–401, 2004.
[MW01]
Paul N. Mortenson and David J. Wales. Energy landscapes, global optimization and
dynamics of the polyalanine Ac(ala)8 NHMe. J. Chem. Phys., 114(14):6443–6454,
2001.
[NHSS07]
Frank Noé, Illia Horenko, Christof Schütte, and Jeremy C. Smith. Hierarchical analysis of conformational dynamics in biomolecules: Transition networks of metastable
states. J. Chem. Phys., 126(15):155102, Apr 2007.
[ODB02]
S. Banu Ozkan, Ken A. Dill, and Ivet Bahar. Fast-folding protein kinetics, hidden
intermediates, and the sequential stabilization model. Protein Science, 11:1958–1970,
2002.
[OSW77]
Irwin Oppenheim, Kurt E. Shuler, and George H. Weiss, editors. Stochastic processes
in chemical physics: The master equation. MIT Press, 1977.
[PBC+ 03]
Vijay S. Pande, Ian Baker, Jarrod Chapman, Sidney P. Elmer, Siraj Khaliq, Stefan M.
Larson, Young Min Rhee, Michael R. Shirts, Christopher D. Snow, Eric J. Sorin, and
Bokan Zagrovic. Atomistic protein folding simulations on the submillisecond time
scale using worldwide distributed computing. Biopolymers, 68:91–109, 2003.
[Pe’03]
Dana Pe’er. From Gene Expression to Molecular Pathways. PhD thesis, Hebrew
University, 2003.
[Pea88]
Judea Pearl. Probabilistic Reasoning in Intelligent Systems : Networks of Plausible
Inference. Morgan Kaufmann, September 1988.
[PGTR98]
Vijay S. Pande, Alexander Yu Grosberg, Toyoichi Tanaka, and Daniel S. Rokhsar.
Pathways for protein folding: Is a “new view” needed? Curr. Opin. Struct. Biol.,
8(1):68–79, 1998.
[PHS06]
Jed W. Pitera, Imran Haque, and William C. Swope. Absence of reptation in the hightemperature folding of the trpzip2 β-hairpin peptide. J. Chem. Phys., 124:141102,
2006.
[PP06]
Sanghyun Park and Vijay S. Pande. Validation of Markov state models using Shannon’s entropy. J. Chem. Phys., 124:054118, 2006.
BIBLIOGRAPHY
[PR99]
156
Vijay S. Pande and Daniel S. Rokhsar. Molecular dynamics simulations of unfolding
and refolding of a beta-hairpin fragment of protein G. Proc. Natl. Acad. Sci. USA,
96(9062–9067), 1999.
[QSHS97]
D. Qiu, P.S. Shenkin, F.P. Hollinger, and W.C. Still. The gb/sa continuum model
for solvation. a fast analytical method for the calculation of approximate born radii.
Journal of Physical Chemistry A, 101(16):3005–3014, 1997.
[Rao73]
C. Radhakrishna Rao. Linear Statistical Inference and Its Applications. Wiley, New
York, second edition, 1973.
[RP05]
Young Min Rhee and Vijay S. Pande. One-dimensional reaction coordinate and the
corresponding potential of mean force from commitment probability distribution. J.
Phys. Chem. B, 109:6780–6786, 2005.
[Ruh98]
Axel Ruhe. Rational krylov: A practical algorithm for large sparse nonsymmetric
matrix pencils. SIAM Journal on Scientific Computing, 19(5):1535–1551, 1998.
[SABW82]
William C. Swope, Hans C. Andersen, Peter H. Berens, and Kent R. Wilson. A computer simulation method for the calculation of equilibrium constants for the formation
of physical clusters of molecules: Application to small water clusters. J. Chem. Phys.,
76(1):637–649, 1982.
[SB01]
Joan-Emma Shea and Charles L. Brooks III. A review and assessment of simulation
studies of protein folding and unfolding. Annu. Rev. Phys. Chem., 52:499–535, 2001.
[Sch99]
Christof Schütte. Conformational dynamics: Modelling, theory, algorithm, and application to biomolecules. PhD thesis, Konrad Zuse Zentrum Berlin, Berlin, Germany, 1999.
[SF03]
Min-yi Shen and Karl F. Freed. Long time dynamics of met-enkephalin: Tests of
mode-coupling theory and implicit solvent models. J. Chem. Phys., 118(11):5143–
5156, 2003.
[SFHD99]
Ch. Schütte, A. Fischer, W. Huisinga, and P. Deuflhard. A direct approach to conformational dynamics based on Hybrid Monte Carlo. J, Comput. Phys., 151:146–168,
1999.
BIBLIOGRAPHY
[SH02]
157
Christof Schütte and Wilhelm Huisinga. Biomolecular conformations can be identified as metastable sets of molecular dynamics. In P. G. Ciaret and J.-L. Lions, editors,
Handbook of Numerical Analysis - special volume on computational chemistry, volume X. Elsevier, 2002.
[Sha96]
David Shalloway. Macrostates of classical stochastic systems. J. Chem. Phys.,
105(22):9986–10007, 1996.
[SHCT05]
Verena Schultheis, Thomas Hirschberger, Heiko Carstens, and Paul Tavan. Extracting
markov models of peptide conformational dynamics from simulation data. J. Chem.
Theor. Comput., 2005.
[SKH05]
Saravanapriyan Sriraman, Ioannis G. Kevrekidis, and Gerhard Hummer. Coarse master equation from Bayesian analysis of replica molecular dynamics simulations. J.
Phys. Chem. B, 109:6479–6484, 2005.
[SO99]
Yugi Sugita and Yuko Okamoto. Replica-exchange molecular dynamics method for
protein folding. Chem. Phys. Lett., 314:141–151, 1999.
[SP00]
Michael Shirts and Vijay S. Pande. Screen savers of the world unite!
Science,
290(5498):1903–1904, December 2000.
[SP05a]
Nina Singhal and Vijay S. Pande. Error analysis and efficient sampling in Markovian
state models for molecular dynamics. J. Chem. Phys., 123:204909, 2005.
[SP05b]
Eric J. Sorin and Vijay S. Pande. Empirical force-field assessment: The interplay
between backbone torsions and noncovalent term scaling. J. Comput. Chem., 26:682–
690, 2005.
[SP05c]
Eric J. Sorin and Vijay S. Pande. Exploring the helix-coil transition via all-atom
equilibrium ensemble simulations. Biophys. J., 88:2472–2493, 2005.
[SPS04a]
William C. Swope, Jed W. Pitera, and Frank Suits. Describing protein folding kinetics
by molecular dynamics simulations: 1. Theory. J. Phys. Chem. B, 108:6571–6581,
2004.
[SPS+ 04b]
William C. Swope, Jed W. Pitera, Frank Suits, Mike Pitman, Maria Eleftheriou,
Blake G. Fitch, Robert S. Germain, Aleksandr Rayshubski, T. J. C. Ward, Yuriy Zhestkov, and Ruhong Zhou. Describing protein folding kinetics by molecular dynamics
BIBLIOGRAPHY
158
simulations: 2. Example applications to alanine dipeptide and a beta-hairpin peptide.
J. Phys. Chem. B, 108:6582–6594, 2004.
[SQD+ 04]
Christopher D. Snow, Linlin Qiu, Deguo Du, Feng Gai, Stephen J. Hagen, and Vijay S. Pande. Trp zipper folding kinetics by molecular dynamics and temperaturejump spectroscopy. Proc. Natl. Acad. Sci. USA, 101(12):4077–4082, 2004.
[SSP04]
Nina Singhal, Christopher D. Snow, and Vijay S. Pande. Using path sampling to
build better markovian state models: Predicting the folding rate and mechanism of a
tryptophan zipper beta hairpin. J. Chem. Phys., 121(1):415–425, 2004.
[STD+ 03]
G. Song, S. Thomas, K. Dill, J. Scholtz, and N. Amato. A path planning-based study
of protein folding with a case study of hairpin formation in protein g and l. Proc.
Pacific Symposium of Biocomputing (PSB), 2003.
[Ste02]
Boris Steipe. A revised proof of the metric properties of optimally superimposed
vector sets. Acta Cryst., A58:506, 2002.
[TEH97]
Peggy A. Thompson, William A. Eaton, and James Hofrichter. Laser temperature
*coil kinetics of an alanine peptide interpreted with a ‘kinetic
jump study of the helix)
zipper’ model. Biochem., 36:9200–9210, 1997.
[TGK96]
Donald G. Truhlar, Bruce C. Garrett, and Stephen J. Klippenstein. Current status of
transition-state theory. J. Phys. Chem., 100:12771–12800, 1996.
[The05]
Douglas L. Theobald. Rapid calculation of RMSDs using a quaternion-based characteristic polynomial. Acta Cryst., A61:478–480, 2005.
[US98]
Alex Ulitsky and David Shalloway. Variational calculation of macrostate transition
rates. J. Chem. Phys., 109(5):1670–1686, August 1998.
[VD85]
Arthur F. Voter and Jimmie D. Doll. Dynamical corrections to transition state theory
for multistate systems: Surface self-diffusion in the rare-event regime. J. Chem.
Phys., 82(1):80–92, 1985.
[vEMB03]
Titus S. van Erp, Daniele Moroni, and Peter G. Bolhuis. A novel path sampling
method for the calculation of rate constants. J. Chem. Phys., 118(17):7762–7774,
2003.
BIBLIOGRAPHY
[vK97]
159
N. G. van Kampen. Stochastic processes in physics and chemistry. Elsevier, second
edition, 1997.
[VS83]
Jiri Vlach and Kishore Singhal. Computer Methods for Circuit Analysis and Design.
Van Nostrand Reinhold, New York, 1983.
[WCG+ 96]
Skip Williams, Timothy P. Causgrove, Rudolf Gilmanshin, Karen S. Fang, Robert H.
Callender, William H. Woodruff, and R. Brian Dyer. Fast events in protein folding:
Helix melting and formation in a small peptide. Biochem., 35:691–697, 1996.
[WCK00]
Junmei Wang, Piotr Cieplak, and Peter A. Kollman. How well does a restrained
electrostatic potential (RESP) model perform in calculating conformational energies
of organic and biological molecules? J. Comput. Chem., 21(12):1049–1074, 2000.
[Web06]
Marcus Weber. Meshless methods in conformation dynamics. PhD thesis, Free University of Berlin, 2006.
[WG02]
Marcus Weber and Tobias Galliat. Characterization of transition states in conformational dynamics using fuzzy sets. Technical Report ZR-02-12, Konrad-Zuse-Zentrum
Berlin, 2002.
[YG04]
Wei Yuan Yang and Martin Gruebele. Detection-dependent kinetics as a probe of
folding landscape microstrucure. J. Am. Chem. Soc., 126:7758–7759, 2004.
[YR06]
Ben Youngblood and Norbert O. Reich. Conformational transitions as determinants
of specificity for the DNA methyltransferase EcoRI. J. Biol. Chem., 281(37):26821–
26831, 2006.
[ZLCD04]
Wei Zhang, Hongxing Lei, Shibasish Chowdbury, and Yong Duan. Fs-21 peptides
can form both single helix and helix-turn-helix. J. Phys. Chem. B, 108:7479–7489,
2004.