Finding the best model

Modeling molecular dynamics from
simulations
Nina Singhal Hinrichs
Departments of Computer Science and Statistics
University of Chicago
January 28, 2009
Motivation
• Proteins are essential parts of living
organisms
– enzymes, cell signaling, membrane
transport . . .
• Composed of chain of amino acids
• Fold to unique 3-dimensional
structure
• Misfolding can cause diseases
– Alzheimer’s, Mad cow, Huntington’s . . .
• How do proteins fold?
Molecular dynamics
• Represent atoms of molecule
and solvent
• Model forces on atoms
• Integrate laws of motion
• Small integration time step
compared to motion
timescales
Folding@Home: Distributed computing for
biomolecular simulation
• Perform multiple
simulations in
parallel
• Total simulation
times – hundreds
of microseconds
Very powerful computational resource
(hundreds of CPU– ~200 Teraflops sustained performance
years)
–
>1,000,000 total CPUs; 200,000 active
Challenge: How to analyze?
• Enormous datasets
– Describe dynamics in microscopic detail
• Questions we want to answer
– Rate of folding, mechanism of folding . . .
• How can we extract these properties from
our data?
Outline
• Markovian state model for molecular motion
– Model description, uses, examples
• New algorithms for building these models
– Defining states and transition probabilities
• New methods for dealing with finite sampling
– Model complexity, uncertainty analysis, targeted
sampling
Chemical intuition
Chemical reactions often exhibit stochastic behavior
n-butane
Chandler, Journal of Chemical Physics (1977)
Markovian state model
Define states in the conformation space
1
5
3
4
2
Define transition probabilities, or edges, between states
1
2
3
4
5
 p11
p
 21
 

 pN1
p12 
p22

p1N 




p NN 
Uses of the model
Chodera et al., Multiscale
Modeling and Simulation (2006)
• Populations of states over time
p
• Eigenvalues and eigenvectors –
conformational changes
t
• Kinetic properties – virtually any kinetic property
• Mechanistic properties – most likely path,
probability of transitions as graph algorithms
Example models
Kasson et al., PNAS (2006)
alanine peptide
lipid vesicle fusion
Chodera et al., Multiscale Modeling
and Simulation (2006)
alpha helix
villin headpiece
Sorin and Pande, Biophysical Journal (2005)
Jayachandran et al., Journal of Structural Biology (2006)
Computational and statistical
challenges
• Building Markovian state model
1
– Defining states that are Markovian
p11

p21


pN1
– Calculating the transition probabilities
• Refining Markovian state model
– Finding the best model
p12
p22

l
– Determining model uncertainty
– Designing new simulations
5
3
4
2
1
2
3
4
5
p1N 




pNN 
Automatic state
decomposition
•
Building Markovian State Model
–
–
Defining states that are Markovian
Calculating the transition probabilities
• Challenge: Find appropriate states
• Individual conformations as states does not scale
• Group conformations into discrete states
• Structural clustering is insufficient
• Basic algorithm – combine structural and kinetic
similarity
J. D. Chodera*, N. Singhal*, V. S. Pande, K. A. Dill, and W. C. Swope. Automatic discovery of metastable states
for the construction of Markov models of macromolecular conformational dynamics. Journal of Chemical Physics,
126, 155101 (2007). (*These authors contributed equally to this work)
Comparison of structural and kinetic
clustering
trpzip2
Cochran et al. PNAS 98:5578, 2001.
structural clustering
kinetic clustering
State decomposition – splitting
Cluster conformations by root
mean square distance (RMSD)
State decomposition – lumping
group states which inter-convert quickly
State decomposition – resplitting
Cluster conformations, restricted
to each state
Blocked alanine peptide
60
1
2
3
4
6
y
-60
Chodera et al., Multiscale Modeling
and Simulation (2006)
5
-60
f
60
Automatic state decomposition of
alanine peptide
Black state sits on top of
multiple other states!
These conformations had
an unusual peptide bond
y
f
Benefit of automatic
algorithm
Stability of decomposition
TrpZip peptide
Transition
probabilities
•
Building Markovian State Model
–
–
Defining states that are Markovian
Calculating the transition probabilities
Discretize trajectories into series of states
1
3
2
5
1223435
4
Count number of transitions between all pairs of states
 z11 z12  z1N 
z

z
22
 21

 




z
z
NN 
 N1
transition counts
normalize
 p11
p
 21
 

 pN1
p12 
p22

p1N 




p NN 
transition probabilities
N. Singhal, C. D. Snow, and V. S. Pande. Using path sampling to build better Markovian state models: Predicting
the folding rate and mechanism of a trp zipper beta hairpin. Journal of Chemical Physics, 121(1), 415-425 (2004).
•
Model selection
Refining Markovian State Model
–
–
–
Finding the best model
Determining model uncertainty
Designing new simulations
• Challenge: How many states should we
have?
– More states are more Markovian
– More states have more parameters
• How do we evaluate this tradeoff?
N. S. Hinrichs and V. S. Pande. Bayesian metrics for validating and improving Markovian state models for
molecular dynamics simulations. (In preparation)
Hidden Markov Model formulation
• Formulate the problem as a Hidden Markov Model
structure scoring question
States
Observations
• Different discretizations of continuous space
• Benefits of Bayesian scores
– Naturally handles tradeoff between complexity of model and
amount of data
– Avoids over-fitting of parameters
Alanine peptide results
Score of Hidden Markov
models for different lag
times
Last model is worse at
shorter times but
preferred at longer times
No previous evaluation
methods could
distinguish these models
•
Refining Markovian State Model
–
–
–
Uncertainty analysis
Finding the best model
Determining model uncertainty
Designing new simulations
Goal: Once we have the states, what is the uncertainty in the
model?
Uncertainty caused by finite sampling
1
2
1
3
4
5
2
3
4
5
Both are reasonable but give different transition
probabilities
 Different MFPT, Pfold, eigenvalues, eigenvectors ...
N. Singhal and V. S. Pande. Error analysis and efficient sampling in Markovian state models for protien
folding. Journal of Chemical Physics, 123, 204909-204921 (2005).
N. S. Hinrichs and V. S. Pande. Calculation of the distribution of eigenvalues and eigenvectors in Markovian
state models for molecular dynamics. Journal of Chemical Physics, 126, 244101 (2007).
Transition probabilities
Recall that we calculate transition probabilities by counting:
pij 
zij
z
ik
k
700
j
300
k
i
Instead of getting a single value,
we can talk about the distribution
of transition probabilities
Bayes’ Rule:
70
j
30
k
i
P(pi* | counts)  P(counts | pi* ) P(pi* )
pij
Sampling approach
Possible solution to get distribution of eigenvalues:
[pij]
solve for
eigenvalue
l
[pij]
solve for
eigenvalue
l
[pij]
solve for
eigenvalue
l
Problem:
sampling can be expensive
solving per sample can be expensive
Closed-form solution
Idea: trade exact distribution for efficient approximation
Eigenvalue equation:
det( P

lI )  0

A
efficient to
calculate using
adjoint systems
Taylor series expansion:
l
l
l
l l 
 p11 
 p12   
p11 A
p12 A
pNN
 pNN
A
Multivariate normal approximation of pi*
 Closed-form normal distribution for l
Uncertainty results
5000 trajectories from each
state
1
2
3
4
6
5
Alanine System
0
0 
4380 153 15 2
 211 4788 1

0
0
0


 169 1 4604 226 0
0 


3
13
158
4823
3
0


 0
0
0
4 4978 18 


5
0
0 62 4926
 7
Transition Counts
Running
Runningtimes
times(87
(6 states)
Sampling-based:
Sampling-based:3600
40 seconds
seconds
Closed-form:
Closed-form:<<0.07
0.01seconds
seconds
•
Sampling strategies
Refining Markovian State Model
–
–
–
Finding the best model
Determining model uncertainty
Designing new simulations
Problem: Simulations are expensive. Even with
Folding@Home, we run simulations for months
How to intelligently allocate our resources?
Common approaches:
• equilibrium sampling – sample each conformation from
its equilibrium distribution
• even sampling – sample equally from each state
New sequential approaches
N. Singhal and V. S. Pande. Error analysis and efficient sampling in Markovian state models for protien
folding. Journal of Chemical Physics, 123, 204909-204921 (2005).
N. S. Hinrichs and V. S. Pande. Calculation of the distribution of eigenvalues and eigenvectors in Markovian
state models for molecular dynamics. Journal of Chemical Physics, 126, 244101 (2007).
Adaptive sampling
Goal: Reduce uncertainty of eigenvalue
Uncertainty analysis decomposes by transitions from each state
 l
l
l  l 
 p11   
 p11
p1N
A


 p1N 

A

 l
l

 p21   
 p21
p2 N
A



 p2 N 

A

 l
l

 pN1   
 p N 1
p NN
A


 p NN 

A

Variance depends on both uncertainty of and sensitivity to
transition probabilities
Adaptive sampling – alanine
On 6-state alanine system,
select trajectories randomly
for 3 sampling strategies
0
0 
4380 153 15 2
 211 4788 1

0
0
0


 169 1 4604 226 0
0 


3
13
158
4823
3
0


 0
0
0
4 4978 18 


5
0
0 62 4926
 7
Transition Counts
Adaptive sampling – villin
2454 states
• Benefits
– Very quickly reduce the
variance
– Reduce the total number of
simulations
– Need less computational
power
– Can study more complex
systems
Villin Headpiece
Jayachandran, et al.,
Journal of Chemical Physics (2006)
Summary
• Markovian state models are convenient
methods to describe molecular motion
• Automatic state decomposition
– Scalable to large size systems
• Model selection
– Evaluate tradeoff between model complexity and
amount of data
• Uncertainty analysis
– Efficient and decomposable
• Adaptive sampling
– Reduce number of simulations
Acknowledgements
• Vijay Pande – Stanford University adviser
• Bill Swope, Jed Pitera – IBM collaborators
• John Chodera – state decomposition work