When is a Potential Accurate
Enough for Structure
Prediction? Theory and
Application to a Random
Heteropolymer Model of Protein
Folding
Joseph D. Bryngelson
SFI WORKING PAPER: 1992-11-053
SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent the
views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or
proceedings volumes, but not papers that have already appeared in print. Except for papers by our external
faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, or
funded by an SFI grant.
©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure
timely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rights
therein are maintained by the author(s). It is understood that all persons copying this information will
adhere to the terms and constraints invoked by each author's copyright. These works may be reposted only
with the explicit permission of the copyright holder.
www.santafe.edu
SANTA FE INSTITUTE
When is a Potential Accurate Enough for
Structure Prediction?: Theory and Application to
a Random Heteropolymer Model of Protein
Folding
Joseph D. Bryngelson
Complex Systems Group (T-13), Theoretical Division
Los Alamos Natio nal Laboratory, Los Alamos, NM 87545, USA
October 27, 1992
Abstra ct
Attempts to predict molecular structure, often try to minimize some
potential function over a set of structures. Much effort has gone
into creating potentials functions and algorithms for minimizing these potentia
l
functions. This paper develops a formalism that addresses a
complementary question: What are the a.ccuracy requirements for a
potentia l
function that predicts molecular structure? The formalism is applied
to
a simple model of a protein structure potential. The results of this
calculation show that for high accuracy predictions (- lA. rms deviatio
n) of a.
typical protein, the monomer-monomer interaction energies must
be accurate to within five to fifteen percent. The paper closes with a discussi
on
of the implications of these results for practical structure predicti
on.
1
Introductil;>n
The theoretical prediction of the structu re of a molecule or an assemb
ly of
molecules, such as a cluster, frequently involves the minimization of
a potenti al
function. Examples of this activity range from using sophisticated techniq
ues of
modern quantu m chemistry to obtain high accuracy predictions of structu
res of
small molecules in the gas phase, to using semi-empirical potentials of
mean force
to predict the structu res of macromolecules in solution. Particularly
for large
molecules, much effort has gone into developing accurate potentials that
r;''Iuire
tractable' amounts of compu ter time for their evaluation, and into
developing
efficient algorithms for finding the deepest minimum of these potenti
als. This
papers addresses anothe r aspect of structu re prediction: the accurac
y required
1
of a potential that predicts molecular structu re. If the potential functio
n is not
accurate enough, then the best minimization algorithm possible is
still useless
for predicting structu re. However, if the accuracy requirements
are known, then
definite goals for potenti al creation exist, and researchers can
concentrate on
problems that are solvable with the present potentials. The formali
sm derived
here is general, and applicable to any structu re prediction problem
. One of
the most import ant unsolved problems in molecular structu re predict
ion is the
prediction of protein structure from amino acid sequence. Therefo
re, this paper
then applies the formalism to a simple protein model to estimate the
accuracy
needed for a potenti al that predicts protein structu re.
This paper will discuss the problem of predicting the full three-dimensio
nal
or tertiary structu re of a protein. Most attemp ts to predict protein
tertiary
structu re are, at least implicitly based on the thermodynamic hypoth
esis, which
states that a protein in solution folds to the configuration that minimi
zes the
free energy of the protein plus solvent system. [1] Typically the solvent
is water. This hypothesis suggests a general strateg y for predicting protein
tertiary
structu re from sequence. First, the researcher develops a semi-empirical
potential function Vappro•. (q, s), whi~h approximates the free energy of the
proteinsolvent system as a function of q, the three-dimensional configuration
of the
protein and s, the sequence of the protein. Henceforth the depend
ence on s
will be suppressed in my notation. Next, the researcher attemp ts to solve
the
problem by finding the configuration q:;;,r:o, that minimizes Vappro•
. (q). The
configuration q:;;,r:o, is the predicted protein structu re. Although the
above general strateg y has successfully predicted the structu re of small polype
ptides, [2J
it has met with limited to non-existent success in predicting the structu
re of
globular proteins. This failure has typically been attribu ted to the
difficulty of
finding the the minimum of the potenti al functions, so that a great deal
of effort
has gone into algorithms for optimizing these potential functions.
This paper
analyzes a complementary question, the potential accuracy questio
n: How accurate must the potenti al function,Vappro •. (q), be so that q:;;.r;;o. is
the correct
structu re?
The protein calculation presented here has a forerunner in the form
of a
paper by Shaknovitch and Gutin on the probability of a neutral mutati
on in
a protein. [3] Their calculation is related to a special case of the
calculation
presented here. The present paper also makes explicit the mathem
atical approximations and the notion of structu re implicit in the mutation paper.
I have
used some of the same notatio n as the mutatio n paper so the reader
can easily
compare it afterwards.
The following Section defines how the term "struct ure" is used in this
paper,
and poses the above question in a mathem atically precise manner
. Sections
Three and Four formally solves the potenti al accuracy problem for some
simple
cases. A simple model of a protein potenti al function is described
in Section
Five and the results of Sections Three and Four are applied to this
model in
Section Six. The paper concludes with a critical discussion of these results,
their
2
implications for real protein structu re prediction, and a look at some
future
directions. Readers not interested in technical details should skip Section
Six.
2
Posi ng the Prob lem
Consider a representation of the arrangement of atoms in three-dimensio
nal
space. Coarse-grain this configuration space so there are a countable
number of
discrete states. If the representation of the arrangement of atoms was
discrete to
begin with, then there is no need to coarse-grain. When I refer to
a "structure"
or (for emphasis) "discrete structure" in this paper, I mean one of these
discrete
states. Each structu re will be labeled with an integer. The value of the
potenti al
function for structu re i will be denoted E" and I will refer to it as as the
"energy"
of structure i, even though it is really a free energy or the value of a
potenti al
of mean force. This loose terminology will prove convenient.
The notion of coarse-graining and discrete structu re may become
clearer
with a simple example. Consider a small molecule with only one flexible
bond,
so the spatial arrangement of atoms can be specified by denoting a
bond angIe, 8. To coarse-grain the bond angle, I could consider the molecu
le to be in
structu re one, if 00 ~ 8 < 10 0 , structu re two, if 100 ~ 8 < 20 0 , and
so on.
The value of a potenti al function for each of these discrete structu res
must be
determined by some fixed algorithm. To continue this example, if I am
given a
continuous potenti al V:;;;oz. (8), then I could define the value of this
potenti al
for structure one to be El = V:C;;':or.(5°), E2 = V:;;:o:r.(15°), and
so on. I can
specify the spatial arrangement of atoms more or less accurately by making
the
discretization finer or coarser. The formalism discussed in this paper
is independent of the manner of representing the arrangement of atoms and
the details
of the coarse-graining procedure, which may be chosen to suit the applica
tion
at hand.
The approximate potenti al Vopproz.(q) assigns a real number to each
structure q. The structu re with the lowest Vopproz.(q) is labeled q~f:oz.
The structure labeled q~r:oz is the predicted structu re. There is also some real,
exact energy of the molecule or the molecule-solvent system, given by potenti
al Vreol(q).
The potential v"eol(q) .lIso assigns a real number to each of the structu
res. Denote by q:;;;:: the discrete structu re with the lowest value of v"eol(q).
The structure q~r:oz is the structu re predicted by the approximate potential
Vopproz.(q)
rool'18 the s t rue t ure In
. nat ure, so if qmin
and qmin
reol = qmin
opproz t h
en "
Yapproz . ( q ) predicts the correct structu re. The approximate potenti al function will
in general
have inaccuracies due to inaccuracies in parameters, neglect of physica
l effects
and so on. The errors Vopproz.(q) are denoted by
I
oE(q)
= Vopproz.(q) -
v"eol(q)
(I)
and may be though t of as noise added to the true potenti al. Notice
that oE(q)
is a function of each discrete state. Since the real potenti al Vreol( q) is
unknown,
3
the 6E(q) is also unknown. However, the knowledge used in constru
cting the
potential can 'also be used to find a probability distribution for 6E. The
expected
size of the inaccuracies in the approximate potential can usually
be estima ted.
For example one can estima te the size of the physical effects that are
ignored,
and the inaccuracies in the parameters. These estimat es of inaccur
acy can be
interpr eted to mean that the best one can really do is to give a
probability
distribution for the energy, parameter, etc. with a width given by the
estima te
of inaccuracy. The value of 6E is given by the sum of all of these inaccur
acies,
and if there are a large number of terms in this sum, which will typical
ly be true
for a large molecule potential, the central limit theorem implies that
p(6E), the
probability distribution of 6E is given by a Gaussian, whose mean and
width is
found from the inaccuracy estimates. Now the potential accuracy questio
n can
be put into a precise, mathematical form: Given p(6E), what is the
probability
that
q:;:r':
3
Dete rmin istic Case
and q~f,;or are the same structure?
To illustra te the formal solutions to the potential accuracy question,
I will start
with the simplest examples and proceed to examples of greater
complexity.
After developing the formalism sufficiently, I will apply it to a simple
model of
a protein potenti al function.
The simplest problem is that of two structu res. A scientist uses an approx
imate potenti al V,ppro• . (q) to calculate the energies, Eh and Ef, for structu
res 0
and 1 respectively. Withou t loss of generality, I assume
Eh < E~, so q::r:o r = 0
For concreteness, one could think of a spin in a solid that could point
in one
of only two directions, surely the simplest kind of "structure. ll These
structures
also have real energies Eo and E" which are related to the approximate
energies
by
=
=
Eo +6Eo
Eh
E, +6E,
E~
(2)
where the 6E; are errors in v.ppr.. . (q) and the distribution of the 6E;
is p(6Ei)'
For this simple case the potenti al accuracy question is: What is the
probability,
R, that Eo < E, ? Note that Eo < E, implies
E; - Eh > AE
(3)
where I have used the natura l variable.
AE,= 6E, - 6Eo
(4)
If I denote the distrib ution of AE by P(AE) , then R, the probab
ility of predicting the correct structu re is given by
R(E~,E;) =
l
E;-E~
P(AE) d(AE) .
-00
4
(5)
To complete the formal solution to the two structu re problem, P(AE)
must
be expressed'in terms of p(6E). In general there is no connection betwee
n these
two distributions. For example, the errors in the potential function
may be so
closely correlated that 6Eo "" 6E1, in which case P(AE) becomes,
to a good
approximation, a Dirac delta function at zero! Here I will only conside
r the
simplest case, which must be solved before others. The simplest assump
tion is
that the errors in the energy of one structu re is independent of the errors
in the
energies of all the other structu res. With this assumption, the distrib
ution of
AE becomes
1:
00
P(AE) =
p(6Eo) p(6E + 6Eo) d(6Eo) .
(6)
Equati ons ( 5) and ( 6) are the desired solution of the two structu
re problem
with the independent error assumption.
Before proceeding further , it will do well to discuss consequences of
the independent error assumption and of the correlations that are expecte
d in real
systems. In practice, the independent error assumption is probab
ly a worst
case, because the kinds of correlations that occur in the inaccuracies
of real
potential functions will tend to narrow the distribution of P(AE) .
For example, perhap s the most import ant error correlation is due to the similar
ities of
two structu res. If two structu res are similar, they will have many of
the same
interactions, e.g., the same hydrogen bonds, which are represented by
the same
(possibly inaccurate) terms in the potenti al function. The inaccuracies
due to
these common interactions are the same, so the difference between the
errors in
these two structu res is due only to the interactions that the structu
res do not
have in common. This effect will tend to narrow P(AE) .
The solution of the many structu re problem is a straightforward general
ization of the solution of the two structu re problem presented above.
Consider Q
different structu res, labeled 0, 1,2, ... , Q - 1. As before, the energies
calculated
from the approximate potenti al, Eh, E;, E ,... ,E _ are related
to the real
2
O
1
energies Eo,E1 ,Ez, .•• , EO- 1 by
Eh
E:
E2
E O- 1
=
=
Eo+6 Eo
EI +6E1
=
E z +6Ez
=
E O_ 1 + 6Eo_I
(7)
where the 6E; are errors in the potenti al Vapprcz . (q) and the distrib
ution of
the 6E; is p(6E;). Once again I may assume without loss of genera
lity that
Eh < E;, Eh < E 2,.. ·, Eo < EO_I' What is the probability,
R, that Eo <
E1,Eo < E z,'" ,Eo < EO-I, i.e., that the 0 structu re has the
lowest energy
for the real potenti al? Since Eo < E; implies
Eo - Eo> AE.,
5
(8)
where in analogy with the two structu re problem I have defined
!1E,
the probability of E;
== 6E; - 6Eo,
(9)
> Eo is g(EI- Eo) where
g(:c)
==
[~ P(!1E)d(!1E)
(10)
and P(!1E) is the distribution of the !1E; and is given by equation (6).
Within
the independent error assumption the probability that all of the inequal
ities in
( 8) are simultaneously true is the produc t of each of them being true,
so the
probability of predicting the right structu re is
R(Eo, E;, E~, ... , En_I)
0-1
= IT g(EI- Eo)·
(11)
i=l
Equations ( 6), ( 10) and ( 11) are the desired solution of the many
structu re
problem with the independent error assumption.
4
Stoc hast ic Case
The work in the above Section solves the potential accuracy problem
for the
case when the calculated eneryies of all of the structu res are known.
This is
certainly not a realistic assumption for protein structure predictions,
or indeed
most problems where one wishes to predict the structu re of a molecu
le. A reasonable alternative to the previous formulation is to assume that
one knows,
or can estimate, the density of states (structu res) of the molecule, that
is, the
probability density p(Eo' E;, E;,
,En_I) that a molecule has structures with
calculated energies Eb, E;, E~,
, En_I' The function p(Eo, E;,E~, ... , En_I)
may be calculated from an approximate model of the potential, and
may also
include information drawn from simulation or experiment. In the next
Section
p(Eo' E;, E~, ... , En_I) will represent the probability that a sequenc
e, drawn
at random from tbe ensemble of all possible sequences of N amino
acids, has
structu res with calculated energies Eo,E;,E~, ... ,En_I' When only
a probability density is known, one can calculate R, the average of R. This
averaging
yields
x
11
0-1
[
]
g(EI- Eo)8(EI - Eo) dEodE; ... dEn _ I
6
(12)
where
O(x)
= {I
o
if x
If x
~0
< O.
(13)
Notice that the energies in the argument of p are not arranged in any
special
order, and in particular, there is no requirement that Eo <
Ei for i > O.
Therefore, in equation ( 12) the product of O-functions ensures that the
structu re
labeled 0 is, indeed, the lowest energy structu re, and there is a factor
of!1 in
front of the integral because the selection of the D-labeled structu re is
arbitrary,
as any of the !1 structu res could be the lowest energy structu re.
Equation ( 12) can be simplified for some import ant special case that
occurs when p(E~,E;,E~, ... ,E'O_I) has some specific form. In this paper
I will
conside~
one such special case, namely
p(E~, E;, E 2,···, En_I)
0-1
= II pee;)
(14)
i=O
Equation ( 14) with hold when the calculated energies of the structu
re are
random, independent variables distributed with probability density p(E').
Substitutin g equation( 14) into equation ( 12) yields
k;
+00
[+00
R =!1 100 p(E~)
p(E')g (E' -
E~)dE'
] 0-1
dE~
(15)
Equation ( 15) can be approximated with a steepes t descent techniq
ue. Define
(E~)
+00
== ;,
E'o
p(E')g (E' - E~)dE'
(16)
and
P(E~) == (!1/ R)p(E~)(E~)O-1
Note that
P(E~)
(17)
is normalized so that
+00
1
-00
P( E~)dE~
= 1.
(18)
The identity
(19)
can be rewritten as
R
1+
00
-00
('(E,~)P(E~)dE~
p(Eo)
7
= -1.
(20)
In most relevant cases P(Na) will have a maximum that grows sharpe
r as 11
becomes larger. Thus, for large 11, P(Eh) is well approximated by a Dirac
delta
function at Eo, the value of Eh that maximizes P(Eh). In this approx
imation,
equation ( 20) for R becomes
p(Eo)
R'" - {'(Eo)
(21)
A useful expression for e(Eh) is found by differentiating equations
( 16) and
(10) for { and 9 respectively and changing the variable of integration in
equation
( 16) to .\ ;: E' - Eh to find
{'(E~) = -p(E~)
f+OO
0
1
-00
P(.\)d.\ -
After substit uting this expression for
in the suggestive form
p(E~
+ .\)P(.\)d.\
(22)
e, the expression ( 21) for R can be put
-
.
Jo
1
R = ~":'-"':7
1+ f(Eo)
(23)
where
(24)
5
A Ran dom Hete ropo lyme r Mod el of a Protein Pote ntia l Func tion
The random heteropolymer model is the simplest model of a protein
potential
function. This model, with some significant extensions, was first propos
ed as a
model of protein folding by Bryngelson and Wolynes, [4, 5] who solved
it'with in
a random energy approximation. Later, and independently, Shakno
vitch and
Gutin also proposed this model and solved it within a mean field approx
imation.
[6, 7J Shaknovitch and Gutin showed that their mean field approximation
was
equivalent to the random energy approximation of Bryngelson and Wolyne
s, and
were able to obtain further information. I will use the notation of Shakno
vitch
and Gutin. In the random heteropolymer model the energy, E, ofa configu
ration
is determined by the contacts between the amino acids, so if
<l(' .) _ {1 if amino acids i and j are in contact
', 1 0 otherwise
.
then the energy of state q
(25)
= {<l(i,i )} is given by
N
E
=1l({6. (i,j)}) = LBI,j <l(i,j)
i<i
8
(26)
where there are N amino acids in the protein. In equation( 26) for the
energy,
the Bij are the energies of contact between amino acids i and j. In the
random
heteropolymer model, the Bi,; are random, with probability distribu
tion
B7.)
1
(
P,on'o ,,(B'j) = ..j2;Jj2
exp - 2~~
(27)
The quantit y B in equation ( 27) sets the energy scale of the amino acid
contac t
energies~
1
The model ( 26) distinguishes between configurations based on their
contacts. Therefore in this model the discrete states {q }are represented
hy their
amino acid contacts. This representation is often refered to as the
contac t or
distance map representation [8] of protein structu re, and is useful
in many
applications.
For compact structu res, the solution of the Shaknovitch-Gutin mean
field
theory for random heteropolymers has four properties that are import
ant for
the present calculation. These properties are a product of both the
model and
the mean field approximation, and therefore may not be entirely physica
l. In
the conclusion I will discuss possible ways to check and improve the
model and
approximations. First, the total number of compact structures is
n =v N
(28)
where v is the average number of conformations per amino acid residue
in the
compact phase. (Excluded volume effects are included in
this counting.) Second,
the energies are random , independent variables, so equation ( 14) holds.
Third,
the probability density that a structu re has energy E is
ptE) =
.J,r~ZB2 exp (- N~~2)
(29)
where z is. the average number of contacts each amino acid residue
has with
other amino acid residues. Fourth, the import ant low energy structu
res have
few contacts, hence few interactions, in common. Therefore, as noted
in Section
3, the independent error approximation is valid.
Inaccuracies in the pair interactions between amino acids are modele
d by
adding random noise to the contact energies, so the known approximate
potential is
N
E' = 1i'({Ll (i,j)}) = LB:jL l(i,j)
(30)
i<i
1The erudite reader may
recaJ.J. that in the Shaknov itch-Gut in paper t.he
G&UMia n (or the
B'ti was centered about a mean energy 130. and the potentia l fWlction included
A three-bo dy
interaction tcnn CE~(ri .rj)~( rk.rj)which account ed {or
excluded volume. The values
of Eo and C determine whether or not the protein molecule is collapsed
. In this paper I 8nl
only interested in the reltltive energies of the different. co1l4p#et
l confonnations. Changes in
Eo and C would only add a constant energy to each collapsed
conformation, and therefore
can be ignored.
9
where
BI,j = Bi,j + ryij
(31)
and the ryij are random variables distributed with mean zero and
standa rd
deviation ry. Since "Ii and "Ii' have the same form, they have the same propert
ies.
A scientist that uses an approximate potential function "Ii' to calcula
te the
energy of the state {A( i, j)} will err by an energy
N
6E({A (i,j)})
= I>;jA (i,j)
(32)
i<j
The quantity 6E is a sum of ~zN independent random variables,
so by the
central limit theorem, 6E is a random variable with probability density
p(8E)
= kexP
(-N6 E22 )
1I:Nzry2
zry
(33)
Substituting this equatio n ( 33) into equation ( 6) gives the probability
density
of AE, the error in energy with ~espect to the predicted lowest energy
structu re,
P(AE) = .
2
1
,)211:N zry2
6
exp (AE
- - -) .
2N zry2
(34)
Accu racy Requ irem ents for the Ran dom Hetropo lyme r Mod el
At the level of accuracy of the Shaknovitch-Gutin mean field theory,
equations
( 23) and ( 24) for the probability of predicting the correct structu re
are valid
for the random heteropolymer model. The density of structures
p( E') and
the distribution of errors in structu re energies P( AE) are given in the
previous
Section, so it only remains to calculate Eo and simplify the resulting express
ions.
For E~ = E
tIP
(35)
dE' =0
o
so, after substit uting ( 29) for peE') in equation ( 17) for P and differen
tiating
I obtain
2E = (n _ l)e'(E
(36)
NzB2
e(E
For all values of E~, e'(E~) < 0 and e(E~) > 0, so Eo < o. Substit
uting
equations ( 29) and ( 34) for peE') and P(AE) into equations ( 16)
and ( 22)
for e(Co) and e'(E~) yields
o,
o
,
e(Eo)
o)
o)
IEol )
= 1-"41erfc ( (Nz)1/
2B
10
and
((Eb)
2( E')2
= -"21 p(Eb) (1+[1+ 2 (;) 2] -1/2 exp [ NZ(B2~
27]2) (;) 2]
X {
1 +erf
LIN~~:27]2) (;)]} )
(38)
for E~ < O. Equations ( 36), ( 37) and ( 38) for Eo can be solved for
the case
of large N. First, define' " so that
'" ==
Eo
N z 1/2B'
(39)
I have shown that", > O. I <l$sume that", is of order one and will show
that this
assumption is self-consistent. I will also assume that 7] is at most the same
order
of magnitude as B, and quite possibly much smaller. This assumption
covers
all interesting cases, because if 7] is much larger than B, then the "signal
" (the
Bi,i) is swamped by the "noise" (the 7];,i) so the probability of predict
ing the
correct structu re is essentially zero. With these assumptions, the asymp
totic
expansion for the error function complement for large argument gives
(40)
where '"I is a constant of order one. Therefore, the equation for'" become
s
4,rN 2 ",
= "NeXP(-N",2){1+[1+2G)TI/2expG~:22:2)
x [1 + erf C/::"' ;7]2 ) )}
(41)
Equati on ( 41) simplifies in two limits, VN(7]/ B) small, and VN(7]/
B) large.
The expansion of ( 41) for small VN(7]/ B) to first order in VN(7]/ B)
is
21r1/2N'/2",
= vNe- Na'
[1 + ~ (N~27])]
(42)
and similarly for large VN(7]/ B) the leading term in the asymptotic
expansion
yields
11
Equations ( 4,2) and ( 43) can be solved for large N by writing a as
a
logN) a, + (l)
= aD + ( ~
N a2 + ...
(44)
and substit uting this expression into the above equations for a to find
(45)
for large ,fN(TJ /B). In the large N limit (logN )/,fN goes to zero, so
equatio ns
( 45) and ( 46) can be solved to leading orders of N by requiring
the order N,
order log N and order one terms each to vanish separately, to yield
a
=
(log v)
1/2
10gN
- 4N(log v)'/2
1
+ 4N(log v)'/2 10g(4".logv)
1
[(lO g v) 1/2 N1/2TJ]
2N(logv)1/2 log 1 + -2--B-
(47)
for small ,fN(TJ/B) and
a= (
B 2 + 2TJ2
) 1/2
1
(4".N B 2 10g v) (B 2 + 2TJ2
) -1/2
B2
log v
- 4N log
B2 + 2TJ2
B2
logv
(48)
for large ,fN(TJ/B). The leading order terms in both equations ( 47)
and ( 48)
for a are of order ,;Iogv. A typical estima te gives v "" 1.4, [9] with
makes a
12
of order one, as promised. Notice that this conclusion still holds if the
value of
for v-I is changed by a factor of ten.
When peE') and P(!!.E) are given by (29) and ( 34) respectively and
=
N ",fiB, then equatio n ( 24) for «Eo) becomes
Eo
«"')
=
Hl+ 2Grr/'.x G~:22:2)
p
x
[1 + erf (
M",'1 )]
jB2+2 rp
1
(49)
2
By inspection of equatio n ( 49), < is small, and hence the probability of
predicting
the correct structu re, R
1/(1 + f), clooe to one, only if ..;N(TJIB) is small.
For this case the value of", is given by equation ( 47), so substit uting
this into
the above expression for «"') and expanding to first order in ..;N(TJ
/B) gives
=
<
log v)IN) } (N I/ 2TJ )
= ( -1r2 Iogv) 1/2 { 1 ~ log[(41r
-4Nlog v
B
2
+ 0 (N'I
-B2- )
(50)
Therefore, if 'I is small compared to B/..;N , then, to first order in
..;N('1/ B),
the probability of predicting the correct structu re is
R= 1-
(~IOgv)1/2{1_10g[(41rlogV)/NJ}(NI / 2TJ )
1r
4Nlog v
B
(51)
which, for large N, is well approximated by
-r
2
) 1/2 (N1/2 )
R=I - ( ;logv
(52)
For large ..;NTJ/B, < :> I so R'" 1/<, therefore, to the leading term
in the
asymptotic expansion,
- _ [
(!L)2] 1/2 (41rNB210gv)Q'f(B'+2q,) -2Nq'I B'
R - 1+2 B
B2 + 2'12
V.
7
(53)
Con clusi ons
This paper has two principle purposes. First, the general formalism
developed
in Sections 2,3, and 4 can potentially be applied to a wide variety of
problems
in chemical physics. The prerequisites for applying the formalism are
fourfold.
First, there must be a suitable coarse-graining procedure to obtain a
finite number of discrete structu res. As I have previously mentioned, this coarsegraining
13
procedure can be tailored to the problem at hand, and one can
specify the
structure as accurately as desired by making the discretization
finer. Second,
the solution of the structu re prediction problem must be put into the
form of
a solution to a minimization problem, typically not a stringe
nt requirement.
Third, something must be known about the sources of inaccuracy.
More specifi-
cally, the different possible kinds of inaccuracies must be known and
the typical
size of each of these inaccuracies must also be known. With this knowle
dge, one
can then usually use the central limit theorem to obtain the probability
density
for the total inaccuracy in energy. Fourth, there must be some
information
about the distrib ution of the calculated energies of the structures.
In practice,
only the calculated energies of the few structu res with lowest calculated
energies
need be known because all of the other structu res have a negligible
probability
of being the real lowest energy structu re. For large molecules where
even these
are calculations are unfeasible, the methods of Section 4 can be
used if one can
estimate the distribution of calculated energies of structures, using,
for example,
a simple model like the one described in Section 5. The main assump
tion in
deriving the equatio ns for the probability of predicting the correct
structu re,
equations ( 5),( Il), ( 12) and ( 23), is the independent error assump
tion, which
is a worst case for most problems. The derivation of equation ( 23)
also used
the independent energy assumption. Extensions of this formalism
that lessen
these assumptions are under way.
The second purpose of this paper is to report a result for the needed
accuracy of potenti al that predicts protein structu re. The major result
of this
investigation is equatio n ( 52), which states that the probability of
predicting
the correct structure is given by
(
Nl/2~)
probability = 1 - k 8-
(54)
where B is the scale of the monomer-monomer interaction energies, ~
is the scale
of the inaccuracy of the these interac tion energies, N is the number of
monomers,
and k is a constan t of order one. Equation ( 54) was derived from
equatio n
( 23) and therefore is based on the independent error and independent
energy
assumptions. I noted in Section 5 that Shaknovitch and Gutin have shown
that,
if one models a protein as a random heteropolymer, then these assump
tions are
correct at the level of accuracy of a mean field theory. Equati on (
54) implies
that, if a potenti al function is to predict the correct structu re, the
monomermonomer interactions energies must have proportional error of less than
1/v'N.
For a globular protein N will typically be between 50 and 400, so the
required
accuracy in monomer-monomer interactions is about five to fifteen
percent. It
is import ant to note that this result is the accuracy required for
getting all
of the monomer-monomer contact s right, that is, predicting the
entire contac t
map with perfect accuracy, a stringe nt requirement for a potenti
al function.
Proteins with 60 or more percent of correct contacts are usually
considered
14
to be structu rally homologous. Therefore, the protein calculation
should be
extended to talcula te the probability of predicting a structu re with
a specified
fraction of correct contacts. This extension will require that the
formalism and
the model be improved. The formalism must be extended so that it can
be used
to calculate the probability of predicting one of many, rather than
just one,
low energy state. The model must be extended so that the low energy
states
of the model have contacts in common. The model can be extende
d in two
ways. First, the statisti cal mechanics of the model potenti al functio
n could be
solved in an approximation that is more accura te than the Shakno
vitch-Gutin
mean field theory. Some progress has already been made in this directio
n.(Silvio
Franz, private communication) Second, the model potential functio
n could be
extended by incorporating new effects that are alleged to be import ant
in protein
folding, such as the principle of minimal frustration. [4, 5, 10, 11]
8
Ackn owle dgem ents
This work was done under the auspices of the U.S. Department
of Energy
through the Los Alamos National Laboratory. I wish to thank the Los
Alamos
National Labora tory and the Department of Energy for generous suppor
t of this
work. I also wish to thank the Center for Non-Linear Studies and the
Santa Fe
Institu te for their generous hospitality. The work described in this papers
owes
much to the help and encouragement of many people. In particu
lar I would
like to thank Drs. Henrik Bohr, Ken Dill, Walter Fontana, Silvio Franz,
John
Hopfield, Giulia lori, Alan Lapedes, Jiri Novotny, Jose Nelson Onuchi
c, Peter
Leopold, Lawrence Pratt, Jeff Skolnick, Paul Stolorz, James Theiler,
Miguel Virasoro, David Wolpert and Peter Wolynes, and Miss Anne Keegan for
listening
to my ideas and for good advice concerning this work.
Refe renc es
[1] C. B. Anfinsen, "Principles that Govern the Folding of Protein Chains
,"
Science, 181,22 3-230 (1973).
[2] Z. Li and H. A. Scheraga, "Monte Carlo-minimization approach
to the
multiple-minimum problem in protein folding," Proc. Nat!. Acad.
Sci.
USA, 84, 6611-6615 (1987).
[3] E. 1. Shaknovitch and A. M. Gutin, "Influence of Point Mutati
ons on
Protein Structu re: Probability of a Neutral-Mutation," J. Theor. Bioi.,
149, 537-546 (1991).
[4] J. D. Bryngelson and P. G. Wolynes, "Spin glasses and the statisti
cal
mechanics of protein folding," Proc. Natl. Acad. Sci. USA, 84,752 4-7528
(1987).
15
[5J J. D. Bryngelson and P. G. Wolynes, "A Simple Statistical Field
Theory of Heteropolymer Collapse with Application to Protein Folding
,"
Biopolymers, 30, 177-188 (1990).
[6J E. I. Shaknovitch and A. M. Gutin, "Formation of unique structu
re in
polypeptide chains: Theoretical investigation with the aid of a replica
approach," Biophysical Chemistry, 34, 187-199 (1989).
[7] E. I. Shaknovitch and A. M. Gutin, "Frozen states of a disordered
globular heteropolymer," J. Phys. A: Math. Gen., 22, 1647-1659 (1989).
[8] T. E. Creighton, Proteins, W. H. Freeman and Company, New
York,
1984, p. 231.
[9] K. A. Dill, "Theory for the Folding and Stability of Globular Protein
s,"
Biochemistry, 24, 1501-1509 (1984).
[10] J. D. Bryngelson and P. G. Wolynes, "Intermediates and Barrier
Crossing in a Rando m Energy Model (with Applications to Protein Folding
),"
J. Phys. Chern., 93,.69q2-6915 (1989).
[11] P. E. Leopold, M. Molital, and J. N. Onuchic, "Protein folding
funnels:
A kinetic approach to the sequence-structure relationship," Proc. Nat!.
Acad. Sci. USA, 89, 8721-8725 (1992).
16
© Copyright 2026 Paperzz