“Sequence space soup” of proteins and copolymers
Hue Sun
Chan and Ken A. Dill
Department of Pharmaceutical Chemistry, University of California, San Francisco, California 94143
(Received 1 August 1990; accepted 20 May 1991)
To study the protein folding problem, we use exhaustive computer enumeration to explore
“sequence space soup,” an imaginary solution containing the “native” conformations (i.e., of
lowest free energy) under folding conditions, of every possible copolymer sequence. The model
is of short self-avoiding chains of hydrophobic (H) and polar (P) monomers configured on
the two-dimensional square lattice. By exhaustive enumeration, we identify all native
structures for every possible sequence. We find that random sequences of H/P copolymers will
bear striking resemblance to known proteins: Most sequences under folding conditions will be
approximately as compact as known proteins, will have considerable amounts of secondary
structure, and it is most probable that an arbitrary sequence will fold to a number of lowest
free energy conformations that is of order one. In these respects, this simple model shows that
proteinlike behavior should arise simply in copolymers in which one monomer type is highly
solvent averse. It suggests that the structures and uniquenesses of native proteins are not
consequences of having 20 different monomer types, or of unique properties of amino acid
monomers with regard to special packing or interactions, and thus that simple copolymers
might be designable to collapse to proteinlike structures and properties. A good strategy for
designing a sequence to have a minimum possible number of native states is to strategically
insert many P monomers. Thus known proteins may be marginally stable due to a balance:
More H residues stabilize the desired native state, but more P residues prevent simultaneous
stabilization of undesired native states.
Proteins are polymers. However the properties and
principal conformations of globular proteins are remarkably
different than those of better understood polymer/solvent
systems. Whereas polymers generally have a large ensemble
of configurations, a protein adopts a single unique “native”
structure. Different proteins have different native structures
depending on the linear sequence of monomers in the chain.
Polymers in dilute solution are generally well solvated; proteins are highly compact. Polymers in solution are generally
amorphous; proteins have much regular internal organization. The main puzzle of protein structure and stability is
how the monomer sequence encodes the unique native conformation. We explore this relationship of monomer sequence to tertiary structure by considering a “sequence
space soup,” a solution containing all the possible monomer
sequences that are folded to their native conformations. We
describe a simplified but rigorous model which has bearing
on the following types of questions. Is there some special
aspect of the chemical structures of amino acids as monomers that cause proteins to differ from simpler polymers?
Or is there something special about the steric precision
which can be achieved by packing together amino acids in
the compact three-dimensional structure of the chain, as if
they were jigsaw puzzle pieces, that leads to these unique
properties? Are proteinlike properties the consequence of
nature’s use of 20 different monomer types, rather than just
two or three? Is the special nature of protein structures due
to evolutionary selection of certain specific amino acid sequences? What fraction of all possible amino acid sequences
polymers, or random sequences of simple monomers, fold up
to configurations resembling those of native structures of
proteins?
Briefly, our approach is to model the conformations of
short chain molecules as self-avoiding walks on two-dimensional square lattices. The chain monomers are of two types,
H (hydrophobic) and P (polar). In a “folding solvent,”
there is a strong mutual pairwise attraction among H-type
monomers. Thus in a folding solvent, the conformations of
lowest free energy are those with the greatest number of HH
contacts. Because the chains are sufficiently short, exhaustive computer enumeration permits us to find unequivocally
all the native conformations (i.e., those of lowest free energy). We find the native structures for all the possible H/P
sequences.
The main conclusions are summarized as follows. In a
folding solvent, a significant fraction of all possible sequences: (i) fold to conformations that have approximately
the same compactness as known proteins, (ii) fold to conformations that have approximately the same amount of secondary structure, and (iii) fold to only a very small number
(oforder one) of native conformations. Hence these proteinlike conformational properties are expected from a wide
range of simple copolymers and do not require explanation
in the 20 different types of amino acids or in subtle chemical
details of amino acids as monomers or in special steric capabilities by which pairs of monomers fit together as pieces in a
jigsaw puzzle. The results below suggest that proteinlike
structures and functions might be achievable within a relatively large number of different sequences of amino acids or
other simple copolymer molecules. It follows that the known
could fold up to proteinlike conformations?Could simpler
biological proteins may have arisen by prebiotic processes
I. INTRODUCTION
0021-9606/91 I1 73775-l 3$03.00
@ 1991 American Institute of Physics
3775
J. Chem. Phys. 95 (5), 1 September 1991
Downloaded 21 Apr 2006 to 128.100.83.175. Redistribution subject to AIP license or copyright, see http://jcp.aip.org/jcp/copyright.jsp
3776
H. S. Chan and K. A. Dill: Proteins and copolymers
from random sequences with considerably higher probability than has been previously thought. The model suggests
that a significant fraction of copolymer sequences of chain
molecules should be able to fold to unique compact structures with internal secondary structure provided that 30%80% monomers in the chain are highly averse to the solvent.
We first define some useful terms. We refer to the set of
all possible sequences of monomers of a given chain length as
the sequence space. We define sequence space soup as the
following ensemble. Take an ensemble of N copies of every
sequence (N+ CO) . Put all the sequences into a solvent that
favors “folding”. ’ This “soup” (i.e., solution) then contains
an ensemble of all native structures of all possible sequences.
This is sequence space soup. Our purpose here is to explore
the configurational properties of the native structures of proteins or other copolymers in a sequence space soup.
The native structures of a sequence are defined as those
conformations that have the lowest contact free energy, E.
The contact free energy is the sum of the pairwise interactions among “topological neighbor” monomers.’ Topological neighbors are pairs of monomers that are adjacent in
space in a given configuration but that are not neighboring in
the sequence, i.e., they are not covalently bonded neighbors.
Thus a native structure is a configuration that is the most
favorable possible for a sequence insofar as it has many advantageous self-contacts of the chain. See Figs. 1 and 2.
The mapping of sequences to native structures entails
three classes of properties. First, there are intrinsic properties of sequence space. For example, different sequences have
different monomer compositions. Also, sequences can be
compared by their “similarities” to each other, for example
the number of positions at which the monomers in two sequences are identical. These are properties of the sequence
space. Second, there are intrinsic properties of the conformational space. Examples of properties of the conformational
space are root mean square end-to-end distance R,, radius
of gyration R,, distribution of compactness3 p and secondary structure2.3 S.
Third, there are properties of the mapping itself between
the sequence space and the conformational space. Such
properties include the degeneracy and convergence, defined
as follows. Some sequences can configure to have only one
native structure. Other sequences can configure to have
many native structures. We refer to the number of native
structures that a given sequence has as its degeneracy g. The
set of all native structures of a sequence is referred to as its
degeneracy set. Sequences with one native structure (g = 1)
are referred to as singly degenerate; sequences with more
than one native structure are referred to as multiply degenerate. Degeneracy is a property of the mapping of the sequence
space to the conformational space. Another property of this
mapping is the convergence. Whereas the degeneracy set of a
given sequence describes a set of conformations, the convergence set of a conformation describes a set of sequences. A
given native structure may arise from only a single sequence
in the entire sequence space, or it may arise from many different sequences. The convergence set of a given native structure is the set of all sequences that “code” for a given native
structure. The convergence C is simply the number of se-
Native Structure
Mapping
Sequence Space
\-I
a E, <P >ns,
Conformational Space
C
4 a=-ns,4=-n,
FIG. 1. Mapping of a copolymersequence
onto its nativestructure.The
four typesof propertiesarethoseof: (i) the sequence
space,(ii) theconformationalspace, (iii) the degeneracyset of one sequenceonto its setof all
possiblenativestructures(right arrow), and (iv) theconvergenceset ofone
native structure onto the set of all possible sequences that encode it (left
arrow ) .
quences in the convergence set. Figure 1 shows schematically the mapping between sequence space and conformational
space, with examples of degeneracy and convergence. In
general, if the native states of a given sequence s are the
conformations ci ,c2 ,..., cg, the degeneracy set for the sequence s is Cc, ,c2 ,..., c,}, Similarly, if each of the sequences
sr ,s, ,...,se has the given conformation c as the native state or
one of its many native states, the set {s, ,s, ,...,se} is the convergence set for the conformation c.
Properties of the mapping may be divided into two
classes: (i) properties of each sequence’s degeneracy set, and
(ii) properties of each conformation’s convergence set (see
Fig. 1). The convergence C is a property of the convergence
set. Properties of the degeneracy set include the degeneracyg
and native contact free energy E. In addition, we are also
interested in the average compactness (p),S, average radius
of gyration (R, ),, , and average secondary structure (S ),, ,
all taken over the degeneracy set of each sequence. Here the
symbol ( ),, denotes “averaging over native states of a
given sequence.” For example, if the degeneracy of a sequence s is g, and the compactness of its g native structures
arep, ,p2 ,...,ps, respectively, then
5
i==l
and similar definitions apply for (R, >ns and (S ) nJ. As discussed in Ref. 4, the native state average of a quantity is equal
to the limit of strong contact attractions E-+ - COof the
average of that quantity over the full conformational space
accessible to a given sequence. In other words, the native
state average is the full ensemble average under folding conditions.
II. THE MODEL
To study how the native states of proteins depend on
their monomer sequences requires a model for computing
the free energies of all the configurations of a chain in order
to find those configurations of lowest free energy. This is not
currently possible using high-resolution force-field methods
because only a very small fraction of conformational space
can be explored by existing computers for the tens of thou-
J. Chem. Phys., Vol. 95, No. 5,1 September 1991
Downloaded 21 Apr 2006 to 128.100.83.175. Redistribution subject to AIP license or copyright, see http://jcp.aip.org/jcp/copyright.jsp
H. S. Chan and K. A. Dill: Proteins and copolymers
3777
Lowest
free energy
3
HHPPHPHHPHPHHHHHPPHPHHPPPHHH
HHPHHPHHPPPHHH
(a)
FIG. 2. (a) Examples of sequences and
some native structures for which they
code. (b) Examples of n = 14 singly degenerate sequences and their native structures. Open circles represent P residues
and filled circles represent H residues.
sands of atoms of the protein plus solvent. Conformational
space has been explored more broadly using lower-resolution lattice models either by Monte Carlo sampling,5*6 or by
exhaustive simulation of restricted conformational spaces.’
But these methods are also too restricted for our present
purposes since they cannot determine the native structure
for any arbitrary sequence. The problems with those methods are: (i) that the sampling of conformational space is
much too sparse to insure identification of the native state,
and/or (ii) they do not identify native structures based on
the physical forces alone; considerable prior knowledge of
the native structure is required. Our purpose here is to explore the simplest model for protein folding that is based
only on the dominant forces.
Proteins are flexible polymers that are driven to collapse
mainly by hydrophobic interactions* and that are subject to
severe steric constraints in their compact native states. We
assume that native conformations are equilibrium states.
Therefore we adopt the model of Lau and Di114T9in which
proteins are modeled as short copolymer chains that undergo self-avoiding walks on two-dimensional square lattices. Every chain has n monomers,” n@ of which are of the
type H (hydrophobic), and n( 1 - @) of which are of the
type P (polar, or other). In this model, the hydrophobic
interaction is accounted for by a favorable free energy E < 0
for every nonbonded nearest-neighbor HH contact. Therefore the native states of these chains are those configurations
that have the largest number of HH contacts. By considering
J. Chem. Phys., Vol. 95, No. 5,1 September 1991
Downloaded 21 Apr 2006 to 128.100.83.175. Redistribution subject to AIP license or copyright, see http://jcp.aip.org/jcp/copyright.jsp
3778
H. S. Chan and K. A. Dill: Proteins and copolymers
chains that are sufficiently short (n ( 14 monomers), we can
exhaustively explore the conformational space to find all native structures. In addition, we do so here for every sequence
in the sequence space. This mapping process is shown in Fig.
2. Thus, within the framework of this simple model, there is
no approximation in the mapping of sequences to native
structures. This model is therefore also not restricted just to
proteins, but also describes other copolymers composed of
two types of monomers: those with solvent affinity, and
those that are strongly solvent averse.
The obvious simplifications of the present treatment as a
model for proteins are that: (i) the chains are short and (ii)
they are two dimensional. To determine the limitations due
to (i), to the extent that computational limitations permit,
we study several different chain lengths. It is more difficult
to address (ii), the effect of dimensionality. However, in at
least one respect the two-dimensional (2D) model offers a
considerable advantage over the corresponding three-dimensional (3D) model. A most important physical variable
for heteropolymer collapse is the ratio of the number of monomers at the surface to those in the interior of the compact
conformation. In a 3D simulation, a realistic value for this
ratio is about l/4 for a small globular protein of 100 monomers. In 3D, a chain length of n = 27 is at the upper limit of
what can be explored by computer, and yet the 3-cube has
only a single interior site. Thus in 3D it is not possible to
simulate a chain with realistic inside/outside ratio. On the
other hand in two dimensions, a chain of only 16 monomers
can fold to a conformation with 25% of its monomers in an
interior core (four interior sites in a 4 X 4 square). Thus the
2D chains studied here reflect approximately the correct ratio of inside/outside monomers.
With higher dimensionality, the number of possible
nearest neighbor interactions and the configurational freedom increase, and excluded volume becomes less constraining. These factors affect the thermodynamics and phase
transitions, but the present study is not concerned with issues of the balance of forces. The present work and related
studies4’9 show that in the following respects the 2D shortchain lattice model mimics the protein folding process. In a
solvent compatible with both monomer types, there is a large
ensemble of open configurations. When the solvent becomes
incompatible with monomers of type H, the chain collapses
to a relatively small number (of order one) of compact conformations, in which the core is predominantly H monomers
and the surface is predominantly P. The ensemble of compact conformations of the 2D short chains becomes populated with approximately the same distribution of helices,
sheets, and irregular conformations as are found in the
known proteins. ” The mutability properties of the 2D
short-chain lattice molecules mimics those of proteins: proteins are plastic, small perturbations of sequence lead to
small perturbations of structure, surface sites are relatively
insensitive to mutation, and the H residues in the core are
most sensitive to mutation.’ It is in these respects that the
present 2D results resemble the folding of real proteins in
3D.
It is useful to first define certain quantities in the model.
The compactness is defined as p = t/t,,,,,
where t is the
number of intrachain contacts and t,,,, is the maximum
number of such contacts achievable by lattice chains with
the given length.3 Compactness is closely related to the radiwhere ri is
us of gyration R, = [Z;,, (ri - (r))*/n]“*,
the position of the ith monomer and (r) = El=, r/n is the
position of the centroid of all the monomers.
The contact map is defined* as the set of all nearestneighbor contact pairs (i,j) between the ith and thejth monomers along the chain. Since the only contributions to contact free energy are from nearest-neighbor HH contacts in
the present model, the contact free energy of a conformation
is completely determined by its contact map. Consequently
for a given sequence a contact free energy can be assigned to
each contact map. Native structures are therefore obtained
by first performing an exhaustive search through all distinguishable contact maps to identify those contact maps with
the lowest contact free energy E, and the native conformations are then deduced in a second step that enumerates all
conformations consistent with the contact maps.
The numbers M (f) (n ) of distinguishable contact maps
for chains with n = 6-14 monomers and O<t(t,,,
contacts
are given in Table I. These numbers are compared with the
number of conformations a(‘) given’* in Table I of Ref. 3.
While the total number of conformations scales approximately13-” as no (n) - (2.728) n, the total number of distinguishable contact maps increases approximately
as
MO (n) - (2.303) n. Furthermore, the quantity MO (n ) is
found to be consistently smaller than R, (n), e.g., for
n = 14, MO (14)/R, (14) = 0.054. Hence for computational efficiency we use an algorithm that identifies native
states by first searching through all distinguishable contact
maps, as indicated above, instead of directly searching
through all possible conformations.
It is noteworthy that the average number of conformations per contact map fict)( n)/M”’ (n) is approximately
unity at maximum compactness’6 t = t,, for all chain
lengths considered here, but the corresponding ratio becomes extremely large as t approaches zero. In general, the
TABLE I. Number of distinguishable contact maps M(‘)(n) for the square
lattice as a function of the total number contacts t and the number of residues n in the chains. Note that the null contact map fort = 0 is counted as
one contact map. The t = 1 count M ”‘(n)
is given by (n2 - 8n + 24)/4
for even n and (n* - 8n + 23)/4 for odd n.
M(‘)(n)
n
6
7
8
9
10
11
12
13
14
MI(n)
(total)
t=O
8
14
41
78
212
424
1113
2309
5953
1
1
1
1
1
1
1
1
1
t=l
3
4
6
8
11
14
18
22
27
t=2
4
9
15
22
42
65
101
148
218
1~3
t=4
t=5
t=6
19
42
68
116
276
452
761
5
90
180
286
617
1621
48
400
736
1310
31
333
1655
J. Chem. Phys., Vol. 95, No. 5,1 September 1991
Downloaded 21 Apr 2006 to 128.100.83.175. Redistribution subject to AIP license or copyright, see http://jcp.aip.org/jcp/copyright.jsp
t=7
360
3779
H. S. Chan and K. A. Dill: Proteins and copolymers
degeneracy g of a sequence may be viewed as arising from
two sources: (i) the number of distinguishable contact maps
with the lowest contact free energy (this number is the “contact-map degeneracy”); and (ii) the number of conformations consistent with all the contact maps with the lowest
contact free energy. Since (ii) can only be small for compact
conformations, low degeneracies are only possible with high
compactness of the native structures, as is observed in globular proteins.
III. MOLECULES
SOUP
ARE COMPACT
IN SEQUENCE
SPACE
What are the conformational properties of the native
structures in sequence space soup? In this section we explore
the compactness and radius of gyration. The results discussed below show that the chain conformations of native
structures in sequence space soup are, on average, quite
highly compact and have small radii of gyration.
We have performed exhaustive enumerations to determine @)ns for every sequence” with chain length n = 6-14.
The results indicate that a substantial fraction of all sequences attain high @) ns. For example, for n = 12, 13, and
14, the percentages of sequences with @),,s >0.8 are 36.1%,
65.8%, and 62.7%, respectively.
Another measure of size of chain molecules is the radius
of gyration R, as defined above. The measures R, andp are
closely related-small
radius is correlated with high compactness and vice versa. For chains with n monomers on the
square lattice, the maximum radius possible is d ( n2 - 1)/12
when all the monomers lie on a straight line. The minimum
radius ( RG)min is determined by exhaustive enumeration.
For n = 6-14, values of (R G ) min as well as the average radius (RG ) over all possible conformations are listed in Table
II.
For each individual sequence, the quantity (R, ),, is
definedas thearithmetic mean ofall radii (RG)i, i = 1,2,...,g
of the sequence’s g native conformations in its degeneracy
set, (R G ) ns = Sp= I ( R. ) i /g. Figure 3 gives the distribution
of (R, ),, over all possible n = 14 sequences in comparison
with the distribution of R, over all possible conformations.
3
L
$j
25
II
3
g
20
8
c
fz
15
It,
I I,
iTi
iz
2
10
6
t
2e
I
::
I
I
,
I
\I
’h
‘11
!I h
I,\
\’ ,
LA , , , yy,,;,sT.L
.,,., ,
5
1.5
2.0
2.5
CR G >ns/(RG)min
FIG. 3. Distribution of average native radius of gyration (RG)“% in the sequence space soup and distribution of radius of gyration R, in the full conformational space for n = 14. Radii of gyration are measured in units of the
(Table II).
n = 14 minimum radius (R,),,,
The distribution
of R, (dotted curve) peaks around
1.5 (R, ) min. Approximately half of all possible conformations have radii larger than 150% of the minimum radius. On
the other hand, the distribution of (RG)ns (solid curve) has
a maximum around 1.1 (R, )min, implying that the majority
of sequences under folding conditions have radii close to the
minimum allowed by excluded volume.
Figure 4 shows the cumulative distributions of fraction
of sequences (vertical scale) with average radii smaller than
1.0 r-r
0.9
I
I
I
I
I
I
I
I
I
II
I
L,
I
0.8
TABLE II. Radii ofgyration R, ofsquare lattice chain conformations. n is
the number of residues, (R, ),,, is the minimum R, in the entire conformational space of n-residue chains. (RG) is the average radius over the entire
conformational space. All R,‘s are measured in units of bond length, i.e.,
the distance between the centers of two adjacent lattice sites.
n
6
7
8
9
10
11
12
13
14
(&)m,n
0.957
1.030
1.104
1.155
1.269
1.324
1.354
1.418
1.473
(R,)
1.228
1.381
1.522
1.666
1.801
1.938
2.067
2.198
2.323
g? 0.7
z
2 0.6
B
;
0.5
‘gG
B
t
I”.
“3
.
Fi 0.8 5 07$ 0.8 t? 0.5 : 0.4 2% 0.3 0.2 -
0.4
0.3
0.2
0.1
0.0
/Yl
Number
o.olri
1.0
.
I
11
p9
n,
,
I”“nitsOftile
Ml”mwn
Rc1.1n
,-
,,
10
11
of Residues
i
12
13
,,02
,
14
n @ham length)
’ I ’ ’ I ’ ’ ’ ’ ’ I ’ ’ ’ ’ ’ ’
1.5
2.0
2.5
cRG>nsl(RG)min
FIG. 4. Cumulative distribution of average radius of gyration (R,),, for
n = 9-14. For each n, (R,),, is measured in units of (R,),,,
for that n.
The inset shows the chain length dependence of the fraction of sequences
whose(RG)nsare smallerthan specifiedupperbounds.
J. Chem. Phys., Vol. 95, No. 5, 1 September 1991
Downloaded 21 Apr 2006 to 128.100.83.175. Redistribution subject to AIP license or copyright, see http://jcp.aip.org/jcp/copyright.jsp
H. S. Chan and K. A. Dill: Proteins and copolymers
3780
the value of (R,),,
indicated by the horizontal scale.
(RG)- of every chain length n is measured in units of the
minimum radius (R, )min for that n (Table II). The curves
for different n are all similar; they show that substantial fractions of sequences collapse to radii very close to (R, ) min.
For example, for n = 13 and 14, the percentages of sequences whose average radii are less than 10% in excess of
the absolute minimum radius [ (RG)ns < l.l(R,),i”]
are
50.7% and 50.8%, respectively.
To extrapolate these short chain exact results to longer
chains, the n dependence of compactness in sequence space
soup is plotted in the inset of Fig. 4. The inset of Fig. 4 shows
some variations with n, for example the percentages of small
(RG)ns sequences are higher for n = 10 and 11 than for other n values shown. However these variations do not suggest
any systematic decreasing trend as n increases. These variations are principally reflections of the choice of ( RG )min as
units for R,, which measures radius relative to the absolute
minimum radius of the compact state. However, while the n
dependence of (RG ) on the square lattice is smooth (given
quite accurately ~3,‘~ by (RG) = constXn3’4), the n dependence of (R, ) min is not as smooth due to specific shape3
constraints on highly compact lattice conformations. Consequently the short chain lattice (RG)min shows considerable
variation from the expected n’” scaling in two dimensions.
Hence (R,)/(R,
)min also shows considerable variation2’
from the expected scaling of n1’4. Given that the inset of Fig.
4 shows no evidence of a systematic decreasing trend of the
fraction of sequences with small (RG)ns, it is reasonable to
also expect for longer chains that a substantial fraction of
sequences have small (RG)-, with approximately 50% of
sequences capable of collapsing into conformations with
radii not larger than 10% of the absolute minimum radius.
It is likely that the lattice model is conservative in estimating the number of folding molecules. On square or simple cubic lattices, only even-numbered monomers can be
nearest neighbors of odd-numbered monomers. Consequently, there is a nontrivial number of sequences that are
incapable of forming any HH nearest-neighbor contacts despite their relatively high hydrophobic composition. The degeneracy set of each such E = 0 sequence is the entire conformational space, hence g = no (n) . Explicit construction
gives the number No (n) of E = 0 sequences for chain length
n > 1 in the model as
2n/2 + ’ + 2n - 4
for n even
(3.1)
3~2’“-“‘~+2n-4
for n odd .
As No (n) undoubtedly overestimates the actual number of
nonfolding E = 0 sequences in real space,2’ the present lattice model is likely to be conservative in estimating the number of folding sequences.
The results in this section show that a considerable fraction of all possible sequences will be highly compact under
folding conditions. For n = 11-14, the number of @)nS = 1
sequences are 123,95,978, and 992, respectively. How compact does a molecule need to be to resemble a real protein?
We have computed the radii of gyration (R G ) ns of the native
conformations of ten proteins and their theoretical minimum radii of gyration (R, )min (Table III). Although most
known proteins are highly compact, they are not maximally
compact; they have rough surfaces, and generally large active site cavities. Hence the present simulations suggest that
a considerable fraction of all sequences under folding conditions will have “proteinlike” compactness.
No(n) =
IV. THE AMOUNT OF SECONDARY
STRUCTURE
SEQUENCE
SPACE SOUP IS LARGE
IN
We also find that the amount of secondary structure in
sequence space soup is large. Here S denotes the fraction of
residues (monomers) in a conformation that are in some
form of secondary structure: helices or parallel or antiparallel sheets. The lattice representations of these forms of struc-
TABLE III. Radii ofgyration of the native states of ten proteins, from Ref. 22. All radii are measured in units of
the C“-C” virtual bond length (3.8 A). For each protein, n is the number of residues, (R,),, is the radius of
gyration calculated from the C” coordinates (Refs. 23-33) of the crystal structure [(R,),,
= (RG)nl for
g = I]. (R, ) minis the theoretical minimum radius deduced by assuming that the molecule takes a hypothetical
spherical shape with all its residues close packed, thus having a volume equal to the sum ofall residue volumes.
The radius of this sphere can be calculated using volume data compiled by Miyazawa and Jemigan (Ref. 34).
Since the residues are assumed to be uniformly distributed within this sphere, the theoretical (RG)mln is m
times the radius of the sphere (Refs. 35 and 36). The last column in the table gives the percentage by which the
actual radius of gyration (R,),, is in excess of the theoretical minimum (R,),,, .
[(RGL, - (&Li”l
Protein
Crambin ( ICRN)
Ferredoxin ( 1FDX )
BPTI (CPTI )
Ubiquitin ( 1UBQ )
Ribonuclease A ( lRN3 )
Lysozyme ( 1LZ 1)
Papain (9PAP)
Concanavalin A (ZCNA)
Subtilisin ( ISBT)
Thermolysin ( 3TLN)
n
46
54
58
76
124
130
212
237
275
316
(R,),,
2.54
2.46
2.80
3.02
3.77
3.63
4.16
4.47
4.33
5.12
(&)min
(&)mi,
2.28
2.39
2.56
2.82
3.28
3.33
3.91
4.04
4.14
4.43
11%
3%
9%
1%
15%
9%
6%
11%
4%
15%
J. Chem. Phys., Vol. 95, No. 5,l September 1991
Downloaded 21 Apr 2006 to 128.100.83.175. Redistribution subject to AIP license or copyright, see http://jcp.aip.org/jcp/copyright.jsp
3781
H. S. Ghan and K. A. Dill: Proteins and copolymers
ml ”
321
,
,
,
“0
1
2
3
,
,
,
,
,
,
,
,
,
,
,
n
“+I
(a) Helices
(c) Parallel Sheets
(b) Antiparallel
Sheets
(d) Turns
FIG. 5. Secondary structures as patterns on the contact map (Refs. 2 and
3), in which a filled circle at the ith row andjth column represents a contact
between monomers i and j. The dotted boxes encircles minimal units that
must be present to qualify as secondary structures (Refs. 2 and 3).
4 5 6 7 8 9 10 11 12 13
Average Number of Residues
in Secondary Structure ncS>ns
14
FIG. 6. Secondary structure in the n = 14 sequence space soup, i.e., secondary structure in the native states of all possible n = 14 sequences. The histogram shows the distribution of the average number of residues n(S),, in
secondary structure among the native conformations of all sequences.
in a sequence space soup of random terpolymers of amino
acids. The model also demonstrates that neither sequence
specificity nor low degeneracy are necessary conditions for
large amounts of secondary structure.
ture are shown in Fig. 5. Previous exhaustive simulations of
conformational space of chains on two-dimensiona12s3 and
three-dimensional”*37 lattices have shown that the amount
of secondary structuresincreases with chain compactnessp.
S is greater than 50% for compact conformations (p = 1)
for most chain lengths we have explored in two and three
dimensions. In the present study, we are interested in the
average secondary structure fraction (S ),, , for each sequence, defined as the mean (S),, = Z$= ,Si/g of the secondary structure fraction Si (i = 1,2,...,g) of the g native
conformations in the degeneracy set of the sequence.
The average amount of secondary structure (S ),, varies from one sequence to another. The distribution of (S ),,
over all sequences in sequence space soup is shown in Fig. 6.
Given that (i) most molecules in sequence space soup are
highly compact, and (ii) S increases with compactness,3 it
follows that sequence space soup should be characterized by
considerable amount of secondary structure. Figure 6 shows
this to be the case. The average of (S),, over all possible
n = 14 sequences is 39.6%. For comparison the average S
over allp = 1 conformations is 60.7%;3 not all sequences are
capable of collapsing to maximal compactness, i.e., not all
sequences have <p),, = 1. This relatively high (S ),, for a
substantial fraction of random sequences is expected to be
valid for long chains.3 The present results are supported by
experiments of Rao, Carlstrom, and Miller,38 who deter-
One particularly important property of a globular protein is its extraordinarily small number of conformations of
lowest free energy: typically a protein has only one native
state. In contrast, most polymeric states of matter (apart
from crystalline states) are characterized by a large number
of conformations. Typically the number of chain conformations increases exponentially with chain length, even for the
ensemble of maximally compact chains.3,37 In the present
section, our aim is to explore the distribution of the number
of native states per sequence (i.e., the degeneracy) in sequence space soup, given that the hydrophobic interaction is
the principal energetic feature encoded in the sequence.
We have performed exhaustive enumerations to determine the degeneracy of all sequences. In the present model,
the highest possible degeneracy of any n-monomer sequence
is g = R, (n), corresponding to the N, (n) sequences that
have E = 0 [see Eq. (3.1) 1. The smallest degeneracy is
g = 1. For the relatively longer chains, the distributions of
degeneracy all peak at low g, as shown by the n = 14 example of Fig. 7. The distributions of degeneracy for 6<n< 14 for
g< 10 are also listed in Table IV. The most remarkable result
shown in Fig. 7 is that the histogram of sequence degener-
mined by circular dichroism that the helical content is 46%
acieshas a generaldecreasingtrend. That is, there are more
V. DEGENERACY:
HOW MANY
A SEQUENCE HAVE?
NATIVE
STATES
DOES
J. Chem. Phys., Vol. 95, No. 5,1 September 1991
Downloaded 21 Apr 2006 to 128.100.83.175. Redistribution subject to AIP license or copyright, see http://jcp.aip.org/jcp/copyright.jsp
3702
H. S. Chan and K. A. Dill: Proteins and copolymers
I
I
700 ,
650 t
600
709,
,
I
I
I,
i
650
600
550
I
g
I
550
!g MO
:! 450
g 400
500
450
; 400
c% 350
;
$5
=
5
f
3
m
350
300
250
PM)
150
$ 300
$! 250
z’
200
10-i
t%
100
50
0
15otu
5
10
15
20
25
Degeneracy
30
35
40
45
g
54
i
50
0
100
200
300
Degeneracy g
500
400
FIG. 7. Distribution of degeneracy in all possible n = 14 sequences. The
maximum value of g is fl, ( 14) = 110 188, but only the distribution for
g<SOO is shown here. The distribution peaks at small values ofg and diminishes rapidly as g increases. That is, most sequences have relatively few native states. The inset shows the enlarged portion of the distribution for
gc50.
sequences that have only one or two native structures than
have 100 or 1000 native structures. Consistent with earlier
more limited results,4 this suggests that the striking uniqueness of the native structures of globular proteins may arise
principally from the relatively nonspecific nonlocal (e.g.,
hydrophobic) interactions encoded within the sequence,
rather than from the more subtle specific types of interactions that contribute to protein structure.
Our principal interest here is the number of sequences
that have low degeneracies, since these sequences would
then exhibit proteinlike uniquenesses of native structures.
Figure 8 shows the fraction of sequences whose degeneracies
are lower than various upper bounds on g, as a function of n.
Due to the limitation on chain lengths simulated (n = 614), extrapolations to longer chain lengths have much un-
TABLE IV. Distribution of sequence degeneracy. The table gives the number of n-residue sequences with degeneracy g. The maximum possible value
of degeneracy is 0, (n) (see the text); only distributions for g<lO are
shown.
g
n=6
1
2
3
4
5
6
7
8
9
10
7
2
6
9
0
12
0
0
0
4
n=l
10
12
4
6
16
2
6
0
0
1
n=8
7
19
28
17
1
18
4
12
10
5
n=9
6
35
25
21
45
15
28
6
12
16
n=10n=11n=12n=13n=14
6
54
33
52
21
60
27
53
24
13
62
71
68
69
50
59
54
61
36
80
87
165
161
142
116
117
60
86
82
69
173
340
213
278
254
232
200
193
153
130
386
606
526
548
383
516
271
387
243
317
t6
I
7
I
I
I
I
I
8
9
10
11
12
13
Number of Residues n (chain length)
I
14
FIG. 8. Fractions of sequences with low degeneracies. Fractions of sequences whose degeneracies are not larger than 1,5, 10,20,30, and 40 are
plotted as functions of chain length n.
certainty. The fractions of singly degenerate (g = 1) sequences for n = 12, 13, and 14 are 2.1%, 2.1%, and 2.4%,
respectively (see also Table IV). There is no clear trend. The
n dependence is estimated as follows. Assuming that the
fraction of sequences whose degeneracies are lower than a
given upper bound is proportional to K”, where K is a constant, then K is estimated to be between 0.9 1 and 0.97. The
fraction of g<S and g<40 sequences will then be estimated
for n = 100 chains to be of order 1O-s-1O-3 and
10W4-10-‘, respectively. However, the actual values for
these fractions could be much higher, as suggested by the
behavior of g = 1 sequences described above.
Despite the numerical uncertainties in extrapolation,
the principal trend appears to be that the fraction of lowdegeneracy sequences only decreases very gradually with increasing n, so the number of low degeneracy sequences increases rapidly with chain length. Thus the probability that
an arbitrarily chosen random sequence of length n = 100 is
singly degenerate or has low degeneracy is at worst about
10 - 6 in two dimensions, which translates into 1O24H/P type
sequences. However, the average degeneracy over all seincreases
quences
approximately
exponentially,
(g)~O.51x(1.821)“.
Comparison of the results of this section with those of
Sec. III shows that it is much more improbable to find a
sequence with low degeneracy than to find a sequence that
folds to a compact conformation with much secondary
structure.39 Small g is a much more stringent condition to be
satisfied than high (p),s. Hence most sequences that collapse to high (p),s, and thus have large amounts of secondary structure, as observed by Rao et al.,38 should therefore
be multiply degenerate. The fact that a sequence has high
average compactness, small average radius and large
J. Chem. Phys., Vol. 95, No. 5,l September 1991
Downloaded 21 Apr 2006 to 128.100.83.175. Redistribution subject to AIP license or copyright, see http://jcp.aip.org/jcp/copyright.jsp
3783
H. S. Chan and K. A. Dill: Proteins and copolymers
amount of secondary structure in its native conformations
does not imply that it is singly degenerate or even that it has a
small degeneracy. It therefore may be easier to design a protein sequence, by natural or other means, that folds to a compact state with much secondary structure than to design a
sequence that also specifies only a single native structure.
Figure 7 shows that a substantial fraction of sequences
have low degeneracies; for n = 14, the peak is at g = 2. The
same features are also present for n = 12 and 13 chains.
These results constitute a considerable generalization of earlier results of Lau and Di11.4 Here the full conformational
space for each sequence is considered; Lau and Dill focused
only on folding sequences defined by @)nS = 1. Although
the most probable degeneracy (peak of the distribution) is
found here to be 2, and was previously found to be 1, these
simulations also differ in the range of chain length explored.
Given the simplicity of the two-dimensional model, and the
neglect of other interactions and subtle details, we believe
little significance should be attached to the numerical value
of the most probable degeneracy. Rather we believe the most
important conclusions from these simulations are that: (i)
For short chains at least, there are very few ways a chain
with a typical sequence can configure to have the maximum
possible number of HH contacts. The hydrophobic interaction, which is nonlocal along the chain, is sufficient to predict that a very small ensemble of chain conformations have
lowest free energy. (ii) Although the average degeneracy
over all sequences increases with n because of the exponential increase in the number of accessible conformations, the
small degeneracy of native conformations appears to persist
for a nontrivial
fraction of sequences to longer chain lengths.
VI. CORRELATION
PROPERTIES
AMONG
NATIVE
0.90
E
Q
0.80
0.70
loo
200
300
Degeneracy
400
500
STATE
In this section, ‘we show that the various properties of
proteinlike configurations-high
<p),, and (S),,, small
(R, ) ns and g, and many HH contacts-are correlated. That
is, sequences that have any one of these properties tend to
have others. Figure 9 shows the relationship between @)nS,
@a jr,, and g, for all possible n = 14 sequences. Figures
9(a) and 9(b) give (p),, and (R,),, as functions ofg. The
upper two panels show that sequences with very low degeneracies always have large (p>,S and small (RG)ns, and the
variation of @>nS and (Rc ),, is relatively smooth at very
small g as compared with the large-amplitude fluctuations at
larger g. In particular, for all the chain lengths (n = 6-14)
considered, the unique native conformations of all the g = 1
sequences have either the maximum number of contacts,
or one contact fewer than the maximum,
t = t,,,,
t = Lx - 1. Hence very low degeneracy can only be
achieved by sequences with (p),, approaching unity. However, the converse is not necessarily true. Though sequences
with compact native conformations on average have relatively low degeneracy, the large-amplitude fluctuations at
larger g in Figs. 9(a) and 9(b) indicate that for some sequences large <p) ns and small (R, ) ns can be consistent with
fairly large degeneracy.
For the correlation between other conformational prop-
FIG. 9. Correlation between size parameters (p),, and (R,),, ofthenative
structures with degeneracyg (n = 14). (a) and (b) show the average (p)“.
andaverage (R,),, over sequences with thesamedegeneracy (only data for
g6500 are shown) and (c) gives the average degeneracy as a function of
b).,.
WIp2 >“S- (PI >lls(P2 >“,
rs
p?7Yvxd(p:>ns
-
<p2xs
- l<r<l
’
(6.1)
between every pair of conformational properties P, and P2
are tabulated in Table V for all n = 14 sequences. Proteinlike
properties are correlated.W
TABLE V. Correlation coefficients among native properties of n = 14 sequences.
@).S
g
@)“,
(Ro),,
E
- 0.57
(&)ns
0.64
- 0.95
E
0.40
- 0.85
0.76
(S )“S
- 0.23
0.28
- 0.27
-0.11
erties,the pairwise correlation coefficients
J. Chem. Phys., subject
Vol. 95, No.
5, 1 license
September
1991
Downloaded 21 Apr 2006 to 128.100.83.175. Redistribution
to AIP
or copyright,
see http://jcp.aip.org/jcp/copyright.jsp
H. S. Ghan and K. A. Dill: Proteins and copolymers
3784
VII. CORRELATION
WITH SEQUENCE
OF NATIVE STATE
HYDROPHOBICITY
105
PROPERTIES
We find that proteinlike properties also tend to be correlated with hydrophobic composition @ of the molecule. Figure 10 shows that molecules have greater compactness (and
smaller radius), greater number of HH contacts and greater
secondary structure content, on average, as the H content Cp
increases.
Figure 11 shows an interesting result for a strategy for
the design of sequences that will fold to unique native structures. From Fig. 11 (b) it is clear that if we have sequences
that fold to compact states, then a strategy to reduce the
degeneracy is to increase the number of P residues. This
suggests a reason that biological proteins are observed to be
marginally stable. On the one hand, increasing the H content
@ helps stabilize the desired native structure. On the other
hand, increasing the P content helps reduce the danger that
the sequence can simultaneously fold to other undesired native conformations. This strategy undoubtedly requires judicious placement of the P residues to reside mostly on the
surface.
The distribution of Q, in two special sequence subspaces
are shown in Figs. 12(b) and 12(d),
(p),S=landg=l
respectively for n = 14. The binomial distribution of @ for
the full space of random sequences is also included for comparison (dotted line). In both situations a certain nonzero
minimum number of H residues are required for a sequence
to either have maximally compact native structures or be
singly degenerate. But there are also important differences
between the two cases. The (p),* = 1 distribution of Q, is
shifted towards the right, with a mean value of <pat 0.69, and
the largest value of @ is 1. On the other hand, the g = 1
distribution peaks at middle range (mean @ = 0.521, and
104
2‘
m 103
is
EL
0” IO2
10'
100
- cp+,=l
Sequences
n4
FIG. 11. Correlation of degeneracy with composition. (a) Variation of degeneracy in all n = 14 sequences. (b) Variation ofdegeneracy in co)“. = 1,
n = 14 sequences. As in previous figures, circles represent averages over
sequences with the same G, and curves without circles represent minimum
and maximum values.
<p >-= 1 Sequences
f
F
1.0
0.9
0.9
0.6
0.7
0.6
0.6 -
0.5
0.4
0.3
0.0
-
0.7
-
g 0.6
-
:
o.2;n; 0.5
"
0.4
~~~~~~~’
-
01
5
7
9
11
g
0.0 -
13
1
3
5
7
9
11
13
7.0
6.0
5.0
Is"
4.0
300
(b)
9
1113
9
1113
13
5
7
9
1113
,
,,' ‘\,
250
Di(i
g;
zoo
n$
150
g=
=
too
54
0
3.0
7
350
(c)
13
5
13
0.3 -
0.2
0.1
g = 1 Sequences
'1
/
:
:'
:
8'
11,11
,/
13
5
7
2.0
1.0
0.0
“@
3
5
7
9
11
13
“Q
FIG. 10. Correlation of native conformational properties with composition
@ in all possible n = 14 sequences (the number of H residues equals no).
Circles represent average native conformational properties over sequences
with the same 9; curves without circles denote the minimum and maximum
of the conformational properties among sequences with the same @.
FIG. 12. Variation of contact free energy E with composition 9 (n = 14).
(a) & (c): The circles represent the average E over sequences with any
given Cpwhile the continuous curves without circles denote minimum and
maximum E, for @) nI = 1 andg = 1 sequences, respectively. For comparison, the dotted curves in (a) and (c) show the dependence of average Eon
@for al/possiblesequences [from Fig. 10(d) 1. (b) and (d) show thedistribution of @ among n = 14 @),, = 1 and g = 1 sequences, respectively. For
comparison, the dotted curves in (b) and (d) indicate the profile of the
binomial distribution of ‘3 among allpossible sequences.
J. Chem. Phys., Vol. 95, No. 5.1 September 1991
Downloaded 21 Apr 2006 to 128.100.83.175. Redistribution subject to AIP license or copyright, see http://jcp.aip.org/jcp/copyright.jsp
3785
H. S. Chan and K. A. Dill: Proteins and copolymers
there are no g = 1 sequences with @ > 10/14. This demonstrates again that b),,% = 1 and g = 1 are quite different
requirements on the sequences. There are many @)nS = 1
sequences with Q, N 1 that are not singly degenerate.
The correlation of contact free energy E with @ is also
shown for the cases of (p),S = 1 and g = 1 in Figs. 12(a)
and 12(c), respectively. The average number of HH contacts E/E as a function of @ for the full space of all possible
sequences is also included. In both cases, E and + are well
correlated. There is less variation within sets of sequences
with a fixed @ than in the full sequence space [Fig. 12 (d) 1,
presumably due in part to the greatly reduced number of
sequences in these special cases. One noteworthy feature of
these plots is that both (p),S = 1 and g = 1 sequences have
more HH contacts than random sequences with the same @
(or equivalently same number of H residues n+) . The g = 1
sequences are even able to to achieve a slightly higher average number of HH contacts than the @),,s = 1 sequences
with the same @. In other words, the arrangement of H residues in theg = 1 and (JJ)“~ = 1 sequences are such that they
are more efficient in forming HH contacts. This selection is
clearly related to the importance of having sufficient number
of P residues for low degeneracy.
VIII. CONVERGENCE:
TO A GIVEN NATIVE
HOW MANY
STRUCTURE?
SEQUENCES
FOLD
We now ask: How many different sequences will encode
a given native structure? The set of all sequences that fold to
one particular native structure is the convergence set of that
conformation (see Fig. 1). We find that, in general, the convergence set is large.
Whereas degeneracy g characterizes the multiplicity of
native structures of a given sequence, convergence Ccharacterizes the multiplicity of sequences for a given native structure.4’ For any pair of sequence subspace S and conformational subspace C, the average degeneracy (g),,,
of the
mapping of sequence subspace S onto the conformational
subspace C is related to the average convergence (C ),+ c of
the conformational subspace C from the sequence subspace
Sk
number of conformations in C
wsdz
(8.1)
number of sequences in S
*
(C>,,,
=
This relationship applies for any pair of subspaces Cand S. If
S or C is not the full sequence or conformational space, the
averages (g)s-c and (C )s- c are restricted. More specifiin Eq. (8.1) only takes into consideration
cally, W,,,
those native conformations of sequences in S that belong to
C, though there may be native conformations of sequences in
S that are outside C. Similarly (C ),,,
only counts sequences belonging to S that have conformations in Cas their
native structures. Sequences not belonging to S are not
counted even if their native conformations are in C.
Figure 13 shows the restricted average convergence for
different conformational
subspaces with different fixed
number of intrachain contacts [Fig. 13 (a) ] and with different radii of gyration RG [Fig. 13 (b) 1. These plots show that
compact conformations are likely to have much higher convergence than open conformations. The reason is simple. A
~~~~.li
01234567
1.0
Number
of Contacts
(t)
1.5
2.0
2.5
Radius of Gyration in umts
of the Mimmum Radus
FIG. 13. Average convergence as a function ofcompactness or radius of the
molecule (n = 14): (a) vs compactness, (b) vs radius ofgyration. The largerandlesscompactthenativestateis, the fewer sequences
that will encode it
(Ref. 42).
sequence that has a relatively open native structure is most
likely to have the same number of HH contacts in a more
compact configuration. On the other hand, a sequence with a
compact native structure is much less likely to be able to
make the same number of HH contacts in a more open conformation.
The convergences considered above are from the full
sequence space. Convergence from some restricted sequence
subspaces such as low-degeneracy sequences are more relevant to real proteins, however. In the convergence from the
full sequence space, there are many sequences that have the
given conformation as one among a huge number of degenerate native structures. By restricting to singly degenerate or
low-degeneracy sequences in the study of convergence, we
focus attention on the relatively smaller number of sequences that fold to a given conformation uniquely, or sequences that have only a few other native structures.
We now ask: What characteristic of a conformation
specifies that it will be a unique native structure encodable in
a singly degenerate sequence? Figure 14 shows the fraction
of n = 14 conformations that are encodable by sequences
with low degeneracies. The plot shows that more compact
conformations (higher t) are much more likely to be encodable by low-degeneracy sequences. The fractions of conformations encodable by low-degeneracy sequences diminish
rapidly as compactness decreases. This feature is consistent
with our previous observation (Sec. VI) that low degeneracy
is always correlated with high compactness of the native
structures. Extrapolation of the data shown in the inset of
Fig. 14 strongly suggests that the high encodability of compact structures (p- 1) is not an artifact of short chains.
From these lattice simulations we are led to expect that a
substantial fraction of highly compact polymeric structures
are in principle designable in copolymers that will collapse
uniquely to the target structure.
To further investigate sequences that encode highly
compact native structures, we also studied the convergence
to maximally compact p = 1 conformations. Extrapolation
from our exact simulation results suggests that the average
convergence for a n = 100 maximally compact structure
from (p),s = 1 and <p),s 20.8 sequences are of orders lo5
and 10’4-10’s, respectively. In addition, we found that the
unrestricted convergence from the entire sequence space to
J. Chem. Phys., Vol. 95, No. 5, 1 September 1991
Downloaded 21 Apr 2006 to 128.100.83.175. Redistribution subject to AIP license or copyright, see http://jcp.aip.org/jcp/copyright.jsp
3786
H. S. Chan and K. A. Dill: Proteins and copolymers
1.0
0.9
$
2%
‘0S2p
i?5
PO
t?p
Et2 .
:n
‘Oal
g2> 0.3
2
L4-3
2
0.8
0.7
0.6
05
0.4
0.2
0.1
nn
-.-
0
1
2
4
5
3
Number of Contacts (t)
6
7
FIG. 14. Fraction ofconformations that are native structures of low-degeneracy sequences (n = 14). The inset shows the chain length (n) dependence of these fractions for the maximally compact ( t = t,,,.,, p = 1) conformations.
p = 1 conformations scales exponentially with n at approximately the same rate as that from (p),s >0.8 sequences. It is
instructive to compare the present estimation with the simple “critical core” approximate model described by Lau and
Dil1.9 In that model, the convergence is estimated to be of
order 109-10’o for n = 100 chains in two dimensions, which
is intermediate between the present estimates of 10’ for
b),,s = 1 sequences and 10’4-1015 for (p),s 20.8 or unrestricted sequences. The “critical core” model assumes all native conformations are maximally compact, i.e., @)ns = 1.
Hence its estimation of convergence must be lower than the
average convergence from unrestricted sequences. On the
other hand, chain connectivity is neglected in the critical
core model. Consequently steric constraints due to chain
connectivity are underestimated. This accounts for the overestimation of convergence by the critical core model as compared to the present exact results.
Note added in prooj We have recently extended these
studies to chain lengths n = 15, 16, and 17. These results
continue to support the results presented here. In particular,
the fraction of sequences that are singly degenerate (g = 1)
remains approximately constant with n. As noted in the text,
this fraction is 2.1%, 2.1%, and 2.4%, respectively, for
n = 12, 13, and 14. The number of g = 1 sequences for
n = 15-17 are 857, 1539, and 3404, respectively, constitutes
2.6%, 2.3%, and 2.6% of the total number of 2” sequences.
ACKNOWLEDGMENTS
We thank Klaus Fiebig and Dave Yee for very helpful
discussions. We thank the NIH, the URI Program of
DARPA, and the Pew Scholars Program in the Biomedical
Sciences for research support.
’Provided that a copolymer chain is comprised ofmonomers that are highly incompatible with each other, such as the nonpolar and polar residues
that compose proteins, then any solvent compatible with one type of monomer will be incompatible with the other. For proteins, water at 25 ‘C is an
incompatible solvent for the nonpolar monomers, thus this solvent represents “folding conditions” for proteins.
*H. S. Chan and K. A. Dill, J. Chem. Phys. 90,492 ( 1989).
3H. S. Chan and K. A. Dill, Macromolecules 22,4559 ( 1989).
‘K. F. Lau and K. A. Dill, Macromolecules 22,3986 ( 1989).
‘N. Gi5, H. Abe, H. Mizuno, and H. Taketomi, in Protein Folding, edited
by N. Jaenicke (Elsevier/North Holland, Amsterdam, 1980), pp. 16718 1, and references therein.
6J. Skolnickand A. Kolinski, Annu. Rev. Phys. Chem. 40,207 (1989), and
references therein.
‘D. G. Cove11and R. L. Jernigan, Biochemistry 29.3287 ( 1990).
*K. A. Dill, Biochemistry 29, 7133 ( 1990); H. S. Chan and K. A. Dill,
Annu. Rev. Biophys. Biophys. Chem. 20,447 ( 1991).
9K. F. Lau and K. A. Dill, Proc. Natl. Acad. Sci. USA 87,638 ( 1990).
lo In some of our previous articles, Refs. 2 and 3, the chain length is defmed
as the number of bonds, which is equal to the number of residues minus
one.
“H. S. Chan and K. A. Dill, Proc. Natl. Acad. Sci. USA 87, 6388 (1990).
‘r In Ref . 3 I the chain length Nrefers to the number ofbonds. A chain with N
bonds has N + 1 residues. Hence the quantity fI(‘)( n) and R, (n) used in
the present article are respectively equal to n(‘)(N) and fl, (N) in Ref. 3
withN=n1.
” In the asymptotic limit, n -+ 00, the scaling of R, (n) is more accurately
described by the relation f&,(n) cc (n - 1 )rp”, where y= l/3 for all twodimensional lattices and p = 2.64 for the square lattice (see Refs. 14 and
15 ) . However, for the small and intermediate values of n employed in the
present analysis, the simpler exponential relation stated in the text is a
reasonable approximation.
I4 M. N. Barber and B. W. Ninham, Random and Restricted Walks, Theory
and Applications (Gordon and Breach, New York, 1970)) and references
therein.
” K. F. Freed, Renormalization Group Theov of Macromolecules (Wiley,
New York, 1987), and references therein.
161tisinteresting tonote that theratioR’“*“‘(n)/M’““‘(
n) isunityexcept
when n equals one plus a magic number (Ref. 3). In those cases the ratio is
slightly larger than unity: 1.778 for n = 7,1.089 for n = 10, and 1.102 for
n = 13. As discussed in Ref. 3, the t = r,,,, (p = 1) conformations of
chains with n equals one plus a magic number have the highest surface-tovolume ratios among the class ofp = 1 conformations. Inasmuch as the
one-to-one correspondence between contact maps and conformations is a
consequence of packing, the n-dependence of Q’“a”‘( n)/M”-‘(
n) also
exhibits the chain length dependence typical of the magic-number-related
oscillations as in Ref. 3.
“Note that while the number of contacts t( = prmar ) is necessarily an integer, the average number of contacts over the native states of a sequence
may take nonintegral values, because it is the arithmetic
(t)..=LxW,
mean over g native structures [ Eq. ( 1.1) 1.
“Since R, =z,
the quantity considered here, (R,) = (m)
is different from the quantity m
usually considered in scaling relations
such as m=tt
“‘. Nevertheless, the formula (R,) = 0.32~n”~
gives values of (R,) that are within 0.4% of the lattice results listed in
Table II. See Refs. 14, 15, and 19 for related theoretical treatments of
scaling.
I9 P.-G. de Gennes, Scaling Concepts in Polymer Physics (Cornell University, Ithaca, 1979).
“For instance, the (R,)/(R,),*,
ratios are 1.44, 1.42, 1.46, and 1.53, respectively for n = 9,10,11, and 12. The ratio does not increase monotonically. There are proportionally more n = 10 and n = 11 conformations
with small R,/( R, )min than a smooth scaling of ( Ro)min would give.
Since wealsomeasure (R,),, in unitsof (R,),in, the higherpercentages
of small (R,),, sequences for n = 10 and n = 11 should therefore be
readily accountable by this lattice artifact.
r’ Note that the No ( 14) = 280, n = 14, E = 0 sequences are responsible for
the small peak in the distribution
of (R,),,
in Fig. 6 at
(R,),, = (R,) = 1.58(R,),,.
In all the chain lengths studied, we observe that the average radius of gyration (R,) over all conformations is
the upper bound for (R, ).. . Though the possibility cannot be ruled out
that for somelonger sequences (R,),, maybe higher than (R,), it seems
J. Chem. Phys., Vol. 95, No. 5,1 September 1991
Downloaded 21 Apr 2006 to 128.100.83.175. Redistribution subject to AIP license or copyright, see http://jcp.aip.org/jcp/copyright.jsp
H. S. Chan and K. A. Dill: Proteins and copolymers
unlikely that this situation could arise. This is because a sequence with a
native structure of low p or a high R, is most likely to also have some
native structures of high p and low R, in its degeneracy set, such that
( RG)“, > (R,) is highly unlikely.
22F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F. Meyer, Jr., M. D.
Brice, J. R. Rodgers, 0. Kennard, T. Shimanouchi, and M. Tasumi, J.
Mol. Biol. 112, 535; E. E. Abola, F. C. Bernstein, S. H. Bryant, T. F.
Koetzle, and J. Weng, in Crystallographic Databases-Information
Content, Software Systems, Scientific Applications, edited by F. H. Allen, G.
Bergerhoff, and R. Seivers, Data Commission of the Int’l Union of Crystallography, Bonn/Cambridge/Chester
( 1987), pp. 107-132.
23The coordinates are obtained from the Protein Data Bank (Ref. 22). The
authors of the coordinates of the ten proteins used in Table III are acknowledged in the same order as they appear in the table in the following
ten references (Refs. 24-33 ) .
24W. A. Hendrickson and M. M. Teeter (private communication).
‘IE. T. Adman, L. C. Sieker, and L. H. Jensen, J. Biol. Chem. 251, 3801
(1976).
z6M. Marquart, J. Walter, J. Deisenhofer, W. Bode, and R. Huber, Acta
Crystallogr. B 39,480 ( 1983).
“S. Vijay-Kumar, C. E. Bugg, and W. J. Cook, J. Mol. Biol. 194, 531
(1987).
2* N. Borkakoti, D. S. Moss, and R. A. Palml:r, Acta Crystallogr. B 38,22 10
(1982).
“P. J. Artymiuk and C. C. F. Blake, J. MO! Biol. 152,737 (1981).
H, I. G. Kamphuis, K. H. Kalk, M. B. A. Swarte, and J. Drenth, J. Mol. Biol.
179,233 (1984).
” G. N. Reeke, Jr., J. W. Becker, and G. M. i:delman, J. Biol. Chem. 250,
1525 (1975).
32R. A. Alden, J. J. Birktoft, J. Kraut, J. D. Robertus, and C. S. Wright,
B&hem. Biophys. Res. Comm. 48,337 ( 197 1) .
3787
33M. A. Holmes and B. W. Matthews, J. Mol. Biol. 160,623 (1982).
r*S. Miyazawa and R. L. Jemigan, Macromolecules 18,534 (1985). Some
of the volume data compiled in this references are adapted from C.
Chothia, Nature 254, 304 (1975) and C. Chothia and J. Janin, Nature
256, 705 (1978).
“‘The estimates of minimum radii given in Table III assume that the maximum attainable packing density is approximately 0.74, which is the observed average packing density in the core of most globular proteins
(Refs. 34 and 36). We adopt the packing density of 0.74 as the maximum
because it is approximately equal to the theoretical packing density for
close-packed spheres, and the packing density of 0.74 is also observed in
crystals of small organic molecules. If instead one assumes that the maximum attainable packing density for globular proteins is unity, the theoretical minimum radius (R, ),,, will then be decreased by a factor of
(0.74) “3 -0.9 from that given in Table III.
“F. M. Richards, J. Mol. Biol. 82, 1 (1974).
37H. S. Chan and K. A. Dill, J. Chem. Phys. 92,3118 (1990).
‘*S. P. Rao, D. E. Carlstrom, and W. G. Miller, Biochemistry 13, 943
(1974).
39For a recent estimation of evolutionary probability, see F. G. Mosqueira,
Origins of Life & Evolution of the Biosphere Journal 18, 143 (1988).
a Note that the correlation between Sandp among all conformafions is near
perfect, r = 0.99 for n = 14 (data not shown). In comparison, though
(S),, and (p)“. are positively correlated among all sequences, the correlation is not as perfect, see Table V.
“J U. Bowie, J. F. Reidhaar-Olson, W. A. Lim, and B. T. Sauer, Science
247, 1306 (1990).
‘*Note that the minimum convergence from the full sequence space for any
n-monomer conformation is No (n), since every conformation is a native
stateofthe&(n)
E=Osequences [seeEq. (3.1)].
Chem. Phys., Vol.
95, No.
5,1 September
1991
Downloaded 21 Apr 2006 to 128.100.83.175. J.
Redistribution
subject
to AIP
license or copyright,
see http://jcp.aip.org/jcp/copyright.jsp
© Copyright 2026 Paperzz