Universal positions in globular proteins

Eur. J. Biochem. 271, 4762–4768 (2004) FEBS 2004
doi:10.1111/j.1432-1033.2004.04440.x
Universal positions in globular proteins
From observation to simulation
Nikolaos Papandreou1, Igor N. Berezovsky2,3, Anne Lopes4, Elias Eliopoulos1 and Jacques Chomilier4
1
Laboratory of Genetics, Agricultural University of Athens, Greece; 2Department of Structural Biology, The Weizmann Institute of
Science, Rehovot, Israel; 3Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA;
4
Equipe Biologie Structurale, LMCP, Universite´s Paris 6 and Paris 7, Paris, France
The description of globular protein structures as an ensemble of contiguous Ôclosed loopsÕ or Ôtightened end fragmentsÕ
reveals fold elements crucial for the formation of stable
structures and for navigating the very process of protein
folding. These are the ends of the loops, which are spatially
close to each other but are situated apart in the polypeptide
chain by 25–30 residues. They also correlate with the locations of highly conserved hydrophobic residues (referred
to as topohydrophobic), in a structural alignment of the
members of a protein family. This study analysed these
positions in 111 representatives of different protein folds,
and then carried out dynamic Monte Carlo simulations of
the first steps of the folding process, aimed at predicting the
origins of the assembling folds. The simulations demonstrated that there is an obvious trend for certain sets of
residues, named Ômostly interacting residuesÕ, to be buried at
the early stages of the folding process. Location of these
residues at the loop ends and correlation with topohydrophobic positions are demonstrated, thereby giving a route to
simulations of the protein folding process.
Despite the continuously increasing number of experimentally determined protein structures, many new folds are still
to be discovered. This was illustrated clearly in a recent
study [1], where a plot of the number of protein families vs.
the number of resolved complete genomes resulted in a
quasi-linearly increasing function. Elucidating the evolutionary mechanisms leading to the emergence of a finite
number of protein folds [2,3] from the vast number of
protein sequences [4,5], as well as the mechanisms of the
formation of mature protein globules [6], remains a topic
both of great challenge and interest. The latter mechanisms
are related to the physical basis of protein structure
formation and stability [7], and thus can point to possible
evolutionary routes [8].
This study is based on universal structural units of protein
folds, named Ôclosed loopsÕ [9] or Ôtightened-end fragmentsÕ
(TEFs) [10]. These major elements are universally present in
all types of protein folds and have the following features in
common: (a) they usually start and end in the hydrophobic
core [11]; (b) they form loop-like structures of nearly
standard size (25–30 amino acid residues); (c) they serve as
universal units of protein domain structure [12]; (d) the
ends of these elements (or so-called locks [13]), mainly
correspond to clusters of hydrophobic amino acids in
general (WIMVYLF), and highly conserved ones, the
topohydrophobic (TH) positions [14,15], in particular.
Determination of the TH positions is based on the analysis
of multiple structural alignments of members of a protein
family, limited to a pair sequence identity with a maximum
of 30%. TH positions are of particular importance for the
formation and stability of the protein core [16]. From a
dynamic point of view, the early formation of a nucleus
composed of TH positions would favor the formation of
closed loops and considerably speed up the folding process
[17]. The coupled concepts of TH and closed loops/TEFs
therefore offer a simple and general scenario for the folding
mechanism of globular proteins [11,15] and provide a set of
critical positions in the protein core [10,11,13]. The loop
structure of globular proteins is a general concept, independent from secondary structure, as well as from the
particular folding mechanism of each protein [9,10,13].
This study addresses the question of predicting these
critical positions from the sequence, a task of major
importance to approach the structure of a protein of
unknown folding. To successfully build such a structure,
numerous pieces of information have to be collected by
combining various methods. An initial calculation of critical
positions could be a first step, providing a frame of
structural restraints, as TEF limits and TH residues are
located mainly inside the protein core.
The notion of topohydrophobic positions suggests that
the forces that bury these residues and lead to a stable core do
not rely on the details of the amino acid side chain structure,
but rather on an adequate succession of hydrophobic and
polar amino acid residues along the polypeptide chain. Thus
simplified protein models, such as lattice ones, are adequate
tools for calculations aimed at locating critical residues.
Correspondence to N. Papandreou, Laboratory of Genetics, Agricultural University of Athens, Iera Odos 75, 11855 Athens, Greece.
Fax: +30 2105294322, Tel.: +30 2105294372,
E-mail: [email protected]
Abbreviations: MIR, mostly interacting residues; PDB, protein data
bank; SCOP, structural classification of proteins; TEF, tightened end
fragment; TH, topohydrophobic.
(Received 29 June 2004, revised 22 September 2004,
accepted 15 October 2004)
Keywords: folding nucleus; hydrophobic core; lattice simulation; protein folding.
FEBS 2004
Universal positions in globular proteins (Eur. J. Biochem. 271) 4763
This study was carried out on a dataset of 111 globular
proteins with well-defined structures in the Protein Data
Bank (PDB), that were representative of various folds, and
for which the TEFs were available. For a subset of 73
proteins of the above database, the TH positions have also
been determined.
The initial stages of folding were simulated using a
simplified model, which consists of an alpha-carbon reduced
representation of the polypeptide chain on a 24-first
neighbour lattice. A standard Monte Carlo algorithm
dynamically simulated the folding process and a statistical
mean force potential was used to describe the interactions
between noncontiguous residues. A commonly accepted
lattice model has been used [18] and was focused on the first
stages of folding process, by measuring the tendency of
amino acids to be packed inside the hydrophobic core,
depending on the peculiarities of polypeptide chain
sequence.
Starting from random conformations, the Monte Carlo
simulations revealed that a subset of hydrophobic residues
had a strong tendency to be buried. These residues, named
Ômostly interacting residuesÕ (MIR), were found to statistically match TEF limits and TH positions.
These results are in agreement with the hydrophobic
collapse mechanism, which can be further generalized onto
the nucleation–condensation mechanism, a hybrid of hierarchical and hydrophobic collapse mechanism [23,24].
Materials and methods
The protein database consisted of 111 globular protein
chains, representing 78 different folds, according to the
structural classification of proteins (SCOP classification)
[22]. In detail, there are 26 a class proteins, 23 b class, 26
a + b class, 18 a/b class and 18 of the small proteins class,
providing a balanced representation of the major known
folds. The polypeptide chain lengths vary between 50 and
250 residues.
Simulations have been carried out using a Ca representation of the polypeptide chains and the lattice geometry
(Fig. 1) is as in [18].
Fig. 1. The lattice model. The solid line represents the backbone from
Ca to Ca positions, while the dotted line is the underlying cubic lattice.
On an underlying cubic lattice (Fig. 1, dotted lines) with
edges of unit length, contiguous alpha carbons are connected by vectors of the form (± 2, ±p1,ffiffiffi0) (Fig. 1, solid lines).
The length of such a vector is 5 lattice units and is
equivalent to 3.8 Å, the typical distance between contiguous
alpha carbons in proteins. In this geometry, for residue i,
there are 24 possible positions for residue i + 1 to occupy.
This kind of polypeptide chain projection allows for a more
realistic representation of the polypeptide chain [18].
Two spatial constraints are implemented. First, the
distance betweenpnoncontiguous
alpha carbons cannot be
ffiffiffi
less than 3.8 Å ( 5 lattice units), and second (contrary to
cubic lattice, where only angles of 90 and 180 are possible),
limit angles here are 66 and 143 (seven possible values),
approximating the range of pseudo-angle s in natural
proteins [19].
The different nature of amino acids is taken into account
in the force field used to attribute an energy value to each
chain conformation. The distance-independent 20 · 20
residue pair energy matrix of Miyazawa and Jernigan was
used [20]. In detail, if two noncontiguous residues i and j are
found within a distance smaller or equal to 5.88 Å, a term Eij
is added to the total energy, depending on their nature. p
The
ffiffiffiffiffi
maximum interaction range of 5.88 Å corresponds to 12
lattice units and seems a reasonable estimate for the mean
noncovalent interaction range between amino acid residues.
For each protein, 100 different initial conformations were
randomly generated and used as starting points for 100
simulation runs, to avoid dependency from the initial state.
The only constraint placed on initial states is their
noncompactness, in the sense that amino acid residues
placed far away in the sequence were not allowed to be close
in space, to avoid clustering due to particular initial state
conformation. Quantitatively, this constraint introduces a
minimum spatial distance, dmin, according to the separation Delta ¼ |i–j| between residues i and j: (1) Delta ¼
6‚10, dmin ¼ 7 Å; (2) Delta ¼ 11‚15, dmin ¼ 11 Å; (3)
Delta ¼ 16‚20, dmin ¼ 19 Å; (4) Delta more than 20,
dmin ¼ 27 Å.
The single residue movements [18] are of two kinds; end
flip movement for the N and C terminal residues and corner
movements for the others. The choice of the move set is
more or less arbitrary, as the elementary one-residue moves
are sufficient to bring the protein to a folded state. In this
case, the restriction to elementary moves only, apart from
its simplicity, permits a sequential analysis of the chain
tendency to form compact fragments around particular
amino acid residues from the beginning of the simulation.
After each move, the calculated conformational energy
was subjected to a standard Metropolis criterion, at
constant temperature.
Because the goal was to analyse the propensity of residues
to be buried from the start of folding, we ensured that the
maximum number of Monte Carlo steps was sufficient to
allow formation of compact chain fragments. Due to the
serial nature of the algorithm, this time limit is correlated to
protein chain length L. It was empirically determined that
for small proteins of about 50 residues, the value tmax is
around 106 Monte Carlo steps. Thus, the following linear
relation was adopted to generalize tmax to proteins of any
length L: tmax ¼ INT (106 L/50), where INT is integer part,
because tmax is an integer by definition (Monte Carlo steps).
FEBS 2004
4764 N. Papandreou et al. (Eur. J. Biochem. 271)
For each simulation, 104 records of intermediate conformations were taken at regular time intervals. As the number
of simulations per protein is 100 (one for each initial state),
the end result is a set of 106 records per protein.
For every recorded conformation, and for each amino
acid residue the number of residues with which it is in
noncovalent interaction was calculated. In spatial terms,
these noncovalent neighbours are the
amino acid residues
pffiffiffiffiffi
lying within a distance of 5.88 Å or 12 lattice units. For a
given protein and for residue i, at the r-th record, the
number of noncovalent neighbours is nc(i,r). The time mean
of this quantity is
6
NCðiÞ ¼
10
1 X
nc(i,r)
6
10 r¼1
NC(i) values are rounded to the nearest integer. This mean
number of noncovalent neighbours is a quantitative measure of the tendency of a residue to be buried from solvent.
The higher the NC(i), the stronger this tendency.
If NC is the mean value of NC(i) over the sequence for a
given protein, the residues for which NC(i) is significantly
higher than NC are of particular interest and are called
mostly interacting residues (MIRs). Their selection requires
fixing a cut-off value above the mean value NC. It was
found that NC(i) varies between 1 and 8 and that NC ¼ 4
for all studied proteins. Figure 2 presents the distribution of
the different values of NC(i) over the amino acid residues of
all 111 proteins. The most probable value is four, which
coincides with the mean sequence value, which is also four
for all proteins as stated above. From this distribution, it
appears that 13% of residues have a number of noncovalent
neighbours equal to or higher than six, which was adopted
as the lowest NC(i) value for considering residue i as a MIR.
In order to validate this model, once the positions of MIRs
were determined they were compared to TEF limits and to
topohydrophobic positions. The comparison with TEFs was
performed on the complete database of 111 proteins. The
comparison with TH positions was performed on a 73protein subset of this database, where these positions were
determined. For the remaining 38 proteins, the calculation of
TH positions was not possible, because to obtain this at least
four 3D structures of members of the same family are
required, with a pair identity not exceeding 30% [14,15]. This
critirion was not fullfilled for these 38 cases.
The PDB codes [21] of the database are given in Table 1.
Results
The Monte Carlo algorithm for folding simulation has been
applied to the entire protein dataset and the histograms
NC(i), containing the distribution of noncovalent neighbours along the amino acid sequence, have been obtained
for each protein.
In Fig. 3 the positions of TEFs, TH and MIR for 10
proteins of the database representative of the various classes
as determined by SCOP [22] are illustrated.
Among the 1920 calculated MIRs, 92% were hydrophobic, following the definition of topohydrophobic residues
(i.e. they belonged to the set ÔVIMWYLFÕ). Also, the total
numbers of MIRs and TH positions, in the 73-protein
subset where they are compared, are relatively close (1299
MIRs vs. 1011 TH). In the same subset, the total number of
TEFs was 309; thus the number of TEF limits was 618,
about half the number of MIRs.
To assess the overall quality of agreement between
predicted critical positions (MIR) and structure-defined ones
(TH and TEF limits), a statistical analysis is required. This
has been carried out over the whole database, i.e. over all 111
proteins for the comparison between MIR and TEF limits
and for the subset of 73 proteins for the comparison between
MIR and TH. The results are presented in two histograms in
Figs 4 and 5. The histogram of Fig. 4 gives the comparison
between MIR and TH positions and is constructed as
follows. Each TH position is placed at the origin of the
abscissa. Then, the neighbouring MIRs that are closer to this
central TH than to any other TH are located. Their number is
plotted as a function of their sequence distance with respect
to the central TH. This is reproduced for all THs along all the
73 proteins of the data set. Thus Fig. 4 shows a histogram of
the separation between TH and the closest MIR. The plotted
distances range from )20 to +20, and MIRs lying at
distances greater than ± 20 residues from the closest TH are
added to the histogram at the ± 20 positions. The second
histogram (Fig. 5) follows the same rules and concerns the
comparison of MIR to TEF limits. It is constructed using the
whole database of 111 proteins. From observation of Figs 4
and 5 it is evident that comparison of MIR with TH and TEF
limits clearly presents a peak at the origin. This is an
indication that the residues predicted to be MIRs actually do
correspond to TH positions. They also statistically correlate
with TEF limits, which are mostly hydrophobic [13] as it was
already shown that most TH positions are located in or in
vicinity of TEF ends [10]. The agreement between MIR and
TH is very clear and 63% of MIR were found within ± 5
positions from a TH residue. The TEF histogram presents
two main secondary maxima at positions ± 3 and 57% of
MIR was found within ± 5 positions from a TEF limit. This
good agreement between prediction and analysis [13] is of
great interest in the prediction of elements of the protein core
from the sequence.
Discussion
Fig. 2. Distribution of the mean number of noncovalent neighbours over
all 111 sequences of the dataset.
The existence of critical positions in protein structures,
punctuated by TH positions and/or TEF limits, is of great
importance for protein folding and stability. Consecutive
formation of the globule core [10,11,17] composed essentially of these residues [13] leads to tremendous optimi-
FEBS 2004
Universal positions in globular proteins (Eur. J. Biochem. 271) 4765
Table 1. A list of the PDB codes, names and SCOP classes of the proteins studied. The TEFs are known for all these proteins. Proteins with known
TH positions are in bold. The uppercase letters at the end of the code correspond to the chain.
PDB
code
Name
SCOP
PDB
code
Name
SCOP
PDB
code
Name
SCOP
1aep
Apolipophorin-III
a
2sns
b
1gmpA
RNase Sa
a+b
b
b
1aba
1opr
b
1ble
a
2pelA
Legume lectin
b
3cla
Cytochrome c oxidase
Phospholipase A2
Lysin
Retinoid-X receptor a
Cytochrome c3
Interleukin-10
a
a
a
a
a
a
1knb
2stv
1pmy
1qabA
2plv3
1cbs
b
b
b
b
b
b
5nll
3chy
1 cls
1dhr
5p21
1asu
1rro
Oncomodulin
a
1ivpA
Adenovirus fibre
STNV coat protein
Pseudoazurin
Transthyretin
Picornavirus
Cellular retinoicacid-binding protein
2 (HIV-2) protease
b
1lbbA
2sas
Calcium-binding
protein
Parvalbumin
Myoglobin
Glycera globin
Lamprey globin
Hemoglobin (horse)
Hemoglobin (human)
a
1ptf
a+b
1tml
a
a
a
a
a
a
1ubi
1frd
153 L
1lsg
1acf
1ctf
a+b
a+b
a+b
a+b
a+b
a+b
1tpfB
1brsA
1akz
1rvvA
1 ns5A
1jkeB
Triosephosphate isomerase
Endonuclease
Uracil-DNA glycosylase
Lumazine synthase
Hypothetical protein YbeA
D-Tyr tRNAtyr deacylase
a/b
a/b
a/b
a/b
a/b
a/b
a
a
1aihA
1apyA
Histidine-containing
phosphocarrier
Ubiquitin
2Fe-2S ferredoxin
Lysozyme, Goose
Lysozyme, Chicken
Profilin
Ribosomal protein
L7/12
Integrase
Glycosylasparaginase
Glutaredoxin
Orotate
phosphoribosyltransferase
Fructose permease,
subunit Iib
Chloramphenicol
acetyltransferase
Flavodoxin
Signal transduction protein
Cutinase
Dihydropteridin reductase
cH-p21 Ras
Retroviral integrase,
catalytic domain
Glutamate receptor
ligand binding core
Cellulase E2
a/b
a/b
1lcl
Staphylococcal
nuclease
Xylanase II
Bacillus 1–3,
1–4-b-glucanase
Serine esterase
1utg
2mhr
Uteroglobin
Myohemerythin
a
a
1yna
2ayh
256bA
Cytochrome b562
a
1aa0
Fibritin
1occD
1poc
1lis
1lbd
2cy3
2ilk
a+b
a+b
1iodG
1dtdB
Coagulation factor X
Carboxypeptidase inhibitor
small
small
small
small
small
small
4cpv
1bvd
1hbg
2lhb
2mhbA
1dkeA
1eca
1lki
a/b
a/b
a/b
a/b
a/b
a/b
a/b
a/b
a/b
a/b
a
1ast
Astacin
a+b
1icfI
3c2c
1 bp2
1enh
Erythrocruorin
Leukemia inhibitory
factor
Mitochondrial
cytochrome c
Cytochrome c2
Phospholipase A2
DNA-binding protein
a
a
a
1dtp
1nox
2pii
a+b
a+b
a+b
2bbkL
1sgpI
1ajj
2erl
Pheromone
a
1durA
a+b
1i8nA
Anti-platelet protein
small
1pht
b
1fxd
a+b
1ejgA
Crambin
small
b
b
b
1c0bA
1shaA
1ag2
a+b
a+b
a+b
1ehs
1tgj
4rxn
b
1abrA
Abrin A-chain
a+b
1caa
1cdcA
2 lm
CD2, first domain
Macromycin
b
b
1plfB
1mgsA
a+b
a+b
1fas
1pk4
1anu
Cohesin-2 domain
b
1hucB
Platelet factor 4
Chemokine (growth
factor)
(Pro)cathepsin B
Heat-stable enterotoxin B
TGF-b3
Rubredoxin,
Clostridium pasteurianum
Rubredoxin, Archaeon
Pyrococcus furiosus
Fasciculin
Plasminogen
small
small
small
1reiA
Phosphatidylinositol
3-kinase
a-Spectrin, SH3 domain
Signal transduction protein
Seed storage protein
7 s vicillin
Immunoglobulin
Diphtheria toxin
NADH oxidase
Signal transduction
protein
Ferredoxin II,
Peptostreptococcus
Ferredoxin II,
Desulfovibrio gigas
Ribonuclease A
c-src Tyrosine kinase
Prion protein domain
MHC class II p41
invariantchain fragment
Methylamine dehydrogenase
Ovomucoid III domain
ldl Receptor
a+b
1hpi
small
1f3g
Glucose-specific factor III
b
2act
Actinidin
a+b
1hip
1sno
Staphylococcal nuclease
b
2 ci2
a+b
1knt
small
1gpc
DNA-binding protein
b
1fkb
Chymotrypsin
inhibitor CI-2
FK-506 binding
HIPIP, Ectothiorhodospira
vacuolata
HIPIP, Allochromatium
vinosum
Collagen type VI
a+b
1edmB
Factor IX
small
3cytO
1pwt
1semA
1cauB
small
small
small
small
4766 N. Papandreou et al. (Eur. J. Biochem. 271)
FEBS 2004
Fig. 3. Examples of comparison of MIR, TH and TEF for 10 sequences of various folds. In each example, the PDB code (with the chain) is given,
followed by the name, the SCOP class and the fold of the protein in parentheses. The following lines represent the sequence and the TEFs. The
residues belonging to a TEF are indicated ÔIÕ. In case of TEF overlap, two lines are used for this representation (for example in protein 1shaA). The
next line shows TH positions, where the corresponding residues are indicated ÔTÕ. The final line shows MIR residues, indicated by ÔMÕ. For 3chy and
5p21, due to the sequence length, the results appear in two consecutive blocks.
zation of the folding process, by reducing the conformational space to be explored. Thus, the prediction of these
ÔhotÕ residues becomes an important step in approaching
the native three-dimensional structure. A first approach to
this goal was undertaken in this study. The guiding
hypothesis was that, in order to achieve fast folding,
FEBS 2004
Universal positions in globular proteins (Eur. J. Biochem. 271) 4767
Fig. 4. Histogram of the correspondence between TH positions and
MIR from a set of 73 proteins.
Fig. 5. Histogram of the correspondence between TEF ends and MIR
from a set of 111 proteins.
critical residues should have a tendency to contact each
other and thus form the origins of the hydrophobic core.
The results confirmed this hypothesis. Using a simple
alpha-carbon lattice model, formation of the nucleation
sites at initial steps of the folding process was demonstrated.
These results suggest that folding initiation can be based
on the early formation of a set of nucleation sites around
selected hydrophobic residues [10,11,13]. This is essentially
the basis of the hydrophobic collapse mechanism [23], which
supposes formation of hydrophobic tertiary interactions
that initiate secondary structure. It can be extended onto a
unified nucleation–condensation mechanism, which is a
combination of hierarchical and hydrophobic collapse
mechanisms [23,24]. In the latter case, hydrophobic tertiary
interactions are consolidated at the same time as elements of
secondary structure (with possible variations of the kinetics
of the mechanism caused by the different intrinsic stabilities
of the secondary structural elements). These models have
been developed from experiments and simulations of
folding and unfolding of several small proteins [23,24] and
particularly from the analysis of the residual structure of
denatured states, which are thought to correlate to the
nucleation sites. The comparison of MIR predictions with
this type of data is being considered for future studies.
The secondary peaks in the histogram representing the
correlation between MIR and TEF (Fig. 5) come from
the proteins belonging mainly to the a class. For these
folds, the TEF limits are often located inside a helices and
are mainly hydrophobic. Sometimes, the predicted MIR
are not exactly these limits but are the nearest hydrophobic residues, which in a helix are located three positions
away because of the a-helix periodicity. This observation
is in full agreement with the definition of the van der
Waals locks, as extended (three to five residues long)
segments of polypeptide chains interacting with each
other, and thus forming Ôloop-n-lockÕ structures in globular proteins [13].
The main conclusion of this study is that burying MIR
positions can serve as the creation of anchors for sequential
formation of closed loops. These results remarkably corroborate experimental evidence on the initial stages of the
folding process. NMR analysis of folding intermediates of
protein bovine pancreatic trypsine inhibitor [25] revealed
loop formation in early, non-native states, stabilized by
nonlocal interactions. Also, an NMR study on the folding of
lysozyme [26] showed the early formation of hydrophobic
clusters, which are linked together by long-range interactions. These interactions were shown not to occur in the
native structure, but they are apparently important for
keeping the loop structure and thereby speeding up the
folding procedure. The appearance of these essential features
in this folding simulation permits an initial estimation of
the anchor regions for loop formation. This approach
therefore provides a set of structural constraints from first
principles for an unknown structure. This information
could be incorporated at the early steps of a prediction
method for building protein structures from the sequence by
producing anchor residues known to belong to the structural
core. In a second stage they can be introduced as a set of
constraint distances in a more detailed modeling process.
Acknowledgements
This project has been funded by a Concerted Action from the European
Union, QLG2-CT-2002–01298, and by the Greek-French bilateral
PLATO program (grant no 04146WM). I. N. B. was also supported
by the Post-Doctoral Fellowship of the Feinberg Graduate School,
Weizmann Institute of Science.
References
1. Kunin, V., Cases, I., Enright, A.J., de Lorenzo, V. & Ouzounis,
C.A. (2003) Myriads of protein families, and still counting.
Genome Biol. 4, 401.
2. Koonin, E.V., Wolf, Y.I. & Karev, G.P. (2002) The structure of
the protein universe and genome evolution. Nature 420, 218–223.
3. Xia, Y. & Levitt, M. (2004) Simulating protein evolution in
sequence and structure space. Curr. Opin. Struct. Biol. 14, 202–
207.
4. Rost, B. (2002) Did evolution leap to create the protein universe?
Curr. Opin. Struct. Biol. 12, 409–416.
5. Liu, J. & Rost, B. (2003) Domains, motifs and clusters in the
protein universe. Curr. Opin. Chem. Biol. 7, 5–11.
6. Daggett, V. & Fersht, A. (2003) The present view of the
mechanism of protein folding. Nat. Rev. Mol. Cell. Biol. 4, 497–
502.
7. Shakhnovich, E.I. (1997) Theoretical studies of protein-folding
thermodynamics and kinetics. Curr. Opin. Struct. Biol. 7, 29–40.
8. Tiana, G., Shakhnovich, B.E., Dokholyan, N.V. & Shakhnovich,
E.I. (2004) Imprint of evolution on protein structures. Proc. Natl
Acad. Sci. USA 101, 2846–2851.
4768 N. Papandreou et al. (Eur. J. Biochem. 271)
9. Berezovsky, I.N., Grosberg, A.Y. & Trifonov, E.N. (2000) Closed
loops of nearly standard size: common basic element of protein
structure. FEBS Lett. 466, 283–286.
10. Lamarine, M., Mornon, J.P., Berezovsky, I.N. & Chomilier, J.
(2001) Distribution of tightened end fragments of globular proteins statistically match that of topohydrophobic positions:
towards an efficient punctuation of protein folding? Cell. Mol. Life
Sci. 58, 492–498.
11. Berezovsky, I.N., Kirznher, V., Kirzhner, A. & Trifonov, E.N.
(2001) Protein folding: looping from hydrophobic nuclei. Proteins
45, 346–350.
12. Berezovsky, I.N. (2003) Discrete structure of van der Waals
domains in globular proteins. Protein Engineering 16, 161–167.
13. Berezovsky, I.N. & Trifonov, E.N. (2001) Van der Waals locks:
loop-n-lock structure of globular proteins. J. Mol. Biol. 307, 1419–
1426.
14. Poupon, A. & Mornon, J.P. (1998) Populations of hydrophobic
amino acids within protein globular domains; identification
of conserved ÔtopohydrophobicÕ positions. Proteins 33, 329–
342.
15. Poupon, A. & Mornon, J.P. (1999) ÔTopohydrophobic positionsÕ
as key markers of globular protein folds. Theoret Chem. Accounts
101, 2–8.
16. Poupon, A. & Mornon, J.P. (1999) Predicting the protein folding
nucleus from sequences. FEBS Lett. 452, 283–289.
17. Berezovsky, I.N. & Trifonov, E.N. (2002) Loop fold structure of
proteins: resolution of Levinthal’s paradox. J. Biomol. Struct.
Dynamics 20, 5–6.
FEBS 2004
18. Skolnick, J. & Kolinski, A. (1991) Dynamic Monte Carlo simulations of a new lattice model of globular protein folding, structure
and dynamics. J. Mol. Biol. 221, 499–531.
19. Labesse, G., Colloc’h, N., Pothier, J. & Mornon, J.P. (1997)
P-SEA: a new efficient assignment of secondary structure from C
alpha trace of proteins. Comput Appl. Biosci. 13, 291–295.
20. Miyazawa, S. & Jernigan, R.L. (1996) Residue-residue potentials
with a favorable contact pari term and an unfavorable high
packing density term for simulation and threading. J. Mol. Biol.
256, 623–644.
21. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N.,
Weissig, H., Shindyalov, I.N. & Bourne, P.E. (2000) The Protein
Data Bank. Nucleic Acids Res. 28, 235–242.
22. Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. (1995)
SCOP: a structural classification of proteins database for the
investigation of sequences and structures. J. Mol. Biol. 247, 536–
540.
23. Fersht, A. & Daggett, V. (2002) Protein folding at atomic
resolution. Cell 108, 573–582.
24. Fersht, A. (1997) Nucleation mechanisms in protein folding. Curr.
Opin. Struct. Biol. 7, 3–9.
25. Ittah, V. & Haas, E. (1995) Nonlocal interactions stabilize long
range loops in the initial folding intermediates of reduced bovine
pancreatic trypsin inhibitor. Biochemistry 34, 4493–4506.
26. Klein-Seetharaman, J., Oikawa, M., Grimshaw, S.B., Wirmer, J.,
Duchardt, E., Ueda, T., Imoto, T., Smith, L.J., Dobson, C.M. &
Schwalbe, H. (2002) Long-range interactions within a non-native
protein. Science 295, 1719–1722.