Computer search algorithms in protein modification and design John

471
Computer search algorithms in protein modification and design
John R Desjarlais* and Neil D Clarke†
The computer-aided design of protein sequences requires
efficient search algorithms to handle the enormous
combinatorial complexity involved. A variety of different
algorithms have now been applied with some success. The
choice of algorithm can influence the representation of the
problem in several important ways — the discreteness of the
configuration, the types of energy terms that can be used and
the ability to find the global minimum energy configuration. The
use of dead end elimination to design the complete sequence
for a small protein motif and the use of genetic and mean-field
algorithms to design hydrophobic cores for proteins represent
the major themes of the past year.
Addresses
*Department of Chemistry, Pennsylvania State University, University
Park, Pennsylvania 16802, USA; e-mail: [email protected]
†Department of Biophysics and Biophysical Chemistry, Johns Hopkins
School of Medicine, Baltimore, Maryland 21205, USA;
e-mail: [email protected]
Current Opinion in Structural Biology 1998, 8:471–475
http://biomednet.com/elecref/0959440X00800471
© Current Biology Publications ISSN 0959-440X
Abbreviations
DEE
dead end elimination
GA
genetic algorithm
MC
Monte Carlo
Introduction
The design of protein sequences, whether intended to
adopt a particular fold or to modify a function, involves
evaluating an extraordinarily large number of sequences
for their ability to ‘fit’ a given structure. Search algorithms
describe how a computer program samples from this enormous set of allowed solutions. All search algorithms necessarily make compromises between computational speed
and thoroughness. Furthermore, there are important
dependencies between the choice of search algorithm,
the way in which the search space is represented and the
energy or scoring function used. The purpose of this brief
review is both to outline the computer search algorithms
that have been used to design protein sequences and to
raise some of the issues involved in understanding how
search algorithms, energy functions and structural representations are inter-related. Finally, we discuss some of
the experimental criteria used to analyze designed proteins, since conclusions regarding the effectiveness of the
search algorithms depend, in part, on the interpretation of
such experiments.
Overview — sampling versus quasi-exhaustive
searching
Search algorithms can be divided into two categories.
The first category encompasses algorithms that sample
solutions semi-randomly and then move from one possible
solution to another in a manner that depends on both the
nature of the energy landscape and the algorithm-specific
rules for movement. Algorithms of this type that have
been applied to proteins include Monte Carlo (MC) techniques [1,2] and genetic algorithms (GAs) [3–6]. An advantage of these algorithms is that they can be applied to
search problems that have an infinite number of possible
solutions. In particular, sidechain and backbone conformations can be allowed to vary continuously ([2]; JR
Desjarlais, TM Handel, unpublished data). On the other
hand, there is no guarantee that these algorithms will
explore solutions near the global energy minimum. In contrast, algorithms that fall into the second catergory, pruning
algorithms, are intended to be functionally equivalent to
an exhaustive search. Since truly exhaustive searches are
possible only for very small search spaces, pruning algorithms first simplify the search space by allowing only certain discrete conformations. They then apply rejection
criteria in order to eliminate the vast majority of combinatorial possibilities without actually considering them formally. The robustness of these methods obviously
depends both on how finely the conformational space is
represented and on the criteria used for rejection.
Application of the dead end elimination (DEE) theorem
[7] is the most important pruning idea currently used in
the design of protein sequences [8••,9]. Other pruning
methods [10,11] have been successfully used to design
ligand-binding sites [12–17].
Sampling algorithms
In the case of protein-sequence design, sampling algorithms can be used to vary sidechain identity, sidechain
orientation and backbone structure. The simplest type of
sampling procedure is the MC method. The general strategy of MC algorithms is to iteratively propose a modification to a model and then decide whether or not the
proposed modification should be accepted. The most
common way of deciding whether to accept a proposed
modification is to use the Metropolis criterion [18].
According to this method, a modification is always accepted if it lowers the energy of the model. If a modification
increases the energy of the model, acceptance or rejection
is based on the outcome of what is essentially a weighted
coin toss. The relative probabilities of the old (unmodified) and new (modified) models, which are used to weight
the coin, are calculated according to Boltzmann’s relationship between probability and energy differences at a given
temperature for the system. The relationship between
acceptance probability and temperature is useful for simulated annealing. In this variation of the standard sampling
algorithm, the system is slowly cooled throughout the run
in order to gradually decrease the probability that an uphill
energy modification will be accepted.
472
Engineering and design
GAs are similar in some respects to MC methods. The
major distinctions are that a population of models is propagated (evolved) throughout the course of the run and
genetic operators, such as recombination, are used to create new models from existing parents. The efficacy of the
GA method stems from the implicit parallelism contained
within protein design problems; different segments of the
structure are optimized in parallel and selective recombination between models will sometimes bring two of the
optimized segments together into the same model.
Both the MC and GA methods are relatively straightforward to encode into search algorithms, although neither is
guaranteed to converge to a global minimum. Both methods require a thorough optimization of the parameters that
control the convergence properties of the algorithm, with
respect to the system being studied.
Dead end elimination
DEE is arguably the most powerful method for discrete
conformational searches because of its ability to make
enormous reductions in combinatorial complexity using a
robust process of elimination. In simple terms, the DEE
theorem allows individual sidechain identities/rotamers to
be strictly designated as being incompatible with the global energy minimum. In its original form [7], the DEE theorem uses the criterion that if the lowest energy structure
that can be found using a given sidechain rotamer is higher in energy than the highest energy structure that can be
found with a different rotamer, the first rotamer can be
eliminated. A significantly larger reduction of possible
rotamers is attainable with Goldstein’s variation [19] of the
original method. This uses the criterion that if the energy
of a possible structure containing one rotamer is always
lowered by changing to a second rotamer, the first rotamer
can be eliminated. With both methods, extending the concept to include rotamer pairs and higher order combinations results in further improvements in efficiency,
although the application of the algorithm to higher order
combinations obviously poses combinatorial problems of
its own. For some problems, the application of DEE may
result in a unique solution [8••,9,19,20]. Several modifications that improve the efficiency of the DEE process have
been described [20–22].
Like all pruning-type algorithms, the implementation of
DEE requires the use of discrete representations of the
backbone and sidechains. In addition, it is restricted to
energy terms that can be written as the sum of individual
and pairwise energy terms. In some cases, these limitations
might be overly restrictive for the problem at hand, necessitating use of other sampling methods, such as MC or GA.
Sidechain rotamers and the use of discrete
backbone conformations
Although the representation of conformational space as a
set of discrete states is required only for pruning algorithms, this simplification is also very commonly used for
sampling methods such as MC and GA. The level at which
sidechain and backbone conformations are made discrete
can be expected to have a dramatic effect on the ability of
the algorithm to predict the foldability of alternative
sequences. The most obvious problem with the discretization of conformational space is that the number of acceptable packing solutions found will be much smaller than it
should be because of the steric clashes that might be
avoided in continuous space. Recent studies on the effect
of backbone and/or sidechain flexibility have confirmed
that rigidly defined rotamers can be very misleading when
applied to the prediction of allowed sequences ([23]; JR
Desjarlais, TM Handel, unpublished data). These studies
also imply that parameterization of the weights applied to
various energy function terms strongly depends on the
level of discreteness used. As an example of how energy
term parameterization is closely coupled to the choice of
search strategy, an often used compromise for dealing with
the steric problems inherent in the use of rotamers is to
soften the repulsive van der Waal’s term or to reduce the
size of the atomic radii. These adjustments can themselves
be parameterized according to experimental results indicating whether certain substitutions are functional or
nonfunctional [24,25]. Whether this solution is robust in
the sense that it gives predictions as accurate as those that
could be achieved with finer sampling and a more accurate
potential is unclear.
Global versus additive energies (scoring terms)
Algorithms that eliminate individual amino acids/rotamers
or pairs of amino acids/rotamers are very powerful approximations, but they disregard the context of the global structure. Whether this kind of pruning is appropriate depends
on how well the global energy can be represented as a simple sum of single and pairwise energy terms. Examples of
terms that might reasonably be considered to be simple in
this way include Lennard–Jones energies, torsional energies, secondary-structure propensities and simple coulombic representations of electrostatic effects. The most
important property that is not additive in this simple way
is the solvent-accessible surface area. This is important
because accessible surface areas are commonly used to
estimate the solvent contribution to the free energy of a
model sequence or structure [26]. Recognition of this
problem [8••] has led to the development of empirical
terms that compensate for the nonadditivity of accessible
surface area [27•]. Other global energy terms that might be
used in design algorithms include compositional biases and
geometric constraints towards preferred structures. One
important advantage of sampling methods such as GA and
MC is that they can use global terms of this type.
Pruning strategies in the design of ligandbinding sites
In addition to the redesign of protein folds, considerable
progress has been made in designing simple
ligand-binding sites. Metal ions offer important experimental advantages for the development of these methods.
Computer search algorithms in protein modification and design Desjarlais and Clarke
These advantages include well-defined coordination
geometries, tight intrinsic binding and, in some cases,
spectroscopic properties that allow the design to be evaluated without requiring a complete structure determination.
Two computer programs have been written that help
design metal sites in proteins of known structure.
Interestingly, the two programs use very different strategies for dealing with the combinatorial complexity of the
problem. DEZYMER [10] uses ‘depth first’ pruning
whereas METAL-SEARCH [11] uses ‘on-the-fly binning’.
METAL-SEARCH is much less versatile than
DEZYMER (it was written to look only for tetrahedral
Cys–His sites) but this lack of versatility is not due to the
choice of pruning algorithm. The different strategies are
illustrated schematically in Figure 1.
473
Both DEZYMER and METAL-SEARCH assume fixed
backbones, both use rotamers in the initial stages of the
search and both use simple geometric criteria for evaluating potential sites. Despite these simplifications, both programs have been used successfully in the design of
metal-binding sites [12–17,28]. DEZYMER has also been
used to design sites of more complex ligands [29].
Criteria for evaluating design algorithms
The past couple of years have seen remarkable successes
in the manual design of simple helical proteins, including
one design that was constrained to be 50% identical in
sequence to a predominantly β-sheet protein [30]! Success
in protein design, however, only raises the standards of
what should be considered a success. If it is important that
Figure 1
(a)
(b)
5
5
1
1
4
4
3
2
3
2
Current Opinion in Structural Biology
Pruning strategies used in the design of metal-binding sites. For the purposes of this schematic, potential binding sites involve three substitutions.
(a) ‘Depth first’ pruning. A particular amino acid substitution, sidechain rotamer and, if appropriate, sidechain to metal orientation are first picked as
an ‘anchor’ around which the search for additional coordinating residues is conducted. The coordinates of the bound-metal ion are then calculated
(small black circle at the end of the sidechain on residue 1). Residues that are deemed too far away from the metal ion are immediately rejected
(residues 2 and 3, gray circles). For each remaining residue, all possible rotamers are ‘grown’ in turn, one atom at a time. As each atom is added,
the growing sidechain is evaluated to see whether it is compatible with binding to the anchor ligand. If the growing sidechain is deemed
incompatible with binding, then growth along that branch is stopped (gray branches); otherwise, growth and evaluation are continued (black
branches). In DEZYMER, the criteria for assessing the compatibility with binding include having the anchor position lie within a precalculated
‘ligand expectation sphere’ for the growing sidechain [10]. In this example, sidechains from residues 1, 4, and 5 might meet the geometric criteria
and could be further refined and evaluated. The search must then be repeated using a different initial position for the metal ion, as determined by
different combinations of anchor residue, sidechain rotamer and sidechain metal geometry. (b) Pruning by ‘on-the-fly binning’. METAL-SEARCH
precalculates idealized metal positions at every residue, for each kind of sidechain ligand and for every rotamer [11]. The efficiency of the algorithm
comes in grouping, ‘on-the-fly’, those substitutions that have idealized metal positions near to one another. This is done by placing a grid over the
protein structure prior to the calculation of the metal coordinates. As the metal positions are calculated (small black circles), the algorithm simply
notes which box the metal ion falls into. Information relevant to how the metal got into that box (residue number, amino acid type and rotamer) is
then added to the list that is associated with that particular box. Once all the metal positions have been calculated, the algorithm checks to see
which boxes contain information about three or more residues. In this case, the heavy-lined boxes indicate possible sites involving residues 1, 4 and
5, and residues 2, 3 and 4. Geometric criteria that assess the quality of the site(s) are then applied, and sites that meet the criteria can be further
refined and evaluated. Reproduced with permission from [11].
474
Engineering and design
designed proteins be like natural proteins, then it is important to decide what criteria are most indicative of a natural
native structure. Thermodynamic stability is probably not
a good criterion because increased stability can be
obtained simply by increasing structural degeneracy (configurational entropy) in the native-like state. Indeed, α4,
the first four-helix bundle to be designed, is extremely stable even though it lacks the well-ordered core associated
with natural proteins [31,32]. A high enthalpy change for
unfolding is arguably a more useful criterion. Perhaps the
most useful criteria are amide proton exchange rates, as
these provide insight into the dynamic structure of the protein. In the case of ligand-binding site design, the correct
binding geometry and high binding affinity are probably
the two most important criteria.
next few years will see many more designs. How much of
the success of protein design is due to the search algorithms themselves? How much is due to the plasticity of
protein structures? How much is due to the retention of
wild-type sequence features? How do we decide what
computational and experimental criteria are the most
appropriate for evaluating protein designs? The answers to
these, and other questions, will come only when a great
many more calculations, and experiments, are carried out.
References and recommended reading
Papers of particular interest, published within the annual period of review,
have been highlighted as:
• of special interest
•• of outstanding interest
1.
Lee C, Levitt M: Accurate prediction of the stability and activity
effects of site-directed mutagenesis on a protein core. Nature
1991, 352:448-451.
2.
Hellinga HW, Richards FM: Optimal sequence selection in proteins
of known structure by simulated evolution. Proc Natl Acad Sci
USA 1994, 91:5803-5807.
3.
Holland JH: Adaptation in Natural and Artificial Systems. Cambridge,
MA: The MIT Press; 1992.
4.
Tuffery P, Etchebest C, Hazout S, Lavery R: A new approach to the
rapid determination of protein sidechain conformations. J Biomol
Struct Dyn 1991, 8:1267-1289.
5.
Desjarlais JR, Handel TM: De novo design of the hydrophobic cores
of proteins. Protein Sci 1995, 4:2006-2018.
6.
Pedersen JT, Moult J: Genetic algorithms for protein structure
prediction. Curr Opin Struct Biol 1996, 6:227-231.
7.
Desmet J, De Maeyer M, Hazes B, Lasters I: The dead-end
elimination theorem and its use in protein side-chain positioning.
Nature 1992, 356:539-542.
Other search algorithms
The search methods discussed above have evolved from
earlier application in the closely related field of structure
prediction by comparative modeling. It is likely that
emerging methods, such as mean-field approaches [33–35],
will begin to find application in protein design as
well [36,37•].
Parametrized minimization has also been used quite successfully for the prediction [38] and design (P Harbury,
J Plecs, B Tidor, T Alber, P Kim, personal communication)
of coiled-coil proteins. The extension of this method to
more typical proteins, which do not have the high degree
of symmetry inherent in coiled coils, will presumably
require extensive modification in order to accommodate
the increased combinatorics involved.
Conclusions
As the search algorithm determines the types of limitations
and assumptions involved in the search process, one must
carefully consider the trade-offs involved in choosing among
the options. Currently, DEE is the superior method for
guaranteeing convergence to the global minimum energy
conformation. It is, however, important to distinguish
between the global energy minimum of the restricted search
space (determined by the discreteness of allowed conformations) and that of the true conformational and sequence
space of the protein. The impressive success of DEE in
designing sequences for discrete backbone structures suggests that, given sufficiently fine sampling, the two minima
are closely related but not identical [9,39]. We expect that
this will change dramatically as design attempts are extended to include de novo designed backbone structures.
The convergence properties of the GA and MC approaches
have so far proven to be sufficient for hydrophobic core
design and evaluation [2,5,40•], but they may eventually suffer as the size of the search space is increased. Some advantages of these methods, as discussed above, will remain.
The number of designed sequences of the past few years
that have ‘worked’ at some level is truly remarkable. The
8. Dahiyat BI, Mayo SL: Protein design automation. Protein Sci 1996,
•• 5:895-903.
This paper demonstrates the potential of dead end elimination in designing
complete protein sequences that are consistent with a fixed backbone
template. The structure of the designed protein is very similar to the zincfinger template that was used for the sequence optimization and it folds
with a stability that is considered to be substantial given the small size of
the domain. There are, however, interesting differences between the
backbone and sidechain structures determined for the real protein and the
designed structure.
9.
Dahiyat BI, Mayo SL: De novo protein design: fully automated
sequence selection. Science 1997, 278:82-87.
10. Hellinga HW, Richards FM: Construction of new ligand binding
sites in proteins of known structure. I. Computer-aided modeling
of sites with pre-defined geometry. J Mol Biol 1991, 222:763-785.
11. Clarke ND, Yuan SM: Metal search: a computer program that helps
design tetrahedral metal-binding sites. Proteins 1995, 23:256263.
12. Klemba M, Regan L: Characterization of metal binding by a
designed protein: single ligand substitutions at a tetrahedral
Cys2His2 site. Biochemistry 1995, 34:10094-10100.
13. Klemba M, Gardner KH, Marino S, Clarke ND, Regan L: Novel metalbinding proteins by design. Nat Struct Biol 1995, 2:368-373.
14. Hellinga HW, Caradonna JP, Richards FM: Construction of new
ligand binding sites in proteins of known structure. II. Grafting of
a buried transition metal binding site into Escherichia coli
thioredoxin. J Mol Biol 1991, 222:787-803.
15. Regan L, Clarke ND: A tetrahedral zinc(II)-binding site introduced
into a designed protein. Biochemistry 1990, 29:10878-10883.
16. Hellinga HW: Metalloprotein design. Curr Opin Biotechnol 1996,
7:437-441.
17.
Regan L: Protein design: novel metal-binding sites. Trends
Biochem Sci 1995, 20:280-285.
Computer search algorithms in protein modification and design Desjarlais and Clarke
18. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH: Equation of
state calculations by fast computing machines. J Chem Phys
1953, 21:1087-1092.
19. Goldstein RF: Efficient rotamer elimination applied to protein sidechains and related spin glasses. Biophys J 1994, 66:1335-1340.
20. Lasters I, De Maeyer M, Desmet J: Enhanced dead-end elimination
in the search for the global minimum energy conformation of a
collection of protein sidechains. Protein Eng 1995, 8:815-822.
21. Keller DA, Shibata M, Marcus E, Ornstein RL, Rein R: Finding the
global minimum: a fuzzy end elimination implementation. Protein
Eng 1995, 8:893-904.
22. De Maeyer M, Desmet J, Lasters I: All in one: a highly detailed
rotamer library improves both accuracy and speed in the modelling
of sidechains by dead-end elimination. Fold Des 1997, 2:53-66.
®
23. Lee C: Testing homology modeling on mutant proteins: predicting
structural and thermodynamic effects in the Ala98 Val mutants
of T4 lysozyme. Fold Des 1996, 1:1-12.
24. Hurley JH, Baase WA, Matthews BW: Design and structural
analysis of alternative hydrophobic core packing arrangements in
bacteriophage T4 lysozyme. J Mol Biol 1992, 224:1143-1159.
25. Dahiyat BI, Mayo SL: Probing the role of packing specificity in
protein design. Proc Natl Acad Sci USA 1997, 94:10172-10177.
26. Eisenberg D, McLachlan AD: Solvation energy in protein folding
and binding. Nature 1986, 319:199-203.
27. Street AG, Mayo SL: Pairwise calculation of protein solvent•
accessible surface area. Fold Des 1998, 3:253-258.
The authors present the parametrization of scaling factors for pairwise
calculations of solvent-accessible surface areas, a requirement for using
dead end elimination. Excellent correlations between the resulting
approximations and the true surface areas are demonstrated.
28. Pinto AL, Hellinga HW, Caradonna JP: Construction of a
catalytically active iron superoxide dismutase by rational protein
design. Proc Natl Acad Sci USA 1997, 94:5562-5567.
29. Coldren CD, Hellinga HW, Caradonna JP: The rational design and
construction of a cuboidal iron-sulfur protein. Proc Natl Acad Sci
USA 1997, 94:6635-6640.
475
30. Dalal S, Balasubramanian S, Regan L: Protein alchemy: changing
beta-sheet into alpha-helix. Nat Struct Biol 1997, 4:548-552.
31. Regan L, DeGrado WF: Characterization of a helical protein
designed from first principles. Science 1988, 241:976-978.
32. Handel TM, Williams SA, DeGrado WF: Metal ion-dependent
modulation of the dynamics of a designed protein. Science 1993,
261:879-885.
33. Koehl P, Delarue M: Application of a self-consistent mean field
theory to predict protein side-chains conformation and estimate
their conformational entropy. J Mol Biol 1994, 239:249-275.
34. Lee C: Predicting protein mutant energetics by self-consistent
ensemble optimization. J Mol Biol 1994, 236:918-939.
35. Koehl P, Delarue M: Mean-field minimization methods for
biological macromolecules. Curr Opin Struct Biol 1996, 6:222226.
36. Kono H, Doi J: Energy minimization method using automata
network for sequence and sidechain conformation prediction
from given backbone geometry. Proteins 1994, 19:244-255.
37.
•
Kono H, Nishiyama M, Tanokura M, Doi J: Design of hydrophobic
core of E. coli malate dehydrogenase based on the sidechain
packing. Pac Symp Biocomput 1997: 210-221.
This work demonstrates the use of a novel automata network method,
similar to mean-field approaches [35], for hydrophobic core design
38. Harbury PB, Tidor B, Kim PS: Repacking protein cores with
backbone freedom: structure prediction for coiled coils. Proc Natl
Acad Sci USA 1995, 92:8408-8412.
39. Su A, Mayo SL: Coupling backbone flexibility and amino acid
sequence selection in protein design. Protein Sci 1997, 6:17011707.
40. Lazar GA, Desjarlais JR, Handel TM: De novo design of the
•
hydrophobic core of ubiquitin. Protein Sci 1997, 6:1167-1178.
The paper represents the most recent application of a genetic algorithm for
hydrophobic core design. Several multiply sustituted variants of ubiquitin
were designed and experimentally characterized. The use of different types
of sidechain rotamer libraries, atomic-potential functions and levels of
conformational discreteness are explored.